- Introduces a low-rank-based approach to KV cache compression, one of the key bottlenecks in long-context AI
- Speeds up attention computation by up to 6.9x and overall generation throughput by up to 3.1x, moving beyond memory savings to faster inference
- Selected as a Spotlight paper at ICML 2026, representing about 2.2% of reviewed submissions and about 8.4% of accepted papers
- Following the attention around Google’s TurboQuant at ICLR 2026, STAR-KV presents another approach to advancing KV cache compression
- Paper available on arXiv; source code released on GitHub
SEOUL, South Korea, July 2, 2026 /PRNewswire/ — Dnotitia Inc. (Dnotitia), a company specializing in long-term memory AI and semiconductor-based AI infrastructure technologies, has released the paper and source code for “STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control.” The technology was developed through a joint research effort involving UC San Diego’s VVIP Lab and Dnotitia researchers, and the paper was selected as a Spotlight paper at ICML 2026 (International Conference on Machine Learning 2026), one of the world’s leading conferences in machine learning.

Dnotitia contributed STAR-KV, selected as an ICML 2026 Spotlight Paper, achieving up to 20x KV cache compression and faster inference through low-rank compression and GPU optimization
In the experiments reported in the paper, low-rank compression alone reduced the KV cache by up to 75%. Combined with the mixed-precision quantization method proposed in the paper, STAR-KV compressed the full KV cache by up to 20x. The technology also improves computation speed through custom GPU kernels, increasing attention computation speed by up to 6.9x and overall generation throughput by up to 3.1x. STAR-KV also showed higher accuracy than major existing KV cache compression methods.
KV cache compression has become a key technical challenge in AI infrastructure. As research into reducing the memory bottleneck of long-context AI gains momentum, including the attention around Google’s TurboQuant at ICLR 2026, STAR-KV presents a new approach that combines low-rank compression with quantization and GPU execution optimization.
The KV cache is temporary memory stored on the GPU so that a large language model (LLM) does not have to recompute context it has already processed. As AI evolves into agentic systems that use multiple documents, conversation history, code, search results, and outputs from external tools, the amount of context a model must process is growing rapidly. In this environment, the KV cache has emerged as a key bottleneck affecting both GPU memory usage and inference cost.
According to the STAR-KV paper, when a LLaMA-3.1-8B model processes a 128K-token context at a batch size of 4, the KV cache accounts for about 81% of total GPU memory. As long-context AI becomes more widely used, KV cache compression is increasingly viewed as a core AI infrastructure technology for processing long context at lower cost.
ICML, where the STAR-KV paper was accepted, is widely regarded as one of the top international conferences in AI and machine learning, alongside NeurIPS and ICLR. ICML 2026 will be held from July 6 to 11 at COEX in Seoul. This year, 23,918 papers entered review, 6,352 were accepted, and 536 were selected as Spotlight papers. Spotlight papers account for about 2.2% of all reviewed submissions and about 8.4% of accepted papers.
Going forward, Dnotitia plans to further advance STAR-KV for use in real-world AI service environments and explore its application to open-source LLM inference frameworks such as vLLM.
“Technologies that help AI process longer context faster and at lower cost are advancing rapidly” said MK Chung, CEO of Dnotitia. “STAR-KV addresses the core bottlenecks in KV cache capacity and attention processing speed, and Dnotitia aims to contribute to the AI inference ecosystem through open sourcing.”
