KV cache offloading trades GPU memory for PCIe bandwidth by compressing or selectively loading key-value pairs. NIAH-style benchmarks support the assumption that this tradeoff is transparent to output quality. Bocharnikov et al.1 introduce Text2JSON, a context-intensive extraction benchmark with 500 samples averaging 20,100 tokens, and show that ShadowKV, ArkVale, LRQK, and InfiniGen all degrade accuracy on Llama 3 and Qwen 3 during multi-needle retrieval. The loss is not random; it is systematic, rooted in two compression heuristics.
Text2JSON Finds the Blind Spot: Why NIAH Benchmarks Miss KV Offloading Accuracy Regressions
The standard test for long-context inference is needle-in-a-haystack: hide one fact in a long prompt and ask the model to retrieve it. RULER and LongBench subsets fall into this category, testing one to three needle retrievals at most. Text2JSON, introduced in the May 8 revision of Bocharnikov et al.’s paper1, uses four structured extraction tasks (product specifications, medical professional records, organizational records, and movie metadata) across 500 samples averaging 20,100 tokens, with a range of 10,000 to 63,500.1 Accuracy is measured by deterministic intersection-over-union against ground-truth JSON, not by an LLM-as-a-judge that might smooth over partial failures.
The paper notes that existing methods score near-losslessly on RULER while dropping accuracy on Text2JSON, MultiNeedle-128K, LongProc HTML-to-TSV, and Loong multi-document QA.1 The difference is context intensity: benchmarks with few needles miss the error modes that appear when a model must perform ten or more retrievals to assemble a structured response.
Two Failure Modes: Low-Rank Key Projection and Unreliable Landmarks
Bocharnikov et al. trace the accuracy loss to two compression heuristics. The first is low-rank key projection. ShadowKV, the ByteDance/CMU ICML 2025 Spotlight method2, compresses keys via SVD at a default rank of 160. That rank is sufficient for single-needle retrieval but loses information when the model must discriminate among many similar keys during multi-needle extraction. Raising the rank to 512 approaches full-attention accuracy1, yet at that point the compression ratio is worse than simple FP8 quantization.
The second failure mode is unreliable group landmarks. ShadowKV averages channel dimensions over 8-token chunks to build retrieval landmarks. ArkVale builds cuboid digests over 16-to-32-token spans. Both produce false positives: the landmark matches, but the specific KV pair loaded is wrong. The paper notes that doubling the outlier budget or local budget does not resolve the issue, suggesting the problem is structural to chunk-based aggregation rather than a tunable parameter.
The Numbers: ShadowKV, ArkVale, LRQK, InfiniGen Accuracy Drops on Llama 3 and Qwen 3
On Text2JSON, ShadowKV, ArkVale, LRQK, and InfiniGen all show accuracy degradation relative to full attention on the tested models. The full evaluation1 spans Llama 3.1 8B, Llama 3.2 3B, Qwen3-4B, and Qwen3-30B-A3B across four benchmarks. The pattern is consistent: methods tuned for RULER-style needle retrieval falter on tasks that require sustained multi-needle lookup.
YAKV’s Simpler Fix: Quantization Over SVD, Per-Key Selection Over Group Landmarks
The authors’ proposed fix, YAKV (Yet Another KV Offloading), discards both problematic heuristics. It replaces SVD with data-free HIGGS 4-bit quantization3 and skips group landmarks entirely in favor of 2-bit HIGGS per-key selection. On equal PCIe bandwidth budgets, YAKV achieves near-lossless accuracy on Text2JSON for both Qwen3-4B and Qwen3-30B-A3B.1
The throughput gains are substantial. On an H200 serving Qwen3-30B-A3B-Instruct at batch size 32,1 YAKV reaches 107.7 tokens per second against a 35.6 tok/s baseline, a roughly 3x improvement.
The Benchmarking Gap: What vLLM and llm-d Should Test Before Shipping KV Offloading
The infrastructure problem is benchmark coverage, not algorithmic novelty. The paper states explicitly that “most popular benchmarks used to evaluate KV offloading are not context-intensive.”1 vLLM and llm-d suites, which production teams rely on for regression testing, measure time-to-first-token, throughput, and NIAH-style retrieval. None of these surface the accuracy drops that appear on extraction, structured conversion, or multi-document QA workloads.
The implication is that a system can pass all standard benchmarks and still silently corrupt output quality for real users. A finance team running long-document extraction or a legal team doing multi-source synthesis would see degraded results without any latency signal to explain why.
What This Means for Production: Workload-Dependent Offloading Is Not Free
For production teams, the takeaway is workload audit, not algorithm dismissal. If your long-context traffic is dominated by single-document summarization or one-off retrieval, existing offloading methods probably work fine. If your traffic involves structured extraction, code analysis, or multi-document synthesis, the accuracy loss is measurable and systematic.
The code and Text2JSON dataset are released under Apache-2.0/MIT4, built on OpenCompass for evaluation and mini-SGLang for inference. The experimental artifacts ran on A100-80G for approximately 8,500 GPU-hours, plus H200 throughput tests.1
Frequently Asked Questions
Does switching to YAKV require a calibration dataset?
No. YAKV uses data-free HIGGS 4-bit quantization, which applies a deterministic scheme without any calibration data. SVD-based methods like ShadowKV must compute singular value decompositions over actual key activations from the specific model being served, creating an initialization cost and a model-specific dependency that YAKV avoids entirely.
Did the paper benchmark vLLM or llm-d directly?
No. The evaluation measures the offloading algorithms themselves using mini-SGLang for inference and OpenCompass for scoring. vLLM and llm-d would need to implement YAKV’s 2-bit per-key HIGGS selection before production throughput numbers transfer, since scheduler behavior, prefix caching, and continuous batching strategies differ between engines.
What were ShadowKV’s full default parameters beyond the SVD rank?
ShadowKV ships with SVD rank 160, chunk size 8, and a sparse budget of 1.56%, all tuned against RULER. The sparse budget constrains how many KV pairs survive compression per layer; combined with chunk-based landmarks, it is too tight for the dense retrieval patterns in context-intensive tasks. Loosening the budget alone does not fix the landmark false-positive problem.
Do these accuracy regressions apply to MoE models or only dense transformers?
Qwen3-30B-A3B, one of the primary test models, is a Mixture-of-Experts architecture where only roughly 3B of 30B parameters are active per token. YAKV achieves near-lossless accuracy on it, indicating the compression failure is independent of feed-forward expert routing—the problem lives in the attention key-value structure shared by both dense and MoE architectures.