KV cache offloading methods that report near-lossless scores on needle-in-a-haystack and RULER can fall apart on workloads requiring synthesis across the full context window. A paper on arXiv (2604.08426v2, revised May 8)1 identifies this gap and introduces Text2JSON, a benchmark designed to resist landmark heuristics. The authors from Yandex and HSE propose YAKV, a retrieval strategy that replaces group-based landmarks with per-key 2-bit quantized selection and claims near-lossless accuracy where ShadowKV-style methods degrade.
What the paper found: standard benchmarks miss the failure mode
The current validation stack for KV cache offloading centers on two task families: needle-in-a-haystack (NIAH), which tests whether a model can retrieve a single fact from long context, and RULER, which stretches context length with varied retrieval patterns. ShadowKV2, accepted as an ICML 2025 Spotlight, reports near-lossless scores on both. But the authors show that these benchmarks belong to a low-intensity task class1. The model only needs to locate and return scattered facts, not combine or reason across the offloaded KV region.
When the workload shifts to context-intensive tasks (structured JSON extraction from 10K-63.5K token documents1, multi-document synthesis, or HTML-to-TSV conversion) the same methods degrade measurably. The paper tests ShadowKV, ArkVale, LRQK, and InfiniGen across Llama 3.1 8B, Llama 3.2 3B, Qwen3 4B, and Qwen3-30B-A3B1. All four methods show accuracy drops on context-intensive benchmarks including Text2JSON, MultiNeedle-128K, LongProc HTML-to-TSV, and Loong1. Even raising ShadowKV’s sparse token budget to 10× its default 1.5625%1 does not fix the problem.
Two root causes: SVD key compression and unreliable landmarks
The paper traces the failure to two design choices common across the tested methods1.
First, low-rank SVD compression of keys. ShadowKV uses rank 1601 SVD by default. The authors demonstrate that this compression preserves enough signal for isolated retrieval (NIAH) but destroys accuracy when the model must attend broadly across the offloaded region to synthesize an answer. The low-rank approximation smooths away the fine-grained key structure needed for context-intensive attention patterns.
Second, group-based landmark selection. ShadowKV groups channels into sets of 8 and uses channel-mean landmarks. ArkVale uses cuboid digests of 16-32 tokens. These groupings generate false positives: the landmark signals a match, but the loaded KV entries are wrong for the actual query. On low-intensity tasks, a single correct retrieval is enough. On context-intensive tasks, false positives compound because the model must cross-reference many entries across the offloaded window.
Text2JSON: the benchmark designed to break landmarks
The authors built Text2JSON to expose exactly this failure mode. It contains 500 samples3 across four extraction domains: product specifications, medical professionals, organizational records, and movies. Context lengths range from 10K to 63.5K tokens3, averaging 20.1K3. Each sample requires extracting 3 to 20 JSON entries. The evaluation uses deterministic intersection-over-union (IoU) scoring, not LLM-as-a-Judge, so the metric does not inherit the leniency of a generative evaluator.
The benchmark is intentionally adversarial to landmark heuristics. Successful extraction requires attending to many scattered fields and combining them into structured output. A method that approximates attention via coarse-grained group landmarks will miss the cross-field dependencies that Text2JSON demands. The authors have released the dataset and code at github.com/yandex-research/context-intensive-kv-offloading3.
One confound to note: Llama 3 models score poorly on Text2JSON and LongProc even at full attention, without any offloading. The authors attribute this to model weakness in structured generation, not to offloading artifacts. The offloading degradation is visible on top of this baseline, but the Llama 3 numbers should be read as model-limited, not method-limited.
YAKV: simpler is better
The proposed fix, YAKV, discards both SVD compression and group landmarks1.
Instead of SVD rank-160 keys1, YAKV uses data-free HIGGS 4-bit quantization with d=2 and n=2561. Instead of channel-mean group landmarks, it stores a 2-bit per-key HIGGS selection vector. This eliminates outliers and grouping entirely. The authors equalize total PCIe bandwidth against ShadowKV and show that YAKV achieves near-lossless accuracy on context-intensive tasks while matching ShadowKV’s low-intensity performance.
The design principle is per-key selection rather than group-level approximation. Where ShadowKV asks which group of 8 channels might be relevant, YAKV asks which individual keys, and pays the storage cost of a 2-bit selector per key to answer precisely. On the tested models and benchmarks, this shift from approximate to exact selection at the key level removes the false-positive failure mode.
Throughput numbers and production implications
YAKV is not free. The paper reports throughput on a single H100 using mini-SGLang with continuous batching1. For Qwen3-30B-A3B-Instruct1, baseline full attention at batch size 1 runs at 35.6 tok/s1. YAKV at batch size 8 reaches 73.2 tok/s, and at batch size 32 reaches 107.7 tok/s1. The thinking variant shows a similar pattern: baseline 56.6 tok/s1, YAKV batch size 8 at 129.2 tok/s1. These are 1.5-3× throughput gains over full attention, driven by fitting larger batches rather than faster per-request latency.
The hardware specifics also matter. The paper’s inference benchmarks run on H200 GPUs at PCIe Gen5 x16, while evaluation uses A100-80G GPUs within the OpenCompass framework. Self-hosted teams running older hardware or different PCIe topologies should expect different absolute numbers.
What this means for validation before deploying KV offload in production
The practical implication is straightforward and uncomfortable: if your validation suite consists of NIAH and RULER, you have no evidence that your offload tier handles the workloads that actually stress cross-context synthesis. Structured extraction, multi-document QA, and codebase analysis all fall into the context-intensive class that existing methods mishandle.
Teams evaluating ShadowKV-style or FlexKV-style offload paths should add Text2JSON or an equivalent context-intensive benchmark to their gating criteria before any production deployment. The failure mode is not a minor accuracy drift; it is a structural collapse on synthesis tasks that existing benchmarks simply do not measure.
The YAKV result suggests the fix may be simpler than the problem: swap coarse group landmarks for fine-grained per-key selection, and swap SVD compression for lightweight quantization. But simplicity at the algorithm level does not guarantee simplicity at the integration level. Until YAKV or similar designs ship in production inference engines like vLLM, most teams will be running validation on one workload class and deploying on another.
Frequently Asked Questions
Does any offloading method hold up on context-intensive tasks?
LRQK actually edges out the other methods on MultiNeedle-128K by 0.09%, but it still drops significantly on Text2JSON and LongProc. No method in the benchmark set maintains accuracy across both retrieval-heavy and synthesis-heavy tasks — the failure is structural, not implementation-specific.
What else is in ShadowKV’s configuration beyond rank-160 SVD?
ShadowKV also reserves a 384-token outlier budget for attention spikes and maintains a 32-token local sliding window for recent-token access. Both are heuristic guardrails that help on narrow-retrieval workloads but do nothing when the task requires distributed attention across the offloaded region, which is exactly what Text2JSON demands.
Does the SVD compression failure apply outside PCIe offloading?
The root cause is information loss in the key representation, not the transfer channel. Any inference optimization that applies low-rank approximation to cached KV states — including in-GPU-memory compression schemes that never touch PCIe — would likely exhibit the same degradation on context-intensive workloads. The failure mode is algorithmic, not I/O-bound.
Has the YAKV paper been peer-reviewed?
As of May 2026 the paper (arXiv 2604.08426v2) has not been accepted at a peer-reviewed venue. The code, Text2JSON dataset, and evaluation methodology are publicly available for independent reproduction, but the claims — particularly the head-to-head comparisons against ShadowKV, which comes from a competing lab (ByteDance) — have not undergone formal peer review.
Why does NIAH pass when Text2JSON fails under the same offload method?
NIAH inserts a single fact into filler text, producing a narrow attention pattern where the correct group is easy to distinguish from groups of irrelevant background tokens. Text2JSON inverts this: nearly every position in the context contains structurally relevant fields, so group-level landmarks generate false positives at rates that compound as the model cross-references entries across the offloaded window.