KV Cache Offloading Breaks on Context-Intensive Tasks: Text2JSON Exposes the Landmark Failure Mode

KV cache offloading methods that report near-lossless scores on needle-in-a-haystack and RULER can fall apart on workloads requiring synthesis across the full context window. A paper on arXiv (2604.08426, now at v4 after a May 15 revision) [Updated June 2026]¹ identifies this gap and introduces Text2JSON, a benchmark designed to resist landmark heuristics. The authors from Yandex and HSE (Bocharnikov, Ermakov, Kuznedelev, Zhdanovskiy, and Yershov) propose YAKV, a retrieval strategy that replaces group-based landmarks with per-key 2-bit quantized selection and claims near-lossless accuracy where ShadowKV-style methods degrade.

What the paper found: standard benchmarks miss the failure mode

The current validation stack for KV cache offloading centers on two task families: needle-in-a-haystack (NIAH), which tests whether a model can retrieve a single fact from long context, and RULER, which stretches context length with varied retrieval patterns. ShadowKV², accepted as an ICML 2025 Spotlight, reports near-lossless scores on both. But the authors show that these benchmarks belong to a low-intensity task class¹. The model only needs to locate and return scattered facts, not combine or reason across the offloaded KV region.

When the workload shifts to context-intensive tasks (structured JSON extraction from 10K-63.5K token documents¹, multi-document synthesis, or HTML-to-TSV conversion) the same methods degrade measurably. The paper tests ShadowKV, ArkVale, LRQK, and InfiniGen across Llama 3.1 8B, Llama 3.2 3B, Qwen3 4B, and Qwen3-30B-A3B¹. All four methods show accuracy drops on context-intensive benchmarks including Text2JSON, MultiNeedle-128K, LongProc HTML-to-TSV, and Loong¹. Even raising ShadowKV’s sparse token budget to 10× its default 1.5625%¹ does not fix the problem.

Two root causes: SVD key compression and unreliable landmarks

The paper traces the failure to two design choices common across the tested methods¹.

First, low-rank SVD compression of keys. ShadowKV uses rank 160¹ SVD by default. The authors demonstrate that this compression preserves enough signal for isolated retrieval (NIAH) but destroys accuracy when the model must attend broadly across the offloaded region to synthesize an answer. The low-rank approximation smooths away the fine-grained key structure needed for context-intensive attention patterns.

The intuition is about where attention mass lands. NIAH produces a spiky attention distribution: one key dominates, and a rank-160 projection that captures the top directions of variance will still rank that key first. Synthesis tasks produce a flatter distribution across many keys whose query dot-products differ by small margins, and those margins live in the low-variance directions SVD discards first. The paper also tests rank 256 and 512¹; raising the rank narrows the gap but does not close it, which is consistent with the failure being about which structure SVD keeps rather than how much. [Updated June 2026]

Second, group-based landmark selection. ShadowKV groups channels into sets of 8 and uses channel-mean landmarks. ArkVale uses cuboid digests of 16-32 tokens. These groupings generate false positives: the landmark signals a match, but the loaded KV entries are wrong for the actual query. On low-intensity tasks, a single correct retrieval is enough. On context-intensive tasks, false positives compound because the model must cross-reference many entries across the offloaded window.

There is a cost angle the paper also measures. The landmark machinery is not cheap: recall and selection account for roughly 73% of ShadowKV’s total latency and about 94% of ArkVale’s¹. The group-based approximation is therefore spending most of its compute budget on the exact step that produces the false positives on synthesis tasks. A method that selects imprecisely and also burns most of its time selecting is paying twice for the wrong answer.

Text2JSON: the benchmark designed to break landmarks

The authors built Text2JSON to expose exactly this failure mode. It contains 500 samples³ across four extraction domains: product specifications, medical professionals, organizational records, and movies. Context lengths range from 10K to 63.5K tokens³, averaging 20.1K³. Each sample requires extracting 3 to 20 JSON entries. The evaluation uses deterministic intersection-over-union (IoU) scoring, not LLM-as-a-Judge, so the metric does not inherit the leniency of a generative evaluator.

The benchmark is intentionally adversarial to landmark heuristics. Successful extraction requires attending to many scattered fields and combining them into structured output. A method that approximates attention via coarse-grained group landmarks will miss the cross-field dependencies that Text2JSON demands. The authors have released the dataset and code at github.com/yandex-research/context-intensive-kv-offloading³.

One confound to note: Llama 3 models score poorly on Text2JSON and LongProc even at full attention, without any offloading. The authors attribute this to model weakness in structured generation, not to offloading artifacts. The offloading degradation is visible on top of this baseline, but the Llama 3 numbers should be read as model-limited, not method-limited.

YAKV: simpler is better

The proposed fix, YAKV, discards both SVD compression and group landmarks¹.

Instead of SVD rank-160 keys¹, YAKV uses data-free HIGGS 4-bit quantization with d=2 and n=256¹. Instead of channel-mean group landmarks, it stores a 2-bit per-key HIGGS selection vector. This eliminates outliers and grouping entirely. The authors equalize total PCIe bandwidth against ShadowKV and show that YAKV achieves near-lossless accuracy on context-intensive tasks while matching ShadowKV’s low-intensity performance.

The design principle is per-key selection rather than group-level approximation. Where ShadowKV asks which group of 8 channels might be relevant, YAKV asks which individual keys, and pays the storage cost of a 2-bit selector per key to answer precisely. On the tested models and benchmarks, this shift from approximate to exact selection at the key level removes the false-positive failure mode.

The accounting is what makes the comparison fair. A 2-bit selector per key sounds expensive next to one channel-mean landmark per group of eight, but YAKV drops the SVD key cache that ShadowKV keeps resident, so the GPU-side memory budget comes out roughly even¹. YAKV is not buying accuracy with extra memory; it is spending the same budget on per-key precision instead of low-rank reconstruction. That is the paper’s actual claim, and it is a stronger one than “more bits help.”

Throughput numbers and production implications

YAKV is not free. The paper reports throughput on a single H100 using mini-SGLang with continuous batching¹. On real-data prompts for Qwen3-30B-A3B-Instruct-2507, baseline full attention at batch size 1 runs at 35.6 tok/s, while YAKV at batch size 32 reaches 107.7 tok/s, a 3.03× gain [Updated June 2026]¹. The thinking variant follows the same shape: baseline 56.6 tok/s at batch 1, YAKV 129.2 tok/s at batch 8, a 2.28× gain¹. On the synthetic 65K-token forced-decoding test the win is smaller, 29.4 to 52.6 tok/s (1.79× at batch 4)¹. The gains come from fitting larger batches rather than faster per-request latency; time-per-output-token actually rises with batch size, from 24 ms to 379 ms in the batch-32 case¹.

The hardware specifics also matter. Per the paper’s own table caption, the throughput benchmarks run on a single H100, and the totals fold in prefill time rather than reporting decode-only rates [Updated June 2026]. (Earlier drafts of this article cited H200 and A100 hardware and an OpenCompass harness; none of those appear in the paper, which uses one H100 throughout.) The comparison against ShadowKV is normalized to the same PCIe budget per decoding step, measured in GiB of host-to-device transfer, so the accuracy results are not confounded by one method simply moving more bytes across the bus. Self-hosted teams on different GPUs, bus topologies, or serving stacks should expect different absolute numbers.

What this means for validation before deploying KV offload in production

The practical implication is straightforward and uncomfortable: if your validation suite consists of NIAH and RULER, you have no evidence that your offload tier handles the workloads that actually stress cross-context synthesis. Structured extraction, multi-document QA, and codebase analysis all fall into the context-intensive class that existing methods mishandle.

Teams evaluating ShadowKV-style or FlexKV-style offload paths should add Text2JSON or an equivalent context-intensive benchmark to their gating criteria before any production deployment. The failure mode is not a minor accuracy drift; it is a structural collapse on synthesis tasks that existing benchmarks simply do not measure.

The YAKV result suggests the fix may be simpler than the problem: swap coarse group landmarks for fine-grained per-key selection, and swap SVD compression for lightweight quantization. But simplicity at the algorithm level does not guarantee simplicity at the integration level. Until YAKV or similar designs ship in production inference engines like vLLM, most teams will be running validation on one workload class and deploying on another.

Where YAKV sits in the KV-offload landscape

KV-cache management has fractured into several non-overlapping strategies, and YAKV’s per-key thesis pushes against most of them. The dominant production pattern is reuse rather than offload: keep hot prefixes resident and page cold ones out to cheaper tiers. ObjectCache pushes KV reuse out to S3-class object storage, betting that layerwise retrieval hides the latency. That class of system never reconstructs attention from a compressed proxy; it moves the real KV bytes, so it sidesteps the accuracy collapse YAKV documents but pays full storage and bandwidth cost. The methods YAKV critiques sit at the opposite end: they keep a lossy sketch on the GPU and rebuild attention on demand, cheaper until the task needs the parts the sketch discarded.

Quantization is the other axis. Huawei’s KVarn moves KV-cache quantization inside the vLLM backend, and the recurring lesson there, and in the documented accuracy cliff for 4-bit KV caches on long-context agents, is the one Text2JSON makes precise: aggregate scores hide where compression actually breaks. YAKV leans on HIGGS, a data-free grid quantizer co-authored by part of the same Yandex group, to produce its 4-bit values and 2-bit selectors with no calibration pass. The choice is deliberate. A calibration-dependent scheme would couple the offload tier to a specific data distribution, the exact hidden assumption the paper argues against.

The throughput story also depends on the serving substrate. The benchmarks run on mini-SGLang, not a production engine, and the prefill-decode split that governs real latency is handled differently in each stack. Bidirectional KV transfer between prefill and decode workers, or disaggregating KV cache from weights for cold MoE models, both change the bandwidth math that YAKV’s PCIe-budget equalization assumes. Until a per-key selector like YAKV’s lands in one of those engines, the accuracy advantage stays a research result, not a deployable one.

Frequently Asked Questions

Does any offloading method hold up on context-intensive tasks?

LRQK actually edges out the other methods on MultiNeedle-128K by 0.09%, but it still drops significantly on Text2JSON and LongProc. No method in the benchmark set maintains accuracy across both retrieval-heavy and synthesis-heavy tasks, the failure is structural, not implementation-specific.

What else is in ShadowKV’s configuration beyond rank-160 SVD?

ShadowKV also reserves a 384-token outlier budget for attention spikes and maintains a 32-token local sliding window for recent-token access. Both are heuristic guardrails that help on narrow-retrieval workloads but do nothing when the task requires distributed attention across the offloaded region, which is exactly what Text2JSON demands.

Does the SVD compression failure apply outside PCIe offloading?

The root cause is information loss in the key representation, not the transfer channel. Any inference optimization that applies low-rank approximation to cached KV states, including in-GPU-memory compression schemes that never touch PCIe, would likely exhibit the same degradation on context-intensive workloads. The failure mode is algorithmic, not I/O-bound.

Has the YAKV paper been peer-reviewed?

As of June 2026 the paper (arXiv 2604.08426, latest revision v4 on May 15) remains a preprint with no acceptance at a peer-reviewed venue [Updated June 2026]. The code, Text2JSON dataset, and evaluation methodology are publicly available for independent reproduction, but the claims, particularly the head-to-head comparisons against ShadowKV, which comes from a competing lab (ByteDance), have not undergone formal peer review.

Why does NIAH pass when Text2JSON fails under the same offload method?

NIAH inserts a single fact into filler text, producing a narrow attention pattern where the correct group is easy to distinguish from groups of irrelevant background tokens. Text2JSON inverts this: nearly every position in the context contains structurally relevant fields, so group-level landmarks generate false positives at rates that compound as the model cross-references entries across the offloaded window.