Running Long-Context Agents on a 4-Bit KV Cache: Where Accuracy Breaks

A 4-bit KV cache can make long-context agent sessions cheaper to serve, but the paper that proposes it does not yet tell you where the quality floor is. UltraQuant: 4-bit KV Caching for Context-Heavy Agents (PDF), submitted by Inesh Chakrabarti and co-authors on 18 June 2026, reports large speedups on AMD CDNA4 hardware: a 3.47x cut in P50 time-to-first-token for cache-pressured late rounds and a 1.63x lift in output throughput over an FP8 KV baseline. What it does not report is how much task accuracy degrades at 4 bits. For operators, that omission is the story.

Why do agent workloads break the old KV-cache math?

The abstract opens with a framing that is obvious to anyone running multi-turn agents but is rarely the headline in quantization papers: context-heavy agents stress the KV cache differently from single-shot long-context prompts. In a typical agent session, the prefix is long and shared across many short turns, while concurrency determines whether the GPUs stay fed. Agent sessions force the bottleneck onto the cache state that survives from turn to turn.

The implication is that the cost model flips. For a single long-context query, you care about prompt processing and one decode pass. For an agent loop, you pay for the cache footprint of the full conversation on every subsequent turn, and you need enough concurrent sessions to hide decode latency. The paper argues that this workload is the right place to study 4-bit KV caching because the cache, not the weights, dominates memory. That is a useful corrective to coverage that still treats KV compression as a minor optimization next to INT4/INT8 weight quantization.

What is UltraQuant’s 4-bit path, exactly?

UltraQuant is not a generic INT4 scheme. The authors describe an FP4 approximation path that keeps queries in FP8, stores KV tensors in FP4, and uses UE8M0 group scales, with native scaled-MFMA support on AMD’s CDNA4 GPUs. That hardware specificity matters: the numbers in the abstract are not portable to NVIDIA cards, and treating them as a general recipe for 4-bit KV serving would be a mistake.

The design choices are where the work lives. The authors call out asymmetric K/V treatment, Walsh-Hadamard rotation, removal of QJL, and block-scale variants as the practical decisions that make the 4-bit path robust. The paper uses TurboQuant-style rotation and codebook quantization as a quality anchor, and vLLM FP8 KV caching as the deployment anchor. The FP8 path is the baseline being beaten, not a rival product. That baseline choice is sensible for readers running vLLM today, but it also means the comparison is against an already-compressed KV format, not uncompressed FP16.

The asymmetric treatment of K and V tensors is worth pausing on. In transformer attention, keys and values play different roles during decoding: keys are used in the attention score computation across the full prefix, while values are weighted and accumulated. Quantizing both to the same bit width and scale is the easy choice, but it is not necessarily the efficient one. The authors’ design suggests that tolerating more error in one tensor than the other can preserve end-to-end behavior at lower precision. How much tolerance is acceptable, and for which task types, is left to the full paper.

How big is the speedup over FP8 KV?

The abstract reports only performance metrics. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 TTFT by 3.47x in the cache-pressured late rounds, with a 2.3x improvement across all rounds, and raises output throughput by 1.63x over the FP8 KV baseline. Those are substantial numbers, and they are reported against the right workload shape: not a single 100K-token prompt, but repeated agent turns that reuse a growing prefix.

The distinction between late-round and all-round TTFT is important. Late rounds are where the cache pressure peaks, because the prefix has grown and the working set no longer fits cleanly in fast memory. A 3.47x win there is the number that matters for user-perceived latency in long sessions. The 2.3x across all rounds is more representative of the average session, and the 1.63x throughput gain is what matters for cost per session. All three are useful, but they measure different things.

What is missing from the abstract is any quality metric. The authors state that task quality, cache residency, and serving throughput must be measured jointly, which is true and operationally relevant. But they do not say how much task quality dropped at 4 bits, or on which benchmarks. For a builder deciding whether to turn this on in production, that gap is larger than the throughput number.

Where does accuracy actually break?

This is the question the title raises and the abstract does not answer. The authors explicitly frame 4-bit KV caching as a tradeoff between task quality, cache residency, and throughput, which implies that quality does degrade. They just do not quantify it in the abstract.

There are plausible reasons the degradation could be small for some workloads and large for others. Codebook quantization and Walsh-Hadamard rotation are known techniques for reducing the impact of low-precision storage on attention. Asymmetric K/V treatment can preserve the more sensitive tensor. But agent workloads are heterogeneous: a coding agent that calls tools and inspects long traces may be more sensitive to KV precision than a simple Q&A agent, because errors compound across tool-use boundaries and long-context retrieval. The abstract’s “long-context, multi-turn agentic workload” is a single benchmark; it does not establish a general accuracy floor.

The honest takeaway for operators is that 4-bit KV compression for agents is now a measured performance win on one hardware platform and one workload, but the accuracy boundary is still a hypothesis. If you are already running FP8 KV and your sessions are cache-bound, UltraQuant gives you a reason to test FP4. It does not yet give you a reason to ship it without a task-specific evaluation.

What changes for capacity planning and per-session cache budgets?

If the performance numbers hold and the accuracy numbers turn out acceptable, the second-order effect is a shift in how serving teams plan capacity. Agent serving forces the conversation toward per-session cache budget and concurrency.

Long sessions can consume many gigabytes of KV cache. At FP8 that halves; at FP4 it halves again, which matches the bit-width step vLLM describes for FP8 KV caching. The win is not just memory saved, but the concurrency that saved memory enables. If your cache footprint per session drops, you can fit more sessions on the same hardware, which directly affects throughput and cost. But the planning variable is now per-session size, turn count, and tolerable precision loss.

This also changes the shape of failure. A weight-quantization bug usually shows up as a uniform quality drop across all outputs. A KV-cache precision bug is stateful and turn-dependent: it may only appear after the twentieth turn, or only when a specific token is recalled from the compressed cache. Debugging that is harder than debugging a bad weight matrix. The paper’s focus on robust design choices (asymmetric K/V, Hadamard rotation, block scales) suggests the authors understand this, but operators should expect that the production failure modes of 4-bit KV will be subtle and workload-specific.

The AMD-specificity of UltraQuant adds another planning constraint. If your serving stack is on NVIDIA hardware, the exact numbers do not apply, and the FP4 instruction support is different. The general principle, that 4-bit KV can trade quality for cache capacity on agent workloads, is portable. The 3.47x figure is not.

The broader point is that agent frameworks and serving infrastructure are being forced to co-design. You cannot treat context length as cheap if the cache state that supports it now dominates your memory and your latency. UltraQuant is one of the first papers to make that co-design explicit for 4-bit KV compression. The performance side is promising. The accuracy side is still an open question, and that is exactly where the headline should land.

Frequently Asked Questions

Which hardware stack can actually run UltraQuant’s 4-bit KV path as described?

The paper’s FP4 path needs AMD CDNA4 GPUs with native scaled-MFMA support and UE8M0 group scales. It is not a vLLM configuration toggle; without that specific kernel path, the 3.47x late-round TTFT figure does not transfer to NVIDIA or even to older AMD cards.

How is UltraQuant different from the TurboQuant work it cites?

TurboQuant supplies the rotation and codebook-quantization ideas that UltraQuant uses as a quality anchor. UltraQuant itself is a serving-oriented FP4 path that keeps queries in FP8, drops QJL, and adds asymmetric K/V treatment plus block-scale variants for AMD CDNA4.

What should a vLLM FP8 KV team measure before enabling FP4 KV for production agents?

Replicate your own agent workload on the same AMD CDNA4 hardware, then measure task accuracy across real session lengths and tool-use boundaries. The abstract publishes TTFT and throughput only, so your own traces, especially coding and long-context retrieval loops, are the only valid acceptance test.

What kind of quality bug is hardest to catch with a 4-bit KV cache?

Stateful, turn-dependent errors that look fine for many turns and then fail when a specific earlier token is recalled from compressed cache. A practical guard is adversarial session replay with golden traces and per-turn diffs, because standard per-request benchmarks will often miss late-turn drift.

If accuracy holds up, what planning metric overtakes model weight size?

Per-session cache budget. With FP4 halving the KV footprint again versus FP8, capacity planning must model how many concurrent agent sessions fit before late-round cache pressure spikes, not just how many parameters the model stores.