groundy
infrastructure & runtime

ObjectCache Moves KV Reuse to S3-Class Storage: Why Layerwise Retrieval Beats Full-Prefix Cache Hits

ObjectCache retrieves KV cache per-layer from S3, adding 5.6% TTFT at 64K context but 56-75 ms at 4K. Long-context deployments where DRAM is the bottleneck benefit most.

7 min · · · 6 sources ↓

A single Llama 3.1 70B request at 128K context consumes roughly 42.9 GB of KV cache at BF16, according to Spheron’s optimization guide. Eight concurrent users push that past 343 GB, exceeding model weights by 2.4×. ObjectCache responds to this capacity wall by retrieving KV cache layer-by-layer from S3-class object storage, adding only 5.6% time-to-first-token (TTFT) overhead at 64K context on a 100 Gbps RoCE cluster. The approach works at long contexts. At short ones, the calculus flips.

The 343 GB problem

GQA and FP8 quantization compress the per-request ceiling, not the aggregate. FP8 halves the cache for 32K/8-user configurations back to ~42.9 GB, per Spheron’s analysis, but the scaling law remains linear in both context length and concurrency. Double the users or double the context and the cache doubles.

The dominant open-source answer has been to tier the cache: keep hot prefixes in GPU memory, warm layers in CPU DRAM, cold layers on local disk. LMCache implements this strategy inside vLLM, achieving up to 15× throughput improvement on multi-turn QA and document analysis workloads. It also documents an uncomfortable finding: context truncation, a common practice for managing memory pressure, halves prefix cache hit ratio. The shared-prefix assumption that makes PagedAttention work breaks when you chop the context.

FlexKV distributes KV blocks across engine instances via a CPU-backed store. vLLM v0.19.1 ships a FlexKV connector alongside its built-in --enable-prefix-caching flag. Both LMCache and FlexKV assume the KV store lives in memory or on fast local media. ObjectCache questions that assumption.

How ObjectCache retrieves KV from object storage

The core idea: store KV tensors in S3-compatible object storage and fetch them per-layer during inference, rather than loading the entire cached context into GPU memory before generation begins. The non-obvious part is the protocol co-design.

ObjectCache orders storage reads so data arrives in the sequence the GPU consumes it, overlapping transfer latency with compute across concurrent requests. On the benchmark cluster (100 Gbps RoCE with NIXL and Ceph RGW), this overlap limits the TTFT penalty at 64K context to 5.6% over a local-DRAM baseline, per the paper. At 4K context, the overhead is 56, 75 ms, a flat penalty that dominates short-request latency budgets.

The bandwidth scheduler matters as much as the storage medium. Under shared bandwidth caps, ObjectCache’s scheduler reduces added TTFT by 1.2, 1.8× compared with equal bandwidth sharing, the same paper reports. For operators already running Ceph or MinIO alongside their inference cluster, the scheduler is the component worth evaluating first.

Layerwise retrieval vs. full-prefix matching

The architectural distinction matters. LMCache and FlexKV cache at the prefix level: if two requests share the same system prompt and retrieved documents, the shared KV tensors are reused. ObjectCache caches at the layer level and retrieves from object storage regardless of prefix overlap.

This is a different tradeoff space. Prefix caching wins when workloads have high overlap (repeated system prompts, shared retrieved documents in RAG). Layerwise retrieval from a cheap tier wins when prefix overlap is low but you need to serve long contexts to many concurrent users without provisioning DRAM for the worst case.

KVLink attacks the same problem from another angle: precompute KV cache per document independently, concatenate at inference time, and fix positional embeddings post-concatenation. KVLink reports up to 96% TTFT reduction with a 4% QA accuracy improvement over prior art. It trades precompute time for inference speed, which is attractive for static document collections but less applicable to multi-turn conversations where context evolves.

KV-CAR compresses KV cache along the embedding dimension using lightweight autoencoders and reuses tensors across adjacent layers, achieving up to 47.85% memory reduction on GPT-2 and TinyLLaMA. The compression approach is orthogonal to ObjectCache’s storage-tier approach; combining both is theoretically possible but not yet demonstrated.

StrategyStorage tierBest workload profileKey limitation
LMCacheCPU DRAM + local diskMulti-turn QA, document analysis with high prefix overlapDRAM capacity bound; context truncation kills hit rate
FlexKVDistributed CPU-backed KV storeMulti-instance vLLM sharing prefixesRequires dedicated KV infrastructure
ObjectCacheS3-class object storageLong-context (64K+), low-prefix-overlap, cost-sensitive56–75 ms penalty at 4K; requires fast network
KVLinkPrecomputed per-document KVStatic document collections, RAG with known corpusNot designed for evolving multi-turn context

Where object storage loses

The 56, 75 ms added TTFT at 4K context is the constraint that matters for most production chat workloads. If your p99 TTFT budget is 200 ms and your baseline without caching is already 150 ms, adding 75 ms of object-store latency consumes most of your slack. At 64K context the percentage penalty shrinks to 5.6% because the baseline TTFT is large enough to amortize the fixed transfer cost, but the absolute latency still grows.

For operators on standard TCP networks or commodity S3, the 100 Gbps RoCE results are aspirational. The paper’s bandwidth scheduler mitigates contention, but the absolute throughput requirement does not disappear. Object storage IOPS, not bandwidth, becomes the bottleneck when concurrent request count scales.

What to deploy now

For self-hosted vLLM operators, the decision splits along three axes: context length, prefix overlap, and latency budget.

Short-context chat (4K and below). Stay on LMCache or vLLM’s built-in prefix caching. The 56, 75 ms TTFT penalty from object storage is disproportionate at these lengths. LMCache’s 15× throughput improvement on multi-turn QA at DRAM speed is hard to beat.

Long-context RAG (32K, 128K) with shared prefixes. LMCache or FlexKV in distributed mode. The prefix overlap in RAG workloads (shared system prompt, shared retrieved chunks) plays to their architecture. Avoid context truncation; LMCache’s own data shows it halves hit ratio.

Long-context (64K+) with low prefix overlap, cost-constrained. ObjectCache is the early candidate, with two caveats. You need fast network fabric (the paper benchmarks at 100 Gbps RoCE) and your latency budget must tolerate single-digit percentage TTFT overhead. If both hold, S3-class storage at commodity prices beats provisioning hundreds of gigabytes of DRAM per inference node.

Static document collections. KVLink’s precompute-concatenate approach is worth evaluating. The 96% TTFT reduction is compelling, and the positional-embedding correction appears to preserve accuracy. It does not solve the multi-turn problem, but for batch RAG over a fixed corpus it may outperform all caching strategies.

None of these options are mutually exclusive. A practical deployment might use LMCache for hot prefixes in DRAM, ObjectCache for long-context overflow to S3, and KVLink for precomputed document caches. The missing piece is a unified connector that tiers across all three. As of late May 2026, no open-source project ships one.

Frequently Asked Questions

Can ObjectCache be dropped into an existing vLLM or SGLang deployment?

Not without engine-level modifications. ObjectCache requires the inference kernel to issue layer-ordered storage reads that overlap with GPU compute, which goes beyond the connector APIs that FlexKV (FlexKVConnectorV1) and LMCache use to plug into vLLM. No open-source integration for ObjectCache with any inference engine has been published as of May 2026.

KVLink precomputes KV cache per document with its own positional range, then uses trainable special tokens to restore cross-document attention after concatenation. ObjectCache retrieves already-computed KV tensors from their original context without modifying positional embeddings. KVLink’s trainable-token mechanism is what produces its 4% QA accuracy gain over baseline concatenation, a property ObjectCache does not replicate.

What bottleneck appears when hundreds of concurrent requests hit ObjectCache?

Object storage IOPS, not bandwidth, dominates at high concurrency. A 70B-class model with roughly 80 transformer layers triggers 80 sequential object reads per request. At 100 concurrent requests, that is 8,000 GET operations per inference step against Ceph RGW, each incurring metadata lookup overhead. The paper’s bandwidth scheduler manages throughput allocation but does not address the metadata amplification that object stores introduce at this operation volume.

Do GQA and FP8 together eliminate the need for S3-tier KV offloading?

They reduce but do not eliminate it. The 343 GB figure for 128K/8-user already reflects grouped-query attention, since Llama 3.1 uses GQA natively. FP8 halves this to roughly 171 GB, which still exceeds typical CPU DRAM budgets on inference nodes. At 32K contexts and moderate concurrency, GQA plus FP8 compress the working set enough to fit in DRAM, making ObjectCache unnecessary. The crossover shifts toward S3 as context length or concurrency grows.

Could KV-CAR’s compression lower ObjectCache’s 4K-context latency penalty?

KV-CAR compresses along the embedding dimension using lightweight autoencoders and reuses tensors across adjacent layers, cutting memory by up to 47.85%. Applied before writing KV tensors to S3, this would reduce per-layer object size and could narrow the 56-75 ms transfer penalty at short contexts. The combination is untested, and KV-CAR has only been benchmarked on GPT-2 and TinyLLaMA, not 70B-class models. Compression artifacts could also interact unpredictably with ObjectCache’s transfer-compute overlap protocol.

sources · 6 cited

  1. KV Cache Optimization: Serve 10x More Users on the Same GPU analysis accessed 2026-05-26
  2. ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse primary accessed 2026-05-26
  3. LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference primary accessed 2026-05-26
  4. Prefix Caching FlexKV - vLLM vendor accessed 2026-05-26
  5. KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse primary accessed 2026-05-26
  6. KV-CAR: KV Cache Compression using Autoencoders and KV Reuse in Large Language Models primary accessed 2026-05-26