ObjectCache Moves KV Reuse to S3-Class Storage: Why Layerwise Retrieval Beats Full-Prefix Cache Hits

A single Llama 3.1 70B request at 128K context consumes roughly 42.9 GB of KV cache at BF16, according to Spheron’s optimization guide. Eight concurrent users push that past 343 GB, exceeding model weights by 2.4×. ObjectCache responds to this capacity wall by retrieving KV cache layer-by-layer from S3-class object storage, adding only 5.6% time-to-first-token (TTFT) overhead at 64K context on a 100 Gbps RoCE cluster. The approach works at long contexts. At short ones, the calculus flips.

The 343 GB problem

GQA and FP8 quantization compress the per-request ceiling, not the aggregate. FP8 halves the cache for 32K/8-user configurations back to ~42.9 GB, per Spheron’s analysis, but the scaling law remains linear in both context length and concurrency. Double the users or double the context and the cache doubles.

The dominant open-source answer has been to tier the cache: keep hot prefixes in GPU memory, warm layers in CPU DRAM, cold layers on local disk. LMCache implements this strategy inside vLLM, achieving up to 15× throughput improvement on multi-turn QA and document analysis workloads. It also documents an uncomfortable finding: context truncation, a common practice for managing memory pressure, halves prefix cache hit ratio. The shared-prefix assumption that makes PagedAttention work breaks when you chop the context.

FlexKV, a distributed KV store from Tencent and NVIDIA, distributes KV blocks across engine instances via a CPU-backed store. vLLM ships a FlexKV connector (FlexKVConnectorV1) alongside its built-in --enable-prefix-caching flag. The tier kept growing in 2026: FlexKV added distributed reuse over the Mooncake Transfer Engine, and in May 2026 vLLM gained first-class support for the Mooncake Store as a cross-instance KV cache engine. [Updated June 2026] Both LMCache and FlexKV still assume the KV store lives in memory or on fast local media. ObjectCache questions that assumption.

How ObjectCache retrieves KV from object storage

The core idea: store KV tensors in S3-compatible object storage and fetch them per-layer during inference, rather than loading the entire cached context into GPU memory before generation begins. The non-obvious part is the protocol co-design.

ObjectCache keeps each request’s KV as fine-grained, hash-addressed chunks rather than one monolithic blob, then extends the S3 GET with a descriptor that names the matched chunks, the model’s layer layout, the delivery order, and an RDMA target on the serving node. The storage server reads that descriptor, gathers the scattered chunk ranges, assembles one layer-major payload at a time, and pushes each layer directly into GPU-adjacent memory over RDMA, per the paper. That server-side gather is what keeps a 70B model’s roughly eighty per-layer reads from collapsing into eighty separate client round trips. This is not a stock S3 client talking to a stock bucket; the layout-aware assembly runs inside the storage server, which is why the benchmark targets Ceph RGW rather than a hosted endpoint. [Updated June 2026]

The whole design rests on one inequality: per-layer fetch time has to stay below per-layer compute time. Decode is layer-sequential, so while the GPU runs attention on layer N, the fetch for layer N+1 is already in flight. When each layer’s transfer finishes before the GPU needs it, the round trip is fully hidden and the only visible cost is filling the first layer. That inequality is why the 4K penalty is a fixed 56 to 75 ms while the 64K penalty is a 5.6% ratio: at 64K there is enough per-layer compute to mask the transfer, and at 4K there is not, so the fetch latency shows through as a flat tax on every short request. [Updated June 2026]

ObjectCache orders storage reads so data arrives in the sequence the GPU consumes it, overlapping transfer latency with compute across concurrent requests. On the benchmark cluster (100 Gbps RoCE with NIXL and Ceph RGW), this overlap limits the TTFT penalty at 64K context to 5.6% over a local-DRAM baseline, per the paper. At 4K context, the overhead is 56, 75 ms, a flat penalty that dominates short-request latency budgets.

The bandwidth scheduler matters as much as the storage medium. Under shared bandwidth caps, ObjectCache’s scheduler reduces added TTFT by 1.2, 1.8× compared with equal bandwidth sharing, the same paper reports. For operators already running Ceph or MinIO alongside their inference cluster, the scheduler is the component worth evaluating first.

Layerwise retrieval vs. full-prefix matching

The architectural distinction matters. LMCache and FlexKV cache at the prefix level: if two requests share the same system prompt and retrieved documents, the shared KV tensors are reused. ObjectCache caches at the layer level and retrieves from object storage regardless of prefix overlap.

This is a different tradeoff space. Prefix caching wins when workloads have high overlap (repeated system prompts, shared retrieved documents in RAG). Layerwise retrieval from a cheap tier wins when prefix overlap is low but you need to serve long contexts to many concurrent users without provisioning DRAM for the worst case.

ObjectCache also lands in a serving world that has already accepted moving KV cache across the network. Prefill-decode disaggregation splits the compute-bound prefill phase and the memory-bound decode phase onto separate nodes and ships the KV between them, and LMCache’s own PD-disaggregation mode does the same across engines. Once KV is already a network-resident object rather than a GPU-local one, parking the cold tail of it on object storage is a smaller architectural step than it first sounds. What changes most is the cost model, not the data path: you trade interconnect bandwidth and a custom storage server for the DRAM you would otherwise overprovision. [Updated June 2026]

KVLink attacks the same problem from another angle: precompute KV cache per document independently, concatenate at inference time, and fix positional embeddings post-concatenation. KVLink reports up to 96% TTFT reduction with a 4% QA accuracy improvement over prior art. It trades precompute time for inference speed, which is attractive for static document collections but less applicable to multi-turn conversations where context evolves.

KV-CAR compresses KV cache along the embedding dimension using lightweight autoencoders and reuses tensors across adjacent layers, achieving up to 47.85% memory reduction on GPT-2 and TinyLLaMA. The compression approach is orthogonal to ObjectCache’s storage-tier approach; combining both is theoretically possible but not yet demonstrated.

Strategy	Storage tier	Best workload profile	Key limitation
LMCache	CPU DRAM + local disk	Multi-turn QA, document analysis with high prefix overlap	DRAM capacity bound; context truncation kills hit rate
FlexKV	Distributed CPU-backed KV store	Multi-instance vLLM sharing prefixes	Requires dedicated KV infrastructure
ObjectCache	S3-class object storage	Long-context (64K+), low-prefix-overlap, cost-sensitive	56–75 ms penalty at 4K; requires fast network
KVLink	Precomputed per-document KV	Static document collections, RAG with known corpus	Not designed for evolving multi-turn context

Where object storage loses

The 56, 75 ms added TTFT at 4K context is the constraint that matters for most production chat workloads. If your p99 TTFT budget is 200 ms and your baseline without caching is already 150 ms, adding 75 ms of object-store latency consumes most of your slack. At 64K context the percentage penalty shrinks to 5.6% because the baseline TTFT is large enough to amortize the fixed transfer cost, but the absolute latency still grows.

For operators on standard TCP networks or commodity S3, the 100 Gbps RoCE results are aspirational. The paper’s bandwidth scheduler mitigates contention, but the absolute throughput requirement does not disappear. Object storage IOPS, not bandwidth, becomes the bottleneck when concurrent request count scales.

There is a subtler cost the percentages hide. Because the gather logic lives in the storage server, ObjectCache is S3-compatible at the API surface but not S3-portable in practice. Point it at AWS S3, or any endpoint that cannot run the descriptor-driven gather and RDMA push, and you fall back to client-side assembly over HTTP, where the per-layer overlap that makes the scheme work disappears. The object-storage framing oversells the portability; what the paper actually demonstrates is a custom protocol bolted onto Ceph RGW over an RDMA fabric. Chunk granularity is the other knob with no free setting. Smaller chunks raise hit rates on partial-prefix matches but multiply the metadata lookups that already dominate at high concurrency, while larger chunks cut lookups but waste bandwidth on partial reuse. [Updated June 2026]

The hit-rate question is not only unmeasured, it is the part most likely to bite. Offloading methods that pass synthetic recall benchmarks have been shown to collapse on context-intensive synthesis tasks, where the cache a request needs is not the cache a prefix matcher would have kept warm. ObjectCache’s premise is that you can serve low-prefix-overlap workloads from a cheap tier, which is exactly the regime where naive offloading heuristics have historically degraded output quality, not just latency. The paper measures transfer cost, not generation quality, so that risk is uncharacterized here. [Updated June 2026]

What to deploy now

For self-hosted vLLM operators, the decision splits along three axes: context length, prefix overlap, and latency budget.

Short-context chat (4K and below). Stay on LMCache or vLLM’s built-in prefix caching. The 56, 75 ms TTFT penalty from object storage is disproportionate at these lengths. LMCache’s 15× throughput improvement on multi-turn QA at DRAM speed is hard to beat.

Long-context RAG (32K, 128K) with shared prefixes. LMCache or FlexKV in distributed mode. The prefix overlap in RAG workloads (shared system prompt, shared retrieved chunks) plays to their architecture. Avoid context truncation; LMCache’s own data shows it halves hit ratio.

Long-context (64K+) with low prefix overlap, cost-constrained. ObjectCache is the early candidate, with two caveats. You need fast network fabric (the paper benchmarks at 100 Gbps RoCE) and your latency budget must tolerate single-digit percentage TTFT overhead. If both hold, S3-class storage at commodity prices beats provisioning hundreds of gigabytes of DRAM per inference node.

Static document collections. KVLink’s precompute-concatenate approach is worth evaluating. The 96% TTFT reduction is compelling, and the positional-embedding correction appears to preserve accuracy. It does not solve the multi-turn problem, but for batch RAG over a fixed corpus it may outperform all caching strategies.

None of these options are mutually exclusive. A practical deployment might use LMCache for hot prefixes in DRAM, ObjectCache for long-context overflow to S3, and KVLink for precomputed document caches. The missing piece is a unified connector that tiers across all three. vLLM’s May 2026 integration of the Mooncake Store moves toward cross-instance, multi-tier KV sharing across DRAM and SSD, but it does not subsume layerwise object-storage retrieval or KVLink’s precompute-concatenate path. As of June 2026, no open-source project ships a connector that spans all three strategies. [Updated June 2026]

Frequently Asked Questions

Can ObjectCache be dropped into an existing vLLM or SGLang deployment?

Not without engine-level modifications. ObjectCache requires the inference kernel to issue layer-ordered storage reads that overlap with GPU compute, which goes beyond the connector APIs that FlexKV (FlexKVConnectorV1) and LMCache use to plug into vLLM. No open-source integration for ObjectCache with any inference engine has been published as of May 2026.

How does KVLink’s positional embedding strategy differ from ObjectCache’s retrieval?

KVLink precomputes KV cache per document with its own positional range, then uses trainable special tokens to restore cross-document attention after concatenation. ObjectCache retrieves already-computed KV tensors from their original context without modifying positional embeddings. KVLink’s trainable-token mechanism is what produces its 4% QA accuracy gain over baseline concatenation, a property ObjectCache does not replicate.

What bottleneck appears when hundreds of concurrent requests hit ObjectCache?

Object storage IOPS, not bandwidth, dominates at high concurrency. A 70B-class model with roughly 80 transformer layers triggers 80 sequential object reads per request. At 100 concurrent requests, that is 8,000 GET operations per inference step against Ceph RGW, each incurring metadata lookup overhead. The paper’s bandwidth scheduler manages throughput allocation but does not address the metadata amplification that object stores introduce at this operation volume.

Do GQA and FP8 together eliminate the need for S3-tier KV offloading?

They reduce but do not eliminate it. The 343 GB figure for 128K/8-user already reflects grouped-query attention, since Llama 3.1 uses GQA natively. FP8 halves this to roughly 171 GB, which still exceeds typical CPU DRAM budgets on inference nodes. At 32K contexts and moderate concurrency, GQA plus FP8 compress the working set enough to fit in DRAM, making ObjectCache unnecessary. The crossover shifts toward S3 as context length or concurrency grows.

Could KV-CAR’s compression lower ObjectCache’s 4K-context latency penalty?

KV-CAR compresses along the embedding dimension using lightweight autoencoders and reuses tensors across adjacent layers, cutting memory by up to 47.85%. Applied before writing KV tensors to S3, this would reduce per-layer object size and could narrow the 56-75 ms transfer penalty at short contexts. The combination is untested, and KV-CAR has only been benchmarked on GPT-2 and TinyLLaMA, not 70B-class models. Compression artifacts could also interact unpredictably with ObjectCache’s transfer-compute overlap protocol, and aggressive cache compression has a track record of breaking accuracy before it breaks latency, which is precisely the failure ObjectCache’s transfer-cost benchmarks would not catch.