When a vLLM replica boots from cold, the seconds pile up across six CPU-bound startup phases before it serves a single token, according to the first systematic cold-start study of the project, posted to arXiv on 5 June 2026 and accepted at MLSys 2026. For operators running inference behind autoscalers, the finding reopens a question serverless was supposed to have closed: whether you can actually afford to spin GPU pods down to zero between bursts of traffic.
What happens when a vLLM replica boots from cold
“Breaking the Ice: Analyzing Cold Start Latency in vLLM” (arXiv:2606.07362, submitted 5 June 2026) locates the boot bottleneck on the host rather than in GPU initialization, per the IBM Research publication page. Dropping in a faster accelerator does not necessarily shorten boot.
The arXiv abstract names six foundational steps but does not enumerate them inline; the full phase taxonomy lives in the paper body. The step-by-step breakdowns circulating outside the paper come from vendors measuring their own stacks, not from the authors’ results.
What did the MLSys 2026 paper actually build
Beyond the characterization, the paper contributes a lightweight analytical model that predicts vLLM startup latency for a given hardware configuration, framed explicitly toward serverless scheduling and resource planning, per IBM Research’s summary. A predictive model is the useful artifact here: it lets an autoscaler estimate boot time for a model-and-accelerator pair before committing a pod, instead of discovering empirically that a cold replica missed the latency budget.
Where do the seconds go during a vLLM cold start
The most concrete phase breakdown available comes from a vendor, not the paper. Tensorfuse’s account, published on the vendor’s documentation blog, groups the boot into three buckets: model loading (download the weights into storage, then load them into GPU memory), the torch.compile pipeline (Dynamo bytecode transformation, graph compilation, graph capture), and init engine.
The torch.compile step alone takes about 52 seconds for Tensorfuse’s example model, per the same post.
Those are one vendor’s phases on one stack. They are not the MLSys paper’s published results, and the abstract’s “six steps” may not map one-to-one onto this vendor list. Treat the taxonomy as directional.
Which knobs cut vLLM cold-start time
vLLM exposes several startup knobs that trade boot latency against steady-state serving performance, documented in the project’s optimization configuration and CUDA Graphs design notes.
The optimization levels are the coarsest lever. vLLM provides four levels that trade startup time for performance, per the optimization docs; the table shows the three with distinct behavior:
| Level | Cudagraph mode | Startup / serving tradeoff |
|---|---|---|
-O0 | not specified | Fastest startup, lowest serving performance |
-O1 | PIECEWISE | Simple compilation and fast fusions |
-O2 (default) | FULL_AND_PIECEWISE | Additional compilation, most performant once warm |
-O3 is listed as aggressive optimization but is currently equal to -O2, per the same docs.
The default -O2 mode is the one that hurts cold boot. It pulls in additional compilation ranges and FULL_AND_PIECEWISE cudagraphs, which is more capture work at startup than -O0 or -O1 in exchange for higher steady-state performance. Because CUDA Graphs are captured per batch size, per the CUDA Graphs design notes, scoping --cuda-graph-sizes to the batch sizes you actually serve limits how many graphs that step must record.
Weight loading is the second lever. The load_format parameter selects the loader. vLLM’s LoadConfig documents built-in formats including runai_streamer, instanttensor, and sharded_state, per the LoadConfig docs; Tensorfuse’s post additionally names vendor extensions fastsafetensors and run-ai for the same hook, per the Tensorfuse blog. A separate safetensors_load_strategy field exists on LoadConfig, but its enumerated values are not described on that page.
Why cold-start latency breaks scale-to-zero economics
If a cold boot cannot land inside your latency SLA, scale-to-zero stops saving money. You are forced to keep a warm pool of GPU pods idle, which is the overprovisioning serverless was meant to eliminate, and the autoscaler’s job shifts from scaling to zero to sizing that pool.
This trade is no longer academic. vLLM became a PyTorch Foundation-hosted project in 2025, and in January 2026 TechCrunch reported that its creators had launched Inferact to commercialize the project, raising $150 million in seed funding, per Wikipedia. With that much capital behind it, serving-cost economics are a funded-startup problem, and every second of cold boot is a second a paying customer waits or a warm GPU sits billing.
The practical move is to measure your cold start per model and hardware config, subtract what the known levers remove (cached volume, scoped graph capture, -O0/-O1), and compare the residual boot latency to your tail-latency SLA. The paper’s analytical model is aimed at exactly that prediction, per IBM Research’s framing. If the residual still breaches the SLA, you do not have a scale-to-zero system. You have a warm pool with an optimistic autoscaler bolted on.
Frequently Asked Questions
How much boot time do caching and scoped CUDA graph capture actually remove?
On the only published end-to-end measurement, a vendor benchmark of Llama 3.1 8B, the baseline cold start was 294 seconds. Caching the model on a volume and scoping --cuda-graph-sizes to a fixed batch-size set cut that to 82 seconds, with CUDA graph capture falling from 54 seconds to 7. Those two levers alone recovered 212 seconds on one stack, which is why they come before dropping optimization levels in any tuning sequence.
Does load_format choice matter differently on network storage?
Yes. On NFS or Lustre, where random reads are expensive, the eager safetensors strategy reads the entire checkpoint into CPU RAM upfront and then pushes it to the GPU, avoiding the seek pattern that makes lazy loading slow over a network filesystem. On local NVMe the gap narrows because random reads are cheap. The loader hook (runai_streamer, fastsafetensors, instanttensor) and the load strategy are independent fields on LoadConfig, so you set both.
Is there a reason not to keep the default -O2 mode on every replica?
Memory pressure. The default FULL_AND_PIECEWISE cudagraph mode is the most performant once warm but, per vLLM’s own design notes, requires the most memory and takes the longest to capture. On a GPU already near its limit from a large KV cache or a heavy model, that capture can fail or force you to shrink the cache to make room. That is the concrete failure behind the advice to run -O0 or -O1 on cold burst replicas.
When does bulk-provisioning beat pay-per-use for LLM serving?
When traffic is sustained or predictable, which is the carve-out ISO/IEC 22123-2, the reference definition for serverless, calls out: for those workloads it states bulk-provisioned servers can be more cost-effective than pay-per-use. LLM inference lands in that bucket once cold boot blows the latency budget, because you pay for a warm pool either way. The remaining question is whether you overprovision deliberately or pay a vendor markup on top of idle GPUs.