Red Hat’s April 21 announcement wires KServe’s new LLMInferenceService abstraction to llm-d’s cross-runtime router and claims up to 57× reduction in P90 time-to-first-token via disaggregated prefill/decode. That number is real in a narrow sense: it applies to the P90 percentile only, measured against a deliberately unoptimized vLLM StatefulSet baseline. Teams already running KServe get a release candidate, a known routing deadlock, and no migration guide.
What Red Hat Announced (and What the Headline Numbers Mean)
Red Hat’s April 21 article1 frames the integration as the production path for disaggregated LLM inference on Kubernetes: KServe provides lifecycle governance and autoscaling via the new LLMInferenceService CRD, while llm-d supplies the prefix-cache-aware router that directs prefill and decode work to separate pools. The headline figures are up to 57× P90 TTFT reduction and roughly 2× token throughput, from approximately 4,400 to 8,730 tokens per second.
The companion KServe blog post2 documents a real deployment running Llama 3.1 70B on 4 AMD MI300X GPUs. That deployment produced 3× output tokens per second and 2× TTFT reduction — different hardware, different scale, different magnitude than the headline claim. The two numbers are measuring different things in different configurations, and Red Hat’s article1 does not provide hardware or batch-size specifics for the 57× figure.
The Baseline Problem: 57× P90 Against Naive vLLM, Not Tuned vLLM
The llm-d.ai blog3 explicitly describes the comparison baseline as a straightforward vLLM deployment wrapped in a Kubernetes StatefulSet with simple round-robin load-balancing that fails to utilize the KV cache across requests. That is not a tuned vLLM setup. A production vLLM deployment with prefix caching enabled and session-affinity routing would close a significant portion of this gap before llm-d’s router enters the picture.
The 57× figure is therefore not a comparison against a reasonable alternative — it is a ceiling on how much you leave on the table if you never configured your existing vLLM stack for cache efficiency. For teams that have already done that work, the incremental gain from disaggregated prefill/decode routing will depend heavily on workload characteristics: request length distribution, prefix reuse rate, and GPU memory ratio between prefill and decode pools. For more on the architecture that makes this split useful at all, see Groundy’s prefill/decode disaggregation overview.
What’s in KServe v0.18.0-rc1: llm-d 0.6, Autoscaling, and Storage Migration
KServe v0.18.0-rc14, shipped April 22, 2026, bundles four significant additions: llm-d upgraded to 0.6, LLMInferenceService autoscaling, storage migration APIs, and Gateway API Inference Extension CRDs. The llm-d 0.6 upgrade is what enables the cross-runtime router the 57× benchmark relies on. The LLMInferenceService CRD represents an inference deployment as a Kubernetes-native object with lifecycle semantics rather than a raw StatefulSet — which is the actual Kubernetes-native governance story Red Hat is selling.
The storage migration APIs address a pain point documented in the KServe blog2: the Llama 3.1 70B deployment on MI300X GPUs experienced storage drag and LVM infrastructure lock-in, and manual PVC deletion was required on hardware failure. The new storage migration surface in RC1 is intended to reduce that manual intervention, but it carries no production track record under real failure conditions as of this release candidate.
The rc suffix matters. RC1 is not GA, has no documented upgrade path from prior KServe versions beyond the release notes, and the routing layer has open issues that are covered next.
The Platform Team Reality: Routing Deadlocks, WIP Merges, and Manual Workarounds
Before teams can evaluate whether the throughput claims apply to their workloads, they have to clear the current routing bugs.
Issue #53855, filed April 14, 2026, reports that the LLMInferenceService auto-generated HTTPRoute gets permanently stuck with empty backendRefs. The root cause is a self-reinforcing deadlock in InferencePool API version detection: the controller cannot identify the correct API version, so it never populates the backend references, and the route stays broken indefinitely. The only documented workaround is manually creating an HTTPRoute that references the correct backend — the auto-generation path does not function.
That is the production state of the feature Red Hat announced on April 21.
PR #50416, which implements gateway auto-migration to the v1 InferencePool API, was merged March 30, 2026, while the author still had it marked WIP and was actively protesting the premature merge. A reviewer noted that failed gateway discovery in that PR could cause incorrect fallback to policy controller parents — meaning a misconfigured cluster could silently route traffic through the wrong parent without raising an error. That code is now in v0.18.0-rc1.
What You Actually Have to Wire Together
Even setting aside the open bugs, the composability story requires more operator effort than the announcement implies. KServe’s lifecycle controller manages LLMInferenceService objects: creation, scaling, and deletion with Kubernetes RBAC semantics. llm-d’s Gateway API router handles runtime routing decisions: prefix-cache-aware dispatch, prefill/decode pool selection, and backend health signaling. These two control planes do not share state by default — operators have to configure the InferencePool CRDs, wire the Gateway API Inference Extension into their existing Gateway setup, and ensure the llm-d router has visibility into KServe-managed backends.
No reference deployment has shipped as of RC1. The KServe blog2 documents the MI300X deployment as a proof of concept, not as a reproducible configuration template. Teams migrating from InferenceService to LLMInferenceService have no migration guide to follow, and the gateway auto-migration path (PR #50416) carries the silent-fallback risk noted above.
The operational pain points documented in the MI300X proof of concept — storage drag, LVM lock-in, and manual PVC deletion on hardware failure — are partially addressed by the new storage migration APIs in RC1, but those APIs are untested at scale under failure conditions.
Verdict: Wait for GA or a Reference Deployment
The architecture underneath the announcement is sound: prefill/decode disaggregation does reduce TTFT at scale for workloads with meaningful prefix reuse, and integrating that with KServe’s Kubernetes-native lifecycle model is the right direction. The problem is that the RC1 stack ships with a known routing deadlock in issue #5385, a prematurely merged WIP gateway migration in PR #5041, and no reference deployment for teams to follow.
For teams already running KServe, the practical path forward is to track the GA release and wait for either a reference deployment or a confirmed fix to issue #5385 before committing to LLMInferenceService. For teams evaluating disaggregated inference options more broadly, the 57× headline should not drive evaluation criteria. The real comparison is against a tuned vLLM deployment with prefix caching enabled — and that number does not appear anywhere in Red Hat’s announcement1 or the llm-d.ai blog3.
Frequently Asked Questions
What workload patterns see the least benefit from disaggregated prefill/decode?
Single-turn RAG over diverse document sets, open-ended creative generation, and adversarial prompting produce minimal prefix overlap between requests — the llm-d router’s dispatch depends on repeated prefixes to route to backends holding warm KV-cache entries. At low reuse rates, the overhead of maintaining two separate GPU pools and the cross-pool routing hop can increase end-to-end latency relative to a single pooled deployment, negating the architecture’s core advantage.
What’s the minimum GPU footprint needed to run prefill and decode as separate pools?
Splitting into two pools requires enough total GPUs that neither pool is memory-starved under peak load. The documented MI300X proof of concept uses 4 GPUs for Llama 3.1 70B; teams running smaller models on single-GPU nodes or consumer hardware likely cannot meaningfully partition into two pools. The architecture also requires right-sizing the prefill-to-decode GPU ratio, which demands load-testing data that most teams don’t collect before their first inference deployment.
Do clusters need Gateway API already installed before LLMInferenceService works?
Yes — RC1 requires Gateway API Inference Extension CRDs and a functioning Gateway controller. Many production Kubernetes clusters still run Ingress controllers (nginx, Traefik) without Gateway API at all, and not every managed Kubernetes offering exposes the Inference Extension CRDs. PR #5041’s silent-fallback behavior compounds the setup risk: if gateway discovery fails during initial configuration, traffic can route through a policy controller parent without raising an error, making misconfiguration hard to detect until production traffic is affected.
How does the HTTPRoute deadlock interact with autoscaling events?
The RC1 release notes and companion blog do not document how in-flight requests are handled during LLMInferenceService scale-down — there is no described graceful connection draining mechanism. Combined with issue #5385’s deadlock, a scale-down event could strand active requests on backends that the auto-generated HTTPRoute no longer recognizes as valid targets, since the route may already be stuck with empty backendRefs. Teams should plan to test scale-down behavior under load before relying on autoscaling in any environment.