Llama 4 on Vercel's AI Model Gateway: Hosted Inference vs Self-Hosted vLLM

Teams deploying Llama 4 hit a fork every frontier model creates: run the mixture-of-experts weights on your own GPUs through llama.cpp or a vLLM-style server, or call them through a hosted gateway. Vercel’s homepage markets an “AI Model Gateway” under its “Agentic Infrastructure” framing, but names no model roster and publishes no per-token rates.

What Llama 4 actually is

Llama 4 is a natively multimodal mixture-of-experts family with text and image input, official support for 12 languages, and pre-training data spanning roughly 200 languages, per its Ollama model card. Scout is a 109B-parameter MoE with 17B active parameters; Maverick is a 400B-parameter MoE, also 17B active, per the Ollama model card. The 17B-active figure governs inference cost per token, because only the activated experts run on the forward pass; the total parameter count governs how much VRAM you need to hold the weights resident.

On Meta’s reported benchmarks, Maverick scores 80.5 on MMLU Pro and 69.8 on GPQA Diamond, with 43.4 pass@1 on the Oct 2024 to Feb 2025 LiveCodeBench slice; Scout scores 74.3 and 57.2 on the first two, per the Ollama card. These are vendor-reported numbers on slices Meta selected; treat them as a ceiling, not a floor.

The hosted-gateway path: Vercel’s AI Model Gateway

Vercel’s pitch, under its Agentic Infrastructure framing, is a stack built from an AI Model Gateway, Sandboxed Environments, Durable Orchestration, and Fluid Compute. The company closed a $300M Series F at a $9.3B valuation in September 2025, per Vercel’s Wikipedia entry, co-led by Accel and GIC.

The attraction of the gateway path is real: the same project that ships your frontend calls the model, there is no GPU to provision or keep warm, and the per-request cost is someone else’s accounting problem. The cost of that convenience is the part to price carefully. You inherit the gateway’s model availability, its rate limits, its context-window caps, and its pricing schedule, with no lever to tune any of them. If the operator raises inference pricing, reorders its model queue, or drops a variant, your application moves with it.

The self-hosted path: llama.cpp and the vLLM baseline

The canonical self-hosted baseline is llama.cpp, which ships a single binary with hand-tuned kernels across CPUs and GPUs, including Apple Silicon, the RTX 5090/4090/3090 line, H100, MI300, B200, Intel Arc, Radeon RX, Jetson, and DGX Spark. For a 17B-active MoE, that matrix is the practical reason self-hosting is even an option: you can serve Maverick-class activations on hardware you already own, and the per-token marginal cost is the electricity and the amortized GPU, with no per-call markup.

Meta’s own reference stack has thinned out, which matters if you are choosing between a third-party gateway and Meta’s first-party tooling. The original meta-llama/llama inference repo is deprecated as of the Llama 3.1 consolidation; Meta now points users to llama-models, llama-toolchain, llama-agentic-system, and PurpleLlama. Teams that need a real serving layer with batching and request handling generally land on llama.cpp, vLLM, or a managed equivalent rather than Meta’s reference code.

What you give up on cost, context, and latency

The decision reduces to three variables, in roughly that order of importance: per-token cost, the context window you actually get, and tail latency.

On cost, the only durable statement is structural. A self-hosted 17B-active MoE forward pass costs the electricity and the GPU lease; a gateway call costs the gateway’s per-token rate plus whatever margin the operator takes. The crossover depends on your request volume and on the specific rates. Model both against your own traffic; breakeven points for MoE inference move sharply with batch size and with how many concurrent requests you can coalesce.

On context, a hosted gateway may cap the window well below the model’s native limit for cost or safety reasons; if your workload is long-context (large document analysis, long agent traces), the effective context window behind the gateway is a question to ask the operator, not to assume from the model card. Self-hosting preserves the native window up to whatever VRAM the KV cache consumes, which is substantial at long context and a real deployment constraint.

On latency, a gateway adds a network hop, a possible queue, and the gateway’s own scheduling; a self-hosted server adds your own queue and your own cold-start if you let the GPUs scale to zero. Neither is unconditionally faster; it depends on whether you can keep weights resident.

Security and lock-in: the April 2026 breach

Any “let the hosted gateway hold your inference keys” recommendation has to reckon with Vercel’s April 19, 2026 breach disclosure. Per Vercel’s Wikipedia entry, the entry point, as reported by Hudson Rock, was Lumma Stealer malware on an employee machine (via Roblox cheat scripts) that pivoted through the third-party AI tool Context.ai into a Vercel Google Workspace account, exposing environment variables not marked “sensitive.”

The detail that matters for an inference-routing decision is the “not marked sensitive” qualifier: the exposed values were environment variables that had not been flagged sensitive, which makes the sensitive flag the relevant protective lever. The general lesson, regardless of Vercel’s exact defaults, is that a platform which holds your inference credentials is a platform whose compromise can leak them, and that the defensive work (flagging secrets, rotating keys, scoping env vars narrowly) lands on you no matter who runs the model.

Which path

For a side project, a prototype, or a workload where developer speed matters more than per-token margin, calling Llama 4 through whatever gateway your platform offers is defensible. For a production workload with steady traffic, long context, or cost sensitivity, self-hosting the 17B-active MoE weights on hardware you control gives you the knobs the gateway hides: cost, context window, uptime, and the model itself.

A caution applies to both paths. Meta’s Llama line has moved before, and model roadmaps shift; do not build a multi-year architecture on the assumption that any specific Llama 4 variant stays available, supported, or competitively priced.

Frequently Asked Questions

Is Llama 4 still Meta’s flagship model line as of 2026?

In April 2026 Meta Superintelligence Labs released Muse Spark as a replacement for the Llama line, and Llama 4 itself shipped April 5, 2025. A team standardizing on Scout or Maverick today is locking into a family Meta has already signaled it is moving past, which shortens the expected support window for any deployment decision built around Llama 4 availability.

What happened to the larger Llama 4 Behemoth variant?

Meta announced a Behemoth model at roughly 2T total parameters, 288B active, and 16 experts, but never released it; Maverick was codistilled from Behemoth. The Scout and Maverick weights you can actually serve or route through a gateway are the smaller siblings in a lineup originally headed for a much larger flagship, which is part of why the 17B-active economics look the way they do.

Does Vercel’s Fluid Compute remove the cold-start problem for inference?

Vercel’s infrastructure runs on AWS, and Fluid, introduced in 2025, lets a single local instance handle multiple concurrent requests while keeping serverless elasticity. That coalesces warm paths for the gateway’s own compute, but it does not touch the network hop to the model, the gateway’s scheduling queue, or the resident-weights question that governs whether a self-hosted server is faster on the tail.

Who is behind Vercel’s push into AI infrastructure?

Mitchell Hashimoto joined Vercel’s board in March 2026, alongside the ‘AI Cloud’ thesis Vercel marketed around its September 2025 Series F at a $9.3B valuation. The board addition is a signal that the AI Model Gateway direction carries a Hashi-era infrastructure founder’s backing, which reframes Vercel’s pitch from pure frontend deployment toward serving as an inference-routing layer.