Gemma 4 31B on Cloud TPU vs GPU: The Serving Cost Crossover Point

Gemma 4 31B scores 85.2% on MMLU Pro and 84.3% on GPQA Diamond per Google’s model card, competitive with models several times its parameter count. The question for infra teams is not capability but cost: a community benchmark on Google’s Trillium TPU v6e reports 463,345 tok/s peak prefill at ~$0.40/hr Flex-start pricing, a number that, if it survives production decode workloads, challenges the GPU-default assumption for open-weight inference.

The GPU baseline: H100 cost context for 30B-class inference

Spheron’s 2026 cross-GPU cost benchmark does not include a Gemma 4 31B throughput row. The closest proxy is Qwen 3 32B dense on 2×H100, which delivers 3,200 tok/s at $5.80/hr on-demand, for a cost of $0.50 per million tokens. Gemma 4 31B, at 30.7B parameters per Google’s model card, fits on a single H100 (80 GB) in FP16, so its hardware cost profile should land below the 2×H100 Qwen 3 figure, but no published benchmark confirms the exact throughput.

The operative phrase is “at full batch.” Spheron’s benchmark methodology uses batch 256 across all runs. Production deployments typically run at effective batch sizes of 8-32, which pushes actual cost well above the headline figure. Spheron’s own Llama 4 Scout result shows $0.19/M tokens on a single H100 at batch 256; that number climbs fast as concurrency drops.

TPU v6e Flex-start: 308M tokens per dollar at peak prefill

Flex-start is not a v6e-specific promotion. [Updated June 2026] It is Google’s Dynamic Workload Scheduler provisioning mode, which hands out idle accelerator capacity for runs of up to seven days at a discount off on-demand rates, and it predates the v6e benchmarks below. A single community submission to Google’s Gemma 4 Challenge benchmarked Gemma 4 31B on a v6e-4 pod using vLLM v0.20.2 and reported:

Peak prefill throughput: 463,345 tok/s at concurrency 256
Time-to-first-token: 0.55s at concurrency 1
Flex-start rate: ~$0.40/hr
Claimed cost efficiency: ~308 million tokens per dollar (per the benchmark author)

For comparison, Spheron’s verifiable GPU benchmarks put the Qwen 3 32B dense model (the closest size proxy to Gemma 4 31B) at roughly 2M tokens per dollar on 2×H100 at batch 256. Against the TPU’s 308M prefill figure, the raw ratio exceeds 100x, but the two numbers measure different things: Spheron reports sustained batch throughput (prefill + decode), while the TPU figure is pure prefill.

Two structural caveats:

Flex-start pricing is opportunistic, not a list rate. It reflects idle TPU capacity allocated through Dynamic Workload Scheduler, not committed or on-demand rates, and the ~$0.40/hr figure comes from one benchmark author rather than a published tariff. [Updated June 2026] Google’s public TPU pricing still does not break out a clean sustained v6e chip-hour rate, and third-party trackers put on-demand v6e well above $0.40/hr, so the cost-per-dollar headline rests on a discount tier that may not be schedulable when you need the capacity.
463K tok/s is prefill throughput, not generation. Prefill measures prompt ingestion speed; production autoregressive decode is sequential per token and typically an order of magnitude slower. The benchmark does not report decode throughput, so the 308M tokens-per-dollar figure cannot be directly applied to production CPM.

Earlier TPU generations: v5e and v5p context

The v6e numbers are not an isolated data point. Easecloud’s TPU v5p analysis reports that TPU v5e at $1.60/chip-hr delivers 56% cost savings versus A100 for batch inference. TPU v5p spot and preemptible pricing drops to $1.20/chip-hr, a 70% discount off the $4.00/chip-hr on-demand rate.

Spheron listed H100 on-demand at $2.90/hr in early 2026, though per-token cost depends on model throughput, which varies by architecture and batch configuration. [Updated June 2026] By mid-2026 that rate reads high. Cross-provider pricing indexes put median H100 on-demand near $2.59/hr, with budget hosts (Thunder Compute, Hyperstack, Runpod) under $2.00/hr and spot capacity around $1.40/hr. Blackwell B200 and B300 availability has pushed H100 into the mid-tier, which compresses any TPU advantage that was computed against a $2.90 baseline. Recompute the crossover against the rate you can actually buy, not the one in a six-month-old benchmark.

The trend across TPU generations is consistent: Google’s own silicon undercuts GPU on-demand pricing for models built in the JAX/XLA ecosystem. Gemma, sharing its lineage with Gemini, fits that profile by design.

TPU v7 Ironwood: the v6e comparison is already a generation behind

[Updated June 2026] Any v6e-versus-H100 cost model written in mid-2026 has a dating problem. Google brought Ironwood, its seventh-generation TPU, to general availability on April 22, 2026 at Cloud Next, weeks before the v6e benchmarks above circulated. Ironwood is the first TPU Google explicitly positions for inference rather than training, and the headline claim is more than 4x better performance per chip than v6e for both training and inference, with a 9,216-chip superpod sharing 1.77 PB of HBM over a 9.6 Tb/s interconnect.

For a 31B dense model that fits on a fraction of a single chip’s memory, the superpod scale is irrelevant. What matters is the per-chip improvement and, more than that, the per-chip price. Google has not published a clean Ironwood chip-hour rate, so the crossover cannot be recomputed yet. The direction is clear enough: if Ironwood delivers even half its claimed per-chip throughput gain at a price within reach of v6e, the decode-phase economics that currently favor the GPU baseline tilt back toward the TPU. The question for an infra team in mid-2026 is no longer “v6e or H100” but whether it is worth waiting for Ironwood quota in your region before committing to either.

The practical constraint is availability. GA does not mean ungated, and inference-optimized silicon that hyperscale buyers have pre-committed to in bulk tends to reach smaller tenants last. Until Ironwood quota and pricing are routine, the v6e numbers remain the live comparison, with an expiration date stamped on them.

Dense vs MoE: the 26B A4B may be the real winner

Gemma 4 31B is a dense 30.7B-parameter model across 60 layers, per Google’s model card. Its Mixture-of-Experts sibling, the 26B A4B, activates only 3.8B of 25.2B total parameters per token.

On the same v6e-4 hardware, the community benchmark reports the dense 31B edges out the 26B MoE in peak prefill (463K vs ~457K tok/s). But the MoE achieves comparable throughput with 7.5x less compute per token. In serving terms, the MoE handles more concurrent requests at the same hardware budget, or matches throughput on cheaper silicon.

Quality-wise, the 26B A4B scores within 2-3 points of the dense 31B across MMLU Pro (where 31B scores 85.2%), AIME 2026 (89.2%), LiveCodeBench v6 (80.0%), and GPQA Diamond (84.3%), per Google’s model card. For most online serving workloads, that quality gap does not justify the 8x parameter activation cost of the dense model (dev.to benchmark).

Runtime choice sets the throughput ceiling as much as silicon does. Gemma 4 ships multi-token-prediction heads that some serving stacks can exploit and others cannot; LiteRT-LM exposes Gemma 4 MTP heads that llama.cpp cannot reach, which matters more for on-device decode than datacenter TPU serving but makes the point that a cost model anchored to one runtime’s numbers is anchored to that runtime’s bugs and missing features too.

Metric	Gemma 4 31B (dense)	Gemma 4 26B A4B (MoE)
Total params	30.7B	25.2B
Active params/token	30.7B	3.8B
Peak prefill (v6e-4)	463K tok/s	~457K tok/s
Compute efficiency (relative)	1x	7.5x more efficient
Quality gap vs dense	baseline	within 2-3 pts on major benchmarks

Sources: dev.to community benchmark, Gemma 4 model card

Hidden costs: JAX migration, kernel maturity, and the prefill-decode gap

The throughput numbers tell one story. The operational economics tell another.

JAX conversion overhead. Easecloud’s analysis estimates 1-3 days of engineering time to port a PyTorch model to JAX for TPU deployment. For teams with deep CUDA investments (custom kernels, Triton ops, vLLM plugin chains), the migration tax is not trivial. Gemma’s shared lineage with Gemini means JAX kernels for this model family are likely more mature than for arbitrary open-weight models, but “more mature” is not the same as “drop-in.”

vLLM TPU support. The benchmark uses vLLM v0.20.2 with its experimental TPU backend. Production reliability under sustained load, mixed sequence lengths, and multi-step scheduling is less proven than the CUDA path. Teams evaluating TPU for production serving should budget for validation work. [Updated June 2026] The TPU path has consolidated since: vLLM now ships a unified tpu-inference plugin with JAX and PyTorch front ends over one backend, listing v6e and v5e as recommended and older v5p, v4, and v3 as experimental. That is a real maturity step from the v0.20-era backend, but recommended is not the same as proven under sustained production load, and the CUDA stack keeps a multi-year lead in operator coverage and community-debugged edge cases. (vLLM’s own roadmap keeps moving the prefill/decode boundary: v0.21 added bidirectional KV-cache transfers between prefill and decode workers.)

Prefill vs decode. This is the largest unquantified variable. The 463K tok/s figure measures how fast the TPU ingests prompt tokens in parallel. Production serving is dominated by autoregressive decode, where each token depends on the previous one and parallelism is limited. Without decode-phase benchmarks for Gemma 4 31B on v6e, the production CPM remains unknown. The gap between prefill and decode throughput on TPU architectures can be substantial, and it is the decode number that determines real serving cost.

The reason is structural. Prefill is compute-bound: a long prompt is a large matrix multiply that saturates the matrix units, which is exactly what TPUs are built for. Decode is memory-bandwidth-bound: generating one token reads the full model weights plus the growing KV cache for a single-row matrix-vector product, so throughput tracks HBM bandwidth, not peak FLOPS. A chip can win prefill by a wide margin and lose decode, because the two phases stress different parts of the silicon. This is why prefill/decode disaggregation has become standard in serious serving stacks: the phases scale on different hardware ratios, and pinning both to one accelerator type overpays for one of them. A v6e-versus-H100 decision that ignores this picks a winner for the easy half of the workload.

The crossover decision framework

The TPU-vs-GPU question for Gemma 4 31B is not a single break-even calculation. It is a matrix of three variables: pricing tier, batch concurrency, and workload composition.

TPU v6e Flex-start is favorable when:

Batch workloads dominate (high concurrency, latency-tolerant)
The team has JAX competency or is serving Gemma-family models with mature XLA kernels
Spot/preemptible capacity is acceptable (the workload can tolerate preemption)
Traffic is prefill-heavy (long prompts, short completions)

H100 is favorable when:

The team has existing CUDA infrastructure (custom kernels, TorchServe, TensorRT-LLM pipelines)
Latency-sensitive serving with low batch sizes is the primary workload
On-demand capacity guarantees are required
Mixed-model serving (Gemma alongside Llama, Mistral, etc.) on shared infrastructure

The batch-size sensitivity alone makes blanket CPM comparisons misleading. Spheron’s benchmark data shows a steep cost curve as concurrency drops from batch 256 toward batch 1. The TPU v6e’s advantage at Flex-start pricing is real, but the magnitude of that advantage collapses the moment you compare prefill benchmarks to decode-heavy production traffic, or Flex-start rates to committed pricing.

For infra teams defaulting to H100 for open-weight serving, the exercise is straightforward: benchmark your actual decode throughput and concurrency distribution, then cost both hardware paths at your real batch profile. The TPU numbers are good enough now that skipping the comparison is leaving money on the table. But the headline gap between prefill TPU throughput and batch-256 GPU throughput is not the number to take to finance.

Frequently Asked Questions

Does FP8 quantization close the GPU cost gap with TPU v6e?

FP8 shrinks Gemma 4 31B weights to roughly 31 GB, freeing H100 VRAM for a larger KV cache and enabling longer context windows at the same batch size. The throughput gain is roughly 1.5-2x, which narrows the gap but does not close it: the TPU Flex-start advantage at batch 256 is two orders of magnitude.

What batch size do most teams actually run, and how much does that inflate GPU cost?

Spheron’s data shows H100 cost for 70B-class models climbs from $2.30/M tokens at batch 256 to roughly $258/M at batch 1, a 100x spread. Most production deployments land at effective batch 8-32, which means they overpay by 10-30x compared to the batch-256 headline numbers that appear in vendor benchmarks.

How do older TPU generations compare to A100 for smaller models?

TPU v5e-8 serves 7B-class models at $1.08/M tokens versus $3.82/M on A100 and $2.73/M on H100, per Easecloud’s analysis. The tradeoff is the same JAX conversion overhead (1-3 days for PyTorch teams) that applies to v6e, since all TPU generations share the XLA compiler stack.

What production metrics are still unmeasured for Gemma 4 31B on TPU v6e?

Beyond decode throughput, the benchmark does not report inter-token latency, behavior under mixed sequence-length distributions, or performance when Flex-start instances are preempted. The 0.55s TTFT at concurrency 1 is the only latency figure published; there is no P90/P99 data and no published results for generation quality or throughput stability under sustained multi-hour runs.