groundy
infrastructure & runtime

Gemma 4 31B on Cloud TPU vs GPU: The Serving Cost Crossover Point

TPU v6e Flex-start delivers 308M tokens per dollar for Gemma 4 31B prefill, undercutting H100 rates for open-weight serving, but production decode costs remain unquantified.

7 min · · · 4 sources ↓

Gemma 4 31B scores 85.2% on MMLU Pro and 84.3% on GPQA Diamond per Google’s model card, competitive with models several times its parameter count. The question for infra teams is not capability but cost: a community benchmark on Google’s Trillium TPU v6e reports 463,345 tok/s peak prefill at ~$0.40/hr Flex-start pricing, a number that, if it survives production decode workloads, challenges the GPU-default assumption for open-weight inference.

The GPU baseline: H100 cost context for 30B-class inference

Spheron’s 2026 cross-GPU cost benchmark does not include a Gemma 4 31B throughput row. The closest proxy is Qwen 3 32B dense on 2×H100, which delivers 3,200 tok/s at $5.80/hr on-demand, for a cost of $0.50 per million tokens. Gemma 4 31B, at 30.7B parameters per Google’s model card, fits on a single H100 (80 GB) in FP16, so its hardware cost profile should land below the 2×H100 Qwen 3 figure, but no published benchmark confirms the exact throughput.

The operative phrase is “at full batch.” Spheron’s benchmark methodology uses batch 256 across all runs. Production deployments typically run at effective batch sizes of 8-32, which pushes actual cost well above the headline figure. Spheron’s own Llama 4 Scout result shows $0.19/M tokens on a single H100 at batch 256; that number climbs fast as concurrency drops.

TPU v6e Flex-start: 308M tokens per dollar at peak prefill

Google’s Trillium TPU v6e entered Flex-start pricing in May 2026. A community submission to Google’s Gemma 4 Challenge benchmarked Gemma 4 31B on a v6e-4 pod using vLLM v0.20.2 and reported:

  • Peak prefill throughput: 463,345 tok/s at concurrency 256
  • Time-to-first-token: 0.55s at concurrency 1
  • Flex-start rate: ~$0.40/hr
  • Claimed cost efficiency: ~308 million tokens per dollar (per the benchmark author)

For comparison, Spheron’s verifiable GPU benchmarks put the Qwen 3 32B dense model (the closest size proxy to Gemma 4 31B) at roughly 2M tokens per dollar on 2×H100 at batch 256. Against the TPU’s 308M prefill figure, the raw ratio exceeds 100x, but the two numbers measure different things: Spheron reports sustained batch throughput (prefill + decode), while the TPU figure is pure prefill.

Two structural caveats:

  1. Flex-start pricing is promotional. It reflects idle TPU capacity, not committed or on-demand rates. Google has not published sustained on-demand pricing for v6e as of late May 2026.
  2. 463K tok/s is prefill throughput, not generation. Prefill measures prompt ingestion speed; production autoregressive decode is sequential per token and typically an order of magnitude slower. The benchmark does not report decode throughput, so the 308M tokens-per-dollar figure cannot be directly applied to production CPM.

Earlier TPU generations: v5e and v5p context

The v6e numbers are not an isolated data point. Easecloud’s TPU v5p analysis reports that TPU v5e at $1.60/chip-hr delivers 56% cost savings versus A100 for batch inference. TPU v5p spot and preemptible pricing drops to $1.20/chip-hr, a 70% discount off the $4.00/chip-hr on-demand rate.

Spheron lists H100 on-demand at $2.90/hr, though per-token cost depends on model throughput, which varies by architecture and batch configuration.

The trend across TPU generations is consistent: Google’s own silicon undercuts GPU on-demand pricing for models built in the JAX/XLA ecosystem. Gemma, sharing its lineage with Gemini, fits that profile by design.

Dense vs MoE: the 26B A4B may be the real winner

Gemma 4 31B is a dense 30.7B-parameter model across 60 layers, per Google’s model card. Its Mixture-of-Experts sibling, the 26B A4B, activates only 3.8B of 25.2B total parameters per token.

On the same v6e-4 hardware, the community benchmark reports the dense 31B edges out the 26B MoE in peak prefill (463K vs ~457K tok/s). But the MoE achieves comparable throughput with 7.5x less compute per token. In serving terms, the MoE handles more concurrent requests at the same hardware budget, or matches throughput on cheaper silicon.

Quality-wise, the 26B A4B scores within 2-3 points of the dense 31B across MMLU Pro (where 31B scores 85.2%), AIME 2026 (89.2%), LiveCodeBench v6 (80.0%), and GPQA Diamond (84.3%), per Google’s model card. For most online serving workloads, that quality gap does not justify the 8x parameter activation cost of the dense model (dev.to benchmark).

MetricGemma 4 31B (dense)Gemma 4 26B A4B (MoE)
Total params30.7B25.2B
Active params/token30.7B3.8B
Peak prefill (v6e-4)463K tok/s~457K tok/s
Compute efficiency (relative)1x7.5x more efficient
Quality gap vs densebaselinewithin 2-3 pts on major benchmarks

Sources: dev.to community benchmark, Gemma 4 model card

Hidden costs: JAX migration, kernel maturity, and the prefill-decode gap

The throughput numbers tell one story. The operational economics tell another.

JAX conversion overhead. Easecloud’s analysis estimates 1-3 days of engineering time to port a PyTorch model to JAX for TPU deployment. For teams with deep CUDA investments (custom kernels, Triton ops, vLLM plugin chains), the migration tax is not trivial. Gemma’s shared lineage with Gemini means JAX kernels for this model family are likely more mature than for arbitrary open-weight models, but “more mature” is not the same as “drop-in.”

vLLM TPU support. The benchmark uses vLLM v0.20.2 with its experimental TPU backend. Production reliability under sustained load, mixed sequence lengths, and multi-step scheduling is less proven than the CUDA path. Teams evaluating TPU for production serving should budget for validation work.

Prefill vs decode. This is the largest unquantified variable. The 463K tok/s figure measures how fast the TPU ingests prompt tokens in parallel. Production serving is dominated by autoregressive decode, where each token depends on the previous one and parallelism is limited. Without decode-phase benchmarks for Gemma 4 31B on v6e, the production CPM remains unknown. The gap between prefill and decode throughput on TPU architectures can be substantial, and it is the decode number that determines real serving cost.

The crossover decision framework

The TPU-vs-GPU question for Gemma 4 31B is not a single break-even calculation. It is a matrix of three variables: pricing tier, batch concurrency, and workload composition.

TPU v6e Flex-start is favorable when:

  • Batch workloads dominate (high concurrency, latency-tolerant)
  • The team has JAX competency or is serving Gemma-family models with mature XLA kernels
  • Spot/preemptible capacity is acceptable (the workload can tolerate preemption)
  • Traffic is prefill-heavy (long prompts, short completions)

H100 is favorable when:

  • The team has existing CUDA infrastructure (custom kernels, TorchServe, TensorRT-LLM pipelines)
  • Latency-sensitive serving with low batch sizes is the primary workload
  • On-demand capacity guarantees are required
  • Mixed-model serving (Gemma alongside Llama, Mistral, etc.) on shared infrastructure

The batch-size sensitivity alone makes blanket CPM comparisons misleading. Spheron’s benchmark data shows a steep cost curve as concurrency drops from batch 256 toward batch 1. The TPU v6e’s advantage at Flex-start pricing is real, but the magnitude of that advantage collapses the moment you compare prefill benchmarks to decode-heavy production traffic, or Flex-start rates to committed pricing.

For infra teams defaulting to H100 for open-weight serving, the exercise is straightforward: benchmark your actual decode throughput and concurrency distribution, then cost both hardware paths at your real batch profile. The TPU numbers are good enough now that skipping the comparison is leaving money on the table. But the headline gap between prefill TPU throughput and batch-256 GPU throughput is not the number to take to finance.

Frequently Asked Questions

Does FP8 quantization close the GPU cost gap with TPU v6e?

FP8 shrinks Gemma 4 31B weights to roughly 31 GB, freeing H100 VRAM for a larger KV cache and enabling longer context windows at the same batch size. The throughput gain is roughly 1.5-2x, which narrows the gap but does not close it: the TPU Flex-start advantage at batch 256 is two orders of magnitude.

What batch size do most teams actually run, and how much does that inflate GPU cost?

Spheron’s data shows H100 cost for 70B-class models climbs from $2.30/M tokens at batch 256 to roughly $258/M at batch 1, a 100x spread. Most production deployments land at effective batch 8-32, which means they overpay by 10-30x compared to the batch-256 headline numbers that appear in vendor benchmarks.

How do older TPU generations compare to A100 for smaller models?

TPU v5e-8 serves 7B-class models at $1.08/M tokens versus $3.82/M on A100 and $2.73/M on H100, per Easecloud’s analysis. The tradeoff is the same JAX conversion overhead (1-3 days for PyTorch teams) that applies to v6e, since all TPU generations share the XLA compiler stack.

What production metrics are still unmeasured for Gemma 4 31B on TPU v6e?

Beyond decode throughput, the benchmark does not report inter-token latency, behavior under mixed sequence-length distributions, or performance when Flex-start instances are preempted. The 0.55s TTFT at concurrency 1 is the only latency figure published; there is no P90/P99 data and no published results for generation quality or throughput stability under sustained multi-hour runs.

sources · 4 cited

  1. Gemma 4 model card vendor accessed 2026-05-27
  2. Gemma-4-31B on v6e-4 TPU Benchmarks community accessed 2026-05-27
  3. GPU Cost Per Token: Benchmark 7 Major LLMs Across GPU Types in 2026 analysis accessed 2026-05-27
  4. Google Cloud TPU v5p Specs, Pricing and LLM Benchmarks analysis accessed 2026-05-27