LiteRT-LM v0.10.1 Ships Gemma 4 MTP Heads That llama.cpp Can't Access

Q: Is Gemma 4's MoE variant practical to run outside LiteRT-LM given PEFT and architecture gaps?

The 26B A4B MoE variant fits in roughly 18GB RAM at Q4KM under llama.cpp, but the HuggingFace Transformers architecture definition for Gemma 4's MoE layer types was missing at launch — meaning Transformers-based workflows (training, evaluation, adapter attachment) could not load the model at all in the initial release window. PEFT incompatibility compounds this: teams that planned to fine-tune the MoE variant with LoRA adapters cannot attach them to the new layer types without upstream library updates, making the 26B variant essentially inference-only outside Google's toolchain for now.

Google shipped LiteRT-LM v0.10.1 on April 3, 2026 — one day after Gemma 4’s public release — with full Gemma 4 support, a rewritten developer CLI, and Qualcomm NPU acceleration. (LiteRT-LM Releases) Alongside it, Google made a quiet distribution decision: MTP prediction heads were present in LiteRT-exported Gemma 4 weights and absent from the public HuggingFace release at launch. (Gemma 4 Was Released Without MTP Data — FlowHunt) Google corrected the omission on May 5, 2026, shipping standalone MTP drafter models for all four Gemma 4 variants under Apache 2.0, with Transformers, vLLM, SGLang, MLX, and Ollama all adding support on day one. (Accelerating Gemma 4: faster inference with multi-token prediction drafters — Google Blog) Teams deploying on llama.cpp still face that cost: the framework has an open feature request for Gemma 4 MTP drafter support and no implementation as of this writing. (Feature Request: support MTP drafters · llama.cpp #22747)

What LiteRT-LM v0.10.1 Actually Ships

The CLI migrated from fire to click, adding --verbose and --version flags and zero-code model experimentation with tool calling for agentic workflows. (LiteRT-LM Releases) HuggingFace direct model import removes a manual conversion step that previously required offline tooling. Speculative decoding is now supported at the framework level. The CMake build system was refactored to support Android cross-compilation.

Gemma 4 itself launched April 2 under Apache 2.0 in four configurations: E2B (edge, 2B parameters, 128K context, multimodal text/image/audio), E4B (edge, 4B parameters, 128K context), 26B A4B MoE (256K context), and 31B Dense (256K context). (Google Pushes Multimodal AI Further Onto Edge Devices with Gemma 4 — Edge AI and Vision Alliance)

The CLI runs on Linux, macOS, and Raspberry Pi. Python and Kotlin (Android) APIs are stable. The Swift API is “In Development”, so iOS teams cannot ship production native Swift apps from this release. (LiteRT-LM Overview — Google AI for Developers)

LiteRT-LM v0.11.0, released May 7, 2026, added two notable capabilities to this base: native Gemma 4 Single Position MTP (delivering over 2x faster decode speeds on mobile GPUs with no quality degradation) and Windows CLI support with CPU and GPU backends. (LiteRT-LM Releases)

The Benchmark Numbers: LiteRT-LM Performance on Mobile, Raspberry Pi, and Qualcomm NPU

For Gemma 4 E2B, Google’s published figures on the Qualcomm Dragonwing IQ8 NPU reach 3,700 prefill tokens/second and 31 decode tokens/second. (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog) On a Raspberry Pi 5 CPU, those numbers drop to 133 prefill tokens/second and 7.6 decode tokens/second — usable for embedded pipelines, though not interactive-chat latency at longer contexts. (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog)

The device-level breakdown from official documentation (LiteRT-LM Overview — Google AI for Developers):

Device	Backend	Prefill (tk/s)	Decode (tk/s)
Samsung S26 Ultra	CPU	557	47
Samsung S26 Ultra	GPU	3,808	52
iPhone 17 Pro	CPU	532	25
iPhone 17 Pro	GPU	2,878	56
MacBook Pro M4	GPU	7,835	160
Qualcomm Dragonwing IQ8	NPU	3,700	31
Raspberry Pi 5	CPU	133	7.6

Two caveats about these numbers: first, the Samsung S26 Ultra, iPhone 17 Pro, and Qualcomm Dragonwing IQ8 are 2026 flagship devices not yet in widespread developer hands — real-world figures on 2024-era hardware will be lower. Second, these are E2B benchmarks. The E2B model fits under 1.5GB with 2-bit and 4-bit quantization and memory-mapped embeddings, and processes 4,000 input tokens across two tasks in under 3 seconds on GPU. (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog)

Deployment requires device-specific QNN pre-compiled binaries (TF_LITE_AUX), and the runtime implements a fallback chain — NPU → GPU (OpenCL) → CPU (XNNPack) — in case a delegate fails. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN — Google Developer Experts / Medium) Android 15 Scoped Storage constraints add a deployment wrinkle: large model files must be placed in an external data directory for ADB-push workflows. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN — Google Developer Experts / Medium)

The MTP Head Decision: How Google Engineered a Performance Asymmetry Into an ‘Open’ Release

Multi-Token Prediction (MTP) allows a model to predict multiple future tokens simultaneously rather than one at a time, accelerating inference without degrading output quality. At launch, Google included MTP prediction heads in Gemma 4’s LiteRT-exported model weights but omitted them from the Apache 2.0 HuggingFace release. (Gemma 4 Was Released Without MTP Data — FlowHunt)

At launch, the only published MTP speedup figure was 1.8x from DeepSeek V3 benchmarks; Google had not published Gemma 4-specific MTP throughput numbers. On May 5, 2026, Google released standalone MTP drafter models for all four Gemma 4 variants and published framework-specific performance data. The Gemma 4 31B drafter is a 0.5B parameter model documented at approximately 2x decoding speedup, with a claimed ceiling of 3x on specific hardware configurations. (Accelerating Gemma 4: faster inference with multi-token prediction drafters — Google Blog) The directional claim holds: MTP acceleration requires the prediction heads, and the gap now sits specifically between llama.cpp and the rest of the ecosystem.

At launch, vLLM, llama.cpp, and SGLang could not implement MTP-based speculative decoding for Gemma 4 because the necessary model components were absent from the publicly distributed weights. As of May 5, 2026, vLLM, SGLang, Transformers, MLX, and Ollama all support the official drafters. llama.cpp cannot: feature request #22747 was filed May 6, 2026, with no implementation linked. (Feature Request: support MTP drafters · llama.cpp #22747)

Without MTP, Gemma 4 31B achieves approximately 11 tokens/second on consumer GPUs under llama.cpp. Community benchmarks put comparable models — Qwen3 Coder at more than twice the parameter count — at 50+ tokens/second on identical hardware. (Gemma 4 Was Released Without MTP Data — FlowHunt)

llama.cpp’s Position: Day-One GGUF Support, the Speed Penalty, and the EAGLE3 Workaround

llama.cpp shipped GGUF support for all four Gemma 4 variants on launch day, which matters for cross-platform portability. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers) The 26B A4B MoE variant fits in approximately 18GB RAM at Q4_K_M quantization under llama.cpp. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers)

The community response to the MTP gap at launch was a 277MB EAGLE3 draft head trained on Gemma 4’s tokenizer, exploiting shared tokenizer architecture across Gemma 4 variant sizes to enable traditional speculative decoding. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers) The EAGLE3 workaround achieves approximately 2x speedup. Google’s official MTP drafters, released May 5, 2026, supersede EAGLE3 for vLLM, SGLang, and Transformers-based workflows: the official drafters are jointly trained with the base model weights, where EAGLE3 is not, which is why the 1.8x-2x speedup from EAGLE3 cannot match the quality guarantee of integrated MTP. For llama.cpp specifically, EAGLE3 remains the only available option; the framework has not implemented Gemma 4 MTP drafter support. (Feature Request: support MTP drafters · llama.cpp #22747)

Community testing at launch flagged missing HuggingFace Transformers architecture support, PEFT incompatibility with new layer types, and stability problems on Apple Silicon — characterizing the release as incomplete outside Google’s toolchain. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers) These issues illustrate that GGUF support on day one and cross-framework inference readiness are not the same claim.

The Lock-In Mechanics: What Adopting LiteRT-LM Means for Your Inference Stack

The technical lock-in is layered. At the model layer: maximum Gemma 4 throughput on mobile NPU still requires LiteRT-converted model files. Google’s May 5, 2026 release of standalone MTP drafters on HuggingFace resolves the framework-portability half of this constraint: vLLM, SGLang, Transformers, MLX, and Ollama users can now run MTP-accelerated Gemma 4 inference without LiteRT. llama.cpp cannot access these drafters yet. At the runtime layer: NPU acceleration on Qualcomm hardware requires device-specific QNN pre-compiled binaries tied to LiteRT’s delegate architecture. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN — Google Developer Experts / Medium) At the API layer: Kotlin is stable for Android, Python is stable for server/desktop, but Swift remains in development — so any team building iOS production apps in native Swift cannot ship on LiteRT-LM today without wrapping the C++ library. (LiteRT-LM Overview — Google AI for Developers)

A vendor releases a high-quality open-weight model under a permissive license, distributes a version of those weights missing a performance-critical component, and routes full capability through its own runtime. Google partially reversed the MTP component gap on May 5 — opening access to Transformers-compatible frameworks. llama.cpp remains without a path to native MTP support.

Decision Framework: When to Use LiteRT-LM vs. llama.cpp vs. ONNX Runtime for Edge Deployment

Teams should choose based on which constraint is load-bearing for their deployment.

LiteRT-LM is the clear choice if: your target hardware is Android with Qualcomm Snapdragon 8 Elite, you are deploying Gemma 4 specifically, and maximum throughput per watt matters more than cross-framework portability. The NPU path — Qualcomm Dragonwing IQ8 at 3,700 prefill tokens/second (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog) — is not matched by any current llama.cpp configuration on equivalent hardware. If you are on Raspberry Pi and running batch workloads where 7.6 decode tokens/second is acceptable, LiteRT-LM also gives you a tested, supported path. (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog)

llama.cpp remains the right choice if: your deployment spans multiple hardware backends without a dominant Android/Qualcomm bias, you need HuggingFace ecosystem compatibility (fine-tuning, PEFT, Transformers-based tooling), or your team’s operational model depends on not being tied to a single vendor’s model distribution pipeline. The EAGLE3 workaround recovers a substantial fraction of MTP’s benefit for llama.cpp users, (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers) though users of vLLM or SGLang now have access to Google’s official MTP drafters at comparable speedup with a first-party quality guarantee. llama.cpp’s cross-platform record is more established than LiteRT-LM’s for non-Android targets.

ONNX Runtime is relevant if your organization already runs ONNX-based inference infrastructure, but no Gemma 4-specific performance data is available to compare.

The decision also depends on timing. LiteRT-LM v0.11.0 (May 7, 2026) extended the framework with native Gemma 4 MTP and Windows CLI support. (LiteRT-LM Releases) Google’s May 5 MTP drafter release narrowed the performance gap for vLLM and SGLang users substantially. The Swift API gap remains open. The remaining asymmetry is now specifically between llama.cpp and the rest of the ecosystem: whether llama.cpp ships native Gemma 4 MTP drafter support will determine whether the gap closes or becomes a durable feature of the llama.cpp deployment profile.

Frequently Asked Questions

Does LiteRT-LM’s Qualcomm NPU path work on older Snapdragon devices, or only the Dragonwing IQ8?

NPU acceleration via QNN delegate targets the Hexagon NPU on Snapdragon 8 Elite specifically — older Snapdragon generations lack the same Hexagon configuration and require device-specific QNN pre-compiled binaries that Google has not published for legacy silicon. Teams deploying to a mixed device fleet will hit the runtime fallback chain (NPU → GPU → CPU) on non-Elite hardware, meaning the 3,700 tk/s prefill figure is only reproducible on Snapdragon 8 Elite or the Dragonwing IQ8 reference board.

How does the EAGLE3 workaround’s quality compare to Google’s native MTP implementation?

EAGLE3 is a community-trained 277MB draft head, and the training process introduces its own quality tradeoffs absent from Google’s first-party MTP implementation — the draft head approximates the target model’s distribution but is not jointly trained with Gemma 4, so acceptance rates and output distribution can diverge from what integrated MTP would produce. Google’s official MTP drafters, released May 5, 2026, are co-trained with the base model weights and carry a published quality-equivalence guarantee that EAGLE3 cannot replicate. For llama.cpp users, EAGLE3 remains the only available path; for vLLM and SGLang users, the official drafters are preferable.

What does it actually cost a team to switch from llama.cpp to LiteRT-LM mid-project?

Beyond code changes to swap inference APIs, teams must acquire LiteRT-exported model files separately from the Apache 2.0 HuggingFace weights — the NPU-optimized weights are only available through Google’s LiteRT export pipeline, not as a drop-in download. Any fine-tuned adapters built with PEFT against the HuggingFace weights are currently incompatible with LiteRT’s layer types, so custom fine-tunes cannot be ported without retraining against LiteRT-compatible model artifacts.

Is Gemma 4’s MoE variant practical to run outside LiteRT-LM given PEFT and architecture gaps?

The 26B A4B MoE variant fits in roughly 18GB RAM at Q4_K_M under llama.cpp, but the HuggingFace Transformers architecture definition for Gemma 4’s MoE layer types was missing at launch — meaning Transformers-based workflows (training, evaluation, adapter attachment) could not load the model at all in the initial release window. PEFT incompatibility compounds this: teams that planned to fine-tune the MoE variant with LoRA adapters cannot attach them to the new layer types without upstream library updates, making the 26B variant essentially inference-only outside Google’s toolchain for now.

What would cause a team currently committed to LiteRT-LM to reconsider that choice?

Google released the MTP drafters to HuggingFace on May 5, 2026, available for Transformers, vLLM, SGLang, MLX, and Ollama. For teams already on those frameworks, the primary performance reason to accept LiteRT lock-in is substantially weaker than it was at this article’s original publication. The remaining trigger is llama.cpp shipping native Gemma 4 MTP drafter support: feature request #22747 is open as of this writing with no implementation linked. (Feature Request: support MTP drafters · llama.cpp #22747) A secondary trigger remains the Swift API reaching stable status — once iOS teams can ship production native Swift apps, the platform coverage argument for LiteRT-LM strengthens considerably, but until then iOS deployments require wrapping the C++ library, which adds maintenance surface that may tip borderline decisions toward ONNX Runtime.

LiteRT-LM v0.10.1 Ships Gemma 4 MTP Heads That llama.cpp Can't Access

What LiteRT-LM v0.10.1 Actually Ships

The Benchmark Numbers: LiteRT-LM Performance on Mobile, Raspberry Pi, and Qualcomm NPU

The MTP Head Decision: How Google Engineered a Performance Asymmetry Into an ‘Open’ Release

llama.cpp’s Position: Day-One GGUF Support, the Speed Penalty, and the EAGLE3 Workaround

The Lock-In Mechanics: What Adopting LiteRT-LM Means for Your Inference Stack

Decision Framework: When to Use LiteRT-LM vs. llama.cpp vs. ONNX Runtime for Edge Deployment

Frequently Asked Questions

Does LiteRT-LM’s Qualcomm NPU path work on older Snapdragon devices, or only the Dragonwing IQ8?

How does the EAGLE3 workaround’s quality compare to Google’s native MTP implementation?

What does it actually cost a team to switch from llama.cpp to LiteRT-LM mid-project?

Is Gemma 4’s MoE variant practical to run outside LiteRT-LM given PEFT and architecture gaps?

What would cause a team currently committed to LiteRT-LM to reconsider that choice?

Sources

Enjoyed this article?

What LiteRT-LM v0.10.1 Actually Ships

The Benchmark Numbers: LiteRT-LM Performance on Mobile, Raspberry Pi, and Qualcomm NPU

The MTP Head Decision: How Google Engineered a Performance Asymmetry Into an ‘Open’ Release

llama.cpp’s Position: Day-One GGUF Support, the Speed Penalty, and the EAGLE3 Workaround

The Lock-In Mechanics: What Adopting LiteRT-LM Means for Your Inference Stack

Decision Framework: When to Use LiteRT-LM vs. llama.cpp vs. ONNX Runtime for Edge Deployment

Frequently Asked Questions

Does LiteRT-LM’s Qualcomm NPU path work on older Snapdragon devices, or only the Dragonwing IQ8?

How does the EAGLE3 workaround’s quality compare to Google’s native MTP implementation?

What does it actually cost a team to switch from llama.cpp to LiteRT-LM mid-project?

Is Gemma 4’s MoE variant practical to run outside LiteRT-LM given PEFT and architecture gaps?

What would cause a team currently committed to LiteRT-LM to reconsider that choice?

Sources

Related Articles

Off Grid v0.0.88 Ships Hexagon HTP Acceleration: Auditability Is the Real Edge Over Apple Intelligence

ACP Registry Is Live: Zed and JetBrains Just Did for AI Agents What LSP Did for Language Servers

Claude Code vs Cursor vs Copilot After the April 2026 Reshuffle: How the Comparison Math Changed

Enjoyed this article?