Table of Contents

Google shipped LiteRT-LM v0.10.1 on April 3, 2026 — one day after Gemma 4’s public release — with full Gemma 4 support, a rewritten developer CLI, and Qualcomm NPU acceleration. (LiteRT-LM Releases) Alongside it, Google made a quiet distribution decision that transforms an inference framework update into an infrastructure power move: MTP prediction heads are present in LiteRT-exported Gemma 4 weights and absent from the public HuggingFace release. (Gemma 4 Was Released Without MTP Data — FlowHunt) Teams evaluating edge AI face a concrete cost for staying framework-agnostic.

What LiteRT-LM v0.10.1 Actually Ships

The CLI migrated from fire to click, adding --verbose and --version flags and zero-code model experimentation with tool calling for agentic workflows. (LiteRT-LM Releases) HuggingFace direct model import removes a manual conversion step that previously required offline tooling. Speculative decoding is now supported at the framework level. The CMake build system was refactored to support Android cross-compilation.

Gemma 4 itself launched April 2 under Apache 2.0 in four configurations: E2B (edge, 2B parameters, 128K context, multimodal text/image/audio), E4B (edge, 4B parameters, 128K context), 26B A4B MoE (256K context), and 31B Dense (256K context). (Google Pushes Multimodal AI Further Onto Edge Devices with Gemma 4 — Edge AI and Vision Alliance)

The CLI runs on Linux, macOS, and Raspberry Pi. Python and Kotlin (Android) APIs are stable. The Swift API is “In Development”, so iOS teams cannot ship production native Swift apps from this release. (LiteRT-LM Overview — Google AI for Developers)

The Benchmark Numbers: LiteRT-LM Performance on Mobile, Raspberry Pi, and Qualcomm NPU

For Gemma 4 E2B, Google’s published figures on the Qualcomm Dragonwing IQ8 NPU reach 3,700 prefill tokens/second and 31 decode tokens/second. (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog) On a Raspberry Pi 5 CPU, those numbers drop to 133 prefill tokens/second and 7.6 decode tokens/second — usable for embedded pipelines, though not interactive-chat latency at longer contexts. (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog)

The device-level breakdown from official documentation (LiteRT-LM Overview — Google AI for Developers):

DeviceBackendPrefill (tk/s)Decode (tk/s)
Samsung S26 UltraCPU55747
Samsung S26 UltraGPU3,80852
iPhone 17 ProCPU53225
iPhone 17 ProGPU2,87856
MacBook Pro M4GPU7,835160
Qualcomm Dragonwing IQ8NPU3,70031
Raspberry Pi 5CPU1337.6

Two caveats about these numbers: first, the Samsung S26 Ultra, iPhone 17 Pro, and Qualcomm Dragonwing IQ8 are 2026 flagship devices not yet in widespread developer hands — real-world figures on 2024-era hardware will be lower. Second, these are E2B benchmarks. The E2B model fits under 1.5GB with 2-bit and 4-bit quantization and memory-mapped embeddings, and processes 4,000 input tokens across two tasks in under 3 seconds on GPU. (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog)

Deployment requires device-specific QNN pre-compiled binaries (TF_LITE_AUX), and the runtime implements a fallback chain — NPU → GPU (OpenCL) → CPU (XNNPack) — in case a delegate fails. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN — Google Developer Experts / Medium) Android 15 Scoped Storage constraints add a deployment wrinkle: large model files must be placed in an external data directory for ADB-push workflows. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN — Google Developer Experts / Medium)

The MTP Head Decision: How Google Engineered a Performance Asymmetry Into an ‘Open’ Release

Multi-Token Prediction (MTP) allows a model to predict multiple future tokens simultaneously rather than one at a time, accelerating inference without degrading output quality. Google included MTP prediction heads in Gemma 4’s LiteRT-exported model weights but omitted them from the Apache 2.0 HuggingFace release. (Gemma 4 Was Released Without MTP Data — FlowHunt)

The 1.8x speedup figure cited in analysis of MTP acceleration originates from DeepSeek V3 benchmarks, not Gemma 4 specifically — Google has not published Gemma 4-specific MTP throughput numbers as of this writing. (Gemma 4 Was Released Without MTP Data — FlowHunt) The directional claim holds: MTP acceleration requires the prediction heads, the public weights lack them, and alternative inference frameworks cannot replicate the feature without them.

vLLM, llama.cpp, and SGLang cannot implement MTP-based speculative decoding for Gemma 4 because the necessary model components are not in the publicly distributed weights. (Gemma 4 Was Released Without MTP Data — FlowHunt) The model is technically open-source under Apache 2.0. The weights that unlock its full inference performance are not available outside Google’s toolchain.

Without MTP, Gemma 4 31B achieves approximately 11 tokens/second on consumer GPUs under llama.cpp. Community benchmarks put comparable models — Qwen3 Coder at more than twice the parameter count — at 50+ tokens/second on identical hardware. (Gemma 4 Was Released Without MTP Data — FlowHunt)

llama.cpp’s Position: Day-One GGUF Support, the Speed Penalty, and the EAGLE3 Workaround

llama.cpp shipped GGUF support for all four Gemma 4 variants on launch day, which matters for cross-platform portability. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers) The 26B A4B MoE variant fits in approximately 18GB RAM at Q4_K_M quantization under llama.cpp. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers)

The community response to the MTP gap was a 277MB EAGLE3 draft head trained on Gemma 4’s tokenizer, exploiting shared tokenizer architecture across Gemma 4 variant sizes to enable traditional speculative decoding. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers) The EAGLE3 workaround achieves approximately 2x speedup, but it is a community artifact, not a first-party capability, and it is not equivalent to integrated MTP: the draft head training process introduces its own quality tradeoffs that Google’s native implementation does not. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers)

Community testing at launch flagged missing HuggingFace Transformers architecture support, PEFT incompatibility with new layer types, and stability problems on Apple Silicon — characterizing the release as incomplete outside Google’s toolchain. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers) These issues illustrate that GGUF support on day one and cross-framework inference readiness are not the same claim.

The Lock-In Mechanics: What Adopting LiteRT-LM Means for Your Inference Stack

The technical lock-in is layered. At the model layer: maximum Gemma 4 throughput requires LiteRT-converted model files, not the Apache 2.0 HuggingFace weights. At the runtime layer: NPU acceleration on Qualcomm hardware requires device-specific QNN pre-compiled binaries tied to LiteRT’s delegate architecture. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN — Google Developer Experts / Medium) At the API layer: Kotlin is stable for Android, Python is stable for server/desktop, but Swift remains in development — so any team building iOS production apps in native Swift cannot ship on LiteRT-LM today without wrapping the C++ library. (LiteRT-LM Overview — Google AI for Developers)

A vendor releases a high-quality open-weight model under a permissive license, distributes a version of those weights missing a performance-critical component, and routes full capability through its own runtime.

Decision Framework: When to Use LiteRT-LM vs. llama.cpp vs. ONNX Runtime for Edge Deployment

Teams should choose based on which constraint is load-bearing for their deployment.

LiteRT-LM is the clear choice if: your target hardware is Android with Qualcomm Snapdragon 8 Elite, you are deploying Gemma 4 specifically, and maximum throughput per watt matters more than cross-framework portability. The NPU path — Qualcomm Dragonwing IQ8 at 3,700 prefill tokens/second (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog) — is not matched by any current llama.cpp configuration on equivalent hardware. If you are on Raspberry Pi and running batch workloads where 7.6 decode tokens/second is acceptable, LiteRT-LM also gives you a tested, supported path. (Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blog)

llama.cpp remains the right choice if: your deployment spans multiple hardware backends without a dominant Android/Qualcomm bias, you need HuggingFace ecosystem compatibility (fine-tuning, PEFT, Transformers-based tooling), or your team’s operational model depends on not being tied to a single vendor’s model distribution pipeline. The EAGLE3 workaround recovers a substantial fraction of MTP’s benefit, (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most — XDA Developers) and llama.cpp’s cross-platform record is more established than LiteRT-LM’s for non-Android targets.

ONNX Runtime is relevant if your organization already runs ONNX-based inference infrastructure, but no Gemma 4-specific performance data is available to compare.

The decision also depends on timing: Google is iterating quickly — v0.10.2 arrived eleven days after v0.10.1. (LiteRT-LM Releases) The Swift API gap will likely close. Whether Google releases the MTP heads for public HuggingFace weights will determine if the performance asymmetry is temporary or a durable feature of the Gemma ecosystem.

Frequently Asked Questions

Does LiteRT-LM’s Qualcomm NPU path work on older Snapdragon devices, or only the Dragonwing IQ8?

NPU acceleration via QNN delegate targets the Hexagon NPU on Snapdragon 8 Elite specifically — older Snapdragon generations lack the same Hexagon configuration and require device-specific QNN pre-compiled binaries that Google has not published for legacy silicon. Teams deploying to a mixed device fleet will hit the runtime fallback chain (NPU → GPU → CPU) on non-Elite hardware, meaning the 3,700 tk/s prefill figure is only reproducible on Snapdragon 8 Elite or the Dragonwing IQ8 reference board.

How does the EAGLE3 workaround’s quality compare to Google’s native MTP implementation?

EAGLE3 is a community-trained 277MB draft head, and the training process introduces its own quality tradeoffs absent from Google’s first-party MTP implementation — the draft head approximates the target model’s distribution but is not jointly trained with Gemma 4, so acceptance rates and output distribution can diverge from what integrated MTP would produce. Google’s native MTP heads are co-trained with the base model weights, which is why the 1.8x speedup (derived from DeepSeek V3 data) can be achieved with zero quality loss — a guarantee the EAGLE3 workaround cannot replicate.

What does it actually cost a team to switch from llama.cpp to LiteRT-LM mid-project?

Beyond code changes to swap inference APIs, teams must acquire LiteRT-exported model files separately from the Apache 2.0 HuggingFace weights — the MTP-capable weights are only available through Google’s LiteRT export pipeline, not as a drop-in download. Any fine-tuned adapters built with PEFT against the HuggingFace weights are currently incompatible with LiteRT’s layer types, so custom fine-tunes cannot be ported without retraining against LiteRT-compatible model artifacts.

Is Gemma 4’s MoE variant practical to run outside LiteRT-LM given PEFT and architecture gaps?

The 26B A4B MoE variant fits in roughly 18GB RAM at Q4_K_M under llama.cpp, but the HuggingFace Transformers architecture definition for Gemma 4’s MoE layer types was missing at launch — meaning Transformers-based workflows (training, evaluation, adapter attachment) could not load the model at all in the initial release window. PEFT incompatibility compounds this: teams that planned to fine-tune the MoE variant with LoRA adapters cannot attach them to the new layer types without upstream library updates, making the 26B variant essentially inference-only outside Google’s toolchain for now.

What would cause a team currently committed to LiteRT-LM to reconsider that choice within the next six months?

Two events would force a reassessment: Google releasing the MTP heads to HuggingFace (eliminating the performance asymmetry and removing the primary reason to accept LiteRT lock-in), or llama.cpp shipping native MTP support via a separately distributed draft head that Google officially sanctions. A secondary trigger is the Swift API reaching stable status — once iOS teams can ship production native Swift apps, the platform coverage argument for LiteRT-LM strengthens considerably, but until then iOS deployments require wrapping the C++ library, which adds maintenance surface that may tip borderline decisions toward ONNX Runtime.

Sources

  1. LiteRT-LM Releasescommunityaccessed 2026-04-23
  2. Gemma 4 Was Released Without MTP Data — FlowHuntanalysisaccessed 2026-04-23
  3. Bring state-of-the-art agentic skills to the edge with Gemma 4 — Google Developers Blogvendoraccessed 2026-04-23
  4. LiteRT-LM Overview — Google AI for Developersvendoraccessed 2026-04-23
  5. Google Pushes Multimodal AI Further Onto Edge Devices with Gemma 4 — Edge AI and Vision Allianceanalysisaccessed 2026-04-23
  6. Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN — Google Developer Experts / Mediumcommunityaccessed 2026-04-23
  7. Google's Gemma 4 isn't the smartest local LLM I've run, but it's the one I reach for most — XDA Developerscommunityaccessed 2026-04-23

Enjoyed this article?

Stay updated with our latest insights on AI and technology.