Google shipped LiteRT-LM v0.10.1 on April 3, 2026 (one day after Gemma 4’s public release) with full Gemma 4 support, a rewritten developer CLI, and Qualcomm NPU acceleration. (LiteRT-LM Releases) Alongside it, Google made a quiet distribution decision: MTP prediction heads were present in LiteRT-exported Gemma 4 weights and absent from the public HuggingFace release at launch. (Gemma 4 Was Released Without MTP Data, FlowHunt) Google corrected the omission on May 5, 2026, shipping standalone MTP drafter models for all four Gemma 4 variants under Apache 2.0, with Transformers, vLLM, SGLang, MLX, and Ollama all adding support on day one. (Accelerating Gemma 4: faster inference with multi-token prediction drafters, Google Blog) Teams deploying on llama.cpp faced that cost for roughly a month. [Updated June 2026] llama.cpp has since closed the gap: PR #23398, authored by am17an, merged to master on June 7, 2026 and shipped in build b9549, adding native Gemma 4 MTP support for the 31B dense and 26B A4B variants, with the E2B and E4B assistant drafters following in PR #24282. Feature request #22747 is now closed. The asymmetry this article was built on has largely resolved, and the body below preserves the original analysis with inline updates marking what changed.
What LiteRT-LM v0.10.1 Actually Ships
The CLI migrated from fire to click, adding --verbose and --version flags and zero-code model experimentation with tool calling for agentic workflows. (LiteRT-LM Releases) HuggingFace direct model import removes a manual conversion step that previously required offline tooling. Speculative decoding is now supported at the framework level. The CMake build system was refactored to support Android cross-compilation.
Gemma 4 itself launched April 2 under Apache 2.0 in four configurations: E2B (edge, 2B parameters, 128K context, multimodal text/image/audio), E4B (edge, 4B parameters, 128K context), 26B A4B MoE (256K context), and 31B Dense (256K context). (Google Pushes Multimodal AI Further Onto Edge Devices with Gemma 4, Edge AI and Vision Alliance)
The CLI runs on Linux, macOS, and Raspberry Pi. Python and Kotlin (Android) APIs are stable. At v0.10.1 the Swift API was “In Development”, so iOS teams could not ship production native Swift apps from that release. (LiteRT-LM Overview, Google AI for Developers) [Updated June 2026] That gap has since closed: v0.12.0 (May 18, 2026) shipped native Swift APIs with Metal GPU acceleration for iOS, and v0.13.0 (June 2, 2026) extended the Swift package to macOS. (Release v0.12.0, google-ai-edge/LiteRT-LM)
LiteRT-LM v0.11.0, released May 7, 2026, added two notable capabilities to this base: native Gemma 4 Single Position MTP (delivering over 2x faster decode speeds on mobile GPUs with no quality degradation) and Windows CLI support with CPU and GPU backends. (LiteRT-LM Releases) [Updated June 2026] The framework has shipped four more releases since. v0.12.0 (May 18) added the native Swift and iOS path noted above, a Web JavaScript API running models in-browser over WebGPU or CPU, NPU backend support in the CLI across Linux, macOS, and Windows via Intel OpenVINO, and community-maintained Flutter APIs. v0.13.0 (June 2) added agent-skill support to the Android demo app, an OpenAI API-compatible server in the CLI, and the macOS Swift package. The current release is v0.13.1 (June 3), a bug-fix follow-up to 0.13.0. (LiteRT-LM Releases)
The Benchmark Numbers: LiteRT-LM Performance on Mobile, Raspberry Pi, and Qualcomm NPU
For Gemma 4 E2B, Google’s published figures on the Qualcomm Dragonwing IQ8 NPU reach 3,700 prefill tokens/second and 31 decode tokens/second. (Bring state-of-the-art agentic skills to the edge with Gemma 4, Google Developers Blog) On a Raspberry Pi 5 CPU, those numbers drop to 133 prefill tokens/second and 7.6 decode tokens/second, usable for embedded pipelines though not interactive-chat latency at longer contexts. (Bring state-of-the-art agentic skills to the edge with Gemma 4, Google Developers Blog)
The device-level breakdown from official documentation (LiteRT-LM Overview, Google AI for Developers):
| Device | Backend | Prefill (tk/s) | Decode (tk/s) |
|---|---|---|---|
| Samsung S26 Ultra | CPU | 557 | 47 |
| Samsung S26 Ultra | GPU | 3,808 | 52 |
| iPhone 17 Pro | CPU | 532 | 25 |
| iPhone 17 Pro | GPU | 2,878 | 56 |
| MacBook Pro M4 | GPU | 7,835 | 160 |
| Qualcomm Dragonwing IQ8 | NPU | 3,700 | 31 |
| Raspberry Pi 5 | CPU | 133 | 7.6 |
Two caveats about these numbers: first, the Samsung S26 Ultra, iPhone 17 Pro, and Qualcomm Dragonwing IQ8 are 2026 flagship devices not yet in widespread developer hands. Real-world figures on 2024-era hardware will be lower. Second, these are E2B benchmarks. The E2B model fits under 1.5GB with 2-bit and 4-bit quantization and memory-mapped embeddings, and processes 4,000 input tokens across two tasks in under 3 seconds on GPU. (Bring state-of-the-art agentic skills to the edge with Gemma 4, Google Developers Blog)
Deployment requires device-specific QNN pre-compiled binaries (TF_LITE_AUX), and the runtime implements a fallback chain (NPU → GPU (OpenCL) → CPU (XNNPack)) in case a delegate fails. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN, Google Developer Experts) Android 15 Scoped Storage constraints add a deployment wrinkle: large model files must be placed in an external data directory for ADB-push workflows. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN, Google Developer Experts)
The MTP Head Decision: How Google Engineered a Performance Asymmetry Into an ‘Open’ Release
Multi-Token Prediction (MTP) allows a model to predict multiple future tokens simultaneously rather than one at a time, accelerating inference without degrading output quality. At launch, Google included MTP prediction heads in Gemma 4’s LiteRT-exported model weights but omitted them from the Apache 2.0 HuggingFace release. (Gemma 4 Was Released Without MTP Data, FlowHunt)
At launch, the only published MTP speedup figure was 1.8x from DeepSeek V3 benchmarks; Google had not published Gemma 4-specific MTP throughput numbers. On May 5, 2026, Google released standalone MTP drafter models for all four Gemma 4 variants and published framework-specific performance data. The Gemma 4 31B drafter is a 0.5B parameter model documented at approximately 2x decoding speedup, with a claimed ceiling of 3x on specific hardware configurations. (Accelerating Gemma 4: faster inference with multi-token prediction drafters, Google Blog) The directional claim holds: MTP acceleration requires the prediction heads, and the gap now sits specifically between llama.cpp and the rest of the ecosystem.
At launch, vLLM, llama.cpp, and SGLang could not implement MTP-based speculative decoding for Gemma 4 because the necessary model components were absent from the publicly distributed weights. As of May 5, 2026, vLLM, SGLang, Transformers, MLX, and Ollama all supported the official drafters. [Updated June 2026] llama.cpp now does too. Feature request #22747 was filed May 6, 2026; the implementation landed a month later in PR #23398 (merged June 7, build b9549) for the 31B dense and 26B A4B models, followed by PR #24282 (merged June 8) for the E2B and E4B assistant drafters. Loading these models requires a llama.cpp build dated after the merge, since older binaries do not recognize the gemma4-assistant architecture.
Without MTP, Gemma 4 31B achieves approximately 11 tokens/second on consumer GPUs under llama.cpp. Community benchmarks put comparable models (Qwen3 Coder at more than twice the parameter count) at 50+ tokens/second on identical hardware. (Gemma 4 Was Released Without MTP Data, FlowHunt)
llama.cpp’s Position: Day-One GGUF Support, the Speed Penalty, and the EAGLE3 Workaround
llama.cpp shipped GGUF support for all four Gemma 4 variants on launch day, which matters for cross-platform portability. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most, XDA Developers) The 26B A4B MoE variant fits in approximately 18GB RAM at Q4_K_M quantization under llama.cpp. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most, XDA Developers)
The community response to the MTP gap at launch was a 277MB EAGLE3 draft head trained on Gemma 4’s tokenizer, exploiting shared tokenizer architecture across Gemma 4 variant sizes to enable traditional speculative decoding. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most, XDA Developers) The EAGLE3 workaround achieves approximately 2x speedup. Google’s official MTP drafters, released May 5, 2026, supersede EAGLE3 for vLLM, SGLang, and Transformers-based workflows: the official drafters are jointly trained with the base model weights, where EAGLE3 is not, which is why the 1.8x-2x speedup from EAGLE3 cannot match the quality guarantee of integrated MTP. [Updated June 2026] EAGLE3 is no longer the only path for llama.cpp users. The native Gemma 4 MTP merge (PR #23398) loads the official assistant drafters directly, so llama.cpp users get the same first-party, co-trained draft model that vLLM and SGLang use rather than a community-distilled approximation.
The mechanism is standard draft-then-verify speculative decoding, but the drafter is the part that matters. Gemma 4’s assistant model shares activations with the target rather than running as a fully independent small model, which is what keeps acceptance rates high without a separate KV cache for the draft pass. On a DGX Spark running the dense 31B target with --spec-draft-n-max 4, the PR author measured wall-clock generation dropping from 290 seconds to 121 seconds, roughly 2.4x, at an acceptance rate near 0.59. Community runs reported wider swings: around 162 decode tokens/second with MTP versus 52 without on a single B200 with the 12B target (0.70 acceptance), and a reported ~74% decode throughput gain on dual RTX 3090s. The smaller E-series drafters give back much of this. Because the E2B and E4B assistants were tuned for draft speed over draft quality, acceptance falls to roughly 47-48%, and the measured uplift on a Galaxy S26 was a single-digit tokens-per-second bump rather than a multiple. (PR #24282) Two caveats are worth flagging before you wire MTP into a serving path: the 26B A4B MoE variant sees limited speedup relative to the dense models, and an early bug where a q8_0-quantized KV cache produced 0% draft acceptance was fixed only after the initial merge by adding Hadamard-rotation support, so a build that predates that fix will silently disable the benefit while appearing to run.
Community testing at launch flagged missing HuggingFace Transformers architecture support, PEFT incompatibility with new layer types, and stability problems on Apple Silicon, characterizing the release as incomplete outside Google’s toolchain. (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most, XDA Developers) These issues illustrate that GGUF support on day one and cross-framework inference readiness are not the same claim.
The Lock-In Mechanics: What Adopting LiteRT-LM Means for Your Inference Stack
The technical lock-in was layered, and most of those layers have since thinned. At the model layer: maximum Gemma 4 throughput on mobile NPU still requires LiteRT-converted model files. Google’s May 5, 2026 release of standalone MTP drafters on HuggingFace resolved the framework-portability half of this constraint: vLLM, SGLang, Transformers, MLX, and Ollama users can run MTP-accelerated Gemma 4 inference without LiteRT. [Updated June 2026] llama.cpp joined that list on June 7 via PR #23398, so the framework-portability story for MTP is now closed across the major runtimes. At the runtime layer: NPU acceleration on Qualcomm hardware still requires device-specific QNN pre-compiled binaries tied to LiteRT’s delegate architecture, and that piece has not opened. (Bringing Multimodal Gemma 4 E2B to the Edge: LiteRT-LM and Qualcomm QNN, Google Developer Experts) At the API layer: Kotlin is stable for Android and Python is stable for server and desktop. [Updated June 2026] Swift is no longer in development; v0.12.0 shipped native Swift APIs with Metal GPU for iOS, so the iOS-team argument for staying on the C++ library has gone away. (Release v0.12.0, google-ai-edge/LiteRT-LM)
A vendor releases a high-quality open-weight model under a permissive license, distributes a version of those weights missing a performance-critical component, and routes full capability through its own runtime. [Updated June 2026] That was the shape of the launch. Google reversed the MTP component gap on May 5 by publishing standalone drafters, llama.cpp implemented native support a month later, and the Swift gap closed in v0.12.0. What remains is narrower and more honest about itself: peak mobile NPU throughput on Qualcomm silicon still routes through LiteRT-converted weights and Google’s QNN delegate. The lock-in that survived is the hardware-acceleration layer, not the model-portability layer.
Decision Framework: When to Use LiteRT-LM vs. llama.cpp vs. ONNX Runtime for Edge Deployment
Teams should choose based on which constraint is load-bearing for their deployment.
LiteRT-LM is the clear choice if: your target hardware is Android with Qualcomm Snapdragon 8 Elite, you are deploying Gemma 4 specifically, and maximum throughput per watt matters more than cross-framework portability. The Qualcomm Dragonwing IQ8 NPU path reaches 3,700 prefill tokens/second (Bring state-of-the-art agentic skills to the edge with Gemma 4, Google Developers Blog) and is not matched by any current llama.cpp configuration on equivalent hardware. If you are on Raspberry Pi and running batch workloads where 7.6 decode tokens/second is acceptable, LiteRT-LM also gives you a tested, supported path. (Bring state-of-the-art agentic skills to the edge with Gemma 4, Google Developers Blog)
llama.cpp remains the right choice if: your deployment spans multiple hardware backends without a dominant Android/Qualcomm bias, you need HuggingFace ecosystem compatibility (fine-tuning, PEFT, Transformers-based tooling), or your team’s operational model depends on not being tied to a single vendor’s model distribution pipeline. [Updated June 2026] The MTP penalty that argued against llama.cpp at publication is gone: native Gemma 4 MTP merged on June 7, so llama.cpp users now load the same first-party drafters as vLLM and SGLang, with the EAGLE3 head (Google’s Gemma 4 isn’t the smartest local LLM I’ve run, but it’s the one I reach for most, XDA Developers) relegated to a fallback for builds that predate b9549. llama.cpp’s cross-platform record is more established than LiteRT-LM’s for non-Android targets, and that advantage no longer comes with a throughput tax on Gemma 4.
ONNX Runtime is relevant if your organization already runs ONNX-based inference infrastructure, but no Gemma 4-specific performance data is available to compare.
The decision also depends on timing, and the timing has moved. [Updated June 2026] The two open questions this article posed have both been answered. llama.cpp shipped native Gemma 4 MTP across the dense, MoE, and assistant variants by June 8, and LiteRT-LM closed the Swift API gap in v0.12.0 (May 18) with iOS Metal GPU support, reaching v0.13.1 by June 3. The asymmetry that justified the lock-in framing has narrowed to one layer: Qualcomm NPU acceleration on Android, where LiteRT-LM’s QNN delegate path still has no direct equivalent in the portable runtimes. For Apple Silicon teams weighing the runtime choice rather than the NPU question, the MLX versus llama.cpp tradeoff is the more relevant comparison, and for the broader on-device picture the LiteRT phone-inference primer covers the framework’s origins.
Frequently Asked Questions
Does LiteRT-LM’s Qualcomm NPU path work on older Snapdragon devices, or only the Dragonwing IQ8?
NPU acceleration via QNN delegate targets the Hexagon NPU on Snapdragon 8 Elite specifically, older Snapdragon generations lack the same Hexagon configuration and require device-specific QNN pre-compiled binaries that Google has not published for legacy silicon. Teams deploying to a mixed device fleet will hit the runtime fallback chain (NPU → GPU → CPU) on non-Elite hardware, meaning the 3,700 tk/s prefill figure is only reproducible on Snapdragon 8 Elite or the Dragonwing IQ8 reference board.
How does the EAGLE3 workaround’s quality compare to Google’s native MTP implementation?
EAGLE3 is a community-trained 277MB draft head, and the training process introduces its own quality tradeoffs absent from Google’s first-party MTP implementation, the draft head approximates the target model’s distribution but is not jointly trained with Gemma 4, so acceptance rates and output distribution can diverge from what integrated MTP would produce. Google’s official MTP drafters, released May 5, 2026, are co-trained with the base model weights and carry a published quality-equivalence guarantee that EAGLE3 cannot replicate. [Updated June 2026] EAGLE3 is no longer the only path for llama.cpp users: native support for the official co-trained drafters merged on June 7 (PR #23398), so the quality-equivalence guarantee now applies to llama.cpp deployments too. EAGLE3 stays relevant only for builds older than b9549 or for Gemma 4 variants whose assistant drafter you do not want to ship.
What does it actually cost a team to switch from llama.cpp to LiteRT-LM mid-project?
Beyond code changes to swap inference APIs, teams must acquire LiteRT-exported model files separately from the Apache 2.0 HuggingFace weights, the NPU-optimized weights are only available through Google’s LiteRT export pipeline, not as a drop-in download. Any fine-tuned adapters built with PEFT against the HuggingFace weights are currently incompatible with LiteRT’s layer types, so custom fine-tunes cannot be ported without retraining against LiteRT-compatible model artifacts.
Is Gemma 4’s MoE variant practical to run outside LiteRT-LM given PEFT and architecture gaps?
The 26B A4B MoE variant fits in roughly 18GB RAM at Q4_K_M under llama.cpp, but the HuggingFace Transformers architecture definition for Gemma 4’s MoE layer types was missing at launch, meaning Transformers-based workflows (training, evaluation, adapter attachment) could not load the model at all in the initial release window. PEFT incompatibility compounds this: teams that planned to fine-tune the MoE variant with LoRA adapters cannot attach them to the new layer types without upstream library updates, making the 26B variant essentially inference-only outside Google’s toolchain for now.
What would cause a team currently committed to LiteRT-LM to reconsider that choice?
Google released the MTP drafters to HuggingFace on May 5, 2026, available for Transformers, vLLM, SGLang, MLX, and Ollama. For teams already on those frameworks, the primary performance reason to accept LiteRT lock-in is substantially weaker than it was at this article’s original publication. [Updated June 2026] Both triggers this answer originally flagged have now fired. llama.cpp shipped native Gemma 4 MTP support in PR #23398 (June 7), closing feature request #22747, so the model-portability reason to accept LiteRT lock-in is essentially gone for runtime-portable deployments. The Swift API also reached stable iOS support with Metal GPU acceleration in v0.12.0, removing the C++-wrapping maintenance surface that previously pushed iOS teams toward ONNX Runtime. The one reason left to accept LiteRT lock-in is the Qualcomm NPU throughput path on Android, which the portable runtimes still cannot match. (Release v0.12.0, google-ai-edge/LiteRT-LM)