Google’s LiteRT — renamed from TensorFlow Lite in September 2024 — is now a full on-device inference platform for large language models. Using the LiteRT-LM layer, developers can deploy Gemma 3, Phi-4-mini, Qwen 2.5, and other SLMs entirely on Android, iOS, or desktop hardware, with no cloud round-trips required. On flagship Snapdragon silicon, prefill speeds exceed 11,000 tokens per second.
What Is LiteRT? The Rebrand That Signals a Shift
Google renamed TensorFlow Lite (TFLite) to LiteRT — short for “Lite Runtime” — on September 4, 2024.1 The name change is more than cosmetic. TFLite launched in 2017 as a TensorFlow-specific mobile runtime; by 2024, it had grown to accept models authored in PyTorch, JAX, and Keras. Anchoring the name to TensorFlow had become actively misleading.
The .tflite file format, FlatBuffers encoding, and class APIs remained untouched. Production apps using TFLite through Google Play Services were unaffected. What changed was the documentation domain (ai.google.dev/edge/litert), the Maven and PyPI package names, and, critically, the stated direction: LiteRT is now positioned as Google’s universal framework for deploying any model on any edge device — not a TensorFlow companion project.
Google states that LiteRT’s heritage (inherited from TFLite) now powers over 100,000 apps across 2.7 billion devices worldwide.2 That installed base matters when evaluating the new GenAI additions: LiteRT-LM slots into an existing ecosystem, not a greenfield one.
The GenAI Layer: LiteRT-LM
Running classical computer vision or NLP models on-device is a solved problem. Running a 1-4B parameter autoregressive language model is not — it requires tokenizer management, KV-cache handling, context window orchestration, and hardware-specific quantization paths. LiteRT-LM is the layer that handles all of this.
LiteRT-LM was publicly launched as a production framework on September 24, 2025.3 It introduced a new model format — .litertlm — which extends the legacy .tflite FlatBuffers format with metadata for generative models: tokenizer configs, quantization headers, and vision encoder definitions for multimodal inputs. The API surface is organized around two core objects:
- Engine: A singleton per application. Loads and caches the model. Shared across features.
- Session: A stateful, per-conversation interface. Manages KV-cache, supports task-specific LoRA weight injection, and can be cloned in under 10 milliseconds for branching conversations or parallel evaluation.
LiteRT-LM also introduces a new CompiledModel API to replace the legacy Interpreter interface — the entry point for GPU and NPU execution paths, zero-copy buffer sharing via AHardwareBuffer, and asynchronous inference. The Interpreter API remains supported for backward compatibility, but new projects should use CompiledModel.
A minimal Android session using the Kotlin API:
val engine = LlmInferenceEngine.create( context, LlmInferenceEngine.Options.builder() .setModelPath(“/sdcard/gemma3-1b-it-int4.litertlm”) .build() ) val session = engine.createSession() session.addQueryChunk(“Explain transformer attention in two sentences.”) session.generateResponseAsync { partial, done -> if (done) displayResult(partial) }
How On-Device LLM Inference Actually Works
The performance gap between CPU, GPU, and NPU inference is not linear — it is categorical. On the Samsung Galaxy S25 Ultra (Snapdragon 8 Elite), Gemma 3 1B achieves 5,836 tokens per second prefill on the Qualcomm NPU versus significantly lower throughput on CPU or GPU alone.4 The NPU path requires a hardware-specific accelerator layer.
LiteRT provides two production NPU backends as of early 2026:
- LiteRT Qualcomm AI Engine Direct Accelerator (announced November 2025): Supports 90 operators; enables 64 of 72 benchmarked canonical models to run fully on the NPU. On the Snapdragon 8 Elite, FastVLM-0.5B achieves prefill speeds exceeding 11,000 tokens per second — a number that reflects the NPU’s ability to parallelize matrix multiplication at hardware level.
- LiteRT NeuroPilot Accelerator for MediaTek (announced December 2025): Targets Dimensity 9000-series SoCs. On the Dimensity 9500 (Vivo X300 Pro), Gemma 3n E2B reaches 1,600+ tokens per second prefill and 28 tokens per second decode at 4K context — Google states this is up to 12× faster than CPU and 10× faster than GPU for the same model.5
The CPU path uses XNNPACK, a high-performance kernel library, across all platforms including iOS, macOS, Windows, and Linux. GPU paths use OpenCL or OpenGL on Android and Metal/WebGPU on Apple silicon.
Compilation strategy matters significantly. LiteRT-LM supports both ahead-of-time (AOT) and just-in-time (JIT) compilation. For production deployments, AOT is necessary: JIT compilation for a model like Gemma 3 270M can take over a minute on-device — an unacceptable cold-start penalty in a user-facing application. AOT-compiled model packages ship ready to execute.
Performance: What the Numbers Say
The benchmark landscape for on-device LLMs is fragmented — devices, quantization levels, context lengths, and hardware backends vary. The following table consolidates published data from official Google sources.
| Device | SoC | Backend | Model | Prefill (tok/s) | Decode (tok/s) |
|---|---|---|---|---|---|
| Samsung Galaxy S25 Ultra | Snapdragon 8 Elite | NPU (QNN) | Gemma 3 1B INT4 | 5,836 | 84.8 |
| Samsung Galaxy S25 Ultra | Snapdragon 8 Elite | NPU (QNN) | FastVLM-0.5B INT8 | >11,000 | >100 |
| Samsung Galaxy S24 Ultra | Snapdragon 8 Gen 3 | GPU | Gemma 3n E2B INT4 | 816 | 15.6 |
| Vivo X300 Pro | Dimensity 9500 | NPU (NeuroPilot) | Gemma 3n E2B INT4 | >1,600 | 28 |
| MacBook Pro M3 | Apple M3 | CPU (XNNPACK) | Gemma 3 1B INT4 | 422 | 66.9 |
Decode speed — the rate at which a model generates new tokens during a response — is the most user-perceptible metric. At 84.8 tokens per second (Gemma 3 1B on Snapdragon 8 Elite NPU), output renders visibly faster than a human reads. At 15.6 tokens per second (Gemma 3n E2B on mid-tier GPU), it is usable but noticeably slower.
Google states that on TF 2.21 (released March 6, 2026), the LiteRT GPU backend is 1.4× faster than the legacy TFLite GPU implementation for the same models, and that asynchronous execution with zero-copy buffer interoperability adds up to 2× performance improvement for pipelines that chain inference steps.6
Supported Models and Deployment Scale
LiteRT-LM ships pre-tested against the following models, available pre-converted and quantized at the LiteRT Hugging Face Community:
| Model | Format | Size | Quantization | Context |
|---|---|---|---|---|
| Gemma 3 1B Instruct | .litertlm | 557 MB | INT4 | 4,096 |
| Gemma 3n E2B | .litertlm | 2,965 MB | INT4 | 4,096 |
| Gemma 3n E4B | .litertlm | 4,235 MB | INT4 | 4,096 |
| FunctionGemma 270M | .litertlm | 288 MB | INT8 | 1,024 |
| Phi-4-mini Instruct | .litertlm | 3,728 MB | INT8 | 4,096 |
| Qwen 2.5 1.5B Instruct | .litertlm | 1,524 MB | INT8 | 4,096 |
| Llama 3.2 1B / 3B Instruct | .litertlm | — | — | — |
| DeepSeek-R1-Distill-Qwen-1.5B | .litertlm | — | — | — |
Gemma 3n is the first multimodal model in the stack, supporting text and image inputs (with audio recently added in the AI Edge Gallery demo). At 288 MB quantized, FunctionGemma 270M is notable for function-calling use cases on extremely constrained devices.
LiteRT vs. the Alternatives
On-device LLM inference has attracted several competing frameworks. The differences are real and architecture-dependent:
| Framework | Primary Format | NPU Support | iOS Support | Cross-Platform | Status |
|---|---|---|---|---|---|
| LiteRT-LM | .litertlm | Qualcomm, MediaTek (Android) | CPU + GPU (Metal) | Yes | Production (alpha API) |
| llama.cpp | GGUF | No | CPU + Metal GPU | Yes | Open source, widely used |
| ONNX Runtime | ONNX | Via Execution Providers | Yes | Yes | Production |
| Core ML | .mlmodel | Apple Neural Engine | Apple only | No | Production (Apple) |
| MediaPipe LLM | .task | Limited | Yes | Yes | Deprecated |
Google’s own published benchmarks show LiteRT outperforming llama.cpp on both CPU and GPU for Gemma 3 1B on the Galaxy S25 Ultra — with the NPU path adding a further dimension unavailable in llama.cpp entirely.7
The more interesting comparison is with ONNX Runtime. Both target cross-platform deployment and use delegation models for hardware acceleration. LiteRT has deeper Android ecosystem integration (Play Services, AICore, Chrome, Pixel hardware partnerships) and production NPU paths for Qualcomm and MediaTek silicon. ONNX Runtime has broader enterprise coverage outside mobile. For Android-first development with flagship hardware targets, LiteRT-LM’s NPU path is currently the highest-performance option.
Production Deployments
LiteRT-LM’s architecture claims are corroborated by shipping products. As of late 2025, Google confirmed that LiteRT-LM powers Gemini Nano-based features in three production environments:
- Chrome browser: Tab management and text analysis AI features
- Chromebook Plus: On-device AI features tied to the product line
- Pixel Watch: Smart Replies feature
Google states these deployments collectively reach “hundreds of millions of devices.”3 This is a production inference stack running inside a browser and a wearable OS, not a demo.
The Google AI Edge Gallery — an open beta Android app demonstrating on-device LLMs, image analysis, and speech-to-text with zero internet dependency — reached 500,000 APK downloads within two months of its GitHub launch and is now available on Google Play.8 An iOS version is planned.
Limitations Practitioners Need to Know
The performance narrative is real, but bounded:
Device requirements are non-negotiable. The flagship NPU benchmarks require flagship hardware. Gemma 3n E4B at 4.2 GB of RAM fits only in phones with 8+ GB of available memory. Mid-range devices fall back to GPU or CPU paths with commensurately lower throughput.
Quantization trades accuracy for size. INT4 post-training quantization reduces model size 2.5–4× versus BF16 full precision. Published evaluations show ~2% degradation on MMLU-Pro reasoning benchmarks for INT4 models — acceptable for most applications, but a real trade-off that increases with smaller model sizes.
NPU support is Android-only, and not all Android. The Qualcomm and MediaTek NPU backends cover the majority of current flagship Android devices, but older Qualcomm SoCs, Samsung Exynos devices, and all iOS devices have no NPU path through LiteRT-LM. iOS inference runs on CPU (XNNPACK) and GPU (Metal/WebGPU). An experimental Core ML delegate exists as of LiteRT 2.4.0 that routes some ops through the Apple Neural Engine, but it is labeled beta and has limited model support.9
LiteRT-LM is still in alpha. As of late March 2026, the API is at v0.9.0-alpha. Swift and Python bindings are “in development.” The API surface can change between releases. Factor this into production adoption timelines.
Getting Started
The conversion pipeline for custom models follows a consistent path:
Install conversion tooling
pip install litert-torch ai-edge-quantizer
Convert a PyTorch model to LiteRT format
python -m litert_torch.export
—model my_model.pt
—output my_model.litertlm
Apply INT4 post-training quantization
python -m ai_edge_quantizer.quantize
—model my_model.litertlm
—quantization_config int4_default
—output my_model_int4.litertlm
For benchmarking at scale before deployment, AI Edge Portal provides access to Google’s lab of 100+ physical Android device models — useful for understanding performance distribution across the long tail of Android hardware before shipping.
Pre-converted models are available at huggingface.co/litert-community. For most developers, starting with an existing Gemma 3 1B or Qwen 2.5 1.5B quantized model is faster than converting from scratch.
Frequently Asked Questions
Q: Does LiteRT-LM replace TensorFlow Lite for classical ML tasks?
A: LiteRT is the renamed TFLite runtime and is backward-compatible with existing .tflite models and apps. LiteRT-LM is a new component within the ecosystem specifically for generative AI models — it does not replace classical ML workflows, which continue to use the standard LiteRT runtime.
Q: Can I run LiteRT-LM models on an iPhone? A: Yes. LiteRT-LM supports iOS using CPU (XNNPACK) and GPU (Metal/WebGPU) backends. An experimental Core ML delegate enables partial Apple Neural Engine access, but it is currently beta and has limited operator coverage. Expect CPU/GPU-level performance on iOS rather than the NPU benchmarks published for Android flagship devices.
Q: What is the minimum device required for a usable on-device LLM experience? A: There is no published minimum spec, but practical experience suggests 6+ GB RAM for Gemma 3 1B (557 MB INT4) and a mid-range or better GPU for acceptable decode speeds. Flagship Snapdragon 8-series or Dimensity 9000-series devices with NPU support deliver the best experience. Entry-level devices may struggle with models above 600 MB.
Q: How does LiteRT-LM compare to llama.cpp for Android development? A: On CPU and GPU, Google’s published benchmarks show LiteRT-LM outperforming llama.cpp for Gemma 3 1B on current flagship hardware. More significantly, LiteRT-LM provides production NPU paths (Qualcomm QNN, MediaTek NeuroPilot) that llama.cpp does not support — enabling the 5,000–11,000+ tokens/sec prefill speeds that make on-device inference practically competitive with cloud latency.
Q: Is LiteRT-LM production-ready today? A: The underlying LiteRT runtime reached production-ready status with TensorFlow 2.21 in March 2026. LiteRT-LM itself is at v0.9.0-alpha, with the Kotlin and C++ APIs stable but the Swift and Python bindings still in development. Google runs LiteRT-LM in production in Chrome, Chromebook Plus, and Pixel Watch, which is meaningful evidence of stability — but the public API may still change before a stable release designation.
Footnotes
-
Google Developers Blog. “TensorFlow Lite is now LiteRT.” September 4, 2024. developers.googleblog.com/tensorflow-lite-is-now-litert/ ↩
-
Google Developers Blog. “LiteRT: The Universal Framework for On-Device AI.” developers.googleblog.com/en/litert-the-universal-framework-for-on-device-ai/ ↩
-
Google Developers Blog. “On-device GenAI in Chrome, Chromebook Plus, and Pixel Watch with LiteRT-LM.” September 24, 2025. developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/ ↩ ↩2
-
Google Developers Blog. “Unlocking Peak Performance on Qualcomm NPU with LiteRT.” November 2025. developers.googleblog.com/unlocking-peak-performance-on-qualcomm-npu-with-litert/ ↩
-
Google Developers Blog. “MediaTek NPU and LiteRT: Powering the Next Generation of On-Device AI.” December 2025. developers.googleblog.com/mediatek-npu-and-litert-powering-the-next-generation-of-on-device-ai/ ↩
-
Dev Journal. “Google Launches TensorFlow 2.21 and LiteRT.” March 7, 2026. earezki.com/ai-news/2026-03-07-google-launches-tensorflow-221-and-litert-faster-gpu-performance-new-npu-acceleration-and-seamless-pytorch-edge-deployment-upgrades/ ↩
-
Google Developers Blog. “On-device SLMs with multimodality, RAG, and Function Calling.” developers.googleblog.com/google-ai-edge-small-language-models-multimodality-rag-function-calling/ ↩
-
Google Developers Blog. “Google AI Edge Gallery: Now with audio and on Google Play.” September 2025. developers.googleblog.com/google-ai-edge-gallery-now-with-audio-and-on-google-play/ ↩
-
Google AI Edge. “LiteRT Core ML delegate.” ai.google.dev/edge/litert/ios/coreml ↩