Table of Contents

Google’s LiteRT — renamed from TensorFlow Lite in September 2024 — is now a full on-device inference platform for large language models. Using the LiteRT-LM layer, developers can deploy Gemma 3, Phi-4-mini, Qwen 2.5, and other SLMs entirely on Android, iOS, or desktop hardware, with no cloud round-trips required. On flagship Snapdragon silicon, prefill speeds exceed 11,000 tokens per second.


What Is LiteRT? The Rebrand That Signals a Shift

Google renamed TensorFlow Lite (TFLite) to LiteRT — short for “Lite Runtime” — on September 4, 2024.1 The name change is more than cosmetic. TFLite launched in 2017 as a TensorFlow-specific mobile runtime; by 2024, it had grown to accept models authored in PyTorch, JAX, and Keras. Anchoring the name to TensorFlow had become actively misleading.

The .tflite file format, FlatBuffers encoding, and class APIs remained untouched. Production apps using TFLite through Google Play Services were unaffected. What changed was the documentation domain (ai.google.dev/edge/litert), the Maven and PyPI package names, and, critically, the stated direction: LiteRT is now positioned as Google’s universal framework for deploying any model on any edge device — not a TensorFlow companion project.

Google states that LiteRT’s heritage (inherited from TFLite) now powers over 100,000 apps across 2.7 billion devices worldwide.2 That installed base matters when evaluating the new GenAI additions: LiteRT-LM slots into an existing ecosystem, not a greenfield one.


The GenAI Layer: LiteRT-LM

Running classical computer vision or NLP models on-device is a solved problem. Running a 1-4B parameter autoregressive language model is not — it requires tokenizer management, KV-cache handling, context window orchestration, and hardware-specific quantization paths. LiteRT-LM is the layer that handles all of this.

LiteRT-LM was publicly launched as a production framework on September 24, 2025.3 It introduced a new model format — .litertlm — which extends the legacy .tflite FlatBuffers format with metadata for generative models: tokenizer configs, quantization headers, and vision encoder definitions for multimodal inputs. The API surface is organized around two core objects:

  • Engine: A singleton per application. Loads and caches the model. Shared across features.
  • Session: A stateful, per-conversation interface. Manages KV-cache, supports task-specific LoRA weight injection, and can be cloned in under 10 milliseconds for branching conversations or parallel evaluation.

LiteRT-LM also introduces a new CompiledModel API to replace the legacy Interpreter interface — the entry point for GPU and NPU execution paths, zero-copy buffer sharing via AHardwareBuffer, and asynchronous inference. The Interpreter API remains supported for backward compatibility, but new projects should use CompiledModel.

A minimal Android session using the Kotlin API:

val engine = LlmInferenceEngine.create( context, LlmInferenceEngine.Options.builder() .setModelPath(“/sdcard/gemma3-1b-it-int4.litertlm”) .build() ) val session = engine.createSession() session.addQueryChunk(“Explain transformer attention in two sentences.”) session.generateResponseAsync { partial, done -> if (done) displayResult(partial) }


How On-Device LLM Inference Actually Works

The performance gap between CPU, GPU, and NPU inference is not linear — it is categorical. On the Samsung Galaxy S25 Ultra (Snapdragon 8 Elite), Gemma 3 1B achieves 5,836 tokens per second prefill on the Qualcomm NPU versus significantly lower throughput on CPU or GPU alone.4 The NPU path requires a hardware-specific accelerator layer.

LiteRT provides two production NPU backends as of early 2026:

  • LiteRT Qualcomm AI Engine Direct Accelerator (announced November 2025): Supports 90 operators; enables 64 of 72 benchmarked canonical models to run fully on the NPU. On the Snapdragon 8 Elite, FastVLM-0.5B achieves prefill speeds exceeding 11,000 tokens per second — a number that reflects the NPU’s ability to parallelize matrix multiplication at hardware level.
  • LiteRT NeuroPilot Accelerator for MediaTek (announced December 2025): Targets Dimensity 9000-series SoCs. On the Dimensity 9500 (Vivo X300 Pro), Gemma 3n E2B reaches 1,600+ tokens per second prefill and 28 tokens per second decode at 4K context — Google states this is up to 12× faster than CPU and 10× faster than GPU for the same model.5

The CPU path uses XNNPACK, a high-performance kernel library, across all platforms including iOS, macOS, Windows, and Linux. GPU paths use OpenCL or OpenGL on Android and Metal/WebGPU on Apple silicon.

Compilation strategy matters significantly. LiteRT-LM supports both ahead-of-time (AOT) and just-in-time (JIT) compilation. For production deployments, AOT is necessary: JIT compilation for a model like Gemma 3 270M can take over a minute on-device — an unacceptable cold-start penalty in a user-facing application. AOT-compiled model packages ship ready to execute.


Performance: What the Numbers Say

The benchmark landscape for on-device LLMs is fragmented — devices, quantization levels, context lengths, and hardware backends vary. The following table consolidates published data from official Google sources.

DeviceSoCBackendModelPrefill (tok/s)Decode (tok/s)
Samsung Galaxy S25 UltraSnapdragon 8 EliteNPU (QNN)Gemma 3 1B INT45,83684.8
Samsung Galaxy S25 UltraSnapdragon 8 EliteNPU (QNN)FastVLM-0.5B INT8>11,000>100
Samsung Galaxy S24 UltraSnapdragon 8 Gen 3GPUGemma 3n E2B INT481615.6
Vivo X300 ProDimensity 9500NPU (NeuroPilot)Gemma 3n E2B INT4>1,60028
MacBook Pro M3Apple M3CPU (XNNPACK)Gemma 3 1B INT442266.9

Decode speed — the rate at which a model generates new tokens during a response — is the most user-perceptible metric. At 84.8 tokens per second (Gemma 3 1B on Snapdragon 8 Elite NPU), output renders visibly faster than a human reads. At 15.6 tokens per second (Gemma 3n E2B on mid-tier GPU), it is usable but noticeably slower.

Google states that on TF 2.21 (released March 6, 2026), the LiteRT GPU backend is 1.4× faster than the legacy TFLite GPU implementation for the same models, and that asynchronous execution with zero-copy buffer interoperability adds up to 2× performance improvement for pipelines that chain inference steps.6


Supported Models and Deployment Scale

LiteRT-LM ships pre-tested against the following models, available pre-converted and quantized at the LiteRT Hugging Face Community:

ModelFormatSizeQuantizationContext
Gemma 3 1B Instruct.litertlm557 MBINT44,096
Gemma 3n E2B.litertlm2,965 MBINT44,096
Gemma 3n E4B.litertlm4,235 MBINT44,096
FunctionGemma 270M.litertlm288 MBINT81,024
Phi-4-mini Instruct.litertlm3,728 MBINT84,096
Qwen 2.5 1.5B Instruct.litertlm1,524 MBINT84,096
Llama 3.2 1B / 3B Instruct.litertlm
DeepSeek-R1-Distill-Qwen-1.5B.litertlm

Gemma 3n is the first multimodal model in the stack, supporting text and image inputs (with audio recently added in the AI Edge Gallery demo). At 288 MB quantized, FunctionGemma 270M is notable for function-calling use cases on extremely constrained devices.


LiteRT vs. the Alternatives

On-device LLM inference has attracted several competing frameworks. The differences are real and architecture-dependent:

FrameworkPrimary FormatNPU SupportiOS SupportCross-PlatformStatus
LiteRT-LM.litertlmQualcomm, MediaTek (Android)CPU + GPU (Metal)YesProduction (alpha API)
llama.cppGGUFNoCPU + Metal GPUYesOpen source, widely used
ONNX RuntimeONNXVia Execution ProvidersYesYesProduction
Core ML.mlmodelApple Neural EngineApple onlyNoProduction (Apple)
MediaPipe LLM.taskLimitedYesYesDeprecated

Google’s own published benchmarks show LiteRT outperforming llama.cpp on both CPU and GPU for Gemma 3 1B on the Galaxy S25 Ultra — with the NPU path adding a further dimension unavailable in llama.cpp entirely.7

The more interesting comparison is with ONNX Runtime. Both target cross-platform deployment and use delegation models for hardware acceleration. LiteRT has deeper Android ecosystem integration (Play Services, AICore, Chrome, Pixel hardware partnerships) and production NPU paths for Qualcomm and MediaTek silicon. ONNX Runtime has broader enterprise coverage outside mobile. For Android-first development with flagship hardware targets, LiteRT-LM’s NPU path is currently the highest-performance option.


Production Deployments

LiteRT-LM’s architecture claims are corroborated by shipping products. As of late 2025, Google confirmed that LiteRT-LM powers Gemini Nano-based features in three production environments:

  • Chrome browser: Tab management and text analysis AI features
  • Chromebook Plus: On-device AI features tied to the product line
  • Pixel Watch: Smart Replies feature

Google states these deployments collectively reach “hundreds of millions of devices.”3 This is a production inference stack running inside a browser and a wearable OS, not a demo.

The Google AI Edge Gallery — an open beta Android app demonstrating on-device LLMs, image analysis, and speech-to-text with zero internet dependency — reached 500,000 APK downloads within two months of its GitHub launch and is now available on Google Play.8 An iOS version is planned.


Limitations Practitioners Need to Know

The performance narrative is real, but bounded:

Device requirements are non-negotiable. The flagship NPU benchmarks require flagship hardware. Gemma 3n E4B at 4.2 GB of RAM fits only in phones with 8+ GB of available memory. Mid-range devices fall back to GPU or CPU paths with commensurately lower throughput.

Quantization trades accuracy for size. INT4 post-training quantization reduces model size 2.5–4× versus BF16 full precision. Published evaluations show ~2% degradation on MMLU-Pro reasoning benchmarks for INT4 models — acceptable for most applications, but a real trade-off that increases with smaller model sizes.

NPU support is Android-only, and not all Android. The Qualcomm and MediaTek NPU backends cover the majority of current flagship Android devices, but older Qualcomm SoCs, Samsung Exynos devices, and all iOS devices have no NPU path through LiteRT-LM. iOS inference runs on CPU (XNNPACK) and GPU (Metal/WebGPU). An experimental Core ML delegate exists as of LiteRT 2.4.0 that routes some ops through the Apple Neural Engine, but it is labeled beta and has limited model support.9

LiteRT-LM is still in alpha. As of late March 2026, the API is at v0.9.0-alpha. Swift and Python bindings are “in development.” The API surface can change between releases. Factor this into production adoption timelines.


Getting Started

The conversion pipeline for custom models follows a consistent path:

Install conversion tooling

pip install litert-torch ai-edge-quantizer

Convert a PyTorch model to LiteRT format

python -m litert_torch.export
—model my_model.pt
—output my_model.litertlm

Apply INT4 post-training quantization

python -m ai_edge_quantizer.quantize
—model my_model.litertlm
—quantization_config int4_default
—output my_model_int4.litertlm

For benchmarking at scale before deployment, AI Edge Portal provides access to Google’s lab of 100+ physical Android device models — useful for understanding performance distribution across the long tail of Android hardware before shipping.

Pre-converted models are available at huggingface.co/litert-community. For most developers, starting with an existing Gemma 3 1B or Qwen 2.5 1.5B quantized model is faster than converting from scratch.


Frequently Asked Questions

Q: Does LiteRT-LM replace TensorFlow Lite for classical ML tasks? A: LiteRT is the renamed TFLite runtime and is backward-compatible with existing .tflite models and apps. LiteRT-LM is a new component within the ecosystem specifically for generative AI models — it does not replace classical ML workflows, which continue to use the standard LiteRT runtime.

Q: Can I run LiteRT-LM models on an iPhone? A: Yes. LiteRT-LM supports iOS using CPU (XNNPACK) and GPU (Metal/WebGPU) backends. An experimental Core ML delegate enables partial Apple Neural Engine access, but it is currently beta and has limited operator coverage. Expect CPU/GPU-level performance on iOS rather than the NPU benchmarks published for Android flagship devices.

Q: What is the minimum device required for a usable on-device LLM experience? A: There is no published minimum spec, but practical experience suggests 6+ GB RAM for Gemma 3 1B (557 MB INT4) and a mid-range or better GPU for acceptable decode speeds. Flagship Snapdragon 8-series or Dimensity 9000-series devices with NPU support deliver the best experience. Entry-level devices may struggle with models above 600 MB.

Q: How does LiteRT-LM compare to llama.cpp for Android development? A: On CPU and GPU, Google’s published benchmarks show LiteRT-LM outperforming llama.cpp for Gemma 3 1B on current flagship hardware. More significantly, LiteRT-LM provides production NPU paths (Qualcomm QNN, MediaTek NeuroPilot) that llama.cpp does not support — enabling the 5,000–11,000+ tokens/sec prefill speeds that make on-device inference practically competitive with cloud latency.

Q: Is LiteRT-LM production-ready today? A: The underlying LiteRT runtime reached production-ready status with TensorFlow 2.21 in March 2026. LiteRT-LM itself is at v0.9.0-alpha, with the Kotlin and C++ APIs stable but the Swift and Python bindings still in development. Google runs LiteRT-LM in production in Chrome, Chromebook Plus, and Pixel Watch, which is meaningful evidence of stability — but the public API may still change before a stable release designation.


Footnotes

  1. Google Developers Blog. “TensorFlow Lite is now LiteRT.” September 4, 2024. developers.googleblog.com/tensorflow-lite-is-now-litert/

  2. Google Developers Blog. “LiteRT: The Universal Framework for On-Device AI.” developers.googleblog.com/en/litert-the-universal-framework-for-on-device-ai/

  3. Google Developers Blog. “On-device GenAI in Chrome, Chromebook Plus, and Pixel Watch with LiteRT-LM.” September 24, 2025. developers.googleblog.com/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/ 2

  4. Google Developers Blog. “Unlocking Peak Performance on Qualcomm NPU with LiteRT.” November 2025. developers.googleblog.com/unlocking-peak-performance-on-qualcomm-npu-with-litert/

  5. Google Developers Blog. “MediaTek NPU and LiteRT: Powering the Next Generation of On-Device AI.” December 2025. developers.googleblog.com/mediatek-npu-and-litert-powering-the-next-generation-of-on-device-ai/

  6. Dev Journal. “Google Launches TensorFlow 2.21 and LiteRT.” March 7, 2026. earezki.com/ai-news/2026-03-07-google-launches-tensorflow-221-and-litert-faster-gpu-performance-new-npu-acceleration-and-seamless-pytorch-edge-deployment-upgrades/

  7. Google Developers Blog. “On-device SLMs with multimodality, RAG, and Function Calling.” developers.googleblog.com/google-ai-edge-small-language-models-multimodality-rag-function-calling/

  8. Google Developers Blog. “Google AI Edge Gallery: Now with audio and on Google Play.” September 2025. developers.googleblog.com/google-ai-edge-gallery-now-with-audio-and-on-google-play/

  9. Google AI Edge. “LiteRT Core ML delegate.” ai.google.dev/edge/litert/ios/coreml

Enjoyed this article?

Stay updated with our latest insights on AI and technology.