Google LiteRT: Running LLMs on Your Phone Without the Cloud

Google LiteRT is the production runtime powering on-device large language models across hundreds of millions of Chrome browsers, Chromebook Plus laptops, and Pixel Watches—without any cloud call. Formerly known as TensorFlow Lite, it now supports full generative AI inference through LiteRT-LM, delivering 1.4x faster GPU performance than its predecessor and up to 100x CPU speedup via NPU acceleration.

What Is Google LiteRT?

LiteRT is Google’s on-device inference framework for running ML and GenAI models directly on edge hardware—phones, laptops, browsers, and IoT devices. Google announced the rebrand from TensorFlow Lite in late 2024, not as cosmetic change, but as a signal of expanded scope.¹

The old name reflected a single origin: TensorFlow. The new name reflects the reality that the runtime now supports models authored in PyTorch, JAX, and Keras—not just TensorFlow. LiteRT captures a multi-framework vision that TFLite couldn’t.

What changed structurally: LiteRT shipped a new CompiledModel API alongside the original Interpreter API. The CompiledModel API is designed for modern hardware acceleration pipelines—automated accelerator selection, hardware memory buffers via the TensorBuffer API, and async execution. The Interpreter API stays available for backward compatibility, but new projects should default to CompiledModel.

As of the TensorFlow 2.21 release in March 2026, LiteRT has fully graduated from preview to production.² The framework now officially replaces TFLite for all future development.

How LiteRT Handles LLM Inference

Running a large language model on a phone requires more than a generic inference runtime. LLMs have stateful computation, key-value caches, multi-step token generation, and session management that traditional ML inference pipelines don’t accommodate well.

Google’s answer is LiteRT-LM: a C++-based orchestration layer built on top of LiteRT specifically for language model inference.³ It handles:

KV-cache management — keeping attention states across turns without redundant recomputation
Session cloning — branching inference sessions for parallel generation paths
Prompt caching and scoring — reusing cached prefill computations across requests
Stateful inference — maintaining conversation context properly across multi-turn exchanges

LiteRT-LM isn’t experimental. It’s the production backend running Gemini Nano in Chrome’s built-in AI APIs, Chromebook Plus tab management and text features, and the Smart Replies feature on Pixel Watch.⁴

Supported open-weight models include the full Gemma family (Gemma 3, Gemma 3n, EmbeddingGemma, FunctionGemma), Qwen, Phi, and FastVLM for multimodal tasks. The LiteRT Community on Hugging Face hosts pre-converted checkpoints ready for deployment.

// Android integration via MediaPipe LLM Inference API
val options = LlmInferenceOptions.builder()
    .setModelPath("/data/local/tmp/gemma-3-1b-it.bin")
    .setMaxTokens(1024)
    .setTopK(40)
    .setTemperature(0.8f)
    .setRandomSeed(101)
    .build()

val llmInference = LlmInference.createFromOptions(context, options)
val result = llmInference.generateResponse("Summarize this article:")

For PyTorch models, LiteRT provides the Torch Generative API—a Python module for reauthoring and converting PyTorch GenAI models into a format deployable via LiteRT-LM.

# Convert a PyTorch GenAI model for LiteRT deployment
pip install ai-edge-torch
python -m ai_edge_torch.generative.examples.gemma.convert_gemma_to_tflite \
  --checkpoint_path /path/to/model \
  --output_path /output/gemma.tflite

Hardware Acceleration: NPUs Change the Equation

The headline story isn’t the software—it’s what the software unlocks on modern chipsets.

Every flagship phone shipped since 2022 includes a Neural Processing Unit. Until LiteRT-LM, most of that silicon sat underutilized for LLM inference because vendor NPU APIs (Qualcomm’s AI Engine Direct, MediaTek’s APU) required navigating separate SDKs, compilers, and runtime dependencies.

LiteRT abstracts all of that into a unified acceleration interface. Developers write to one API; LiteRT handles the vendor-specific translation.⁵

The performance numbers are significant. According to Google’s benchmarks on the Snapdragon 8 Elite Gen 5:⁶

Time-to-first-token (TTFT): 0.12 seconds on a 1024-token prompt
Prefill speed: over 11,000 tokens/second
Decode speed: over 100 tokens/second

On the Samsung Galaxy S25 Ultra with Gemma 3 1B, LiteRT outperforms llama.cpp on both CPU and GPU for prefill and decode. The NPU adds a further 3x performance gain over GPU for prefill specifically.

The quantization strategy affects which hardware can be maximally utilized. LiteRT’s Qualcomm NPU path uses INT8 weight quantization with INT16 activations—the configuration that unlocks the NPU’s highest-speed kernel paths. INT4 quantization is available for smaller model footprints with additional accuracy trade-offs.

MediaTek support follows a similar pattern. The LiteRT-MediaTek plugin enables NPU acceleration on Dimensity chipsets using both ahead-of-time (AOT) and on-device compilation paths.⁷

Real-World Performance: What the Numbers Mean

Academic benchmarks describe ceilings. Developer experience reveals floors.

The broader landscape of on-device LLM performance as of early 2026 shows wide variation by device class:⁸

Device	Chipset	Approximate Decode Speed
Mac M4 Pro	Apple Silicon	~173 tokens/sec
iPhone 17 Pro	A19 Pro	~136 tokens/sec
Galaxy S25 Ultra	Snapdragon 8 Elite	~91 tokens/sec
Mid-range Android	Snapdragon 7s Gen 3	~30-45 tokens/sec
Raspberry Pi 5	Cortex-A76	~24 tokens/sec

LiteRT currently focuses on Android and Linux with comprehensive support. iOS support runs through the MediaPipe LLM Inference API (which uses LiteRT-LM under the hood). Full native Swift API support is available, but iOS NPU acceleration through the Apple Neural Engine remains less mature than Qualcomm or MediaTek paths within the LiteRT ecosystem.

One friction point practitioners should anticipate: model size. Gemma 3 1B in INT4 quantization runs around 500MB-700MB. Gemma 3n E4B runs larger. Distributing these models with an app or prompting users to download them on first launch creates UX considerations that don’t exist with cloud API calls.

The power trade-off is also real. Research published in 2025 found on-device inference consumes 4-9x more energy than retrieving equivalent results from a remote server.⁹ The energy calculus favors on-device only where privacy requirements, offline capability, or latency sensitivity justify the cost.

LiteRT vs. The Alternatives

LiteRT isn’t the only on-device LLM runtime. Practitioners evaluating the space should understand where it fits.

Runtime	Vendor	Primary Platform	NPU Support	GenAI Focus
LiteRT + LiteRT-LM	Google	Android, Linux, Web	Qualcomm, MediaTek	Strong
ExecuTorch	Meta	Android, iOS	Limited	Growing
llama.cpp	Community	All	Minimal	Strong
Core ML	Apple	iOS, macOS	Apple Neural Engine	Strong
ONNX Runtime	Microsoft	All	Varies	Moderate
TensorRT-LLM	NVIDIA	NVIDIA hardware	NVIDIA GPU/NPU	Strong

ExecuTorch hit 1.0 GA in October 2025 and represents Meta’s serious commitment to on-device inference, with a notably small 50KB base footprint. For iOS-first developers, Core ML with Apple Neural Engine acceleration often outperforms LiteRT. For cross-platform Android development where Qualcomm or MediaTek NPU acceleration matters, LiteRT-LM has the most production-proven path.

llama.cpp remains the benchmark reference and the easiest path for experimentation, but LiteRT outperforms it on supported hardware as benchmarks on the Galaxy S25 Ultra confirm.

What’s Already Shipping

The most important signal about LiteRT maturity is production deployment scale.

Google’s own products rely on LiteRT-LM today:

Chrome Built-in AI APIs: Web developers can call window.ai.createTextSession() to access on-device Gemini Nano through Chrome, with LiteRT-LM handling inference behind the browser API surface
Chromebook Plus: Tab organization suggestions, text analysis, and writing assistance run on-device without cloud round-trips
Pixel Watch: Smart Replies generation runs locally on the watch, enabling AI features without phone connectivity

That third case is instructive. A smartwatch running LLM inference represents a meaningful constraint environment—limited compute, limited memory, battery-critical operation—and LiteRT-LM ships there in production. The engineering decisions about model size, quantization, and hardware routing that make Pixel Watch AI work are the same decisions third-party developers need to make.¹⁰

Google also shipped the AI Edge Gallery app (available on Google Play and the App Store) as an experimental showcase. It lets users download and run Gemma and other open-weight models from Hugging Face entirely on-device, with no network requirement after the initial download. The app supports AI Chat, image question-answering, code generation, and a fully offline mini game powered by on-device natural language inference—a deliberate proof-of-concept for what locally-run GenAI can enable.¹¹

The Infrastructure Argument

The cloud AI model rests on a latency tax, a privacy cost, and a connectivity requirement. For many applications, those are acceptable trade-offs. For a growing category, they aren’t.

Medical applications handling patient data face regulatory constraints that cloud inference complicates. Enterprise applications processing proprietary documents face similar pressure. Consumer applications that need to work in subways, on planes, or in rural areas can’t bet on connectivity. Wearables operating at the edge of battery life can’t afford the radio power for every inference call.

LiteRT’s architecture addresses the infrastructure argument directly: move compute to the data, rather than data to the compute. The 1.4x GPU improvement over TFLite and the NPU acceleration paths that deliver up to 100x speedup over CPU inference are engineering achievements that make this argument viable for increasingly capable models.

The Gemma 3n architecture—which Google designed specifically for mobile deployment—demonstrates where model design and runtime optimization converge. The 3n in the name refers to the mobile-native design; the E2B and E4B variants (effective 2B and 4B parameters) support multimodal inputs including text, vision, and audio, running on hardware that cost $800 four years ago.¹²

Whether on-device AI represents a fundamental architectural shift or a specialized niche depends on use case. The runtime infrastructure no longer limits the answer.

Frequently Asked Questions

Q: Is LiteRT backward compatible with TensorFlow Lite models? A: Yes. LiteRT reads existing .tflite files without modification, and the Interpreter API is fully compatible with TFLite. Migration requires only a package name update in your build configuration.

Q: Which Android devices support LiteRT NPU acceleration? A: Devices with Qualcomm Snapdragon 8 series and MediaTek Dimensity chipsets with dedicated NPUs are supported. LiteRT automatically falls back to GPU or CPU on unsupported hardware, so NPU acceleration is an enhancement rather than a requirement.

Q: How large are LiteRT-compatible LLM models, and how should developers handle distribution? A: Quantized models range from approximately 500MB (Gemma 3 1B INT4) to several gigabytes. The standard approach is on-demand download after app install, not bundling with the APK. Google’s AICore service on Android 14+ devices can manage model caching across apps.

Q: Can LiteRT models run on iOS? A: Yes, via the MediaPipe LLM Inference API with Swift bindings. However, iOS NPU acceleration through the Apple Neural Engine is not currently available through LiteRT—inference runs on CPU and GPU. For ANE acceleration on iOS, Core ML is the more mature option.

Q: What’s the difference between LiteRT and LiteRT-LM? A: LiteRT is the general-purpose on-device inference runtime (the successor to TFLite). LiteRT-LM is a specialized orchestration layer built on top of LiteRT that handles LLM-specific requirements: KV-cache management, session state, multi-turn conversation, and prompt caching. Most developers working with language models will interact with LiteRT-LM APIs or higher-level wrappers like the MediaPipe LLM Inference API.

Sources:

Google Developers Blog. “TensorFlow Lite is now LiteRT.” Google, 2024. https://developers.googleblog.com/tensorflow-lite-is-now-litert/ ↩
MarkTechPost. “Google Launches TensorFlow 2.21 And LiteRT: Faster GPU Performance, New NPU Acceleration, And Seamless PyTorch Edge Deployment Upgrades.” March 2026. https://www.marktechpost.com/2026/03/06/google-launches-tensorflow-2-21-and-litert-faster-gpu-performance-new-npu-acceleration-and-seamless-pytorch-edge-deployment-upgrades/ ↩
GitHub. “google-ai-edge/LiteRT-LM.” https://github.com/google-ai-edge/LiteRT-LM ↩
Google Developers Blog. “On-device GenAI in Chrome, Chromebook Plus, and Pixel Watch with LiteRT-LM.” https://developers.googleblog.com/en/on-device-genai-in-chrome-chromebook-plus-and-pixel-watch-with-litert-lm/ ↩
Google AI for Developers. “NPU acceleration with LiteRT.” https://ai.google.dev/edge/litert/next/npu ↩
Google Developers Blog. “Unlocking Peak Performance on Qualcomm NPU with LiteRT.” https://developers.googleblog.com/unlocking-peak-performance-on-qualcomm-npu-with-litert/ ↩
Google Developers Blog. “MediaTek NPU and LiteRT: Powering the next generation of on-device AI.” https://developers.googleblog.com/mediatek-npu-and-litert-powering-the-next-generation-of-on-device-ai/ ↩
v-chandra.github.io. “On-Device LLMs: State of the Union, 2026.” https://v-chandra.github.io/on-device-llms/ ↩
Malavolta, I. et al. “On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Content.” CAIN 2025. http://www.ivanomalavolta.com/files/papers/CAIN_2025.pdf ↩
Google Developers Blog. “LiteRT: The Universal Framework for On-Device AI.” https://developers.googleblog.com/litert-the-universal-framework-for-on-device-ai/ ↩
TechCrunch. “Google quietly released an app that lets you download and run AI models locally.” May 2025. https://techcrunch.com/2025/05/31/google-quietly-released-an-app-that-lets-you-download-and-run-ai-models-locally/ ↩
Hugging Face. “google/gemma-3n-E4B-it-litert-lm.” https://huggingface.co/google/gemma-3n-E4B-it-litert-lm ↩

What Is Google LiteRT?

How LiteRT Handles LLM Inference

Hardware Acceleration: NPUs Change the Equation

Real-World Performance: What the Numbers Mean

LiteRT vs. The Alternatives

What’s Already Shipping

The Infrastructure Argument

Frequently Asked Questions

Footnotes

Related Articles

Alibaba's zvec: A Lightning-Fast Vector Database That Fits In-Process

Edge AI Deployment: Running Models Where the Data Lives

GitHub Agentic Workflows: AI That Commits Code For You

Enjoyed this article?