Off Grid v0.0.88 Ships Hexagon HTP Acceleration: Auditability Is the Real Edge Over Apple Intelligence

Q: What quantization level are the tier benchmarks based on?

The vendor's 5–30 tok/s performance tiers assume q40 quantization. Running models at less aggressive quantization (q80, f16) significantly increases the memory footprint, which can push a model out of a 6 GB phone's available per-app RAM entirely — making it unloadable regardless of how fast or slow inference would be.

The v0.0.88/v0.0.89 release cluster Off Grid shipped on April 17, 2026 adds Hexagon HTP/NPU acceleration for text inference and an explicit backend selector — CPU, OpenCL, HTP, or Metal. The developer reports on Hacker News that models previously running at 10 tok/s hit 30 tok/s on the HTP path with q4_0 KV cache quantization. That 3× figure is self-reported and unverified independently, but it marks the first time the app exposes Snapdragon NPU acceleration for LLM text generation.

What v0.0.88/v0.0.89 Actually Ships

To be precise: v0.0.89 is a maintenance release — it restricts a DownloadRetrying event listener to Android only and bumps the version number. The substantive changes are in v0.0.88, published the same day. That release ships:

Inference backend selector: explicit choice between CPU, OpenCL, Hexagon HTP (Snapdragon NPU), and Metal (Apple).
HTP/NPU text acceleration: the first time the app exposes Snapdragon’s Hexagon DSP for LLM text generation, not just image workloads.
Image generation without a text model: previously the two were coupled; now they run independently.
Download migration: backend switched to Room database and WorkManager for improved Android reliability.
Remote API key support: users can point the app at a remote server if full local isolation is not required.

The developer’s HN post frames the speed claim as: “On q4_0, models that were doing 10 tok/s are hitting 30.” The mechanism is the Hexagon HTP path combined with configurable KV cache quantization — switching from f16 to q4_0 KV cache, the developer writes on dev.to, “roughly triples inference speed with minimal quality impact.” Both are developer self-assessments with no independent benchmark reproduction in public sources as of April 24, 2026.

The vendor-documented performance tiers, per the official docs:

Hardware tier	Text (tok/s)
6 GB RAM (minimum)	5–10
8 GB RAM	10–20
Snapdragon 8 Gen 3	15–30
NPU image generation	5–10 s per image
Vision inference	~7 s

Typical English reading speed is approximately 4 tok/s. At 15 tok/s, streamed output outpaces reading; at 5 tok/s, it is reading-pace functional — usable but not fast. All figures are vendor self-reported.

The Capability Ceiling: What 4–8B On-Device Quants Can and Cannot Do

The app’s practical ceiling is 4–8B quantized models. The developer is explicit about this: “smaller models (1B to 7B parameters) produce less sophisticated output for complex reasoning, but for everyday tasks like quick questions, summarization, drafting, and document analysis, it’s surprisingly capable.” That description is both accurate and the bound. Multi-step logical chains, deep code generation, and extended context work are where 4–8B quants lose fidelity relative to 70B+ models.

The minimum hardware bar is 6 GB RAM on an ARM64 chip from the last four to five years. v0.0.86 (April 8, 2026) pushed the floor further down by adding SmolVLM2 500M and SmolLM2 360M — explicitly targeting phones under 6 GB RAM, which represents a substantial slice of the global Android installed base.

The full model surface per the repository:

Text: Qwen 3, Llama 3.2, Gemma 3, Phi-4, any GGUF
Vision: SmolVLM, Qwen3-VL, Gemma 3n
Image generation: 20+ Stable Diffusion variants (Absolute Reality, DreamShaper, Anything V5)
Voice: on-device Whisper via whisper.cpp

Bring-your-own-GGUF is supported, so the model list is not hard-capped by the app’s release cadence.

Competitive Landscape: MLC LLM, PocketPal, and Why Distribution Changes the Adoption Curve

Off Grid is not the first open-weight mobile inference stack. MLC LLM Chat and PocketPal AI are direct predecessors on Android and iOS — both run open-weight models on-device and have been doing so for longer. The technical capability gap between these projects is narrower than Off Grid’s app store presence implies.

What Off Grid adds is consumer-grade distribution: it is packaged and listed as a standard app on the App Store and Play Store. A non-developer can install it in one tap without sideloading, configuring model directories, or building from source. For the privacy-first and offline-first use case, that packaging changes the adoption curve from “developer project” to “install and run.” The 1.7k GitHub stars as of April 24, 2026 and the earlier Hacker News thread (124 points, 66 comments) reflect early-adopter traction; the audience is largely technical users exploring the privacy and offline angles, not a general-consumer wave yet.

The repository is MIT-licensed, carries no telemetry, requires no accounts, and is built on React Native (TypeScript 93.5%), Kotlin, and Swift. Inference runs on llama.cpp for text, whisper.cpp for voice, Core ML/ANE for the iOS Metal path, and a Stable Diffusion backend for image generation. The code is auditable line-by-line, and the MIT license means forks and custom builds face no legal friction.

Apple Intelligence and Gemini Nano Are Not the Same Problem

Positioning Off Grid against Apple Intelligence and Gemini Nano as equivalent “network-bound” alternatives elides real architectural differences. Apple Intelligence uses on-device models under 3B parameters for tasks it can handle locally and routes more complex requests to Private Cloud Compute — Apple’s server-side processing infrastructure with hardware-attested privacy guarantees. It is a tiered system where off-device compute has explicit attestation, not simply a cloud API call that happens when the model struggles.

Gemini Nano is architecturally more aggressive about staying on-device and targets OEM Android integration rather than developer-installed apps. The two do not belong in the same comparison bucket.

Off Grid’s genuine differentiation against both is not inference speed and not model quality — proprietary on-device stacks have hardware integration advantages that llama.cpp on a Snapdragon 8 Gen 3 cannot match. The differentiation is auditability. An engineer can read every line of Off Grid’s codebase and independently confirm no network egress occurs during inference. Neither Apple Intelligence nor Gemini Nano offers that property; their privacy claims rest on attestation and policy, not inspectable source code.

The Off-Grid Use Case: SIM-Less, Wi-Fi-Less, Surveillance-Resistant

The name is a precise use case specification. The scenarios Off Grid genuinely serves are narrow: devices operating without SIM or Wi-Fi, users whose threat model includes network traffic analysis, and field deployments where cloud dependencies are either unavailable or operationally unacceptable. Summarization, drafting, document analysis, quick Q&A, and voice transcription within that constraint — that is the functional envelope the developer describes.

What it does not serve: tasks requiring better-than-4–8B quant quality, extended context windows, or real-time information access. The capability ceiling is honest, and the developer’s own framing reflects it without overselling.

What Comes Next: Memory Constraints, Context Length, and the Path to 13B On-Device

The binding constraint for the next capability tier is accessible RAM. 13B models require substantially more addressable memory per process than most current Android phones expose to a single application, even on hardware that ships with 12 GB physical RAM. Getting there requires OS-level memory limit changes, quantization schemes below q4_0 that maintain acceptable quality, or hardware that catches up on per-app memory ceilings.

Context length is a secondary ceiling: at 4–8B parameter scale, practical context windows shrink under memory pressure on mobile hardware, limiting document analysis to shorter inputs than the same model handles on a workstation.

The v0.0.84 release introduced device-aware model sorting (“For You”), trending badges, and lazy conversation creation — signaling that development attention is splitting between inference performance and model discovery UX. For a consumer-facing app at this stage of adoption, that split is rational. It also means future inference backend improvements compete with UI polish for velocity in what appears to be a solo-developer project.

Frequently Asked Questions

Does the 3× speedup apply to iOS users too?

The 10→30 tok/s claim is measured exclusively on the Hexagon HTP path for Snapdragon devices. iOS users have CPU and Metal backend options, but the developer has not published Metal-specific benchmarks, and community comparisons suggest Apple-side throughput may lag significantly behind the Snapdragon HTP numbers for equivalent models.

Can Gemini Nano be installed like Off Grid?

No. Gemini Nano ships embedded in specific Android device firmware at the OEM level — it is not available as a standalone download or sideload. Off Grid runs on any ARM64 device with 6 GB+ RAM regardless of manufacturer, making it accessible on the majority of Android phones sold in the last four to five years that Gemini Nano will never reach.

What happens with a custom GGUF outside the tested model list?

Off Grid supports bring-your-own-GGUF, but models outside the curated set fall outside the vendor’s documented performance tiers. Users are responsible for their own compatibility and quality testing — there is no guarantee a given GGUF will fit in per-app memory or produce usable inference speed on a specific device tier.

What quantization level are the tier benchmarks based on?

The vendor’s 5–30 tok/s performance tiers assume q4_0 quantization. Running models at less aggressive quantization (q8_0, f16) significantly increases the memory footprint, which can push a model out of a 6 GB phone’s available per-app RAM entirely — making it unloadable regardless of how fast or slow inference would be.

What OS-level change would 13B on-device models require?

The binding constraint is Android’s per-process memory allocation ceiling — typically 512 MB to 2 GB depending on device and Android version — not total physical RAM. Phones shipping 12 GB still restrict individual apps to a fraction of that. Running 13B models at usable quantization would require OS-level policy changes to raise that per-app cap, not just faster hardware.

Off Grid v0.0.88 Ships Hexagon HTP Acceleration: Auditability Is the Real Edge Over Apple Intelligence

What v0.0.88/v0.0.89 Actually Ships

The Capability Ceiling: What 4–8B On-Device Quants Can and Cannot Do

Competitive Landscape: MLC LLM, PocketPal, and Why Distribution Changes the Adoption Curve

Apple Intelligence and Gemini Nano Are Not the Same Problem

The Off-Grid Use Case: SIM-Less, Wi-Fi-Less, Surveillance-Resistant

What Comes Next: Memory Constraints, Context Length, and the Path to 13B On-Device

Frequently Asked Questions

Does the 3× speedup apply to iOS users too?

Can Gemini Nano be installed like Off Grid?

What happens with a custom GGUF outside the tested model list?

What quantization level are the tier benchmarks based on?

What OS-level change would 13B on-device models require?

Sources

Enjoyed this article?

What v0.0.88/v0.0.89 Actually Ships

The Capability Ceiling: What 4–8B On-Device Quants Can and Cannot Do

Competitive Landscape: MLC LLM, PocketPal, and Why Distribution Changes the Adoption Curve

Apple Intelligence and Gemini Nano Are Not the Same Problem

The Off-Grid Use Case: SIM-Less, Wi-Fi-Less, Surveillance-Resistant

What Comes Next: Memory Constraints, Context Length, and the Path to 13B On-Device

Frequently Asked Questions

Does the 3× speedup apply to iOS users too?

Can Gemini Nano be installed like Off Grid?

What happens with a custom GGUF outside the tested model list?

What quantization level are the tier benchmarks based on?

What OS-level change would 13B on-device models require?

Sources

Related Articles

free-claude-code Routes Claude Code Through NVIDIA NIM and Local Models After Anthropic's CLI Ban

ggsql Alpha: Write ggplot2-Style Visualizations Directly in SQL

Hugging Face's Spring 2026 State of Open Source Report: China Hits 41% of Downloads, Industry Share Collapses From 70% to 37%

Enjoyed this article?