Running RAG on a Snapdragon NPU: The On-Device Retrieval Tradeoff

Running an entire RAG pipeline, embedding through generation, on a laptop NPU is a plausible deployment target on current Snapdragon X Elite hardware. The Qualcomm Hexagon NPU provides 45 TOPS of dedicated AI compute with shared access to the SoC’s unified memory, making it architecturally capable of handling the interleaved embedding-then-generation pattern that RAG requires. The constraint is physical: 16GB of soldered LPDDR5x shared across the CPU, GPU, NPU, and operating system is the hard ceiling on how much you can retrieve against.

What mobile NPUs deliver for inference

Google’s LiteRT Qualcomm AI Engine Direct (QNN) Accelerator, announced November 2025, positions the NPU as a dedicated AI compute path that runs parallel to the GPU and CPU. According to Google, the NPU offers tens of TOPS of compute, is more power-efficient per TOP than both CPUs and GPUs, and is present on over 80% of recent Qualcomm SoCs. The LiteRT accelerator abstracts away vendor-specific SDK complexity and SoC version fragmentation, letting developers deploy .tflite models to the NPU through a unified API with either ahead-of-time or on-device compilation.

The NPU’s advantage over the integrated GPU for inference workloads is partly architectural and partly a software-path issue. Mobile integrated GPUs like the Snapdragon X Elite’s Adreno are designed primarily for rendering. Running transformer workloads through an unoptimized compute backend appears to hit kernel dispatch overhead and memory bandwidth penalties that negate whatever raw compute advantage the GPU might offer. The LiteRT blog’s framing is straightforward: the NPU exists to offload the GPU, freeing it for rendering while the NPU handles heavy AI processing.

Memory: the fixed ceiling

Snapdragon X Elite laptops ship with 16GB of soldered LPDDR5x shared across the CPU, GPU, NPU, and the operating system. The SoC combines a 12-core Oryon CPU, integrated Adreno graphics, and a Hexagon NPU. A production RAG index that might comfortably sit in a cloud VM’s 128GB RAM does not fit here.

The tradeoff is structural. You gain zero per-token API cost and full data locality. You lose elastic scaling. A cloud vector database can absorb a growing corpus by adding nodes. A soldered-memory laptop cannot. As the retrieval index grows toward the memory bound, bandwidth contention and eviction pressure will degrade whatever throughput the NPU provides.

What the quantization study reveals about model choice

A separate arXiv study (2603.26603) profiling eight LLMs from 0.5B to 9B parameters on a flagship Android device found that model architecture, not quantization scheme, is the decisive factor for battery life. The authors document a “quantization energy paradox” where importance-aware quantization yields negligible energy savings compared to mixed-precision methods. Mixture-of-experts architectures offered 7B-parameter storage capacity with 1B-to-2B energy profiles, per their measurements. The study identifies a pragmatic sweet spot at mid-sized models like Qwen2.5-3B that balance response quality with sustainable energy consumption.

This finding matters directly for on-device RAG: if you are picking a model for NPU-based inference, the architecture decision (dense vs. MoE, parameter count) dominates the quantization decision. Spending optimization budget on quantization scheme rather than model architecture is the wrong order of operations.

What remains unproven

Production RAG systems handle heterogeneous corpora with PDFs, tables, code, and images, not just short passages from a homogeneous source. Those workloads exercise the retrieval and reranking stages differently and may expose memory bandwidth bottlenecks that simplified benchmarks do not surface.

Cross-vendor NPU generalization is an open question. Apple’s Neural Engine, Intel’s NPU, and MediaTek’s APU each have different memory architectures, compute topologies, and software stacks. Performance and energy ratios could shift significantly on any of them.

The WebAssembly angle is worth flagging. If NPU-accelerated on-device inference becomes a reliable deployment target, WASM runtimes that can target NPUs would extend the reach beyond native apps to browser-based inference. That path exists in prototype form but is not widely benchmarked.

For practitioners evaluating on-device RAG today, the tradeoff is clear: zero marginal inference cost and full data locality, in exchange for a fixed retrieval corpus that cannot grow beyond what the device’s soldered memory will hold. The hardware is shipping. The question is whether the corpus sizes real products require fit within the memory that hardware provides.

Frequently Asked Questions

How much worse is the integrated Adreno GPU than the NPU for this RAG workload?

On the paper’s benchmark, the Adreno GPU was 1.7× slower than the CPU baseline and consumed 6.5× more system energy than the NPU, making it the worst of the three backends rather than a middle ground. The gap comes from OpenCL kernel dispatch overhead and memory bandwidth penalties on a rendering-oriented GPU, not from a raw compute deficiency. A discrete GPU with a mature compute stack (CUDA, ROCm) would produce very different results.

Does running RAG on the NPU degrade answer quality compared to CPU or GPU?

No. A GPT-4.1 LLM-as-judge evaluation scored NPU answers at 9.32 on a 10-point rubric, versus 8.95 for CPU and 9.03 for GPU. Across 120 queries, 86.7% produced identical scores on all three backends. The NPU marginally outscored the others, though the differences fall within measurement noise.

How representative is the paper’s benchmark of production RAG workloads?

The entire study profiles a single Dell XPS 13 configuration running 120 Wikipedia-passage queries against a fixed index. It does not vary index size, context length, or document type, so it cannot predict where memory bandwidth contention begins to erode the NPU’s throughput advantage. Google’s separate LiteRT benchmarks on the Snapdragon 8 Elite Gen 5 phone SoC show up to 100× CPU speedup across 72 models, suggesting the Hexagon NPU has headroom beyond what one laptop configuration captures, but no multi-device or multi-corpus data exists yet.

Which RAG pipeline stage benefits most from NPU acceleration?

The gains are stage-dependent. Embedding throughput hits 9.1× the CPU rate with 12.3× less system energy on indexing workloads. LLM prefilling, the prompt-processing phase before token generation begins, jumps to 18.1× faster than CPU. The paper does not isolate decode speed separately, so the token generation phase may narrow the overall 4.0× end-to-end latency advantage as output length grows.

Can a phone NPU handle RAG, or is this limited to laptops?

Phone NPUs are already in the ballpark. Google’s LiteRT benchmarks on the Snapdragon 8 Elite Gen 5 show FastVLM-0.5B achieving 0.12s time-to-first-token, over 11,000 tokens per second prefill, and over 100 tokens per second decode on the phone’s NPU. The laptop chip has more thermal headroom and unified memory, but the phone numbers suggest compute is not the bottleneck. The constraint shifts to the phone’s tighter RAM budget and battery drain under sustained retrieval workloads.