groundy
infrastructure & runtime

The RTX Spark Bet on Unified Memory for Local LLMs: Where Bandwidth Caps It

LLM decode is memory-bandwidth-bound, not capacity-bound. A 70B model on 128-bit LPDDR5X hits roughly 1 tok/s. When evaluating inference hardware, count GB/s, not GB.

8 min · · · 5 sources ↓

When Nvidia ships a Grace-Blackwell-class developer inference box with LPDDR5X unified memory, the pitch will sound straightforward: enough memory to fit large language models locally, no cloud subscription required. But capacity is not the binding constraint for token-generation throughput. Autoregressive decode is memory-bandwidth-bound, and the LPDDR5X memory that unified-memory boxes rely on delivers a fraction of the bandwidth of discrete GPU memory. The question is not whether the model fits. It is how fast it runs once it does.

The Pitch: Why Memory Size Gets the Headlines

Unified memory is the headline feature. Nvidia has delivered its first Vera CPUs to partners including Anthropic, OpenAI, SpaceXAI, and OCI as of June 2026, initially for GB200 NVL72 data center systems. A developer-class inference box using Grace-Blackwell-class silicon with LPDDR5X unified memory has not been formally announced, but the architectural direction is visible. The appeal for local inference is straightforward: one pool of memory shared between CPU and GPU means no model sharding, no multi-GPU coordination, and no PCIe bottleneck between host and device memory. One machine, one model, no partitioning.

Marketing for this class of hardware emphasizes total memory capacity. The number that sticks is the gigabyte count, because that is the number that enables the “runs a 70B model locally” claim. Nvidia controls more than 80% of the GPU market for AI training and deployment as of 2025 and became the first company to surpass US$5 trillion in market capitalization in 2025. When a vendor with that position ships a developer inference box, the capacity figure dominates the conversation.

Capacity enables fitting the model. It does not enable fast inference.

The Physics: Autoregressive Decode Is Bandwidth-Bound

LLM inference has two phases. Prefill processes the prompt in parallel and is compute-bound: the GPU’s arithmetic units stay busy. Decode generates tokens one at a time. At each decode step, the model reads its full weight matrix from memory to produce a single token. The arithmetic intensity is near zero because there is no batch to amortize the weight read over.

This makes decode throughput a function of memory bandwidth, not compute. For a 70B-parameter model stored in FP16, the weights occupy roughly 140 GB. Generating one token requires reading all 140 GB. If the memory subsystem delivers 170 GB/s, the theoretical ceiling is approximately 1.2 tokens per second before accounting for KV-cache overhead, attention computation, or any system-level tax.

That is not a marketing claim. It is a physical limit derived from the ratio of model size to available bandwidth.

LPDDR5X by the Numbers: Pin Speed Is Not System Bandwidth

Samsung’s premium LPDDR5X reaches data processing speeds of up to 10.7 Gbps as of June 2026, a 1.25x increase over the previous generation, with nearly 25% lower power consumption through FDVFS and low-frequency mode extensions. Micron’s 1γ LPDDR5X matches the 10.7 Gbps peak, marketed as the highest speed grade for mobile DRAM, with up to 20% power savings over the prior 1β generation in a 0.61mm package. Both vendors target edge AI and data center inference as primary use cases.

The JEDEC LPDDR5X standard peaks at 8,533 Mbps with 32n prefetch at 1.05V, compared to LPDDR5 at 6,400 Mbps with 16n prefetch at 1.1V. The per-pin speed is real. But per-pin speed is a chip specification, not a system specification. System bandwidth is per-pin speed multiplied by bus width, divided by eight to convert bits to bytes. Without the bus width, the 10.7 Gbps figure tells you nothing about tokens per second.

Samsung’s LPDDR5X is available at up to 32GB per chip, and both Samsung and Micron market the memory class for environments where power efficiency is the priority: flagship smartphones, thin-and-light laptops, automotive systems. The design priority is baked into the silicon. LPDDR5X optimizes for watts per gigabyte, not for gigabytes per second.

The Bandwidth Gap: LPDDR5X vs Discrete GPU Memory

The memory classes that power discrete GPUs operate on a different order of magnitude. GDDR6X, GDDR7, and HBM stacks run at higher per-pin speeds on wider buses (256-bit to 384-bit per consumer GPU, thousands of bits per HBM package). The resulting system-level bandwidth is an order of magnitude higher than LPDDR5X on a narrow bus.

Even assuming a generous 256-bit bus at 10.7 Gbps, the result is approximately 340 GB/s, derived from Samsung’s published pin speed. Discrete GPUs with GDDR6X or GDDR7 on wide buses deliver several times that bandwidth, and data center accelerators with HBM stacks reach higher still. The gap is roughly 3x to 10x depending on configuration, and it maps directly to the decode speed ceiling.

This is not a flaw in the hardware. It is the engineering tradeoff that unified memory demands. LPDDR5X is soldered, low-power, and capacity-dense. Discrete GPU memory is high-throughput, power-hungry, and capacity-limited per chip. The two memory classes optimize for different constraints, and the bandwidth gap is the consequence of choosing the low-power, high-capacity path.

Apple Silicon machines face the same constraint. The M-series unified memory architecture uses LPDDR5X and delivers capacity at low power, but bandwidth sits in a similar range. A developer choosing between a Grace-Blackwell-class box and an Apple Silicon machine for local inference is comparing two bandwidth-constrained systems. The differentiator is which one delivers more bandwidth per dollar for the target model size.

Second-Order Economics: Fitting the Model vs Running It

The economic distortion works like this. A developer buys a unified-memory inference box because the spec sheet says it has enough memory for a 70B or 109B model. The model loads. It runs. At 1 to 2 tokens per second for the larger models, the interactive experience is slow enough that it is usable only for batch generation, not for conversation or agentic loops that require back-and-forth reasoning.

The developer then faces a choice: run a smaller model that fits within the bandwidth budget for acceptable speed, or accept that the hardware’s advertised capability (running large models) and its practical performance (running them at usable speed) are different things. The capacity was the purchase justification. The bandwidth is the actual constraint.

This does not make the hardware a bad purchase. For workloads where power consumption matters (always-on edge inference, thermally constrained deployments, battery-powered systems), LPDDR5X’s 20% power savings at the 1γ node over the prior generation and nearly 25% lower consumption via FDVFS are genuine advantages. For a developer workstation where wall power is available and the goal is interactive-speed inference, the tradeoff tilts toward discrete GPUs with higher bandwidth, even if they require model sharding.

Buyer’s Framework: Evaluating Beyond the GB Count

For engineers evaluating local-inference hardware, the decision framework should run on bandwidth, not capacity.

Bandwidth first, capacity second. Calculate the minimum bandwidth for your target model size and token speed. For FP16: required bandwidth ≈ 2 × parameters_in_billions × target_tokens_per_second. A 70B model at 10 tok/s needs approximately 1,400 GB/s. If the hardware cannot deliver that, additional memory capacity will not close the gap.

Bus width is not optional. Per-pin speed (10.7 Gbps for LPDDR5X per Samsung’s spec) is a chip spec, not a system spec. Without the bus width, you cannot calculate system bandwidth. If a vendor does not publish it, assume the worst case for your calculation.

Quantization helps but does not eliminate the constraint. A 70B model in 4-bit quantization reads roughly 35 GB per token instead of 140 GB, improving throughput proportionally. The ceiling is still bandwidth. Run the same calculation for the quantized weight size.

Multi-GPU trades simplicity for bandwidth. Two discrete GPUs deliver their rated bandwidth independently, but sharding introduces NVLink or PCIe communication overhead. The math is less clean than unified memory, but the raw bandwidth budget is significantly higher. For decode-heavy workloads, the multi-GPU approach often wins despite the complexity.

Grace-Blackwell-class developer inference boxes with LPDDR5X unified memory will be a specific hardware tradeoff: maximum memory capacity at minimum power, with bandwidth that reflects those priorities. For workloads that need a large model in memory and can tolerate slow decode, the tradeoff works. For workloads that need interactive-speed inference from large models, the bandwidth ceiling is the constraint, and the gigabyte count on the spec sheet will not change it.

Frequently Asked Questions

What did the LPDDR5-to-LPDDR5X generational jump actually buy in decode throughput?

The JEDEC standard moved from 6,400 Mbps with a 16n prefetch at 1.1V (LPDDR5) to 8,533 Mbps with a 32n prefetch at 1.05V (LPDDR5X). Samsung and Micron then pushed beyond spec to 10.7 Gbps. On a 128-bit bus, the LPDDR5 baseline would deliver roughly 102 GB/s, placing a 70B FP16 model’s decode ceiling below 0.8 tok/s. LPDDR5X at 10.7 Gbps moves that ceiling to roughly 1.2 tok/s. The generational gain is real but keeps the hardware an order of magnitude below discrete GPU memory bandwidth.

When does LPDDR5X unified memory actually win over discrete GPUs?

Samsung’s LPDDR5X carries AEC-Q100 certification for automotive stress environments (temperature cycling, vibration), and Micron packages its 1γ die at 0.61mm thickness. These are not workstation parts. For vehicle inference, battery-powered edge nodes, or thermally constrained enclosures, the bandwidth concession is deliberate: discrete GPUs drawing 300W+ to hit their throughput numbers are impractical. LPDDR5X unified memory wins where power budget and physical form factor outweigh raw tok/s.

Does 4-bit quantization get a 70B model to interactive speed on LPDDR5X?

At 4-bit precision, a 70B model reads roughly 35 GB per token. On a 128-bit LPDDR5X bus at 170 GB/s, the ceiling becomes approximately 4.9 tok/s. A 256-bit bus at 340 GB/s pushes that to about 9.7 tok/s. Both figures approach but do not reliably clear the 10 tok/s threshold most practitioners consider the floor for fluid back-and-forth interaction. Quantization narrows the bandwidth gap by 4x but leaves the developer trading model quality for speed, a compromise discrete GPUs sidestep for the same model.

What hardware change would actually close the bandwidth gap?

Bus width is the only lever that scales system bandwidth linearly at a given per-pin speed. A 512-bit LPDDR5X interface at 10.7 Gbps would deliver roughly 680 GB/s, still well below a single RTX-class discrete GPU with GDDR7 (roughly 1,800 GB/s) but enough to push a 4-bit 70B model past 19 tok/s. The cost is silicon area, package pin count, and power draw. A future LPDDR6 standard may increase per-pin speed, but the architectural tradeoff (LPDDR optimizes for watts per gigabyte, not gigabytes per second) is structural and unlikely to reverse in one generation.

sources · 5 cited

  1. Nvidia Homepage — Vera CPU Delivery to Partners vendor accessed 2026-06-06
  2. Nvidia — Wikipedia analysis accessed 2026-06-06
  3. LPDDR5X — Samsung Semiconductor vendor accessed 2026-06-06
  4. LPDDR5X — Micron vendor accessed 2026-06-06
  5. LPDDR5 and LPDDR5X Explained — ComputerCity community accessed 2026-06-06