The RTX Spark Bet on Unified Memory for Local LLMs: Where Bandwidth Caps It

Nvidia shipped the box. The DGX Spark, built on the GB10 Grace Blackwell Superchip with 128 GB of LPDDR5X unified memory, started reaching buyers in late 2025, and the pitch is exactly as advertised: enough memory to fit large language models locally, no cloud subscription required. But capacity is not the binding constraint for token-generation throughput. Autoregressive decode is memory-bandwidth-bound, and the LPDDR5X memory these unified-memory boxes rely on delivers a fraction of the bandwidth of discrete GPU memory. The question is not whether the model fits. It is how fast it runs once it does. The shipped hardware settled that question: a 70B model decodes at roughly 2.7 tokens per second [Updated June 2026].

The Pitch: Why Memory Size Gets the Headlines

Unified memory is the headline feature. The GB10 pairs a Blackwell GPU with a 20-core Arm Grace CPU on one package, sharing a single 128 GB LPDDR5X pool [Updated June 2026]. The appeal for local inference is straightforward: one pool of memory shared between CPU and GPU means no model sharding, no multi-GPU coordination, and no PCIe bottleneck between host and device memory. One machine, one model, no partitioning. Nvidia rates the part at up to 1 PFLOP of FP4 tensor throughput and inference on models up to 200 billion parameters, and the marketing leads with both numbers.

Marketing for this class of hardware emphasizes total memory capacity. The number that sticks is the gigabyte count, because that is the number that enables the “runs a 200B model locally” claim. Nvidia controls more than 80% of the GPU market for AI training and deployment as of 2025 and became the first company to surpass US$5 trillion in market capitalization in 2025. A vendor in that position shipping a desktop inference box guarantees the capacity figure dominates the conversation. The 200B claim is true. It is also the wrong number to evaluate the box on.

Capacity enables fitting the model. It does not enable fast inference.

The Physics: Autoregressive Decode Is Bandwidth-Bound

LLM inference has two phases. Prefill processes the prompt in parallel and is compute-bound: the GPU’s arithmetic units stay busy. Decode generates tokens one at a time. At each decode step, the model reads its full weight matrix from memory to produce a single token. The arithmetic intensity is near zero because there is no batch to amortize the weight read over.

This makes decode throughput a function of memory bandwidth, not compute. For a 70B-parameter model stored in FP16, the weights occupy roughly 140 GB. Generating one token requires reading all 140 GB. The DGX Spark’s memory subsystem delivers 273 GB/s, which puts the theoretical ceiling at approximately 1.95 tokens per second before accounting for KV-cache overhead, attention computation, or any system-level tax [Updated June 2026]. Measured single-stream decode on Llama 3.1 70B lands at about 2.7 tok/s, which is what you get when the model runs at reduced precision and the bandwidth tax is the floor, not the only term.

That is not a marketing claim. It is a physical limit derived from the ratio of model size to available bandwidth, and the shipped numbers sit right on it.

LPDDR5X by the Numbers: Pin Speed Is Not System Bandwidth

Samsung’s premium LPDDR5X reaches data processing speeds of up to 10.7 Gbps as of June 2026, a 1.25x increase over the previous generation, with nearly 25% lower power consumption through FDVFS and low-frequency mode extensions. Micron’s 1γ LPDDR5X matches the 10.7 Gbps peak, marketed as the highest speed grade for mobile DRAM, with up to 20% power savings over the prior 1β generation in a 0.61mm package. Both vendors target edge AI and data center inference as primary use cases.

The JEDEC LPDDR5X standard peaks at 8,533 Mbps with 32n prefetch at 1.05V, compared to LPDDR5 at 6,400 Mbps with 16n prefetch at 1.1V. The per-pin speed is real. But per-pin speed is a chip specification, not a system specification. System bandwidth is per-pin speed multiplied by bus width, divided by eight to convert bits to bytes. Without the bus width, the 10.7 Gbps figure tells you nothing about tokens per second.

Samsung’s LPDDR5X is available at up to 32GB per chip, and both Samsung and Micron market the memory class for environments where power efficiency is the priority: flagship smartphones, thin-and-light laptops, automotive systems. The design priority is baked into the silicon. LPDDR5X optimizes for watts per gigabyte, not for gigabytes per second.

The Bandwidth Gap: LPDDR5X vs Discrete GPU Memory

The memory classes that power discrete GPUs operate on a different order of magnitude. GDDR6X, GDDR7, and HBM stacks run at higher per-pin speeds on wider buses (256-bit to 384-bit per consumer GPU, thousands of bits per HBM package). The resulting system-level bandwidth is an order of magnitude higher than LPDDR5X on a narrow bus.

The DGX Spark’s confirmed 273 GB/s sits at the low end of that comparison. Discrete GPUs with GDDR6X or GDDR7 on wide buses deliver several times that bandwidth, and data center accelerators with HBM stacks reach higher still. The gap is roughly 3x to 10x depending on configuration, and it maps directly to the decode speed ceiling. One review measured a workstation RTX 6000 clearing over 240 tok/s on a 120B model where the Spark managed roughly 12, a ~20x gap that tracks the bandwidth difference, not the FP4 compute spec the Spark wins on [Updated June 2026].

This is not a flaw in the hardware. It is the engineering tradeoff that unified memory demands. LPDDR5X is soldered, low-power, and capacity-dense. Discrete GPU memory is high-throughput, power-hungry, and capacity-limited per chip. The two memory classes optimize for different constraints, and the bandwidth gap is the consequence of choosing the low-power, high-capacity path.

Apple Silicon machines face the same constraint. The M-series unified memory architecture uses LPDDR5X and delivers capacity at low power, with the higher-end Max and Ultra parts reaching into the hundreds of GB/s, well above the Spark’s 273. A developer choosing between the DGX Spark and an Apple Silicon machine for local inference is comparing two bandwidth-constrained systems, and the runtime matters as much as the silicon (see MLX vs llama.cpp on Apple Silicon). The differentiator is which one delivers more bandwidth per dollar for the target model size.

Second-Order Economics: Fitting the Model vs Running It

The economic distortion works like this. A developer buys a unified-memory inference box because the spec sheet says it has enough memory for a 70B or 120B model. The model loads. It runs. At roughly 2.7 tok/s for a 70B and around 12 tok/s for a 120B MoE at batch size one [Updated June 2026], the single-user interactive experience is slow enough that it suits batch generation and overnight jobs more than conversation or agentic loops that require back-and-forth reasoning.

The developer then faces a choice: run a smaller model that fits within the bandwidth budget for acceptable speed, or accept that the hardware’s advertised capability (running large models) and its practical performance (running them at usable speed) are different things. The capacity was the purchase justification. The bandwidth is the actual constraint.

This does not make the hardware a bad purchase. For workloads where power consumption matters (always-on edge inference, thermally constrained deployments, battery-powered systems), LPDDR5X’s 20% power savings at the 1γ node over the prior generation and nearly 25% lower consumption via FDVFS are genuine advantages. For a developer workstation where wall power is available and the goal is interactive-speed inference, the tradeoff tilts toward discrete GPUs with higher bandwidth, even if they require model sharding.

What Shipping Confirmed: Price, Throughput, and the Concurrency Caveat

The DGX Spark started as Project DIGITS at CES in January 2025 and reached buyers as a Founders Edition through 2025. Nvidia launched it at $3,999. In February 2026 it raised the price to $4,699, a $700 jump it blamed on rising DRAM and storage costs. That is the same memory-supply squeeze pushing up prices on consumer machines (it forced Apple to raise Mac and iPad prices over the same window), and it lands hardest on a box whose whole value proposition is 128 GB of soldered LPDDR5X. Partner variants from Acer, ASUS, Dell, and MSI ship the same GB10 at adjacent prices, usually with 1 TB of storage instead of the Founders Edition’s 4 TB [Updated June 2026].

The single-stream decode figures above describe one user typing at one model. They are not the throughput of a served endpoint, and conflating the two is the most common way the bandwidth argument gets misread. Decode is bandwidth-bound only because batch size one cannot amortize the weight read. Raise the batch size and the same weight matrix fan-outs across many concurrent sequences, so each token-step’s memory read serves many requests at once. One benchmark drove a Spark to roughly 695 tokens per second of aggregate output across 256 concurrent streams, two orders of magnitude above its single-stream rate. The per-stream latency stays slow, but for a backend serving many users, the box trades interactive speed for aggregate efficiency, and the bandwidth ceiling stops being the only number that matters. The physics is unchanged; the workload moved into the regime where arithmetic intensity is no longer near zero.

The 200B inference claim leans on a second lever: the 200 Gbps ConnectX-7 NIC. Two Sparks linked over that fabric pool their memory across the cable, which is how Nvidia gets to larger models than 128 GB can hold on its own. The bandwidth math does not improve, it just spreads a bigger model across two bandwidth-limited boxes, so decode on a 405B model split this way is slow for the same reason a 70B is slow on one. The link is a capacity story, not a throughput one, and it is worth keeping those two straight when reading the spec sheet.

The Cheaper Bandwidth Twin: AMD’s Strix Halo

Nvidia did not ship the only desktop unified-memory box. AMD’s Ryzen AI Max+ 395, codenamed Strix Halo, pairs 16 Zen 5 cores and a 40-CU RDNA 3.5 integrated GPU with up to 128 GB of LPDDR5X-8000 on a 256-bit bus, good for about 256 GB/s. That is within a rounding error of the Spark’s 273 GB/s, which means the two sit in the same decode regime: a 70B FP16 model is bandwidth-capped at roughly the same low single-digit tokens per second on either. Where they diverge is price. Strix Halo mini PCs ship in the $1,499 to $1,999 range, and the part hits around 100 tok/s on a 30B model under LM Studio, the size where its bandwidth budget actually delivers an interactive experience. A reviewer running both found the Spark faster overall on the CUDA-native paths, which is the part of the stack AMD does not match, but that gap is software and FP4 compute, not memory bandwidth.

The comparison sharpens the original point. Two boxes with nearly identical memory bandwidth produce nearly identical large-model decode speeds, and the one charging triple the price wins on tooling and prefill compute, not on the constraint that governs how fast a big model talks back. If the target is a 30B-class model at conversational speed, the cheaper part is sufficient. If the target is a 200B model, both are batch-generation machines, and the choice is about software ecosystem and budget rather than tokens per second. The same calculation that governs a DeepSeek R1 deployment on local hardware applies here: pick the quantization and model size that fit the bandwidth, then choose the box.

Buyer’s Framework: Evaluating Beyond the GB Count

For engineers evaluating local-inference hardware, the decision framework should run on bandwidth, not capacity.

Bandwidth first, capacity second. Calculate the minimum bandwidth for your target model size and token speed. For FP16: required bandwidth ≈ 2 × parameters_in_billions × target_tokens_per_second. A 70B model at 10 tok/s needs approximately 1,400 GB/s. If the hardware cannot deliver that, additional memory capacity will not close the gap.

Bus width is not optional. Per-pin speed (10.7 Gbps for LPDDR5X per Samsung’s spec) is a chip spec, not a system spec. Without the bus width, you cannot calculate system bandwidth. If a vendor does not publish it, assume the worst case for your calculation.

Quantization helps but does not eliminate the constraint. A 70B model in 4-bit quantization reads roughly 35 GB per token instead of 140 GB, improving throughput proportionally. The ceiling is still bandwidth. Run the same calculation for the quantized weight size.

Multi-GPU trades simplicity for bandwidth. Two discrete GPUs deliver their rated bandwidth independently, but sharding introduces NVLink or PCIe communication overhead. The math is less clean than unified memory, but the raw bandwidth budget is significantly higher. For decode-heavy workloads, the multi-GPU approach often wins despite the complexity.

The DGX Spark turned out to be exactly this tradeoff: maximum memory capacity at minimum power, with bandwidth that reflects those priorities. For workloads that need a large model in memory and can tolerate slow decode, or that batch many requests through one box, the tradeoff works. For a single developer who needs interactive-speed inference from a large model, the 273 GB/s ceiling is the constraint, and the gigabyte count on the spec sheet does not change it.

Frequently Asked Questions

What did the LPDDR5-to-LPDDR5X generational jump actually buy in decode throughput?

The JEDEC standard moved from 6,400 Mbps with a 16n prefetch at 1.1V (LPDDR5) to 8,533 Mbps with a 32n prefetch at 1.05V (LPDDR5X). Samsung and Micron then pushed beyond spec to 10.7 Gbps. The DGX Spark runs a 256-bit interface at an effective rate that nets 273 GB/s, which places a 70B FP16 model’s decode ceiling at roughly 1.95 tok/s [Updated June 2026]. A narrower 128-bit LPDDR5 baseline would deliver closer to 100 GB/s and a sub-0.8 tok/s ceiling, so the generational and bus-width gains are real, but they keep the hardware an order of magnitude below discrete GPU memory bandwidth.

When does LPDDR5X unified memory actually win over discrete GPUs?

Samsung’s LPDDR5X carries AEC-Q100 certification for automotive stress environments (temperature cycling, vibration), and Micron packages its 1γ die at 0.61mm thickness. These are not workstation parts. For vehicle inference, battery-powered edge nodes, or thermally constrained enclosures, the bandwidth concession is deliberate: discrete GPUs drawing 300W+ to hit their throughput numbers are impractical. LPDDR5X unified memory wins where power budget and physical form factor outweigh raw tok/s.

Does 4-bit quantization get a 70B model to interactive speed on LPDDR5X?

At 4-bit precision, a 70B model reads roughly 35 GB per token. On the DGX Spark’s 273 GB/s, the theoretical ceiling becomes approximately 7.8 tok/s, and real runs land lower once KV-cache reads and the attention pass are counted [Updated June 2026]. That approaches but does not reliably clear the 10 tok/s threshold most practitioners consider the floor for fluid back-and-forth interaction. Quantization narrows the bandwidth gap by 4x but leaves the developer trading model quality for speed, a compromise discrete GPUs sidestep for the same model.

What hardware change would actually close the bandwidth gap?

Bus width is the only lever that scales system bandwidth linearly at a given per-pin speed. A 512-bit LPDDR5X interface at 10.7 Gbps would deliver roughly 680 GB/s, still well below a single RTX-class discrete GPU with GDDR7 (roughly 1,800 GB/s) but enough to push a 4-bit 70B model past 19 tok/s. The cost is silicon area, package pin count, and power draw. A future LPDDR6 standard may increase per-pin speed, but the architectural tradeoff (LPDDR optimizes for watts per gigabyte, not gigabytes per second) is structural and unlikely to reverse in one generation.