NVIDIA Open-Sources SANA-WM: 60s 720p Video From One RTX 5090 With Hybrid Linear Attention

NVIDIA Labs released SANA-WM, a 2.6B-parameter world model¹ that turns a single image and a 6-DoF camera trajectory into 60 seconds of 720p video. The headline is not the resolution but the throughput: a distilled, quantized variant can denoise a full minute-long clip in 34 seconds on one RTX 5090², and the paper claims 36× the throughput of prior open baselines on a fraction of the hardware¹.

What SANA-WM Actually Does (and What It Doesn’t)

SANA-WM is not a text-to-video system. It is a world model: you feed it a starting image and a metric-scale 6-DoF camera trajectory, and it generates 961 latent frames representing 60 seconds of 720p video¹. The model was trained on 212,975 public video clips¹ with pose supervision over roughly 15 to 18.5 days on 64 H100 GPUs.

What it does not do is maintain an explicit 3D scene representation. The authors note that without structured scene memory, the model can drift in dynamic environments or when asked to hold a rare viewpoint for an extended duration. This is a known failure mode, not a footnote; it means SANA-WM excels at camera motion through mostly static scenes and struggles when the world itself moves unpredictably.

The Hybrid Linear Attention Trick: Why O(T²) Was the Enemy

The architectural bet is hybrid linear attention. Of the 20 transformer layers, 15 are Gated DeltaNet (GDN) blocks and five are periodic softmax attention blocks. GDN keeps its recurrent state at a fixed D×D matrix regardless of video length, which avoids the quadratic memory growth that normally chokes long-sequence transformers. The remaining softmax layers inject periodic global context so the model does not lose track of large-scale structure.

This design choice is what collapses the hardware floor. Prior open world models needed multi-GPU clusters just to hold the attention maps for minute-scale video. By keeping memory usage flat in time, SANA-WM makes 60-second generation addressable on a single high-end GPU¹. The tradeoff is a more complex training setup and the risk of recurrent state degradation over long horizons, though the paper reports stable results up to the 60-second target¹.

Three Ways to Run It: From H100 Cluster to RTX 5090

The paper describes three inference configurations. Bidirectional generation requires approximately 49.2 GB of VRAM¹. Chunk-causal autoregressive generation needs slightly more, around 51.1 GB¹. Both of these are technically single-GPU modes, but only if that GPU is an H100 or similarly capacious datacenter card.

The configuration getting attention is the distilled autoregressive variant with NVFP4 quantization. NVIDIA reports this setup denoises a 60-second 720p clip in 34 seconds on one RTX 5090, yielding a throughput of 24.1 videos per hour². For context, the full two-stage pipeline that feeds base outputs through a 17B-parameter LTX-2 refiner with rank-384 LoRA hits 22.0 videos per hour², but only when spread across eight H100s. The refiner improves temporal consistency, raising IQ scores by 1.17 on simple trajectories and 0.31 on hard ones².

Benchmarks: 36× Throughput vs. What, Exactly?

NVIDIA’s throughput claim is 36× over LingBot-World, a 14B+14B parameter model running at 480p on eight GPUs at 0.6 videos per hour². SANA-WM also outruns HY-WorldPlay (8B, 480p, eight GPUs, 1.1 vids/hr)² and Matrix-Game 3.0 (5B, 720p, eight GPUs, 3.1 vids/hr)² while matching or exceeding their VBench scores².

The VBench Overall scores are 80.62 for simple trajectories and 81.89 for hard ones². Camera rotation error sits at 4.50° and 8.34° respectively, with translation error holding steady at 1.39 across both sets². CamMC scores are 1.41 and 1.44².

The License Ambiguity

There is a material discrepancy between the code and the paper. The NVlabs/Sana GitHub repository lists an Apache 2.0 license³. The arXiv paper, however, states CC BY-NC-SA 4.0, which carries a non-commercial clause¹. Until NVIDIA clarifies which terms govern the model weights, anyone building a product on SANA-WM is operating under legal uncertainty. The code may be free to fork; the weights may not be free to ship.

What This Means for Open Video Research

For the past two years, the open video community has been gated by cluster economics. Generating a minute of coherent video required racks of H100s and patience measured in hours. SANA-WM does not solve every problem in open video generation, but it does remove the hardware excuse. A 2.6B model on one GPU outruns 14B competitors on eight².

The question that matters now is evaluation honesty. Closed models like Veo and Sora set the perceptual standard, yet they are black boxes with no reproducible benchmarks. Open models now have the speed to iterate, but they still lack agreed-upon metrics for motion quality, physical plausibility, and long-term consistency. The bottleneck has shifted from who can afford the GPUs to who can design evaluations that resist benchmark gaming. That is a harder problem than engineering a faster transformer, and it is the one that will determine whether open world models become viable alternatives or merely efficient curiosities.

Frequently Asked Questions

Can SANA-WM generate video from a text prompt?

No. SANA-WM requires a starting image and a metric-scale 6-DoF camera trajectory. Internally it uses a dual-branch camera control system: UCPE operates at latent-frame rate for coarse trajectory following, while Plücker Raymap mixing handles fine motion at raw-frame rate to correct compression mismatch. Text-to-video requires a different system or a separate model upstream.

Can I run the full-precision model on an RTX 5090?

Not the bidirectional or chunk-causal modes. Those need ~49–51 GB of VRAM and the RTX 5090 has 32 GB. The only consumer-fit option is the distilled autoregressive variant with NVFP4 4-bit quantization, which trades an unmeasured fidelity delta for the headline 34-second generation time. Full-precision inference remains gated behind H100 or A100 80 GB datacenter cards.

Where does SANA-WM’s camera control break down in practice?

The Gated DeltaNet maintains a fixed D×D recurrent matrix that implicitly encodes scene appearance, there is no object-level or geometry representation. Degradation scales with scene dynamism rather than trajectory complexity: a static room viewed from extreme angles holds up better than a scene with moving subjects under a simple camera path. Beyond the validated 60-second window, recurrent state decay is uncharacterized.

How do SANA-WM’s benchmarks compare to Veo or Sora?

There is no direct comparison. Closed models do not publish VBench scores under reproducible conditions, so SANA-WM’s 80.62/81.89 can only be measured against open baselines it already outperforms on throughput. The VBench sub-metrics, rotation error, translation error, CamMC, capture geometric fidelity, not perceptual quality or physical plausibility, which are the dimensions where closed models currently lead.

What would extending generation beyond 60 seconds require?

The GDN recurrent state is length-agnostic in principle, but the paper only trains and validates at 60 seconds (961 latent frames). Pushing further would stress the implicit scene representation and likely amplify the drift the authors already flag. A robust extension would need explicit 3D scene memory or a hierarchical architecture that periodically refreshes the recurrent state, both would increase VRAM demand and break the single-GPU economics that make SANA-WM notable.