Table of Contents

DeepSeek V3 achieved GPT-4-level benchmark performance using approximately 2.664 million H800 GPU hours—roughly $5.6 million in compute at market rental rates. Its successor R1 matched OpenAI o1 on advanced reasoning tasks without a reward model. The architectural innovations behind both results are reshaping how the industry thinks about the relationship between compute and capability.

What Is DeepSeek and Why Does It Matter?

DeepSeek is a Chinese AI research lab founded in July 2023 by Liang Wenfeng, who also co-founded High-Flyer, a quantitative hedge fund managing over $10 billion in assets. High-Flyer’s AI infrastructure ambitions predated the US export controls: the firm accumulated approximately 10,000 Nvidia A100 GPUs before restrictions took effect in October 2022.1

That backstory matters for interpreting what happened next. When the US government restricted high-bandwidth chips like the H100 from Chinese buyers, DeepSeek was left training on H800s—hardware with the same compute throughput but meaningfully reduced chip-to-chip interconnect bandwidth (NVLink cut from 600 GB/s to 400 GB/s). The engineering decisions this forced weren’t obstacles to performance. They became the source of it.

DeepSeek-V3 launched in December 2024; DeepSeek-R1 followed on January 20, 2025. Within a week, the DeepSeek app had surpassed ChatGPT as the most downloaded free app on the US iOS App Store.2 On January 27, 2025, Nvidia’s stock fell 16.9%, erasing approximately $589 billion in market capitalization—the largest single-day market cap loss in stock market history.3

How DeepSeek V3 Achieves Frontier Performance at Radically Lower Compute

DeepSeek-V3 is a 671-billion-parameter model. That number is misleading in isolation. The model activates only 37 billion parameters per token during inference—about 5.5% of its total weights. The remaining parameters sit dormant for any given token, routed past by the model’s expert selection logic.

This is Mixture of Experts (MoE) architecture, and it explains more about DeepSeek’s economics than any other single factor. A 671B dense model would require proportionally more compute per forward pass. DeepSeek-V3 delivers the representational capacity of a very large model while executing inference at the effective cost of a 37B model.

But MoE alone doesn’t explain the training cost figure. Three additional innovations compound the efficiency gains:

Multi-Head Latent Attention (MLA)

Standard transformer attention is memory-hungry at scale. Each token in the context window requires storing full key and value tensors—at large batch sizes and long contexts, this becomes the primary bottleneck. DeepSeek-V3 compresses the KV representation into a 512-dimensional latent vector per token rather than storing the full representation (roughly 14,000 values in comparable models). The result is approximately a 28x reduction in KV cache size and 5.76x higher generation throughput compared to a dense multi-head attention baseline.4

MLA was introduced in DeepSeek-V2 and refined for V3. Its practical consequence: the same hardware serves larger batches, meaning more tokens processed per GPU-hour.

FP8 Mixed Precision Training

DeepSeek-V3 uses 8-bit floating point (FP8) for matrix operations during training—the first validated application of FP8 at this model scale, according to the technical report. FP8 versus the more common BF16/FP16 effectively halves memory bandwidth requirements for matrix multiplications. The implementation uses tile-wise quantization for activations and block-wise quantization for weights, with precision promoted to higher formats at 128-element intervals to prevent error accumulation.5

The practical effect: the same H800s produce more usable computation per watt, per hour.

Auxiliary-Loss-Free Load Balancing

MoE models face a structural failure mode called routing collapse: without explicit incentives, gradient descent discovers it can minimize loss by routing most tokens to the same few experts. The standard fix is an auxiliary loss term penalizing uneven distribution—but too-strong auxiliary loss degrades task performance.

DeepSeek-V3 eliminates the auxiliary loss entirely, replacing it with dynamic bias terms added to expert routing scores. Overloaded experts receive a small negative bias; underloaded experts receive a positive one. The biases update each training step based on observed load. This achieves balanced expert utilization without any task-performance trade-off—a cleaner solution than the penalty approach most prior MoE work used.5

DeepSeek R1: Reasoning Without a Reward Model

DeepSeek R1 was built to compete with OpenAI o1 on tasks requiring multi-step reasoning: mathematical competition problems, complex code generation, and scientific question-answering. Its training approach, Group Relative Policy Optimization (GRPO), departs significantly from the standard reinforcement learning from human feedback (RLHF) pipeline.

Standard RL for language models requires a critic model (typically the same size as the policy model) to estimate baseline value functions. This doubles compute and memory requirements. GRPO eliminates the critic entirely: for each question, the model generates a group of candidate responses, each receives a correctness reward via verifiable checks (regex matching for math, unit tests for code), and the group’s mean and standard deviation normalize the advantage estimates.6

The rewards are ground-truth verifiable, not learned neural reward models. This matters because neural reward models can be exploited—the policy learns to game the reward signal rather than solve the underlying task. With GRPO, there is no reward model to game.

R1-Zero demonstrated this approach in its most extreme form: GRPO applied directly to a base model with no supervised fine-tuning, rewarding only correctness. The model spontaneously developed extended chain-of-thought reasoning, self-verification, and backtracking behaviors. It also developed “poor readability and language mixing”—alternating between Chinese and English mid-thought—which prompted the four-stage training pipeline used for the final R1 model.6

Benchmark Results: What the Numbers Actually Show

Comparing benchmark scores across DeepSeek V3, R1, and OpenAI’s models requires careful category matching. V3 is a non-reasoning model; R1 is a reasoning model. Comparing V3 to o1 or R1 to GPT-4o produces misleading results.

DeepSeek-V3 vs. Non-Reasoning Frontier Models (as of December 2024)5

BenchmarkDeepSeek-V3GPT-4oClaude 3.5 Sonnet
MMLU88.5%87.2%88.3%
MATH-50090.2%74.6%78.3%
AIME 202439.2%9.3%16.0%
GPQA-Diamond59.1%49.9%65.0%
Codeforces (percentile)51.623.620.3
SWE-bench Verified42.0%

DeepSeek-R1 vs. OpenAI o1 (as of January 2025)6

BenchmarkDeepSeek-R1OpenAI o1-1217
AIME 2024 (pass@1)79.8%79.2%
MATH-50097.3%96.4%
GPQA Diamond71.5%~75.7%
MMLU90.8%~91.8%
Codeforces Elo2,029 (96.3 pctile)96.6 pctile

R1 leads narrowly on math benchmarks; o1 leads narrowly on general knowledge and code competition ranking. The practical conclusion: R1 is legitimately on-par with o1-class performance, not merely close.

DeepSeek also released six distilled models derived from R1: the 32B variant outperforms OpenAI o1-mini across benchmarks. The 1.5B distill scores 28.9% on AIME and 83.9% on MATH-500—outperforming GPT-4o on advanced math despite running on consumer hardware.6

The $6 Million Figure: What It Includes and What It Doesn’t

The figure derives from the technical report: 2,664,000 H800 GPU hours for pre-training on 14.8 trillion tokens, at approximately $2/GPU-hour market rental rates. Full training including fine-tuning stages totals 2,788,000 GPU hours—approximately $5.576 million.5

DeepSeek owns its hardware; it paid no rental fees. The $2/hour figure is a convenient reference point, not an actual cost.

SemiAnalysis estimated DeepSeek’s total server capital expenditure at approximately $1.6 billion, with a compute infrastructure of roughly 50,000 Hopper-generation GPUs.7 Researchers at INSAIT put total development cost at $1.3–1.6 billion when accounting for the full investment stack.8

The accurate framing: a final training run that cost GPT-4-equivalent compute labs an order of magnitude more required only $5.6M in H800 time at DeepSeek. That’s the meaningful data point—not that DeepSeek built a frontier model for $6M from scratch.

Why Export Controls May Have Accelerated the Breakthroughs

The US export control rationale was to slow Chinese AI development by restricting access to the most capable chips. The result may have been more complicated.

H800s have reduced NVLink bandwidth relative to H100s. Training a model at V3’s scale on H800s requires more communication-efficient distributed training algorithms, more aggressive compute-per-byte optimization (hence FP8), and architectural designs that minimize inter-chip data movement. DeepSeek’s engineers solved those problems because they had to—and the solutions are now available to every lab under the MIT license.

As Jensen Huang told reporters in February 2025: “the market got it wrong” about DeepSeek’s implications for Nvidia’s long-term position.9 His argument: inference scaling and post-training RL still require substantial compute, and DeepSeek’s efficient training doesn’t reduce that demand. Whether that framing holds as techniques mature remains an open question.

Market Impact and What It Means for Practitioners

The immediate market reaction—Nvidia -16.9%, energy companies down 20–28%—reflected investor concern that AI’s compute appetite might be smaller than infrastructure build-out plans assumed. Sam Altman acknowledged on X that R1 was “an impressive model” and that OpenAI would “deliver much better models.” OpenAI subsequently cut API prices.

For practitioners, the implications are more actionable than the stock moves:

  • Both V3 and R1 weights are MIT licensed and available on Hugging Face—commercial use, fine-tuning, and distillation are permitted without restrictions.10
  • The R1 distilled models provide reasoning capability at 1.5B–70B scale, deployable on-premises without API dependency.
  • The GRPO training methodology is documented and replicable—any lab training reasoning models can adopt it.
  • DeepSeek’s API is available but subject to Chinese data laws, including requirements to store user data in China. Several governments and US agencies have restricted its use on government devices.

The efficiency innovations in V3 and R1 are not isolated to DeepSeek. They are now part of the published literature. The more consequential question is how quickly other labs will adopt and extend them—and whether the efficiency gains continue to compound.


Frequently Asked Questions

Q: Did DeepSeek really train a GPT-4-level model for only $6 million? A: The $6 million figure represents the compute cost of the final pre-training run only—2.664 million H800 GPU hours at ~$2/hour. It excludes prior experiments, R&D salaries, data curation, and infrastructure costs. SemiAnalysis estimates DeepSeek’s total compute infrastructure investment at approximately $1.6 billion.

Q: What is the difference between DeepSeek V3 and DeepSeek R1? A: V3 is a standard instruction-following model—comparable to GPT-4o or Claude 3.5 Sonnet—designed for general tasks. R1 is a reasoning model that generates extended chains of thought before answering, comparable to OpenAI o1. R1 is slower and more expensive per query but substantially stronger on math, code, and multi-step reasoning tasks.

Q: Can I use DeepSeek models commercially? A: Yes. Both V3 and R1 are released under the MIT license, which permits commercial use, modification, and redistribution. Weights are available on Hugging Face. Using DeepSeek’s hosted API subjects user data to Chinese data law requirements; self-hosting the weights does not.

Q: What is GRPO and why does it matter for reasoning models? A: Group Relative Policy Optimization is the RL algorithm DeepSeek used to train R1. Unlike standard PPO, it eliminates the need for a critic model by normalizing rewards across a group of sampled responses. This roughly halves the compute and memory requirements for RL training and uses verifiable ground-truth rewards (correct math answers, passing unit tests) rather than a learnable reward model that can be gamed.

Q: Do the efficiency innovations require restricted hardware to replicate? A: No. The architectural innovations—MLA, FP8 training, GRPO, auxiliary-loss-free load balancing—are hardware-agnostic techniques described in the published technical reports. They were developed on H800s but apply to any training cluster. The papers are publicly available, and the techniques are already being integrated into other labs’ training pipelines.


Footnotes

  1. Fortune. “DeepSeek Founder Liang Wenfeng.” January 27, 2025.

  2. Wikipedia. “DeepSeek.” Accessed February 2026.

  3. CNBC. “Nvidia sheds almost $600 billion in market cap, biggest drop ever.” January 27, 2025.

  4. DeepWiki. “Multi-Head Latent Attention (MLA) — DeepSeek-V3.” Accessed February 2026.

  5. DeepSeek. “DeepSeek-V3 Technical Report.” arXiv

    .19437, December 2024. 2 3 4

  6. DeepSeek. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv

    .12948, January 2025. 2 3 4

  7. SemiAnalysis. “DeepSeek Debates.” Newsletter, January 2025.

  8. The Recursive. “Martin Vechev of INSAIT: DeepSeek $6M cost of training is misleading.” February 2025.

  9. TechCrunch. “Nvidia CEO Jensen Huang says market ‘got it wrong’ about DeepSeek’s impact.” February 21, 2025.

  10. GitHub. “deepseek-ai/DeepSeek-R1.” MIT License. Accessed February 2026.

Enjoyed this article?

Stay updated with our latest insights on AI and technology.