groundy
models & research

Doubao vs Qwen 3.7 vs GLM-5.2: Route by Axis, Not Leaderboard

Doubao-Seed-2.1 Pro's parity claims are vendor-graded and unreproduced. Route the Chinese flagship tier by axis: Qwen for ZH to EN, Doubao for EN to ZH, DeepSeek for code.

7 min···5 sources ↓

The honest answer is axis-dependent, and Doubao-Seed-2.1 Pro has not yet earned an independent evaluation. ByteDance’s freshly launched flagship claims parity with GPT-5.5 and Claude Opus 4.7, per Groundy’s analysis of the deck, but none of those scores has been reproduced by a third party as of 2026-06-23. The only independent Doubao numbers on record belong to the prior 2.0 Pro, and they split by workload rather than anoint a single winner.

A third Chinese flagship, vendor-graded

ByteDance launched Doubao-Seed-2.1 Pro at the Volcano Engine Force Conference in Beijing on June 23, 2026, per Dataconomy’s coverage, placing a third credible flagship next to Alibaba’s Qwen 3.7 and Zhipu’s GLM-5.2. The launch does not, however, come with independent benchmark coverage of its own. Every benchmark number on the Force deck was produced by ByteDance, and the comparison slides mix model versions in ways that suit the headline. The useful question for anyone building a multi-model router is not who tops a single aggregate score but which model wins which axis. A vendor deck and an independent evaluation are not comparable claims.

Are ByteDance’s Doubao-Seed-2.1 Pro scores independently reproduced?

No. Per the deck analysis, ByteDance claims Doubao-Seed-2.1 Pro matches Claude Opus 4.7 on Terminal Bench, beats GPT-5.5 on NL2Repo-Bench, scores 59.8 on SciCode ahead of both Opus 4.7 and GPT-5.5, and leads the MCP-Atlas tool-calling benchmark. None of those scores has been reproduced by a third party as of 2026-06-23, and MCP-Atlas and NL2Repo-Bench are too new to carry any third-party baseline at all. The lead is against a field of one.

The version drift inside the deck compounds the problem. Capability slides compare against Claude Opus 4.7; cost slides compare against Opus 4.6; Anthropic’s current frontier has since moved to Opus 4.8 and Fable 5. The parity claim is anchored to a model that is already a point release behind.

What independent data exists for Doubao?

The only independent numbers on record are for the prior Doubao-Seed-2.0-Pro, tested by NovAI in Hong Kong in May 2026, not the 2.1 Pro that launched at FORCE. NovAI measured 89.5 HumanEval+, 86.4 MMLU-Pro, 47.3 BLEU on EN→ZH FLORES-200, and a P50 first-token latency of 312 ms, placing the model within roughly two points of GPT-5.5 on most categories at about one-twelfth the price. Those are real, third-party numbers. They belong to a different model version, and conflating them with the 2.1 Pro vendor scores is the fastest way to misread this market.

Which model wins translation, code, and reasoning?

On the workloads NovAI measured, the Chinese models split by axis rather than rank in a line, which is the finding a router builder can actually use. Doubao Pro wins EN→ZH translation at 47.3 BLEU versus GPT-5.5’s 42.1, a training-corpus advantage in the direction that matters most inside China. Qwen3-Max wins ZH→EN at 41.1 versus Doubao’s 38.2 and Japanese at 33.5 versus 32.1. DeepSeek V4 Pro leads LeetCode-Hard at 38 out of 50 versus Doubao’s 31, per NovAI’s benchmarks.

Note the version attribution: those are the May 2026 cohort (Qwen3-Max, DeepSeek V4 Pro, Doubao-Seed-2.0-Pro), not the current Qwen3.7-Max-Preview or Doubao-Seed-2.1 Pro. The directional pattern is the durable part. Whoever tops the aggregate index is rarely the model that tops a given axis.

What does Doubao cost, and is the price real for agent loops?

Doubao-Seed-2.1 Pro lists at 6 CNY per million input tokens and 30 CNY per million output, with cache-hit input at 1.2 CNY, and Volcano Engine claims total cost of ownership roughly 80% lower than Claude Opus 4.6. Two caveats bound that claim.

First, the 80% figure is anchored to Opus 4.6, not the current Opus 4.8 frontier, so the comparison is against a previous-generation price. Second, the cache-hit rate is the load-bearing assumption. Claude Code and Codex-style agent loops keep a stable system prompt and tool prefix across many calls, so they hit cache; one-shot reasoning, varied-context retrieval, and cold-start traffic do not. Your blended cost sits between the 1.2 CNY cache-hit floor and the 6 CNY cold rate, depending on your prefix-stability profile.

The access wall matters more than the unit price. The 6 CNY/M rate, the cache-hit discount, and the benchmark claims apply only to the China-served model on Volcano Engine. ByteDance’s international app Dola runs on OpenAI GPT and Google Gemini, not the Doubao model itself, so developers outside China cannot evaluate the same model through Dola. The China-served Doubao is not testable from the West without a Volcano Engine account.

The backing scale is real, if self-reported. ByteDance reports Doubao models now process over 180 trillion tokens per day across Volcano Engine, a self-reported 10x year-over-year, and IDC data gives Volcano Engine a 49.5% share of China’s public-cloud Model-as-a-Service market, first domestically. Both numbers say ByteDance routes enormous inference volume; neither says anything about model quality.

What does the Capability Frontier paper mean for routing?

The Capability Frontier paper (arXiv:2606.26836), which studied 21 LLMs across 16 benchmarks, found that correcting single-model evaluation yields a 54% error-rate reduction, and additionally correcting for single runs yields an 82% improvement. The paper also reports that higher query topic entropy produces a near-monotonic widening of the gap between oracle routing and the best single model. The more varied the workload, the more a per-query router beats committing to one model.

This is the theoretical backing for treating GLM-5.2, Qwen 3.7, and Doubao as a heterogeneous tier rather than a single fallback. The axis splits NovAI measured are exactly what a router exploits, and the single-run correction result says even small aggregate gaps may be partly measurement artifact.

What should Western builders route on?

Route by workload specialization, pin the comparison-model version, cache-adjust the price, and verify before treating any vendor deck as a leaderboard result.

  • Demand independent reproduction. If a model’s benchmark scores have not been reproduced by a third party, its parity claims carry no external anchor. Doubao-Seed-2.1 Pro currently fails this test.
  • Cache-adjust the headline price. 6 CNY/M is the cold input rate; 1.2 CNY/M is the cache-hit rate. Your blended cost depends on prefix stability, not on the marketing figure.
  • Pin the version. ByteDance anchors to Opus 4.6 on cost and Opus 4.7 on capability; Anthropic is at Opus 4.8. Re-run the comparison against the current frontier before quoting savings.
  • Do not conflate versions. Doubao-Seed-2.0-Pro independent numbers, the only ones that exist, do not validate Doubao-Seed-2.1 Pro vendor numbers.
  • Confirm access. The China-served Doubao is not reachable via Dola. Without a Volcano Engine account, you cannot reproduce ByteDance’s own results, let alone the third-party ones that do not yet exist.

The Chinese frontier is now three credible flagships. Doubao-Seed-2.1 Pro’s claims are vendor-graded against benchmarks that do not yet exist elsewhere. Until an independent SWE-bench Verified run appears for Doubao, the honest routing answer is the axis split NovAI already exposed: Qwen for ZH→EN and Japanese, Doubao for EN→ZH, DeepSeek for hard code. Doubao-Seed-2.1 Pro itself remains unproven but cheap for in-China, cache-friendly workloads.

Frequently Asked Questions

Where does DeepSeek-V4-Pro sit relative to GLM-5.2 and Qwen 3.7 on the independent index?

DeepSeek-V4-Pro ranks #27 at 48.20 on the datalearner AA Intelligence Index, far below GLM-5.2 at #7 with 54.70 and Qwen3.7-Max-Preview at #10 with 53.50, despite DeepSeek leading LeetCode-Hard in the NovAI axis test. The aggregate ranking and the per-axis win disagree, which is the routing tension the Capability Frontier paper predicts for high topic-entropy workloads.

How do Doubao’s Claude Code and Codex compatibility claims differ from its benchmark claims?

ByteDance lists WPS, DeDao, and Unity’s Tongjie Engine as early integrations and states that Doubao 2.1 complies with the Claude Code and OpenAI Codex frameworks, but those are API and framework-compatibility statements, not eval results. A model can plug into Claude Code and still post zero verified SWE-bench numbers, which is the gap between plug-in support and a reproduced score.

What blended cost should a router budget for Doubao under real agent traffic?

ByteDance quotes a blended Coding and Agent cost of 1.96 CNY per million tokens under stable-prefix caching, sitting between the 1.2 CNY cache-hit floor and the 6 CNY cold input rate. That figure assumes Claude Code-style loops with a fixed system prompt and tool prefix; retrieval-heavy or varied-context traffic drifts toward the cold rate and erodes the savings.

Why does the datalearner index rank GLM-5.2 and Qwen 3.7 but not Doubao?

GLM-5.2 sits at #7 with 54.70 and Qwen3.7-Max-Preview at #10 with 53.50 on the datalearner AA Intelligence Index, while Doubao-Seed-2.1 Pro has no entry. Volcano Engine’s China-only serving blocks the third-party API access an independent leaderboard needs to score a model, so the absence is an access problem as much as a coverage gap.

What would move Doubao from vendor-graded to independently verified?

An independent SWE-bench Verified or LiveCodeBench run on Doubao-Seed-2.1 Pro would give it a slot comparable to GLM-5.2’s #7 or Qwen 3.7’s #10 on the datalearner index. ALE, MCP-Atlas, and NL2Repo-Bench currently have no third-party baseline anywhere, so even a single verified coding run would shift the routing picture more than another vendor deck.

sources · 5 cited

  1. ByteDance Launches Doubao 2.1 Pro Language Modeldataconomy.comcommunityaccessed 2026-06-28
  2. Doubao - Wikipediaen.wikipedia.orgcommunityaccessed 2026-06-28