Open-Weight LLM Leaderboards 2026: Where DeepSeek, Qwen, and GLM Rank

On the independent leaderboards tracking open-weight models as of late June 2026, no Chinese flagship sits in the overall top three. DataLearner’s June 23 revalidation ranks GLM-5.2 seventh by HLE at 54.70 and reserves the top of its AA Intelligence Index for Claude Fable 5, Claude Opus 4.8 (max), and GPT-5.5 (xhigh). The release-day “#1” graphics from the spring 2026 Chinese launch wave do not survive contact with third-party aggregation.

Why can’t a launch-day “#1” claim settle model selection?

Every lab in the spring 2026 Chinese release wave shipped a launch graphic claiming its new model led some axis. The claim rests on choices the vendor made: which benchmark, which comparison set, which aggregation method, then a bar chart beside it. A launch post can truthfully say “#1 on X” while a third-party board ranks the same model seventh, because the two measure different things on different populations.

DataLearner is one of the few places where the claim has to survive without the marketing framing. Its June 23 snapshot (stamped 2026-06-23 16:05:27) is the first stable third-party read on where DeepSeek, Qwen, and GLM flagships actually sit once the release wave settled.

For a team picking a model, the consequence is concrete. A single leaderboard, let alone a single launch graphic, is no longer sufficient evidence. Four aggregators now track these weights, each on a different axis, and the rank a model earns depends on which axis you read.

Where do the Chinese flagships actually rank on DataLearner?

Ranked by HLE on DataLearner’s June 23 snapshot, GLM-5.2 (Zhipu AI) leads the Chinese field at seventh overall with 54.70. The public page truncates below rank #9 (Kimi K2.6), so the exact HLE positions of the rest of the Chinese field (GLM 5.1, GLM-5, Kimi K2.5, Qwen3.5-397B-A17B, DeepSeek-V4-Pro, DeepSeek-V4-Flash) are not confirmable from the cached snapshot; GLM-5.2 is the only Chinese flagship whose HLE figure surfaces in the visible top of the board.

Note where the field is missing entirely. On the AA Intelligence Index, DataLearner’s composite of ten standardized benchmarks, the top tier is Claude Fable 5, Claude Opus 4.8 (max), and GPT-5.5 (xhigh). No Chinese open-weight model appears there. HLE is the one column where the Chinese flagships surface near the front, and even there none crack the top five.

How do the scores break down benchmark by benchmark?

The single-board view hides which benchmark is doing the lifting. Spread across DataLearner’s per-benchmark columns and Onyx’s vendor-reported columns, the picture fragments by task:

Model	HLE (DataLearner)	MMLU-Pro (Onyx)	AIME 2025 (Onyx)	SWE-bench Verified (Onyx)
GLM-5.2	54.70 (#7)	n/r	n/r	n/r
GLM-5	n/r	70.4	84.0	77.8
Kimi K2.5 (1T)	n/r	87.1	96.1	76.8
Qwen3.5-397B-A17B	n/r	87.8	N/A	76.4
DeepSeek-V4-Pro	n/r	n/r	n/r	n/r
DeepSeek-V4-Flash	n/r	n/r	n/r	n/r

“N/A” or “n/r” means the board does not report that figure. llm-stats, which ranks open models by performance, price, and speed, separately scores DeepSeek-V4-Pro-Max at GPQA 90.1%.

A few things are worth reading out of the table rather than past it. GLM-5.2 carries the top HLE score in the Chinese field at 54.70 and reports no figures on the remaining columns; it is also the only Chinese flagship whose HLE figure is visible in DataLearner’s cached June 23 snapshot, which truncates below rank #9. On the Onyx columns, Kimi K2.5 and Qwen3.5-397B-A17B both clear GLM-5 on MMLU-Pro (87.1 and 87.8 versus 70.4), while GLM-5 holds a narrow SWE-bench Verified lead (77.8 versus 76.8 and 76.4).

When the source moves the score more than the model does, the source is doing more work than the score.

Why does the same model rank differently on every board?

GLM-5.2 is fourth overall on llm-leaderboard.com’s composite but seventh on DataLearner’s HLE sort, and that gap is structural rather than a bug. Each board sorts on a different axis.

llm-leaderboard.com ranks GLM-5.2 at #4 with a 59.0 proprietary composite, a 1.0M-token context window, 176 characters-per-second throughput, and $1.73 pricing. It is the highest-placed open-source model on that board’s top 15, behind Claude Mythos Preview, GPT-5.2 Pro, and Claude Opus 4.8. DataLearner, by contrast, ranks on HLE and folds ten benchmarks into the AA Intelligence Index, which is why GLM-5.2 sits at #7 there and off the composite top tier. llm-stats takes a third approach, ranking its models by performance, price, and speed. Onyx skips composites and presents raw benchmark columns sourced from official tech reports.

The axis a board chooses determines which model wins, and it maps to a different procurement question. DataLearner sorts on HLE; llm-stats ranks by performance, price, and speed; Onyx leaves the sorting to the buyer via raw columns. A model strong on one is not necessarily strong on another, which is exactly why GLM-5.2 is seventh on DataLearner and fourth on llm-leaderboard.com.

The boards also disagree on freshness. Onyx’s data is stamped 2026-03-24 and lags DataLearner’s June 23 revalidation by three months, which matters when flagships land monthly. Onyx pulls its figures from vendor tech reports. A model can be “#1 open-source” on one board and seventh on another, and both can be correct, because they answered different questions.

Did GLM-5 actually beat its predecessor, GLM-4.7?

No. On Onyx’s head-to-head, GLM-4.7 beats GLM-5 on five of nine benchmarks, including the ones a buyer weighs most.

Onyx’s columns give GLM-5 a MMLU-Pro of 70.4, GPQA Diamond 86.0, SWE-bench Verified 77.8, HumanEval 90.0, LiveCodeBench 52.0, AIME 2025 84.0, and MMLU 85.0. The same board puts GLM-4.7 ahead on the load-bearing metrics: MMLU-Pro 84.3 to 70.4, LiveCodeBench 84.9 to 52.0, and AIME 2025 95.7 to 84.0. Newer did not mean better, and on MMLU-Pro the margin is not close.

The wider field makes the same point about not assuming a sweep. On the same Onyx board, Kimi K2.5 (Moonshot, 1T parameters) scores MMLU-Pro 87.1, SWE-bench Verified 76.8, HumanEval 99.0, AIME 2025 96.1, and MMLU 92.0. Qwen 3.5 (397B) posts MMLU-Pro 87.8, GPQA Diamond 88.4, and SWE-bench Verified 76.4, with AIME 2025 unreported. DeepSeek V3.2 (685B) posts MMLU-Pro 85.0, GPQA Diamond 79.9, and SWE-bench Verified 67.8. By Onyx’s vendor-reported columns, Kimi and Qwen both clear GLM-5 on MMLU-Pro and LiveCodeBench, which is not what a GLM launch graphic implies.

What does “open” actually mean here, and what do these models cost?

Most of these flagships are “free commercial” weights, not OSI-approved open source, and one is non-commercial only. DataLearner labels GLM-5.2, GLM 5.1, GLM-5, Kimi K2.6, Qwen3.5-27B, and DeepSeek-V4-Pro and -Flash as “Free commercial,” and marks MiniMax-M2.7 as “Non-commercial.” The distinction matters: free-commercial terms permit commercial use but stop short of the open-source definition, and that gap rarely surfaces in a launch graphic.

Pricing is its own axis, and it is where the Chinese field is genuinely competitive. On llm-stats, GLM-5 runs $1.00/$3.20 per million tokens, GLM-5.1 runs $1.40/$4.40, and Kimi K2.6 runs $0.95/$4.00. On llm-leaderboard.com, GLM-5.2 lists at $1.73 with a 1.0M-token context window and 176 c/s throughput. Context length and throughput belong to the cost calculus in ways a benchmark rank does not, and a model that is cheap per token but slow or short-context can lose on total cost for a latency-sensitive workload.

For a team shortlisting a Chinese open-weight flagship in June 2026, the rule is to read a board’s axis before its rank. DataLearner’s HLE snapshot puts GLM-5.2 seventh and no Chinese model in the composite top three; llm-leaderboard.com’s composite puts GLM-5.2 fourth; Onyx’s vendor-reported columns show GLM-5 losing to its own predecessor on MMLU-Pro, LiveCodeBench, and AIME 2025. None of those contradicts the others. They are four answers to four different questions, and the launch graphics answer only the one the lab chose.

Frequently Asked Questions

How quickly do these leaderboards go stale after a snapshot?

DataLearner revalidates its index every five minutes, while Onyx’s columns are stamped 2026-03-24, a roughly three-month gap during which GLM-5, GLM-5.1, and GLM-5.2 all shipped. A rank pulled Monday morning can shift by Monday afternoon on one board and sit months out of date on another, so a procurement memo should carry a re-pull date rather than a fixed rank.

How far apart can two boards score the same model on one benchmark?

Far enough to flip a shortlist. Onyx reports GLM-5’s SWE-bench Verified at 77.8 from the vendor’s tech report, while DataLearner’s per-benchmark column for the same model posts 2.10, and Qwen3.5-397B-A17B reads SWE-bench at 76.4 on Onyx versus 86.7 on llm-stats. A score quoted without the board it came from is a citation question, not a number.

What does each board’s sort key hide about model selection?

The four boards sort on different keys, and the key picks the winner. llm-stats ranks its 302 tracked models by coding-arena and then GPQA Diamond, which elevates coding-heavy specialists regardless of their reasoning scores. A model optimized for general reasoning can sit below a coding specialist it outperforms on every other column.

Which benchmark dimension do all four leaderboards underweight?

Agentic and long-horizon tool use. None of the four boards elevates ARC-AGI-2 or the τ²-Bench into its headline sort, even though the Chinese field is competitive there: DeepSeek-V4-Pro posts ARC-AGI-2 80.60, GLM-5 posts 77.80 with τ²-Bench 89.70, and Qwen3.5-397B-A17B posts 76.40 with τ²-Bench 86.70. A team buying for autonomous workloads should pull those columns directly instead of trusting the composite rank.