GLM-5.2 Benchmarks: What 62.1% SWE-bench Pro and 99.2% AIME Actually Mean

When Zhipu launched GLM-5.2 on June 13, 2026, the GitHub README contained no benchmark numbers. Six days later, the canonical repository was updated with a full evaluation suite covering software engineering, mathematics, scientific reasoning, and long-horizon agent work. ¹ The delay matters: it gives readers a cleaner signal than scores published on launch day, when cherry-picking pressure is highest. It also creates a useful case study in reading AI benchmark tables: each score on the GLM-5.2 card measures something distinct, with different contamination exposure and different relevance to real work.

GLM-5.2 is a 753B-parameter Mixture-of-Experts model ² from Zhipu/Z.ai, a 2019 Tsinghua University spin-off now listed in Hong Kong (02513.HK) since January 2026. ⁶ The model carries MIT-licensed weights on HuggingFace ² and deploys via SGLang, vLLM, Transformers, or KTransformers. ¹ Context window is 1,000,000 input tokens, a 5x increase over GLM-5.1’s 200K limit. ¹

What Is SWE-bench Pro, and What Does 62.1% Mean?

SWE-bench Pro is a harder variant of SWE-bench Verified. Where Verified pulls real GitHub issues from popular Python repositories and asks the model to generate a passing patch, Pro restricts the test set to issues that stronger models (those scoring above 50% on Verified) consistently solve, meaning the remaining problems are genuinely hard for frontier systems.

GLM-5.2 scores 62.1% on SWE-bench Pro, up from 58.4% for GLM-5.1. ¹ That 3.7-point gain is a notable lift on a benchmark where even top closed models rarely move more than a few points between generations.

Several caveats apply to any SWE-bench result. The benchmark is evaluated by running the model’s generated patch against the issue’s test suite, not by a human reviewer, which means patches that happen to pass the tests but introduce regressions elsewhere are counted as successes. Evaluation harness versions also vary between labs: minor differences in scaffolding, tool access, or retry logic can shift scores by several points without any change to model weights. Zhipu’s evaluation methodology for the Pro score is self-reported; the 58.4% GLM-5.1 comparison figure comes from third-party analysis. ¹

What the number reliably signals: GLM-5.2 can resolve a majority of hard real-world GitHub issues in an automated single-pass evaluation. For teams building agentic coding workflows, that places it among the strongest open-weight options available under a permissive license.

What Is Terminal-Bench 2.1, and Why Does 81.0 vs 85.0 Matter?

Terminal-Bench 2.1 evaluates a model’s ability to complete multi-step tasks inside a terminal environment: file manipulation, shell scripting, environment setup, package installation, and debugging across a running process. It is more representative of AI coding agent work than static code generation tests because it requires the model to observe state, issue commands, and recover from errors across multiple turns.

GLM-5.2 scores 81.0 on Terminal-Bench 2.1. ¹ The same table in Zhipu’s README places Claude Opus 4.8 at 85.0, a 4-point lead for Anthropic’s model on this specific benchmark. ¹ GLM-5.1 scored 62.0, so the 5.2 generation improved by 19 points, a large single-generation gain.

The 4-point gap to Opus 4.8 is the most practically relevant comparison in the GLM-5.2 release. On software engineering (SWE-bench Pro), the comparison to Opus 4.8 is not published by Zhipu, so Terminal-Bench is the primary head-to-head data point available. A 4-point deficit at this tier is within the range where evaluation harness differences, scaffolding choices, or problem sampling could account for the gap, but the stated numbers show Opus 4.8 ahead.

What Is AIME 2026, and Why Is 99.2% a Contamination Red Flag?

AIME is the American Invitational Mathematics Examination, a competition for high school students. AIME 2026 refers to the most recent iteration of that contest. GLM-5.2 scores 99.2% on it. ¹

A near-perfect AIME score from a frontier model is expected given training data practices, but it carries the highest contamination exposure of any score in the GLM-5.2 suite. AIME problems circulate widely on math forums, solution repositories, and competitive programming sites: exactly the type of content that appears in large web crawls. A model can achieve a high AIME score by pattern-matching to memorized solutions rather than by generalizing mathematical reasoning to genuinely novel problems.

The practical implication: AIME 2026 scores are a credible ceiling signal on well-specified algebra and combinatorics problems, but they overstate general mathematical reasoning ability. Use AIME to confirm that a model is not weak on structured competition math; do not use it as a proxy for novel technical problem-solving.

HMMT Nov 2025 (Harvard-MIT Mathematics Tournament, 94.4% ¹) is a complementary data point. HMMT problems are harder than AIME and drawn from a narrower community, so contamination is lower, though not zero for a model trained on the November 2025 internet.

What Is GPQA-Diamond, and What Does 91.2% Cover?

GPQA-Diamond is the hardest subset of the Graduate-Level Google-Proof Q&A benchmark, which tests questions in biology, chemistry, and physics that are designed to be answerable only with domain expertise. The “Diamond” split comprises the questions where domain experts agree on the answer but non-experts cannot reliably choose it even with internet access.

GLM-5.2 scores 91.2% on GPQA-Diamond. ¹ This is a strong result. The benchmark’s contamination risk is moderate: the question set was constructed to be resistant to web lookup at creation time, but frontier model training corpora now include discussions of GPQA-class problems.

What the score reliably measures: GLM-5.2 has strong calibration on graduate-level science questions. For teams using LLMs as scientific literature assistants or for chemistry/biology reasoning tasks, GPQA-Diamond is one of the better single-number proxies available.

What Is HLE, and Why Does 40.5% Stand Out?

HLE (Humanity’s Last Exam) is a benchmark designed by Scale AI and CAIS to be genuinely hard for current frontier models. It draws from PhD-level and research-level questions across many fields, chosen specifically because high-scoring models on existing benchmarks still fail them. The benchmark deliberately resists saturation.

GLM-5.2 scores 40.5% on HLE. ¹ Unlike the 99.2% AIME figure, a 40.5% HLE score reflects actual frontier-level difficulty: the best closed models currently score in roughly the same range. HLE is notable precisely because it is not saturated, which means it provides signal where AIME and GPQA-Diamond are approaching ceiling effects.

For readers comparing GLM-5.2 to closed-source alternatives, HLE is the benchmark in this suite least susceptible to harness gaming or simple contamination. A model that scores 40.5% on HLE has demonstrated broad reasoning ability on questions that are hard to answer by pattern-matching to training data.

How the Benchmark Suite Maps to Use Cases

Benchmark	What it measures	Contamination risk	Best use-case signal
SWE-bench Pro 62.1% ¹	Real GitHub issue resolution	Moderate (fixed test repos)	Coding agents, automated bug fixing
Terminal-Bench 2.1 81.0 ¹	Multi-step terminal agent tasks	Low (stateful, dynamic)	CLI agents, DevOps automation
AIME 2026 99.2% ¹	Competition math (structured)	High (widely circulated)	Well-specified algebra/combinatorics
HMMT Nov 2025 94.4% ¹	Hard competition math	Moderate	Research math reasoning
GPQA-Diamond 91.2% ¹	Graduate-level science Q&A	Moderate	Science literature, lab assistance
HLE 40.5% ¹	PhD-plus reasoning (unsaturated)	Low (designed resistant)	General frontier-tier comparison

The practical takeaway: GLM-5.2’s coding and agent scores (SWE-bench Pro, Terminal-Bench) are the most directly actionable for engineering teams. The math and science scores confirm capability but carry contamination exposure proportional to how widely those problem sets have circulated online.

Architecture Choices That Affect Benchmark Results

Two architectural features of GLM-5.2 are directly relevant to how it achieves these scores.

IndexShare sparse attention reduces per-token computation by reusing the same index across every four sparse attention layers, cutting per-token FLOPs by 2.9x at 1M context length. ² This is Zhipu’s name for the technique; the arXiv paper uses the abbreviation “DSA,” though Zhipu does not describe it as DeepSeek-derived. The practical consequence is that GLM-5.2 can run at 1M-token context without the per-token cost increase that makes most MoE models prohibitive at long range. This directly supports the Terminal-Bench and agentic SWE-bench scores, both of which require maintaining state across long interaction histories.

The MTP (multi-token prediction) speculative decoding layer ² improves generation throughput without changing the model’s output distribution. This matters for benchmark runtime and for production cost, but does not alter the benchmark numbers themselves.

The README designation “744B-A40B” strongly implies approximately 40B active parameters per forward pass, though Zhipu has not stated this number explicitly. ¹ The confirmed total parameter count from HuggingFace model cards is 753B. ²

Deployment and Access

Weights are publicly available as BF16 at huggingface.co/zai-org/GLM-5.2 and FP8 at huggingface.co/zai-org/GLM-5.2-FP8. ³ FP8 had approximately 93.9K downloads vs the BF16 variant’s 11.9K as of June 19, reflecting that most self-hosters prioritize the quantized form. The MIT license permits commercial use and modification with no per-token fees. ²

For teams that prefer managed access, Z.ai’s GLM Coding Plan offers three tiers: Lite at $18/month (approximately 400 prompts per week), Pro at 5x Lite usage, and Max at 20x Lite usage (roughly $112/month on a yearly plan). ⁵ The API endpoint is compatible with the Anthropic Messages API format, so migrating Claude Code or similar agent tooling requires only a base URL change. ⁴

At launch, eight coding agents were listed as validated integrations: Claude Code, Cline, OpenCode, Roo Code, Goose, Crush, OpenClaw, and Kilo Code. ⁵

Frequently Asked Questions

Q: Is GLM-5.2 better than Claude Opus 4.8?

A: On Terminal-Bench 2.1, the numbers Zhipu published show Opus 4.8 at 85.0 and GLM-5.2 at 81.0, a 4-point lead for Opus 4.8. ¹ A head-to-head SWE-bench Pro comparison is not in the published data. GLM-5.2 is the strongest open-weight option on the benchmarks reported; Opus 4.8 leads on the one direct comparison available.

Q: What does the 99.2% AIME score actually tell me?

A: It tells you GLM-5.2 is not weak on structured competition mathematics. It does not tell you the model generalizes to novel mathematical reasoning: AIME problems are widely available in training corpora. The 40.5% HLE score is a better proxy for general reasoning under genuinely hard conditions. ¹

Q: Should I trust self-reported benchmark numbers?

A: Self-reported numbers should be treated as upper bounds until independently reproduced. The most credible figures in the GLM-5.2 release are the Terminal-Bench scores, which are relatively hard to inflate via scaffolding tricks, and the HLE score, which is designed to resist saturation. SWE-bench scores are more harness-sensitive; the 3.7-point gain over GLM-5.1 is plausible but should be confirmed on independent evals before making procurement decisions.

Q: Can I self-host GLM-5.2 on commodity hardware?

A: The FP8 weights require substantial GPU memory: 753B parameters at FP8 precision needs roughly 800GB of VRAM across multiple GPUs. The MIT license permits it, but hardware cost is the binding constraint for most teams. KTransformers may reduce requirements via further quantization, but Zhipu has not published minimum hardware specs. ²