GLM-5.2's 753B MoE Costs More to Self-Host Than the MIT License Suggests

Q: Is GLM-5.2 the same model as the community-rumored 744B?

No. The 744B-A40B designation appears in Zhipu's GitHub README as the model size label, but the actual parameter count on Zhipu's HuggingFace model cards is 753B. The 744B figure circulated in community write-ups before the model cards were available. Use 753B as the authoritative count.

Zhipu released GLM-5.2 on June 13, 2026, under a MIT license with weights live on HuggingFace.² The FP8 quantized variant has already logged roughly 93,900 downloads.³ Both numbers are real. So is the benchmark story: 62.1% on SWE-bench Pro¹ and 81.0 on Terminal-Bench 2.1¹ put this model in serious contention for agentic coding tasks. What the launch narrative underweights is the gap between “MIT license” and “runs on hardware you already own.”

What GLM-5.2 actually is

GLM-5.2 is a 753B-parameter mixture-of-experts model,² confirmed by Zhipu’s own HuggingFace model cards. The GitHub README uses the designation 744B-A40B,¹ which strongly implies approximately 40B active parameters per forward pass, though Zhipu has not stated “40B active” in plain text. The architecture includes IndexShare sparse attention, which reuses the same indexer across every four sparse attention layers, cutting per-token FLOPs by 2.9x at 1M context length.² An MTP speculative decoding layer sits on top.

Context window is 1,000,000 input tokens, a 5x increase over GLM-5.1’s 200K ceiling,¹ with a 128K output token cap.⁴ The model ships under model identifier glm-5.2 (or glm-5.2[1m] in some API contexts)⁴ and is deployable via SGLang, vLLM, Transformers, and KTransformers.¹

This is a point release, not a full architecture overhaul. GLM-5 launched February 11, GLM-5-Turbo on March 15, GLM-5.1 on April 7, and GLM-5.2 on June 13: four releases in roughly four months since Zhipu’s January 2026 Hong Kong IPO as 02513.HK.⁶

What the benchmarks actually show

Zhipu posted the following scores in the GitHub README:¹

Benchmark	GLM-5.2	GLM-5.1 (prior gen)
SWE-bench Pro	62.1%	58.4%
Terminal-Bench 2.1	81.0	62.0
AIME 2026	99.2	n/a
HMMT Nov 2025	94.4	n/a
GPQA-Diamond	91.2	n/a
HLE	40.5	n/a

The Terminal-Bench jump from 62.0 to 81.0 is the most striking delta between generations.¹ On that same benchmark, Claude Opus 4.8 scores 85.0,¹ meaning GLM-5.2 trails by 4 points. That gap is worth naming clearly: the strongest open-weights model on this task is not yet at parity with the frontier closed model, on this particular benchmark.

The AIME 2026 result of 99.2¹ and GPQA-Diamond at 91.2¹ indicate strong mathematical and scientific reasoning, but these benchmarks measure different skills than agentic code execution. A team evaluating GLM-5.2 for a coding agent loop should weight the SWE-bench and Terminal-Bench numbers more heavily than the math evals.

Why the MIT license does not make self-hosting cheap

MIT is a permissive license. It imposes no per-token fee, no usage cap, no commercial restriction. That is the accurate description of what MIT does.² It says nothing about what it costs to run 753B parameters.

Mixture-of-experts inference has a specific property that matters here: the full parameter set must reside in memory, even though only a fraction fires per token. With 753B total parameters,² the BF16 weights alone occupy approximately 1.5 TB of GPU memory before any KV-cache or activation overhead. The FP8 quantized variant cuts that by half, landing around 750 GB.³ Neither figure fits a single server. A minimum multi-GPU configuration handling FP8 would require eight to ten A100 80GB cards, or equivalent high-memory hardware. That is before accounting for KV-cache at 1M context: a million-token context with standard attention heads adds hundreds of gigabytes per request.

The IndexShare sparse attention reduces per-token FLOPs at long context by 2.9x,² which helps throughput once you have the hardware loaded. It does not reduce the memory needed to hold the model weights. FLOPs savings apply at inference time, after the full 753B are already resident.

This is the structural gap between what a license permits and what the hardware requires. MIT gives you the right to run GLM-5.2 on your own infrastructure at no per-token cost. It does not provide the infrastructure.

What the hosted pricing actually costs

Zhipu prices GLM-5.2 access through flat-tier subscriptions on the GLM Coding Plan.⁵ The tiers are:

Lite: $18/month (monthly) or ~$12.60/month (yearly), approximately 400 prompts per week
Pro: 5x Lite usage
Max: 20x Lite usage, ~$112/month yearly⁵

These are flat subscriptions, not per-token meters. For teams with predictable usage patterns, flat pricing is easy to budget. For teams running high-context agentic loops where a single 1M-token pass could consume a large share of a weekly allowance, the cost model is less transparent. The subscription rate does not change based on context length, but usage allowances draw down against prompt count regardless of token volume.

The economics of self-host versus subscription depend on one question: does your team already operate multi-GPU inference infrastructure? If yes, the MIT weights shift marginal cost to electricity and amortized hardware. If no, you are buying or leasing hardware against a subscription cost that is designed to be affordable for most development teams.

What actually works now

As of June 19, 2026, the following are confirmed live:²³

BF16 weights at huggingface.co/zai-org/GLM-5.2
FP8 weights at huggingface.co/zai-org/GLM-5.2-FP8
Z.ai chatbot at chat.z.ai powered by GLM-5.2⁵
API compatible with the Anthropic Messages API (base URL swap sufficient for Claude Code, Cline, and similar agents)⁴
Eight coding agent integrations at launch: Claude Code, Cline, OpenCode, Roo Code, Goose, Crush, OpenClaw, Kilo Code⁵

Deployment is via SGLang, vLLM, Transformers, or KTransformers.¹ The HuggingFace organization is zai-org, not THUDM, which was Zhipu’s prior release organization and now has zero public models.²

Frequently Asked Questions

Is GLM-5.2 the same model as the community-rumored 744B?

No. The 744B-A40B designation appears in Zhipu’s GitHub README¹ as the model size label, but the actual parameter count on Zhipu’s HuggingFace model cards is 753B.² The 744B figure circulated in community write-ups before the model cards were available. Use 753B as the authoritative count.

How does GLM-5.2 compare to GLM-5.1 on coding tasks?

SWE-bench Pro improved from 58.4% to 62.1%, a 3.7 percentage point gain.¹ Terminal-Bench 2.1 improved from 62.0 to 81.0, a 19-point gain.¹ The Terminal-Bench delta is the most significant measurable improvement between the two generations on agentic coding benchmarks.

Can a single high-end workstation self-host GLM-5.2?

In practice, no. The BF16 weights for a 753B-parameter model require approximately 1.5 TB of GPU memory, which rules out single-card or dual-card workstations.² The FP8 variant halves the memory requirement but still requires a multi-GPU server configuration.³ KTransformers supports CPU-GPU hybrid offloading, which can reduce GPU memory requirements, but at the cost of inference speed.

Does GLM-5.2 beat Opus 4.8 on coding benchmarks?

Not on Terminal-Bench 2.1. GLM-5.2 scores 81.0 versus Opus 4.8’s 85.0, a 4-point gap in Opus 4.8’s favor.¹ On SWE-bench Pro, GLM-5.2 posts 62.1%.¹ Direct Opus 4.8 comparisons on SWE-bench Pro are not included in Zhipu’s published benchmark table.

What is IndexShare and why does it matter for long-context inference?

IndexShare is Zhipu’s name for a sparse attention mechanism that reuses the same indexer across every four sparse attention layers.² This reduces per-token FLOPs by 2.9x at 1M context length compared to standard dense attention. The practical benefit is that inference cost does not scale as steeply with context length, making the 1M-token window more viable in throughput terms than a naive dense-attention model of the same size would be.