GLM-5.2 on vLLM and Ascend: Open Weights Beyond NVIDIA

GLM-5.2 shipped as MIT-licensed weights on June 16, 2026, and the unusual part is the serving story, not the model. Within the same week it runs on NVIDIA H200 nodes through vLLM and on Huawei Ascend NPUs through the vLLM-Ascend plugin, with W8A8 quantization and speculative-decoding recipes already written. Most frontier open weights assume you own NVIDIA silicon. This one ships with deployment docs for chips the US stack usually ignores.

What is GLM-5.2?

GLM-5.2 is Zhipu’s Mixture-of-Experts coding model: roughly 744B total parameters with 40B active per token, built on a DeepSeek Sparse Attention backbone and released under an MIT license with no regional limits via Hugging Face and ModelScope. It inherits GLM-5’s scaling trajectory (355B to 744B parameters, 23T to 28.5T pretraining tokens), and some reporting cites a ~753B total figure. Treat 744B as the document value with a small measurement band around it.

The headline capability change is context length. GLM-5.2 raises the window to a claimed 1M tokens, a first for the GLM family, with expanded 1M-context training aimed at coding-agent work: large-scale implementation, automated research, performance optimization, and complex debugging. That 1M figure is Zhipu’s own framing. No independent long-horizon stress test appears in the published material, so read it as a stated ceiling rather than a verified working point.

How do you serve GLM-5.2 on NVIDIA GPUs?

The supported NVIDIA path is vLLM, which runs the FP8 weights directly, with SGLang as the main documented alternative. The release is concurrent enough that the framework support, the weight files, and the serving configs all landed in the same window rather than trailing by a quarter.

A representative serve command, from the GLM-5 repository, documents the documented shape:

vllm serve zai-org/GLM-5.1-FP8 \
  --tensor-parallel-size 8 \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 3 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45

How do you serve GLM-5.2 on Huawei Ascend?

On Ascend NPUs, GLM-5.2 is served through the vLLM-Ascend plugin, xLLM, or SGLang, with hardware sizing that depends on whether you run BF16 or the W8A8 quantized variant. BF16 needs two Atlas 800 A3 nodes (128G×8) or four Atlas 800 A2 nodes (64G×8); the W8A8 build fits on a single Atlas 800 A3 (128G×8) or two Atlas 800 A2 (64G×8).

The Ascend recipe is not “it loads.” The deployment guide specifies accelerated Multi-Token Prediction (MTP) for speculative decoding, hybrid W8A8 quantization combining QuaRot, Flex SmoothQuant, and SSZ, and prefill-decode disaggregation with prefix caching. That is a production serving posture, not a proof-of-concept.

This is the part US coverage has mostly skipped. The CNBC framing leans on a cheap-but-good pricing shock and an open-versus-revocable-US-frontier geopolitics angle. Neither connects the detail that matters to an infrastructure reader: the same weights ship with tuned NPU serving recipes alongside the NVIDIA ones.

How much VRAM does GLM-5.2 actually need?

Budget roughly 744GB for FP8 or INT8 weights and roughly 1,488GB for BF16, before KV cache and runtime overhead. The published sizing guidance stops at per-precision weights memory; translating that into a node count depends on KV-cache sizing, which in turn depends on the layer and KV-head counts carried on the model card.

For self-hosters outside a hyperscaler, the practical takeaway is that FP8 is the only realistic local path. BF16 fidelity costs roughly double the memory of FP8 to hold the same model, with no evidence in the published material that BF16 buys a measurable quality gain over the W8A8 or FP8 build.

How does GLM-5.2 stay fast at 1M tokens?

Sparse attention alone does not get you to a usable 1M context. GLM-5.2 adds IndexShare, which reuses one indexer across every four sparse-attention layers and cuts per-token FLOPs by 2.9× at 1M context. The same post reports that an improved MTP layer raises speculative-decoding acceptance length by up to 20%.

Those are architecture-level claims from Zhipu, not third-party measurements, but the mechanism is specific enough to take seriously. Indexer reuse is a concrete way to hold down the attention cost that normally dominates long-context inference, and a higher MTP acceptance length directly improves tokens-per-second on the decode path.

Which benchmarks should you trust?

On the standard, independently-comparable coding benchmarks, GLM-5.2 posts 81.0 on Terminal-Bench 2.1 against 63.5 for GLM-5.1, and 62.1 on SWE-bench Pro against 58.4. That puts it within a few points of Claude Opus 4.8’s 85.0 on Terminal-Bench 2.1 and ahead of Gemini 3.1 Pro on the same suite, per Zhipu’s own reporting.

Model	Terminal-Bench 2.1	SWE-bench Pro
GLM-5.2	81.0	62.1
GLM-5.1	63.5	58.4
Claude Opus 4.8	85.0	not reported in sources

Treat the vendor-graded agentic numbers as directional, not settled. CNBC reports GLM-5.2 sits within a percentage point of Anthropic’s Opus 4.8 on a closely watched agentic benchmark at roughly a fifth of the cost, with OpenRouter token traffic climbing faster than after DeepSeek’s V4 launch in April. The pricing and traffic claims are sourced. The underlying agentic benchmark is young and vendor-graded, so “within a point of Opus 4.8” is a marketing-grade comparison until an independent evaluation reproduces it.

What changes for the open-weight stack?

The benchmark score is the headline. The structural shift is the deployment story: MIT-licensed frontier weights now ship with same-week inference recipes for non-NVIDIA silicon, alongside the NVIDIA one. GLM-5.2’s Ascend deployment guide landed in the same window as the weights, a posture no US-centric open-weight release matches.

The honest caveat is that for GLM-5.2 specifically, only the Ascend non-NVIDIA tooling is published so far. The coverage is partial for the current model, and it is worth watching whether other silicon vendors add support.

For self-hosters, the practical effect is that the usual open-weights-but-NVIDIA-only-serving trade-off no longer holds. If you can get Ascend hardware, you can run a frontier-class coding model on it the week the weights drop, with quantization, speculative decoding, and disaggregation already specified. The open-weight stack now has to compete on context length and hardware independence, not just raw capability. Whether it does is a separate question from whether it should.

Frequently Asked Questions

Can I run GLM-5.2 on AMD GPUs or other non-NVIDIA, non-Ascend silicon?

GLM-5.2 has published serving recipes only for NVIDIA (vLLM, SGLang) and Huawei Ascend (vLLM-Ascend, xLLM, SGLang). The predecessor GLM-5 was ported to Moore Threads, Cambricon, Kunlun Chip, MetaX, Enflame, and Hygon, so community ports for 5.2 to those vendors are plausible but undocumented today.

What if my GPUs can’t hold the full FP8 weights? Any offload path?

vLLM and SGLang are not the only options. KTransformers (v0.5.12+) and Unsloth (v0.1.47-beta+) also list GLM-5.2 support, and KTransformers is built around CPU-GPU offload for MoE weights, trading tokens-per-second for a lower hardware floor than an 8x H200 node.

How did Zhipu keep the coding-agent RL from gaming its own benchmarks?

GLM-5.2’s RL stage runs a two-stage anti-hack module: a rule-based filter for recall and an LLM judge for precision. It flags shortcut patterns like curling raw GitHub URLs or reading secret_cases.json, which suppresses reward hacking before it inflates coding-agent scores.

What drives the 20 percent jump in GLM-5.2’s MTP acceptance length?

Zhipu credits three coupled changes to the Multi-Token Prediction head: the IndexShare indexer reuse, a companion KVShare scheme, and end-to-end TV loss. Together they move the speculative-decoding acceptance length from 4.56 to 5.47 tokens, which directly improves decode throughput.

How does GLM-5.2’s context window compare to the previous GLM generation?

GLM-5.1 topped out at 200K context. GLM-5.2 jumps to a stated 1M, a 5x increase, and pairs it with IndexShare’s 2.9x per-token FLOP cut at full context. That is why the indexer-reuse trick matters more on this generation than on the older 200K window.