Running GLM-5.2 at Home: SGLang, vLLM, Transformers, and KTransformers Setup Guide

Q: Does GLM-5.2 require special framework versions to run?

The GitHub README lists SGLang, vLLM, Transformers, and KTransformers as supported deployment frameworks but does not specify minimum version requirements in its public documentation as of June 19, 2026. Check the zai-org/GLM-5 repository's installation section for the current version pins before setting up your environment.

Q: What is the model identifier to use in API calls?

The model identifier is glm-5.2, with glm-5.2[1m] used in some documentation to explicitly reference the 1M-context variant. Confirm the exact identifier string against Zhipu's current API documentation before wiring it into a production agent loop, as model identifier strings can change between documentation revisions.

GLM-5.2 weights landed on HuggingFace on June 13, 2026,¹ and as of June 19, 2026, the FP8 variant has roughly 93,900 downloads against around 11,900 for the BF16 original.³ That ratio tells you something: the community already voted for the quantized path. This article covers what you need to run either variant on your own hardware, how SGLang, vLLM, Transformers, and KTransformers differ for this workload, and how to think about hardware cost against the subscription alternative.

What is GLM-5.2 and what makes self-hosting it unusual?

GLM-5.2 is a 753B-parameter² mixture-of-experts (MoE) model released by Zhipu, a 2019 Tsinghua University KEG lab spin-off now publicly listed in Hong Kong as 02513.HK.⁶ The model ships under an MIT license² (not Apache 2.0, not a custom community license), which is the cleanest permissive term you can get on a frontier-scale weight set.

Two architectural details shape the inference setup:²

IndexShare sparse attention reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9x at 1M context length compared to dense attention at the same context size.
MTP (speculative decoding) layer adds a draft head that predicts multiple tokens ahead, reducing wall-clock latency per output token on capable hardware.

The README designation is 744B-A40B,¹ which strongly implies approximately 40B active parameters per forward pass, but Zhipu has not stated that figure in plain text. The total parameter count confirmed on the HuggingFace model card is 753B.² Do not use the community-circulated 744B figure; it is superseded.

Context window is 1,000,000 input tokens with a 128K (131,072) output ceiling.⁴ That is a 5x jump over GLM-5.1’s 200K window.¹ At that scale, MoE memory handling matters more than it does for dense models of the same active-parameter count, because all 753B parameters must reside in memory even though only a fraction fire per token.

What are the verified benchmark scores?

Zhipu posted the following numbers in the GitHub README:¹

Benchmark	GLM-5.2	GLM-5.1 (prior gen)
SWE-bench Pro	62.1%	58.4%
Terminal-Bench 2.1	81.0	62.0
AIME 2026	99.2	n/a
HMMT Nov 2025	94.4	n/a
GPQA-Diamond	91.2	n/a
HLE	40.5	n/a

For context: Claude Opus 4.8 scores 85.0 on Terminal-Bench 2.1,¹ leading GLM-5.2 by 4 points on that benchmark. These are vendor-reported figures from the GitHub README; independent replications have not yet run as of June 19, 2026.

BF16 or FP8: which weight variant to download?

Both variants are publicly downloadable from the zai-org HuggingFace organization (not THUDM, which now has zero public models).²³

BF16 (zai-org/GLM-5.2) is the reference weight format. It preserves full precision and is the right choice if you have the memory and want the highest-fidelity baseline.²

FP8 (zai-org/GLM-5.2-FP8) cuts memory footprint by roughly half compared to BF16 on a per-parameter basis, which is why it leads on downloads by approximately 8:1.³ The tradeoff is that FP8 quantization can affect accuracy on tasks at the edge of the model’s capability, and the magnitude of that effect on GLM-5.2 specifically has not been published by Zhipu.

For most teams doing an initial self-hosting evaluation, FP8 is the practical starting point: lower GPU memory, faster to iterate on, and the download community has clearly converged on it. Regression testing on your specific task distribution against BF16 is the right follow-up, not a prerequisite.

How do the four deployment frameworks compare?

GLM-5.2’s GitHub repo¹ lists four supported inference frameworks. Here is how they differ for a 753B MoE:

SGLang is a structured-generation serving layer designed around batched, concurrent requests. It has native support for constrained decoding (JSON schemas, regex) and implements RadixAttention for KV-cache sharing across requests. For workloads with many concurrent clients or structured output requirements (agent loops calling the model as a service), SGLang is the natural starting point. It is also the framework most likely to expose the MTP speculative decoding gain at serving scale.

vLLM is the more established open-source serving engine, with wider hardware support and a larger operator community. Its paged-attention memory management handles variable-length sequences efficiently and is well-tested on large MoE models from other families. If your team already runs vLLM for other models, adding GLM-5.2 behind the same stack is the lowest operational delta. Speculative decoding support is present but the configuration path differs from SGLang.

Transformers (HuggingFace’s library) is the reference path for single-machine, lower-concurrency evaluation. It is slower than either serving framework under load but requires the least custom setup and is the canonical way to verify that weight loading works before investing in a serving configuration. Use it to validate the download and run a few thousand-token samples; do not use it as your production serving layer for a 753B model.

KTransformers is a kernel-optimized inference library targeting high-throughput on consumer and prosumer hardware. It is the least established of the four by community size, but it is the framework most likely to make sense if you are running on a non-datacenter GPU cluster where raw throughput per dollar matters more than operational familiarity.

For a first deployment, the decision tree is: if you need serving concurrency and structured output, start with SGLang; if you already run vLLM and want operational consistency, add GLM-5.2 there; if you are evaluating on a single node with no concurrency requirement, use Transformers to validate before migrating to a serving layer.

What hardware do you actually need?

GLM-5.2 is a 753B-parameter MoE model. Every parameter must fit in addressable GPU memory, not just the active subset, because the full expert set is needed to route correctly. FP8 packs each weight into one byte; BF16 uses two bytes.

Rough memory floor estimates (weights only, before activations and KV cache):

Variant	Parameter memory	Minimum GPU config
FP8	~753 GB	10x H100 80GB or equivalent
BF16	~1,506 GB	20x H100 80GB or equivalent

These are lower bounds. A 1M-token context window means KV cache at full context can add hundreds of additional gigabytes depending on batch size and implementation. Plan for headroom beyond the weight floor.

IndexShare’s 2.9x FLOP reduction at 1M context² does not change the memory requirement for weights; it reduces compute cost per token at long context, which matters for throughput rather than minimum hardware.

How does self-hosting cost compare to the subscription?

Z.ai prices hosted GLM-5.2 access through flat subscription tiers:⁵

Tier	Monthly	Yearly effective	Usage
Lite	$18/month	~$12.6/month	~400 prompts/week
Pro	n/a	~$50.4/month	5x Lite
Max	n/a	~$112/month	20x Lite

Under the MIT license,² self-hosting incurs no per-token fee and no subscription. The cost is hardware acquisition or rental plus electricity and operational overhead.

The crossover point depends on utilization. At low usage (a few hundred prompts per week), the Lite tier at $12.6/month yearly is almost certainly cheaper than renting the GPU cluster you need for a 753B model. At high, sustained utilization, where you are running the model continuously for production traffic, the hardware cost amortizes against the subscription fast.

A rough framing: renting 10x H100s on a cloud provider runs roughly $30–$50 per hour at current spot rates (this is an illustrative range; actual rates vary by provider and region). At $40/hour, a model running 24/7 costs roughly $29,000/month in compute alone, before storage, networking, and engineering time. That math only works if the model is serving enough traffic that a flat per-query subscription would cost more.

For most teams evaluating GLM-5.2 for the first time, the subscription path is cheaper until you have confirmed the model performs on your workload and have a throughput level that makes the economics of self-hosting close. The MIT license means that calculation stays open: you can switch to self-hosting later without renegotiating a license.

How does the Anthropic-compatible endpoint affect migration?

GLM-5.2 exposes an Anthropic Messages API-compatible endpoint.⁴ For teams running Claude Code, Cline, OpenCode, Roo Code, Goose, Crush, OpenClaw, or Kilo Code,⁵ the migration path is a base-URL and model-name swap rather than a full SDK integration.

This matters for self-hosted deployments too: frameworks like vLLM and SGLang can expose an OpenAI-compatible REST interface, but if your agent tooling is wired to the Anthropic Messages API shape, you will need a thin translation layer or a framework that natively supports the Anthropic API format. Verify this at the framework level before committing to a serving stack, because it affects both the agent integration path and whether you can reuse existing prompt code that relies on Anthropic-specific message fields.

Thinking effort presets and when to use them

GLM-5.2’s API exposes High and Max thinking-effort presets for long multi-step coding tasks.⁴ These presets direct the model to extend its chain-of-thought before producing an output, at the cost of higher latency and token consumption.

For self-hosted deployments, thinking presets interact directly with your hardware budget. A Max-effort request on a 1M-context coding task can produce a substantially longer output than a default request, which fills KV cache faster and increases per-request memory pressure. If you are running near the memory ceiling, High is the safer default until you have profiled your actual request distribution.

Frequently Asked Questions

Does GLM-5.2 require special framework versions to run?

The GitHub README¹ lists SGLang, vLLM, Transformers, and KTransformers as supported deployment frameworks but does not specify minimum version requirements in its public documentation as of June 19, 2026. Check the zai-org/GLM-5 repository’s installation section for the current version pins before setting up your environment.

Can I run the FP8 weights on consumer GPUs?

The FP8 variant still requires roughly 753 GB of GPU memory for weights alone, which puts it beyond any consumer GPU. The FP8 advantage is that it halves the memory requirement relative to BF16, making the model approachable on a larger but still datacenter-class multi-GPU rig rather than an extreme enterprise configuration.

Is the MIT license on the weights the same as the code license?

No. The GitHub repository at zai-org/GLM-5 is licensed under Apache 2.0 for the code.¹ The model weight files hosted on HuggingFace carry a separate MIT license.² The permissive self-hosting right applies to the weights, not to the inference code in the repo.

What is the model identifier to use in API calls?

The model identifier is glm-5.2, with glm-5.2[1m] used in some documentation to explicitly reference the 1M-context variant.⁴ Confirm the exact identifier string against Zhipu’s current API documentation before wiring it into a production agent loop, as model identifier strings can change between documentation revisions.

Can I use GLM-5.2 in Claude Code directly?

Yes. Z.ai lists Claude Code as one of eight supported coding agent integrations at launch.⁵ The Anthropic Messages API-compatible endpoint means a base-URL change in your Claude Code configuration is sufficient; no SDK swap is required.⁴