Both MiniMax’s M3 and Zhipu’s GLM-5.2 advertise a 1M-token context window. Only one ships a benchmark table or an architecture explanation to back the number, and neither publishes independent needle-in-haystack data, so the 1M figure in each launch post is a maximum-token spec rather than a measured effective-retrieval length. GLM-5.2’s claim is the better-evidenced of the two, but only because Z.ai published something to evaluate; the effective-retrieval question is still open for both.
What does each vendor’s 1M claim actually rest on?
Zhipu, which now operates as Z.ai, brands the GLM-5.2 window “Solid 1M Context,” extending its context window to 1M tokens. The same launch post undercuts its own headline one paragraph later: “A 1M context is easy to claim, but much harder to keep reliable under real engineering pressure.” That sentence is doing the work the tagline refuses to do. The rest of the post is unusually dense for a model launch: a named sparse-attention trick (IndexShare) with a 2.9× FLOPs claim, and a speculative-decoding improvement quantified at a 20% acceptance-length gain.
MiniMax markets M3 as “a frontier coding & agentic model built on a novel attention architecture (MSA) with 1M context.” That is the entire technical disclosure. The homepage carries no benchmark numbers, no needle-in-haystack data, and no link to a technical report. The company publishes scale figures elsewhere (157M+ users across 200+ countries, 50,000+ enterprise clients across 90+), but nothing that characterizes how M3 behaves inside its advertised window.
The asymmetry is the story. Two vendors, the same headline number, a completely different disclosure load.
What does GLM-5.2 back its 1M claim with?
The GLM-5.2 post makes the architecture legible. IndexShare reuses one indexer across every four sparse attention layers, which Z.ai reports cuts per-token FLOPs by 2.9× at 1M context. The same post claims a 20% improvement in speculative-decoding acceptance length from upgrades to the multi-token-prediction (MTP) layer. Both figures are vendor-reported.
The benchmark table is also legible, with the caveat that every number in it is Z.ai’s own. On Terminal-Bench 2.1, GLM-5.2 scores 81.0 against GLM-5.1’s 63.5 and Claude Opus 4.8’s 85.0. On SWE-bench Pro it moves to 62.1 from 58.4. On FrontierSWE, a long-horizon benchmark where an agent completes open-ended technical projects over hours to tens of hours, Z.ai reports GLM-5.2 trails Opus 4.8 by 1%, edges out GPT-5.5 by 1%, and beats Opus 4.7 by 11%. Across the three long-horizon benchmarks Z.ai chose to lead with, GLM-5.2 ranks as the top open-weight model and second overall to the Opus series.
None of these are context-retrieval scores. They are coding and agentic benchmarks run inside the window, and Z.ai ran them. Useful, but they do not answer the question the 1M marketing raises.
What has MiniMax actually disclosed about M3?
Nothing beyond the tagline. No M3 technical report is linked from the homepage, no benchmark table ships with the launch, and no architecture diagram accompanies the MSA mention. MSA, presumably a MiniMax variant of sparse or structured attention, is named but not defined in any M3-specific document the launch surfaces.
Why an advertised 1M rarely means a usable 1M
The window length a vendor advertises is the number of tokens the model will accept. Effective context (how many of those tokens the model can retrieve from and reason over coherently) is a different quantity and a smaller one. An arXiv preprint from May 2026 frames the gap directly: large models “degrade in performance over long conversational horizons due to context window limitations and inefficient token usage,” which is why its authors propose context recycling over replaying the full window every turn (arXiv:2606.26105). Both M3 and GLM-5.2 are nominally attacking that degradation. Only Z.ai has tried to characterize it in writing.
Neither vendor publishes independent needle-in-haystack results, RULER-style depth tests, or multi-needle recall curves. The 1M figure is therefore a ceiling on input length, not a guarantee that an instruction placed near the end of the window survives to the output. Practitioners who have watched frontier models lose needles past a few hundred thousand tokens will recognize the pattern; the marketing has not caught up to the measurement.
The honest read is that effective context for both models past roughly 200K to 300K tokens is unverified. Below that range the windows are likely fine. Above it, the only evidence available is vendor self-report on coding benchmarks that stress long-horizon agent execution rather than distributed retrieval.
What does it cost, and when does long-context beat RAG?
Z.ai’s GLM Coding Plan undercuts Western per-seat pricing hard: Lite at $18/month ($12.6 billed yearly), Pro at $72/month ($50.4 yearly), Max at $160/month ($112 yearly), with access through 20+ coding tools including Claude Code, Cline, and Kilo Code. CNBC reported June 26 that GLM-5.2 sits “within a percentage point of Anthropic’s Opus 4.8 on one closely watched agentic benchmark, at roughly a fifth of the cost,” with OpenRouter token traffic climbing faster than after DeepSeek’s V4 launch in April. Harvey co-founder Gabe Pereyra told CNBC: “GLM 5.2, you’re seeing the first model where it’s really competitive with some of these closed-source frontier models.”
Two hedges on the cost story. First, the CNBC benchmark is agentic capability, not context retrieval; the within-a-point comparison maps to FrontierSWE-style long-horizon coding, where GLM-5.2 trails Opus 4.8 by 1% on Z.ai’s own table. Second, cheap open weights do not guarantee cheap hosted throughput. The weights are MIT-licensed and freely servable, but if you depend on Z.ai’s hosted endpoint, capacity is the real price.
The long-context-versus-RAG tradeoff depends on how often you genuinely need the full window. Long-context collapses retrieval into a single forward pass and is simpler to operate, but per-request cost scales linearly with input tokens and you inherit the model’s untested effective-context limit. RAG caps the per-request token count behind a retrieval layer, which is more infrastructure to run and tune, but it bounds your exposure to the depth-degradation problem. The break-even point is the depth at which the model’s retrieval actually fails, which, again, neither vendor has published.
How should a practitioner test retrieval at depth before committing?
Run your own depth sweep before swapping a RAG pipeline for a 1M-context call. The spec sheet is not the budget.
A workable plan: embed needles at increasing depths (128K, 256K, 512K, 1M tokens) in realistic documents, not synthetic filler, and measure recall at each depth. Layer a multi-needle variant to catch interference between facts. Then run RULER-style probes that the single-needle test misses: variable tracking across the window, multi-hop aggregation, and long-form questions whose answers require combining distant spans. Pin the depth at which recall drops below whatever threshold your application tolerates. That depth, not the vendor’s 1M, is the real working budget.
Apply the same test to both models with identical inputs and identical retrieval targets, so the comparison is controlled. Document the prompt format, because sparse-attention architectures are sensitive to needle placement relative to attention boundaries, and a result at one depth is not a result at all depths.
Until one of them publishes a depth test, or a third party does, the headline is a marketing number. GLM-5.2 has earned more benefit of the doubt than M3, because it shipped a benchmark table, an architecture explanation, and an honest sentence conceding that 1M is easy to claim but hard to keep reliable. M3 has shipped a tagline. The 1M in both cases is something you should measure, not something you should believe.
Frequently Asked Questions
If I build on Z.ai’s hosted endpoint instead of self-serving the weights, what’s the capacity risk?
Z.ai restricted signups after its shares fell 23 percent in late February 2026 on compute shortages, so hosted throughput is not guaranteed even though the weights are MIT-licensed and freely servable. Self-hosting on your own GPUs sidesteps that queue but adds operations overhead.
What specific changes drove GLM-5.2’s 20% speculative-decoding gain?
Z.ai attributes the gain to four combined changes in the multi-token-prediction layer: IndexShare, KVShare, rejection sampling, and end-to-end TV loss, lifting acceptance length from 4.56 to 5.47 tokens. None of these address retrieval depth, so the decoding-speed win is orthogonal to whether the 1M window stays coherent at the back.
Where does Z.ai sit among Chinese open-weight vendors relative to MiniMax?
Z.ai IPO’d on the Hong Kong Exchange on January 8, 2026 as China’s first major LLM company to go public, and IDC ranks it the third-largest LLM market player. MiniMax, founded early 2022, is a comparable Chinese open-weight challenger but trails on launch-day disclosure for its 1M claim.
Do the open weights matter if U.S. frontier models keep getting access-capped?
Z.ai ships GLM-5.2 under MIT as ‘Pure Open, no regional limits,’ a hedge against revoked access after Anthropic pulled a Fable model following a Trump-administration order and OpenAI limited GPT 5.6 to trusted partners. Open weights let you self-serve, but Z.ai itself was U.S. Entity-Listed in January 2025, so a U.S. team depending on the hosted endpoint still carries sanctions risk.
Does GLM-5.2’s 1M window represent a jump from GLM-5.1?
GLM-5.2 extends Z.ai’s prior 200K-token limit to 1M, a 5× headline increase over GLM-5.1. The jump is real on the spec sheet, but neither generation shipped independent needle-in-haystack data, so the 5× figure is an input-length claim, not a measured 5× in effective retrieval.