GLM 5.2, Qwen 3.7, and DeepSeek in 2026: A Routing Map by Workload, Not by Rank

Picking between DeepSeek, Qwen, GLM and Kimi in mid-2026 is a routing problem, not a leaderboard race. GLM-5.2 owns long-horizon coding; Qwen 3.7’s Max variant sits closest to the agentic-action slot Alibaba built it for; DeepSeek pre-segments its own lineup by task. The model that clears a workload cheaply and the one that tops an aggregate board are rarely the same, and the four-way headline is messier than it looks when you check which of the four are open weights.

Why picking by task beats picking by rank

Aggregate leaderboards collapse the tradeoff that decides cost: which model clears the workload, at the latency you need, for the fewest tokens. A team that defaults to the top-ranked name on every API call is paying frontier prices for jobs a cheaper or differently-shaped model clears, and the per-task comparison that would expose that overspend is exactly what vendor rankings are built to obscure. The decision that matters is not who wins the headline; it is which workload each model should own.

That decision also forces measurement the marketing never provides. Coding-agent reliability over a messy ten-hour trajectory, the price of bulk 128K-token inference, and the licensing of running a model on your own hardware are three different questions with three different answers, and none of them is “the model ranked first.” The routing cut below is built from two primary vendor sources and one community reference, and every score is vendor-reported until a neutral aggregator confirms it.

Which model should own long-horizon coding?

GLM-5.2, by a clear margin on the workloads that matter to coding agents. It delivers a 1M-token context built for long, messy coding-agent trajectories rather than token-count bragging, and it keeps that context affordable: IndexShare reuses one indexer across every four sparse-attention layers to cut per-token FLOPs 2.9× at 1M context, and a reworked multi-token-prediction layer raises speculative-decoding acceptance length by up to 20%. The weights ship under MIT on HuggingFace and ModelScope, with no regional limits.

The long-horizon coding results are where GLM-5.2 separates from the open pack. On FrontierSWE, which measures hour-to-tens-of-hours agentic projects, GLM-5.2 posts a 74.4 dominance score, trailing Claude Opus 4.8 (75.1) and beating GPT-5.5 (72.6). On SWE-Marathon, the ultra-long compiler-and-kernel benchmark, it scores 13.0 against Opus 4.8’s 26.0, still second only to the Opus series. On PostTrainBench it lands at 34.3, again second to Opus 4.8 (37.2) and ahead of GPT-5.5 (28.4). Across FrontierSWE, PostTrainBench and SWE-Marathon combined, GLM-5.2 is the highest-ranked open-source model.

The standard coding benchmarks tell the same story. GLM-5.2 scores 81.0 on Terminal-Bench 2.1 and 62.1 on SWE-bench Pro, up from GLM-5.1’s 63.5 and 58.4, and within a few points of Claude Opus 4.8’s 85.0 on Terminal-Bench. The next-best open model on SWE-bench Pro, Qwen3.7-Max at 60.6, sits less than two points back; MiniMax M3 (59.0) and DeepSeek-V4-Pro (55.4) trail further.

Two practical details finish the case. GLM-5.2 ships effort-level control with High or Max thinking effort, so you trade capability against latency per call instead of paying for max reasoning on every trivial edit. The GLM Coding Plan is tiered: Lite at $12.6/month for small repos, Pro at $50.4/month, and Max at $112/month (yearly, roughly 30% off), all usable inside Claude Code, Cline and Kilo Code.

Where does Qwen 3.7 fit in the stack?

In the agentic and mobile-action slot Alibaba explicitly built the 2026 Qwen line for. The current stable releases are Qwen3.7 Max and Qwen3.7 Plus, dated May 18, 2026, following the open-weights Qwen3.6-35B-A3B (April 2026, Apache 2.0) and Qwen3.5 (February 16, 2026). Alibaba positions Qwen3.5 as able to take actions independently across mobile and desktop apps, and the Qwen app had reached 234 million users by May 2026.

The agentic crown rests on Alibaba’s positioning more than on a neutral table. The one agentic benchmark in z.ai’s comparison set, MCP-Atlas, puts Qwen3.7-Max at 76.4, within a point of GLM-5.2 (76.8) and ahead of MiniMax M3 (74.2) and DeepSeek-V4-Pro (73.6). Competitive, not dominant. On standard coding, Qwen3.7-Max’s 60.6 on SWE-bench Pro is the closest any open-ish model gets to GLM-5.2, and its 75.0 on Terminal-Bench 2.1 is solid. There is a real coding model in Qwen3.7-Max, even if the headline Alibaba sells is action over code.

The release cadence is the part of the Qwen story that ages this routing map fastest. Qwen3.5 shipped February 16, 2026, Qwen3.6 followed in April, Qwen3.7 Max landed May 18, and Alibaba announced Qwen3.8 in July. The named qwen3.8-max-preview is available through Token Plan, Qoder and QoderWork, but Qwen’s documentation describes it as a moving preview that will be iteratively upgraded and later retired or replaced. With no model card, neutral benchmark record, production API rate card or downloadable weights yet, Qwen3.7 remains the last generation with stable external comparison data. The half-life on a version-specific recommendation is now measured in weeks, which is why the routing job is less “pick a model” than “maintain a table.”

How does DeepSeek’s lineup already segment itself?

DeepSeek does the routing work for you. Its own site segments the lineup by task: V3.2 for everyday tasks, R1 for deep reasoning, V3 for fast responses, and R1-Distill (1.5B to 70B) for local deployment. V3.2 itself is a 685B-parameter mixture-of-experts model with a 128K-token context. That is a ready-made routing table: cheap bulk inference and general work to V3.2, hard reasoning to R1, and a self-hosted distill ladder for the on-device or air-gapped tier.

The naming is messier than the segmentation. DeepSeek’s vendor site still leads with V3.2, while z.ai’s GLM-5.2 benchmark table reports a column for DeepSeek-V4-Pro at SWE-bench Pro 55.4, Terminal-Bench 2.1 64.0, and GPQA-Diamond 90.1. Whether V4-Pro is a shipping variant or a comparison target the competitor chose, that gap is worth pinning down before you commit spend to a model name. Either way, DeepSeek’s per-task shape is consistent: mid-pack on coding benchmarks, with one genuine bright spot on agentic tool use, where DeepSeek-V4-Pro edges GLM-5.2 on Tool-Decathlon, 52.8 to 48.2.

That result is the routing argument in miniature. On long-horizon coding, GLM-5.2 dominates and DeepSeek is nowhere near it. On one agentic-tool benchmark, the order flips. If your workload is high-volume tool chaining rather than sustained code construction, the cheaper DeepSeek tier is the better economic bet, and the headline leaderboard that ranks GLM-5.2 first would never tell you that.

The routing map, per z.ai’s comparison table

Every score below is vendor-reported by z.ai, and the DeepSeek-V4-Pro naming is unresolved. Treat the table as a hypothesis to check, not a procurement decision.

Workload	Routing pick	Score	Closest open-ish rival
Long-horizon coding (FrontierSWE)	GLM-5.2	74.4	none close; Opus 4.8 closed at 75.1
Standard coding (SWE-bench Pro)	GLM-5.2	62.1	Qwen3.7-Max 60.6
Agentic tool use (Tool-Decathlon)	DeepSeek-V4-Pro	52.8	GLM-5.2 48.2
Agentic MCP (MCP-Atlas)	GLM-5.2	76.8	Qwen3.7-Max 76.4
Cheap bulk / self-host	DeepSeek V3.2 + R1-Distill	n/a	n/a

What about Kimi K3 and MiniMax M3?

The headline promises a four-way comparison. Kimi now has enough public evidence to fill that fourth slot, but not enough deployment evidence to own it outright.

Kimi K3 shipped July 16 as a 2.8-trillion-parameter multimodal MoE with a 1-million-token context window, and Moonshot prices its hosted API at $3 per million uncached input tokens, $0.30 for cached input and $15 for output. Artificial Analysis records a 57 Intelligence Index score, which puts K3 in the current frontier cohort rather than the old empty slot. It is a sensible hosted canary for long-context and agentic work, especially when prompt-prefix caching is high.

MiniMax M3 is half-verifiable. It appears only as a comparison column in z.ai’s GLM-5.2 benchmark table: SWE-bench Pro 59.0, Terminal-Bench 2.1 65.0, NL2Repo 42.1, DeepSWE 20.0, MCP-Atlas 74.2, GPQA-Diamond 93.0, with an AIME 2026 score that went unreported. Those are competitor-reported numbers with no MiniMax primary source behind them and no license terms in the material. Strong enough to note, not strong enough to route production traffic on.

How do you pressure-test each pairing?

Bind every score to a neutral aggregator before the spend. The routing above leans on two vendor blogs and one community reference, and the long-horizon coding numbers that make GLM-5.2 look dominant come from z.ai’s own comparison table. Vendor-reported benchmarks and vendor FAQ claims, DeepSeek’s “comparable to GPT-4” framing included, need a leaderboard check before they become a procurement decision. Run each candidate on DataLearner and Artificial Analysis for the specific task you are routing for, not for the aggregate score.

Then re-check on a monthly cadence, not quarterly. Alibaba shipped four Qwen point releases between February and July 2026: 3.5, 3.6, 3.7 and the 3.8 preview. A quarterly review lets two releases land before you notice. GLM-5.2 shipped as a 1M-context flagship in late June; Kimi K3 followed July 16. The FrontierSWE gap between GLM-5.2 and Opus 4.8, whether DeepSeek V4 has actually shipped under that name, K3’s post-weights serving economics and Qwen3.8’s eventual fixed release will all move inside a quarter, probably sooner. The routing taxonomy (agentic coding, long-context retrieval, cheap bulk, local distill) is the durable part. Every version number, benchmark score, context length and pricing tier attached to it is not.

Frequently Asked Questions

Is GLM-5.2’s 1M-token context useful outside of coding, such as for document retrieval or summarization?

Z.ai positions the 1M context for coding-agent trajectory length, not general document retrieval. The benchmarks used to validate it (FrontierSWE, SWE-Marathon, PostTrainBench) all measure sustained code construction over hours. No long-document retrieval or RAG benchmark appears in the comparison set, so routing it toward document search means testing an assumption the vendor has not validated against neutral data.

Where is GLM-5.2 closest to and furthest from Claude Opus 4.8 across the benchmark set?

The gap is smallest on FrontierSWE, where GLM-5.2 trails Opus 4.8 (74.4 to 75.1). It widens to roughly 2x on SWE-Marathon, which covers compiler and kernel work across the longest trajectories: 13.0 versus 26.0. Teams routing on open-source near-parity with frontier will find that parity holds on typical day-scale coding agent runs but breaks down on the hardest benchmark category in the set.

Has Alibaba’s claimed speed and cost advantage for Qwen’s agentic capabilities been independently confirmed?

Alibaba has claimed Qwen3.5 beats US rivals on speed and cost for agentic workloads, but no neutral benchmark has confirmed this for the Qwen3.7 generation. The 234 million Qwen app users cited through May 2026 is a consumer adoption figure, not a throughput or cost-per-action measurement. Any cost routing decision for Qwen3.7 Max should be validated against Artificial Analysis latency and pricing data, since the vendor claim predates the 3.7 release.

What breaks down in the routing model if a team’s primary workload is high-volume tool chaining?

DeepSeek-V4-Pro leads on Tool-Decathlon (52.8 vs GLM-5.2’s 48.2), the benchmark closest to high-volume tool chaining. If that workload dominates, GLM-5.2’s Pro or Max tiers charge for long-context and speculative-decoding optimizations a tool-chaining pipeline does not exercise. The cheaper DeepSeek V3.2 tier is the correct default for that shape, with R1 reserved for the calls requiring sustained reasoning.

What would cause the four-way routing approach to collapse into a single-vendor decision?

GLM-5.2’s MIT license is the only one here that allows self-hosting with no per-token cost. If a team’s throughput justifies running a 1M-context model on local hardware, the routing problem collapses to one endpoint and the API-tier comparisons become irrelevant. The licensing asymmetry (MIT versus Apache 2.0 versus proprietary API-only) is the most durable differentiator in this stack once benchmark scores converge, since it determines whether the team can exit the API cost structure entirely.