MiniMax M3 Claims GPT-5.5-Beating Code With 1M Context and Open Weights

MiniMax released M3 in June 2026 as an open-weight model that the vendor says combines frontier coding, a 1M-token context window, and native multimodal input in a single download. The headline claim is that it edges GPT-5.5 on SWE-bench Pro. That edge is 0.4 points, it is vendor-measured, and nobody outside MiniMax has reproduced it. Treat the comparison as a launch-week claim, not a settled result.

What shipped with M3, and what ‘open weight’ actually covers here

MiniMax launched M3 via API on June 1, 2026, followed by Hugging Face weights on June 7 and a MiniMax Sparse Attention technical report on arXiv on June 11, according to nerova’s launch analysis. The vendor positions M3 as the first open-weight model to combine frontier coding, a 1M-token context window, and native multimodal input (image and video) in one model, per its model page. “First to bundle all three” is MiniMax’s framing, not an independently audited distinction; in a field where DeepSeek and Qwen are already well-established open-weight options, the line between first-to-ship and fastest-to-announce is thin.

The license is the first place “open weight” stops meaning what buyers might assume. The weights ship under the MiniMax Community License, which Localaimaster characterizes as open weight but not fully open source: commercial use of M3 or derivatives reportedly requires a separate license agreement with MiniMax. Open weights here mean inspectable, redistributable under conditions, and self-hostable, not an MIT-style release. For anyone planning to ship a product on the weights, the commercial-use clause is the thing to read before the benchmark table.

Even the architecture is reported inconsistently. Localaimaster describes M3 as a 229.9B-parameter mixture-of-experts model with 9.8B parameters active per token across 256 experts. nerova, reading the Hugging Face model card, describes roughly 428B total parameters with about 23B activated. That is not a rounding difference: the two figures disagree by nearly a factor of two on total size and roughly 2.4x on active parameters. The checkpoint exists, but the reporting around its scale is not yet reconciled.

Do the GPT-5.5-beating benchmark numbers hold up?

The verification gap is the story, and the headline margin is thin. M3’s SWE-bench Pro score of 59.0% is reported by Localaimaster as edging GPT-5.5 at 58.6% and beating Gemini 3.1 Pro at 54.2%, while trailing Claude Opus 4.7 at roughly 64.3%. On the same vendor reporting, M3 posts about 80.5% on SWE-bench Verified, 66.0% on Terminal-Bench 2.1, 74.2% on MCP-Atlas, and 83.5 on BrowseComp, where MiniMax’s model page claims it surpasses Opus 4.7 at 79.3.

Two things break the “M3 beats GPT-5.5” reading. First, the SWE-bench Pro margin over GPT-5.5 is 0.4 points, well inside the noise of a single leaderboard run, and Localaimaster notes the figures are not independently verified. Scale AI, which maintains the official SWE-Bench Pro benchmark, has not confirmed MiniMax’s numbers. Second, a separate independent-style comparison on benchlm.ai reaches the opposite conclusion: it puts GPT-5.5 ahead on a provisional aggregate, 87 to 78, with GPT-5.5 averaging 81.5 versus M3’s 71.9 on agentic tasks. On Terminal-Bench 2.0, benchlm.ai scores GPT-5.5 at 82% against M3’s 66%. benchlm.ai’s own summary concedes that M3 “hits back in coding,” but the agentic gap is large.

There is also a category mismatch the leaderboard line skips. GPT-5.5 is a reasoning model in this comparison; M3 is not. Beating a reasoning model on a coding slice while losing the agentic aggregate is a narrower claim than “M3 is better,” and the headline framing flattens that into a single ranking.

What does MiniMax Sparse Attention buy at 1M context?

MiniMax Sparse Attention (MSA) is the serving mechanism that makes a 1M-token window usable rather than a theoretical spec, and it is the subject of the June 11 arXiv report. According to Localaimaster, MSA delivers roughly 15.6x faster decode and 9.7x faster prefill at 1M context compared to M2, with per-token compute dropping to about 1/20 of the prior generation and output throughput near 100 tokens per second. Those are figures from a vendor technical report, not third-party measurements; treat them as architecture claims until someone reproduces them on independent hardware.

The context claim itself is concrete on paper. The M3 API supports up to 1M tokens of context with a guaranteed minimum of 512K, per the model page, positioned for coding assistants, automated workflows, and long-range agent tasks. “Guaranteed minimum” is the operative phrase: it tells you the floor of the window, not the average, which matters for budgeting agent loops that run for hours.

The showcase demos are where the long-context and agent story gets concrete enough to judge. MiniMax’s model page lists a roughly 12-hour autonomous reproduction of an ICLR 2025 Outstanding Paper (18 commits, 23 figures) and a roughly 24-hour CUDA FP8 GEMM optimization run (147 submissions, 1,959 tool calls, taking utilization from 7.6% to 71.3% for a 9.4x speedup). These are vendor demos, not independent audits, but they are the kind of multi-hour agent sessions that a thin context window would choke on, which is at least suggestive that the 1M window is being exercised rather than merely advertised.

What does M3 cost, and what does self-hosting really take?

The hosted-API cost gap versus GPT-5.5 is large and easy to quote; the self-hosting hardware is where the real math lives. On benchlm.ai, GPT-5.5 runs $5.00 per 1M input tokens and $30.00 per 1M output, against M3 at $0.30 / $1.20, roughly a 25x gap on output cost alone. Localaimaster places M3’s standard API rate at about $0.60 input / $2.40 output per million tokens, with a temporary launch promo at $0.30 / $1.20 and roughly double the pricing for requests above about 512K tokens.

Two caveats sit on that headline number. The $0.30 / $1.20 figure is a temporary launch promotion, so it is not the steady-state rate to plan against. And the hosted API is China-hosted, which is a data-residency concern for teams with EU or US compliance constraints; the cheap tokens come with a jurisdiction.

Self-hosting inverts the comparison. nerova flags the roughly 427B-parameter BF16 checkpoint as a large data-center, multi-GPU deployment target, and argues the real near-term value of the open weights is control, inspection, and optionality rather than instant cost reduction. A checkpoint that size is not a drop-in replacement for an API call; it is an infrastructure commitment, and for most teams the open weights buy the option to leave a closed provider more than they buy savings on day one.

How should you evaluate M3 before trusting the leaderboard?

Treat M3 as a credible candidate worth benchmarking yourself, not as a confirmed GPT-5.5 replacement, and weight whatever survives your own eval above any vendor leaderboard line. As of late June 2026, some things are settled: the weights exist on Hugging Face, the API is live, the context window is 1M with a 512K floor, and the license is open weight with commercial-use strings attached. Several things are not settled: the parameter count (two analyst snapshots disagree by nearly 2x), the SWE-bench Pro result (Scale AI has not confirmed it), the agentic capability (benchlm.ai’s aggregate favors GPT-5.5), the MSA speedups (vendor-reported), and the data-residency posture of the hosted endpoint.

The pattern is a familiar one. Vendor-run benchmarks have outpaced independent verification across most 2026 model launches, and the leaderboard line that moves on launch week rarely matches the one that holds three months later. The defensible move is to run M3 against your own workload, on your own retrieval and tool stacks, before committing a serving path to it.

The bundled pitch is the part worth taking seriously. If the coding and 1M-context claims survive independent eval, M3 is the first self-hostable model to put long context, native multimodal, and frontier coding in one download, and that lowers the practical cost of leaving closed multimodal APIs for agent work. If they do not, it is another leaderboard entry whose numbers nobody has reproduced. Either way, the decision belongs to whoever runs the eval, not to the model card.

Frequently Asked Questions

How does M3’s open-weight pitch differ from DeepSeek or Qwen?

DeepSeek and Qwen ship permissive text-and-code checkpoints under licenses like MIT or Apache 2.0, while M3 layers native image and video input onto a 1M-token window and uses the more restrictive MiniMax Community License that requires a separate agreement for commercial use. The differentiator is the multimodal bundle, not raw coding strength.

What hardware is the realistic floor for self-hosting M3?

A BF16 checkpoint at roughly 427B parameters is about 854GB of weights before any KV cache, so a single 8-way H100 node at 640GB cannot hold it; you need around eleven 80GB GPUs or six H200s just to load the model, before budgeting for long-context memory. That gap is why the API, not the weights, is the practical entry point for most teams.

Where does the 1M context claim break for long agent runs?

Only 512K tokens are guaranteed, so requests in the 512K to 1M band can be refused or fall back, and the API roughly doubles pricing above that threshold. Multi-hour agent loops that accumulate tool output past 512K hit both the cost step and the truncation risk, which is why the floor, not the ceiling, sets the usable budget.

How does MiniMax Sparse Attention compare to other long-context attention work?

MSA sits in the same family as DeepSeek’s Multi-head Latent Attention and other 2026 sparse schemes that cut decode cost at long context, but the 15.6x decode and 9.7x prefill figures are vendor-reported and have not been measured against Multi-head Latent Attention on identical hardware. The direction is shared; the specific multiplier is not yet independent.

What would actually settle the GPT-5.5 comparison?

Confirmation needs either Scale AI publishing an official SWE-Bench Pro entry for M3 or a third party running the coding benchmark on independent infrastructure, because the current 59.0 to 58.6 margin is a single vendor run. Until one of those lands, the leaderboard line is a launch-week claim, not a verified result.