MiniMax launched M3 on June 1, 2026, claiming the first open-weight model to combine a 1M-token context window, frontier coding ability, and native multimodality in a single release. The headline number, 59.0% on SWE-Bench Pro according to MiniMax’s own announcement, would place it above GPT-5.5 and Gemini 3.1 Pro if independently confirmed. But “if independently confirmed” is doing a lot of work there. BenchLM’s provisional ranking puts M3 at #29 of 119 models overall, a long way from frontier.
What MiniMax M3 Claims
According to MiniMax’s launch post, M3 is a single model handling text, images, and video natively, with a 1M-token context window and a computer-use agent for desktop automation. The company positions it as the first open-weight release to unite all three capabilities: long context, strong coding, and multimodality.
The launch page also highlights two internal demonstrations: an autonomous reproduction of an ICLR 2025 paper completed in 12 hours across 18 commits, and a CUDA kernel speedup achieved over 24 hours. Both are MiniMax’s own evaluations with no independent replication reported as of June 2026.
The open-weight commitment matters here. MiniMax has promised to release M3 weights on HuggingFace and GitHub within approximately ten days of the June 1 launch, alongside a technical report, according to Toolworthy’s coverage. As of June 10, the weights do not appear to have landed yet; the window is narrow, and the release will determine whether this is a genuine open-weight launch or a controlled-access announcement.
The MSA Architecture
MiniMax attributes M3’s efficiency gains to what it calls MSA (MiniMax Sparse Attention) architecture. According to the launch announcement, MSA cuts per-token compute to 1/20 of the previous generation. The claimed speedups are 9x on prefill and 15x on decode at 1M-context length.
Sparse attention is not new; models from Mistral, Google, and others have used variants to reduce the quadratic cost of full attention. What MiniMax appears to be claiming is a more aggressive sparsity ratio than prior approaches, enabling the 1M window without the memory and latency penalties that normally make long-context inference expensive. The technical details backing these figures are promised in the upcoming technical report, which is not yet available as of June 2026.
The practical question for self-hosting teams is GPU memory. MiniMax has not published the model’s parameter count, memory requirements for 1M-context serving, or recommended hardware configurations. Without these numbers, teams cannot evaluate whether the claimed compute savings translate into something runnable on their infrastructure.
Benchmarks: Self-Reported vs. Independent Rankings
This is where the picture gets uneven.
According to MiniMax’s announcement, M3 scores 59.0% on SWE-Bench Pro, surpassing GPT-5.5 and Gemini 3.1 Pro, and approaching Opus 4.7.
BenchLM.ai provides an independent cross-reference, and the rankings tell a different story depending on the slice:
| Category | Rank | Out Of |
|---|---|---|
| Overall (provisional) | #29 | 119 |
| Verified models | #12 | 28 |
| Agentic | #13 | — |
| Multimodal & Grounded | #69 | — |
The agentic ranking (#13) aligns with MiniMax’s emphasis on tool use and computer control. The multimodal ranking (#69) does not. A model claiming native multimodality placing 69th on multimodal benchmarks is a significant gap between marketing and measurement.
The SWE-Bench Pro figure of 59.0% is also worth treating with caution. It comes from MiniMax’s own evaluation. The BenchLM ranking does not break out SWE-Bench scores by model, so there is no independent cross-check on that specific number as of June 2026.
Availability, Pricing, and the Open-Weight Timeline
MiniMax offers M3 through its own API with standard pricing of $0.60 per million input tokens and $2.40 per million output tokens, according to Toolworthy. A 7-day launch discount halved both rates, expiring around June 8.
For subscription users, Datacamp’s overview reports three tiers as of June 2026:
- Plus: $20/month for approximately 1.7B tokens
- Max: $50/month for approximately 5.1B tokens
- Ultra: $120/month for approximately 9.8B tokens
The API pricing at $0.60/M input tokens is competitive with, though not dramatically cheaper than, the 1M-context offerings from Anthropic and Google. Where M3 could differentiate is in self-hosted deployment, if the open-weight release materializes with usable documentation. A self-hosted 1M-context model at any price point changes the build-vs-buy calculus for teams currently routing long documents through paid APIs.
Verification Gaps and Hardware Unknowns
Several critical details remain undisclosed or unverified as of June 2026:
Parameter count and hardware requirements. MiniMax has not stated how large M3 is, how much VRAM it needs for 1M-context serving, or what GPU configurations are viable. Self-hosting teams cannot plan capacity without these numbers.
Weight availability. The ~10-day window for HuggingFace/GitHub release is a promise, not a commitment with a hard date. Toolworthy’s coverage frames it as “approximately ten days,” leaving room for slippage.
Benchmark independence. All SWE-Bench Pro, MCP Atlas, ICLR reproduction, and CUDA speedup figures come from MiniMax’s own evaluations. BenchLM’s independent ranking supports the agentic claim but contradicts the multimodal one.
Bottom Line for Engineering Teams
M3 is worth tracking, not worth adopting on launch claims alone.
The combination of open weights, 1M-context, and competitive API pricing addresses a real gap. Teams currently paying premium rates for long-context inference through closed APIs would benefit from a viable self-hosted alternative, and M3’s pricing structure suggests MiniMax understands this market.
But the decision to adopt requires three inputs that do not exist yet: the actual weight files on HuggingFace, independent benchmark confirmation (particularly on SWE-Bench Pro and the multimodal gap), and published hardware requirements for 1M-context serving. Until those arrive, the prudent move is to treat M3 as a promising announcement rather than a deployment target.
The BenchLM agentic ranking of #13 suggests M3 has real strength in tool-use scenarios. If the open-weight release lands with reasonable memory requirements, teams running agentic pipelines on closed models should benchmark M3 against their current stack. The multimodal ranking of #69, meanwhile, suggests that use cases relying on vision or grounded understanding should wait for independent evaluation before switching.
Frequently Asked Questions
Does Anthropic’s distillation accusation against MiniMax affect M3’s reliability?
Anthropic accused MiniMax of model distillation in February 2026, four months before M3 launched. The claim is unresolved and does not confirm M3 used distilled training data, but it introduces a structural risk: models trained on distillation targets can carry systematic blind spots on tasks outside the distillation distribution, weaknesses that do not surface in vendor-selected benchmarks like SWE-Bench Pro. Teams considering M3 for production pipelines should weight independent, third-party evaluations more heavily than MiniMax’s own figures, particularly given the post-IPO commercial pressure to announce strong numbers.
When do the flat-rate subscriptions save money over per-token API billing?
The subscription tiers (Plus $20/mo for ~1.7B tokens, Max $50/mo for ~5.1B) pool input and output tokens together. The key detail: output tokens cost 4x the input rate ($2.40/M vs $0.60/M) on the API, so output-heavy workloads (code generation, long-form synthesis) capture most of the subscription discount. Input-heavy pipelines (document ingestion, RAG retrieval over long-context windows) see a smaller marginal benefit because their API costs are already anchored to the cheaper input tier. The Ultra tier at $120/mo for ~9.8B tokens targets teams running sustained multi-billion-token throughput, not intermittent queries.
What does M3’s 74.2% on MCP Atlas reveal about tool-use performance?
MCP Atlas evaluates models on the Model Context Protocol, an open standard for connecting models to external tools and data sources. M3’s 74.2% there, combined with BenchLM’s #13 agentic ranking, suggests competence at structured single-turn tool invocations. What neither score captures is multi-step agent reliability: error recovery across failed tool calls, coherent planning over extended tool-use sessions, and state persistence when the 1M-context window fills with intermediate results. Teams evaluating M3 for agent loops should test multi-turn tool chains directly rather than relying on any single benchmark.
Which M3 use cases should teams deprioritize given the BenchLM rankings?
BenchLM places M3 at #69 in Multimodal & Grounded, a poor result for a model marketed with native multimodality. The ranking suggests that tasks requiring visual understanding (screenshot parsing, document OCR, chart interpretation) or grounded reasoning (spatial relationships, multimodal chain-of-thought) are not M3’s strength. Teams should treat the “native multimodality” claim as text and image input support rather than strong visual reasoning, and wait for independent multimodal benchmarks before switching away from models that score well on vision-specific evaluations.