MiniMax M3 Bundles 1M Context and Native Multimodal Into One Open-Weight Model

MiniMax released M3 on June 1, 2026, and the spec sheet reads like a bet that one model can do what teams currently spread across three: long-context reasoning, multimodal input, and coding at or near the frontier. If the open weights ship as promised, M3 would be the first model in its class to offer that combination with self-hosting as an option, at roughly one-tenth the per-token cost of Claude Opus 4.8 or GPT-5.5 — and about one-seventeenth the cost of Claude Fable 5, Anthropic’s most capable widely released model, which launched on June 9, 2026.

What MiniMax M3 actually is

M3 is a sparse-attention model built by Shanghai-based MiniMax on a grouped-query attention (GQA) backbone, accepting text, images, and video as input with text-only output. Its context window extends to 1M tokens (guaranteed minimum 512K). According to MiniMax’s developer guide, the model was trained on mixed-modality data from step zero, rather than retrofitting vision onto a text-only base. M3 also supports computer use for desktop operation.

The predecessor, M2.7 (released April 2026), was text-only with a shorter context window and no multimodal input. The gap between M2.7 and M3 is not incremental.

The MSA architecture: sparse attention, round two

The technical hook is MiniMax Sparse Attention (MSA). The mechanism performs block-level selection on uncompressed key-values, dropping per-token compute at 1M context to roughly 1/20th of the M2 generation while preserving accuracy, per MiniMax’s developer guide. The headline throughput numbers: 9.7× faster prefill and 15.6× faster decoding at 1M context compared to M2.

What makes this notable is that MiniMax abandoned sparse attention for the entire M2 generation. According to FelloAI’s coverage, the company killed sparse attention in M2 and brought it back for M3. MSA is a public self-correction on that position. Sparse attention is harder to productionize, the tooling is weaker, and the accuracy tradeoffs are nontrivial. MiniMax walked into those problems once, walked away, and walked back. Whether the second attempt holds up under sustained use is the open question.

Benchmark breakdown: where M3 wins, where it doesn’t

The coding numbers are strong for an open-weight model. According to FelloAI’s report, M3 scores 59.0% on SWE-Bench Pro, edging GPT-5.5 (58.6%) and trailing Claude Opus 4.7 (64.3%). On BrowseComp, M3 scores 83.5, though no comparable Opus score was independently available. Terminal-Bench 2.1 comes in at 66.0%.

MiniMax has demonstrated M3 on long-horizon agentic tasks, but these demos are unreplicated and no independent source has confirmed the results.

The weakness is abstract reasoning. BenchLM’s independent ranking shows a Reasoning score of 0 out of 100, with zero published benchmarks in that category. This is a real gap, not a rounding error.

BenchLM’s independent ranking places M3 at #29 of 119 on the provisional leaderboard (overall 76/100) and #12 of 28 on the verified leaderboard. Strongest category: Agentic (#13, 82.4/100). Weakest: Multimodal and Grounded (#69, 47.3/100). Only 16 of 224 tracked benchmarks have published scores, so the leaderboard picture will shift as more results come in.

Pricing and the cost question

As of June 2026, OpenRouter lists M3 at $0.30 per million input tokens and $1.20 per million output tokens during a 7-day 50% launch promotion, per AI Made Tools. Regular pricing is $0.60/$2.40. Cache reads cost $0.12/M. The long-context tier (512K to 1M) is 2× standard rates.

For comparison:

Model	Input $/M tokens	Output $/M tokens	Context	Multimodal
MiniMax M3 (promo)	$0.30	$1.20	1M	Yes
MiniMax M3 (regular)	$0.60	$2.40	1M	Yes
Claude Opus 4.8	$5.00	$25.00	1M	Yes
Claude Fable 5	$10.00	$50.00	1M	Yes
GPT-5.5	$5.00	$30.00	128K	Yes

MiniMax pricing per AI Made Tools. Claude pricing per Anthropic (June 2026). GPT-5.5 is listed vendor rate as of June 2026. Claude Fable 5 is Anthropic’s most capable widely released model, sitting above Opus 4.8 in the lineup.

At launch pricing, the per-token gap against Claude Opus 4.8 is 8 to 10× on input and roughly 10× on output; against Claude Fable 5 the gap widens to roughly 17× on input. For teams running long-context agentic pipelines, the savings compound fast. But the comparison is only clean if M3’s quality holds under independent testing and if the open weights allow commercial self-hosting.

The competitive set

M3 is the only open-weight model at launch with native multimodality and 1M context, as of June 2026. DeepSeek V4-Pro competes on coding benchmarks and publishes open weights, but it lacks multimodal input entirely. Qwen 3.7-Max-Preview is closed-weight. MiniMax has not published head-to-head numbers against DeepSeek V4 or Kimi K2.6, per Apidog’s comparison.

The framing matters: if your workload is coding-only and you do not need multimodal input, DeepSeek V4-Pro is worth evaluating. M3’s advantage is the bundle, the context window, and (potentially) self-hosting.

What is still unproven

Three things separate “interesting launch” from “changed the market,” and none are resolved yet:

Open weights. MiniMax promises weights on Hugging Face within roughly 10 days of the June 1 launch (expected June 10, 11), per FelloAI. At time of writing, they are not available. Until they appear and the community can inspect the architecture, the “open-weight” claim is a promise, not a fact.
License terms. M3’s license terms have not been disclosed. If the weights ship with commercial-use restrictions, the self-hosting economics argument narrows to research and noncommercial use only.
Independent benchmarks. Sixteen of 224 BenchLM benchmarks have scores. Vendor-reported numbers are useful for direction; they are not useful for procurement decisions.

Who should use M3 right now

Teams evaluating M3 fall into three buckets as of early June 2026:

API users running long-context agentic work. The pricing makes M3 worth testing immediately against current Claude or GPT pipelines. The BrowseComp numbers suggest genuine web-agent competence. The coding benchmarks are competitive with GPT-5.5. Run parallel tests and compare.

Teams planning to self-host. Wait for the weights and the license. The economics only work if unrestricted commercial use is permitted and if the model fits on hardware you can actually provision. MSA’s efficiency claims need validation under your workload, not just MiniMax’s demos.

Teams that need abstract reasoning or top-tier coding. BenchLM’s Reasoning category shows zero published scores for M3, making it a poor choice if that matters for your use case. If SWE-bench Verified is your primary metric and you do not need multimodal input, DeepSeek V4-Pro is worth evaluating as an alternative.

The honest summary: M3’s spec sheet is genuinely unusual for an open-weight model, the pricing is aggressive, and the sparse-attention comeback is technically interesting. But “announced” is not “shipped,” vendor benchmarks are not independent benchmarks, and a license that restricts commercial use would undercut the central pitch. Check back after June 11.

Frequently Asked Questions

How fast is M3’s output compared to Claude or GPT?

M3 generates text at roughly 100 tokens per second, about 3× the throughput of Claude Opus, according to FelloAI’s report. The speed comes from MSA’s block-level KV selection during decoding, but applies only to text output: M3 accepts images and video as input but cannot generate either.

How does M3 stack up against DeepSeek V4-Pro on coding benchmarks?

DeepSeek V4-Pro scores 80.6% on SWE-bench Verified versus M3’s 59.0% on SWE-Bench Pro, a related but distinct benchmark. DeepSeek V4-Pro also undercuts M3’s regular pricing at $0.435/$0.87 per million input/output tokens. The gap closes only if your pipeline requires multimodal input, which DeepSeek lacks entirely.

What did MiniMax’s own engineers say about sparse attention before M3?

MiniMax’s engineering blog previously stated that ‘the infrastructure for linear and sparse attention is much less mature’ than full attention, which is why the entire M2 generation dropped the approach. M3’s MSA is a reversal of that engineering judgment, not a steady technical progression, and whether the tooling has actually caught up remains an open question.

What does M2.7’s license suggest about M3’s commercial-use terms?

M2.7’s license restricts commercial use without written authorization from MiniMax. If M3 ships under similar terms, self-hosting would be limited to research and noncommercial applications, which would remove the primary economic argument for choosing it over per-token API access to Claude or GPT.

What specific long-horizon tasks has MiniMax demonstrated with M3?

MiniMax reports M3 autonomously reproduced an ICLR 2025 paper in 12 hours across 18 commits producing 23 figures, and optimized CUDA kernels over 24 hours, pushing FP8 hardware utilization from 7.6% to 71.3% (a 9.4× speedup across 147 submissions). M3 also scored 0.37 on PostTrainBench. No independent lab has replicated any of these results as of June 2026.