groundy
industry & business

Moonshot's Kimi K2.7 Code Loses 11 of 12 Benchmark Cells, Leads on Efficiency Instead

Moonshot's Kimi K2.7 Code loses 11 of 12 benchmark cells to GPT-5.5 and Opus 4.8, leading on token efficiency and price, which pushes buyers to run their own evals.

7 min · · · 4 sources ↓

Moonshot shipped Kimi K2.7 Code on 2026-06-12 with open weights, an OpenAI-compatible API, and a 256K context window, per analysis of the release. The notable part is what its own model card does not do: claim a capability win. Across the six benchmarks Moonshot ran itself, K2.7-Code trails GPT-5.5 and Claude Opus 4.8 in 11 of 12 head-to-head cells, per the open-source release breakdown. The pitch leads on token efficiency and price instead.

What Moonshot shipped on June 12

Kimi K2.7-Code is a 1-trillion-parameter Mixture-of-Experts model with 32 billion activated parameters per token, open-sourced on Hugging Face under a Modified MIT license with OpenAI- and Anthropic-compatible API endpoints, per the release analysis. The architecture carries over from K2.6 per the open-source release breakdown: 384 experts with 8 selected plus 1 shared per token, 61 layers, Multi-head Latent Attention, native INT4 weights, a 256K context window, and a 400M-parameter MoonViT vision encoder.

For a coding follow-up, the version delta is generational rather than architectural. Moonshot reports K2.7-Code gains over K2.6 per the same coverage: 21.8% on Kimi Code Bench v2 (50.9 to 62.0), 11.0% on Program Bench, and 31.5% on MLS Bench Lite. Two of those six benchmarks (Kimi Code Bench v2 and Kimi Claw 24/7) are Moonshot’s own in-house suites, and Moonshot executed all the comparison runs itself.

Did K2.7-Code beat GPT-5.5 and Opus 4.8 on benchmarks?

Moonshot published a benchmark table. It just did not win it. The model card’s six-benchmark comparison puts K2.7-Code behind GPT-5.5 and Claude Opus 4.8 on almost every cell, per the release breakdown. Its single win is MCPMark Verified, where it scores 81.1 against Opus 4.8’s 76.4, and it still loses that benchmark to GPT-5.5 at 92.9.

That is where the headline framing needs a correction. K2.7-Code did not skip the benchmark race; Moonshot showed up, ran the table, and lost most of it. What changed is the posture. Rather than leading with a cherry-picked SOTA claim, the launch leads on efficiency. That is a narrower and more honest story than “skipped the race,” but it is the one the numbers support.

One more wrinkle: no SWE-Bench Verified score appears in Moonshot’s launch table. A third-party model directory lists a 78.2% SWE-Bench figure for K2.7-Code, but that number is absent from the vendor’s own materials and should be treated as unverified.

Worth noting on methodology: GPT-5.5 and Opus 4.8 were run inside Codex and Claude Code at “maximum-effort” settings rather than default API calls, per the release breakdown. That is a defensible choice for a coding model, since you want to compare agentic coding at agentic settings, but it is also a knob that favors whichever harness a benchmark is tuned for.

Why token efficiency is the actual pitch

K2.7-Code’s pitch leads on token economics because the capability comparison is a loss. Moonshot claims roughly 30% lower thinking-token usage versus K2.6 while scoring higher, per the release breakdown, and it ships thinking forced on via preserve_thinking with no instant or non-thinking mode. As of June 2026 that figure is vendor-reported and vendor-measured.

For a coding agent that runs hundreds of turns per task, thinking-token volume dominates the bill. A model that thinks less per accepted change can win on cost-per-merged-PR even while losing the leaderboard. That is the trade Moonshot is selling: concede the benchmark headline, keep the efficiency win.

The constraint is structural. Thinking cannot be disabled, so for trivial requests (a one-line rename, a formatting pass) the model still pays reasoning overhead. Workloads dominated by cheap, high-volume edits will not capture the full efficiency benefit and may spend tokens where a non-thinking model would not.

What K2.7-Code costs as of June 2026

As of June 2026, K2.7-Code’s API is priced per the release analysis at $0.19 per million cached-input tokens, $0.95 per million on cache misses, and $4.00 per million output tokens. Moonshot frames that as roughly a tenth of Claude Fable 5’s $10/$50 per-million list rates. An order-of-magnitude price gap matters more than a couple of benchmark points for teams running coding agents at volume.

The open weights are the other half of the cost story. Self-hosting avoids the per-token bill, but the model is 340GB-plus to run, which is its own DRAM-spike problem, per the open-source release breakdown. For most teams the API economics, not the weights, are the actionable comparison.

The Modified MIT license adds one commercial condition worth reading before assuming standard MIT terms. Per the release analysis, a product exceeding 100 million monthly active users or US$20 million per month in revenue must prominently display “Kimi K2.7 Code” in its UI. Below those thresholds it behaves like vanilla MIT, so the attribution requirement only applies to hyperscaler-class products.

What coding-agent buyers must now test themselves

When the vendor stops selling a leaderboard delta, the evaluation work moves in-house. K2.7-Code’s differentiation (efficiency and price against a capability deficit) does not map to any public score a procurement team can look up. The comparison that matters is cost-per-accepted-task on your own repository, not K2.7-Code’s MCPMark number against Opus 4.8.

Concretely, that means running the model on real pull requests from your codebase and measuring accept and merge rates, tokens consumed, and rework frequency. Public benchmarks will not tell you whether K2.7-Code’s tool-calling loop holds together on your stack, and Moonshot’s own comparison was run in two different agentic harnesses at non-default settings.

Integration constraints will surface in that eval too. Multi-turn tool use requires feeding back the assistant’s previous reasoning_content in the message history, and tool_choice can only be set to auto or none, per a community integration guide. Existing coding-agent frameworks may need adapter work to handle both.

The caveats: vendor-reported numbers, forced thinking, and tool-loop limits

Almost everything load-bearing in this launch is self-reported. Moonshot ran all six comparison benchmarks itself, two of them (Kimi Code Bench v2 and Kimi Claw 24/7) are in-house suites, and the thinking-token-reduction claim is self-measured. The headline SWE-Bench figure floating on third-party directories is not in Moonshot’s launch table and should be treated as unverified. Independent replication is not yet reported as of 2026-06-16.

The broader read is that benchmark fatigue is reaching Chinese coding-model vendors too. A lab that a few quarters ago would have led with a claimed SWE-bench win is instead shipping a table it loses and betting on efficiency and price. For buyers that is not a relief from evaluation. It is a transfer: the work of telling these models apart now sits on your repo, not on a leaderboard.

Frequently Asked Questions

How does the cached-input price gap change how teams should prompt K2.7 Code?

Cached input runs $0.19 per million tokens against $0.95 on a miss, a fivefold spread that is wider than most coding APIs expose. Teams that reuse large system prompts or repository context across turns pay roughly a fifth of the uncached rate, so prompt-caching strategy, not model selection, becomes the dominant cost lever for high-volume coding agents.

Does the maximum-effort benchmark setting undercut K2.7 Code’s efficiency claim?

Partially. Moonshot ran GPT-5.5 and Opus 4.8 inside Codex and Claude Code at maximum-effort rather than default API calls, which raises their token spend and makes the 30 percent thinking-token comparison favorable to K2.7 Code. The efficiency delta is real but measured against competitors running hotter than their default configuration, so the gap shrinks if a buyer runs the closed models at lower effort.

Why is MCPMark Verified the one benchmark K2.7 Code wins?

MCPMark Verified scores Model Context Protocol tool-server integration, the agentic task of discovering and calling external tools, rather than single-shot code generation. K2.7 Code’s 81.1 to Opus 4.8’s 76.4 on that one cell suggests its tool-calling loop is comparatively strong even as its raw generation loses elsewhere, which matters more for autonomous coding agents than for autocomplete.

What breaks when pointing an existing coding agent at K2.7 Code’s OpenAI-compatible endpoint?

Two constraints beyond the endpoint. Multi-turn tool use requires replaying the prior reasoning_content block in message history, so clients that strip reasoning between turns drop into broken tool loops, and tool_choice accepts only auto or none, which blocks frameworks that force a specific tool call. Both need adapter code despite the OpenAI-compatible surface.

What does the MoonViT vision encoder buy in a coding model?

The 400M-parameter MoonViT encoder lets K2.7 Code ingest screenshots, UI mockups, and image-based error output alongside text, which dense coding models that lack a vision path cannot do without a separate OCR or captioning step. For tasks like reproducing a design from a mockup or debugging a rendered layout, the vision path removes a manual transcription step, though Moonshot’s launch benchmarks do not isolate vision-grounded coding performance.

sources · 4 cited

  1. Moonshot AI's Kimi K2.7 Code Is Out. Here's What Actually Changed for Coding-Agent Teams. analysis accessed 2026-06-16
  2. Kimi K2.7-Code Is Open-Source. Running It Yourself Is Another Story. analysis accessed 2026-06-16
  3. Kimi K2.7 Code: 262k Context, 78.2% SWE-Bench, $0.95/M Tokens community accessed 2026-06-16
  4. RayCodes_Kimi_2.7: Integration guide for Kimi K2.7 Code in Claude Code, Cline, Roo Code community accessed 2026-06-16