JetBrains Mellum2: A 12B Open-Weights Code Model for Self-Hosted Completion

JetBrains released Mellum2 on June 1, 2026: a 12-billion-parameter mixture-of-experts code model, open-weight under Apache 2.0, with only 2.5B parameters active per token. The pitch is straightforward: serve your own inline code completion on a single commodity GPU, keep proprietary code off metered cloud APIs, and pay nothing per seat.

What is Mellum2, and why did JetBrains open-source it now?

Mellum2 is the successor to the original Mellum, a 4B-parameter dense code-completion model that JetBrains debuted in late 2024 for its IDEs before open-sourcing it in April 2025. Where the first Mellum handled single-line and multi-line completions, Mellum2 extends to broader software-engineering tasks including tool use, retrieval-augmented summarization, and what JetBrains calls “agentic coding” sub-tasks, according to the technical report on arXiv.

JetBrains open-sourced it under Apache 2.0, publishing weights on Hugging Face. The motivation is partly ecosystem positioning. JetBrains reported 15.06 billion CZK revenue in 2024 (Wikipedia), serving 15M+ developers across its IDE lineup. Giving away a code model costs little for a company whose revenue comes from IDE subscriptions, and it creates gravitational pull toward JetBrains-specific tooling and integrations.

The release includes two post-trained variants: an Instruct model that answers directly, and a Thinking model that emits an explicit reasoning trace before producing its final answer (arXiv:2605.31268).

How does a 12B MoE with 2.5B active parameters actually work?

Mellum2’s architecture is built around three design choices that reduce per-token cost:

Mixture-of-Experts with sparse activation. The model has 64 experts per layer but activates only 8 per token, giving it 12B total parameters but a 2.5B active-parameter compute footprint. Per-token FLOPs are comparable to running a dense 2.5B model, which is what makes single-GPU serving plausible (CodersEra analysis).
Sliding Window Attention on three of every four layers, paired with Grouped-Query Attention using 4 KV heads. This trades full-sequence attention for lower memory pressure on long contexts, which matters when the model runs alongside a running IDE on a developer workstation.
Multi-Token Prediction head. The MTP head predicts multiple tokens in parallel and doubles as a draft model for speculative decoding. In speculative decoding, a small draft model proposes tokens and the main model verifies them in a single forward pass; when the draft is accurate, this produces multiplicative throughput gains without changing output quality (arXiv:2605.31268).

Pre-training spanned approximately 10.6 trillion tokens through a three-phase curriculum that shifted from general web data toward curated code and mathematical content. The model supports a 128K context window, extended via layer-selective YaRN (arXiv:2605.31268).

How fast is Mellum2, and what do the benchmarks actually say?

JetBrains claims Mellum2 is “competitive with open-weight baselines in the 4B-14B range” on code generation, math, reasoning, tool use, and safety benchmarks, while delivering “more than 2x faster inference” as of the June 2026 release (HuggingFace launch post).

Both claims need context.

The “>2x faster inference” is a vendor-claimed figure comparing Mellum2 against other open-weight models in the 4B-14B parameter range, not against commercial copilots like GitHub Copilot or Cursor. No independent benchmarking has been published as of June 2026. The HuggingFace launch post directs readers to the technical report for benchmark details; the specific scores have not been independently reproduced or confirmed by third parties as of this writing.

The 2.5B active-parameter footprint does make a straightforward throughput prediction possible: token generation will be bounded by the same constraints as any dense 2.5B model on the same hardware, with additional memory overhead from storing the inactive expert weights. On a GPU with 24 GB VRAM, serving the full 12B model in FP16 is tight but feasible; quantized variants (INT4, INT8) should fit comfortably on 16 GB boards.

What does self-hosting Mellum2 actually cost?

The economic argument for self-hosting breaks down along three axes:

GPU hardware. The 2.5B active-parameter footprint means Mellum2 is servable on commodity GPUs, according to the CodersEra analysis. A single consumer or workstation GPU is sufficient for inference, which is the single biggest capital expense. Exact hardware recommendations have not been published as of June 2026.

Per-seat licensing. Commercial copilots charge per seat per month. As a back-of-envelope comparison at a 50-developer scale, the annual subscription bill for a product like GitHub Copilot Business exceeds the cost of a single workstation GPU, making a self-hosted Mellum2 server amortizable in under a year, assuming one GPU handles the team’s completion traffic.

Operational overhead. Someone has to maintain the inference server, handle updates, and monitor latency. This is the cost most teams underestimate. Self-hosting trades a metered cloud bill for engineering time.

The wildcard is quality. If Mellum2’s completion quality is noticeably below Copilot or Cursor on a team’s actual codebase, the cost savings are irrelevant. Developers who find completions unhelpful simply stop using them, at which point the hardware investment is sunk.

Where does Mellum2 fit in a real code-assistance stack?

JetBrains is explicit: Mellum2 is a “focal model,” not a frontier replacement. In the JetBrains AI blog announcement, the company positions it for high-frequency, latency-sensitive tasks where a large model would be wasteful: routing requests to the right tool, summarizing RAG results, orchestrating sub-agents, and generating inline completions that need to appear within roughly 200ms.

The implied architecture is multi-model. A large frontier model handles complex reasoning and multi-step planning; Mellum2 handles the fast, repetitive work that does not require frontier-level intelligence but does require low latency and low per-token cost. Inline completion needs sub-200ms responses at high throughput; complex refactoring needs minutes of chain-of-thought reasoning. Running a frontier model for the former is expensive; running a small model for the latter produces bad results. This is consistent with JetBrains’ own IDE strategy: IntelliJ IDEA supports multiple AI agents side-by-side, including Junie, Claude Agent, Codex, and GitHub Copilot, with new agents discoverable via the Agent Client Protocol.

JetBrains also unified IntelliJ IDEA into a single distribution in December 2025, merging Community and Ultimate editions (Wikipedia). This consolidation suggests a platform strategy where the IDE itself is the integration point and models are pluggable components behind it. Open-sourcing Mellum2 under Apache 2.0 extends that strategy: third-party tooling can adopt the model without licensing friction, and the model’s improvements benefit the broader JetBrains ecosystem.

For teams building their own stack, Mellum2’s sweet spot is the middle layer: fast enough for inline completion, capable enough for structured sub-tasks like tool calls, summarization, and classification, and small enough to run on hardware you already own. It is not the model you call for architectural reasoning across a 50-file refactor. Whether multi-model orchestration actually outperforms a single good model in practice remains an open question. The engineering complexity of routing, context management, and failure handling across multiple models is non-trivial. JetBrains has the IDE integration layer to make this work, but teams building their own stacks will shoulder that complexity themselves.

Frequently Asked Questions

How does Mellum2’s per-token compute compare to the original Mellum?

The original Mellum was a 4B dense model, so all 4B parameters fired on every forward pass. Mellum2, despite being 3x larger in storage at 12B total, activates only 2.5B parameters per token. Per-token compute is actually lower than its predecessor. The tradeoff is VRAM: the full 12B weight set must reside in memory even though only a fraction is exercised per token, so serving hardware costs more than a dense 2.5B model would demand.

When should teams pick the Thinking variant over Instruct?

The Thinking model emits a reasoning trace before its final answer, adding output tokens and latency. For inline completion, where JetBrains targets sub-200ms response times, the Instruct variant is the practical choice. The Thinking variant suits agentic tasks like multi-step tool calls or debugging, where a visible chain of thought improves reliability and makes the model’s decision process auditable at higher per-query token cost.

Can Mellum2 process images or diagrams in a repository?

Mellum2 is text-and-code only. It cannot parse screenshots, architecture diagrams, or image assets. Teams whose workflows involve visual context, such as reproducing UI bugs from screenshots or generating code from design mockups, need a separate vision-capable model. This is a deliberate scope choice: the sliding-window attention and MTP head are built for token sequences, not cross-modal reasoning.

Why would a company open-source a model competitors could adopt?

JetBrains has never accepted external investment, and its revenue comes from IDE subscriptions, not model inference fees. Releasing Mellum2 under Apache 2.0 costs nothing directly while making JetBrains IDEs the native integration point for a model whose architecture was designed for JetBrains workflows. Competitors can use the weights, but adopting them means building on architectural decisions optimized for a rival’s product line.