The Million-Token Context Window: What Can You Actually Do?

Million-token context windows let you feed entire codebases, legal contracts, and hours of audio to an LLM in a single prompt. In practice, most models lose coherence well before their advertised limit, and the middle of your document is where accuracy degrades most. The sweet spot for reliable use sits around 60 to 70% of the claimed maximum, with specific failure modes that every practitioner needs to understand before building on top of these capabilities.

The Long-Context Landscape in 2026

Ultra-long context has gone from a Google research demonstration to a mainstream spec in under two years. In early 2024, a 200K window was a premium differentiator. By early 2026, several models offer 1 million tokens or more as standard, and outliers push into double-digit millions.

Model	Context Window	Provider	Notes
Llama 4 Scout	10M tokens	Meta	17B active / 109B total params (MoE); released April 2025
Claude Fable 5	1M tokens	Anthropic	Anthropic’s most capable widely released model; GA June 9, 2026; 128k max output
GLM-5.2	1M tokens	Zhipu / Z.ai	753B MoE (approximately 40B active, implied); IndexShare sparse attention; MIT-licensed weights; launched June 13, 2026; 128k max output
Gemini 3.1 Pro	1M tokens	Google	Supersedes Gemini 3 Pro; released February 2026
Gemini 2.5 Pro	1M tokens	Google	Ships at 1M; 2M expansion announced
Claude Opus 4.8 / Opus 4.6 / Sonnet 4.6	1M tokens	Anthropic	Opus 4.8 is Anthropic’s most capable Opus-tier model; all three GA at standard pricing
Llama 4 Maverick	1M tokens	Meta	400B MoE model
Gemini 2.0 Flash	1M tokens	Google	Released February 2025
GPT-4.1	1M tokens	OpenAI	Extended from GPT-4o’s 128K; released April 2025
Claude 4 Sonnet	200K standard / 1M beta	Anthropic	1M beta available since August 2025 (tier 4+ API)
GPT-4o	128K tokens	OpenAI	Previous-generation flagship
Claude 4 Haiku	200K tokens	Anthropic	Standard API limit is 200K; the 500K Enterprise window was a claude.ai product-tier feature, not specific to Haiku
Qwen2.5-1M	1M tokens	Alibaba	Open-source; available for self-hosting

The open-weight story is notable. Meta’s Llama 4 Scout reaches 10 million tokens, exceeding any closed-source model’s production-available window, while running on a 17B active parameter mixture-of-experts architecture. Alibaba’s Qwen2.5-1M enables self-hosted million-token processing for organizations with the infrastructure to run it. (Elvex. “Context Length Comparison: Leading AI Models in 2026.”)

What You Can Actually Do Well

Full-Repository Code Analysis

The clearest real-world application is codebase comprehension. Gemini 1.5 Pro can process codebases exceeding 30,000 lines of code in a single pass, and tools like Repomix (nominated for the JSNation Open Source Awards in 2025) exist specifically to pack entire repositories into a single AI-readable file targeting Claude, Gemini, GPT-4, and DeepSeek. (Repomix GitHub repository)

The practical workflow: serialize your codebase to a single file, feed it with a precise question about architecture, dependency chains, or bug location. This eliminates the fragmentation problem of RAG pipelines where relevant context is split across retrieved chunks. For debugging complex interactions between files, a 200K+ context with the full codebase loaded often outperforms a RAG system that retrieves individual files in isolation. For a detailed breakdown of how frontier models compare on code tasks specifically, see AI code generation benchmarks 2026.

Llama 4 Scout’s 10M window is positioned explicitly for “reasoning over vast codebases” (Meta’s wording), large enough to handle most real enterprise monorepos in a single context. (Meta AI. “The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation.”)

Legal Due Diligence

Long-context LLMs are entering legal workflows as auxiliary tools for due diligence and contract analysis. One documented pattern: investment funds use LLMs to search large document sets for answers using a query set defined by human analysts and lawyers, with the LLM acting as a fast first-pass filter. (Nature/Humanities & Social Sciences Communications. “Large Language Models in Legal Systems.”) Claude 3.5 Sonnet’s 200K context handles complex legal texts (full merger agreements, regulatory filings, multi-party contracts) that previously required chunking, with its accuracy degradation across its full range cited as less than 5% in some evaluations.

The implementation standard across serious legal deployments involves mandatory human review of LLM outputs. The LLM accelerates review, not replaces it.

Multimodal Long-Context Processing

Gemini 1.5 Pro demonstrated the most striking capability here: processing one hour of video and eleven hours of audio in a single inference call. (Google DeepMind. “Gemini 1.5 Technical Report.” arXiv:2403.05530) Near-perfect retrieval (>99.7% recall) on multi-needle tasks across the full 1M token window in Google’s technical report makes this the most validated long-context application in the research literature.

For audio, practical applications span doctor dictation processing, bulk interview transcription, and long-form podcast analysis, workflows where maintaining context across an entire session matters more than per-segment accuracy.

Where Long Context Actually Breaks Down

Lost in the Middle

The foundational research on long-context failure comes from Nelson Liu et al. at Stanford and the University of Washington, published in Transactions of the Association for Computational Linguistics in 2024. (Liu, Nelson F. et al. “Lost in the Middle: How Language Models Use Long Contexts.” Transactions of the Association for Computational Linguistics, 2024. arXiv:2307.03172) The paper documents a U-shaped performance curve: accuracy peaks when relevant information appears at the beginning or end of the context, and degrades when critical information sits in the middle.

In extreme cases, models with long context prompts performed worse than zero-shot prompting with no context at all. Performance drops by more than 30% when relevant information shifts from position-optimal (start or end) to middle placement.

The root cause is structural: Rotary Position Embedding (RoPE) introduces a long-term decay effect that causes models to systematically de-emphasize middle-context tokens. This isn’t a bug that will be patched. It’s a property of the attention mechanism itself.

Context Rot

Chroma Research’s “Context Rot” study evaluated 18 state-of-the-art models including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3, finding that reliability decreases significantly with longer inputs even on simple retrieval and text replication tasks. (Chroma Research. “Context Rot: How Increasing Input Tokens Impacts LLM Performance.”) The degradation accelerates when:

Question-evidence similarity is low (the answer looks semantically different from the question)
Distractors in the context are semantically similar to the answer
The task requires two-hop reasoning rather than direct retrieval

Adobe Research in February 2025 demonstrated this empirically in the NoLiMa benchmark (arXiv:2502.05167, ICML 2025): when questions and answers lack literal lexical overlap, requiring latent-association inference rather than direct matching, performance degrades sharply with context length. Eleven models drop below 50% of their short-context baseline by 32K tokens.

The framing that crystallizes the problem: each new token in the context depletes a finite “attention budget,” diluting the model’s capacity to attend to relevant information. A million-token context doesn’t give you a million tokens of reliable attention. It distributes a fixed attention capacity across a much larger pool.

The Effective Window Gap

A 2025 paper (arXiv: 2510.05381) found that even when models can retrieve all relevant information, performance still degrades by 13.9% to 85% as input length increases within the model’s claimed context limits. (arXiv:2510.05381. “Context Length Alone Hurts LLM Performance Despite Perfect Retrieval.”) The advertised number and the effective number diverge substantially. GPT-4o, with a 128K advertised window, has shown strong primacy and recency effects with only approximately 8K effective tokens for certain tasks. The middle 70 to 80% of the prompt is largely ignored.

The Performance Gap: Benchmarks Confirm the Theory

RULER (arXiv: 2404.06654), NVIDIA’s synthetic benchmark released in 2024, evaluates long-context models across 13 tasks at lengths from 4K to 128K tokens. (Hsieh, Cheng-Ping et al. “RULER: What’s the Real Context Size of Your Long-Context Language Models?” arXiv:2404.06654) The key finding: “Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as context length increases.” Simple needle-in-a-haystack retrieval flatters model capabilities; any task requiring aggregation, multi-hop reasoning, or variable tracking reveals the degradation earlier.

The canonical Needle-in-a-Haystack test results illustrate the spread:

Gemini 1.5 Pro: 99% recall at 1M tokens; >99.7% on multi-needle tasks (100 unique needles) at 1M tokens; 99.2% at 10M tokens (Google DeepMind. “Gemini 1.5 Technical Report.” arXiv:2403.05530)
GPT-4: Accuracy drops significantly when processing sequences beyond 10% of maximum capacity (BABILong benchmark)
Sequential-NIAH (EMNLP 2025): A harder variant requiring sequential needle extraction saw the best-performing model reach only 63.5% accuracy on 8K to 128K contexts (arXiv:2504.04713. “Sequential-NIAH: A Harder Needle-in-a-Haystack Benchmark.”)

The gap between Gemini’s retrieval performance and the field illustrates that long-context quality varies enormously by model and task. Advertised context length is not a proxy for long-context capability.

The Cost Reality

Provider / Model	Input (per 1M tokens)	Output (per 1M tokens)
Gemini Flash-Lite	$0.075	$0.30
GPT-4.1	$2.00	$8.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Opus 4.8	$5.00	$25.00
Claude Fable 5	$10.00	$50.00
DeepSeek (self-hosted)	~$0.48	varies

Pricing as of June 2026; check provider documentation for current rates.

The cost structure creates a real architecture decision. A 500K token prompt to Claude Sonnet 4.6 costs $1.50 for that single input call; the same prompt to Fable 5 costs $5.00. For high-volume applications, that arithmetic favors retrieval-augmented generation over full-context loading, even when the context window technically fits. For a breakdown of how the Anthropic tier pricing stacks up against task performance, see Claude Fable 5 vs Opus 4.8: when the 2x price is worth it.

The infrastructure cost is just as significant. A 128K context prompt on Llama 3.1-70B consumes roughly 40GB of HBM just for KV cache storage. At 1M tokens, KV cache requires approximately 15GB per concurrent user. This means million-token context serving is GPU-memory-intensive per session, not just per-token computationally.

Latency compounds the challenge. A 100K token prefill on a 70B model takes 8 to 10 seconds of GPU compute. Prefill latency exceeds two minutes for maximum context lengths without optimization on current hardware. Microsoft Research’s MInference technique (NeurIPS 2024 Spotlight) reduces 1M-token prefill on LLaMA-3-8B from 30 minutes to 3 minutes on a single A100 GPU by exploiting sparse attention patterns. That is still not instant, but a 10× improvement. (Jiang, Huiqiang et al. “MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention.” arXiv:2407.02490. NeurIPS 2024 Spotlight)

Engineering Around the Limits

Given the failure modes, production long-context applications use a set of mitigations:

# Structuring prompts to maximize long-context reliability
system_prompt = """
[CRITICAL CONTEXT - READ FIRST]
{most_important_facts}

[BACKGROUND DOCUMENTS]
{long_document_body}

[TASK - REFER BACK TO CRITICAL CONTEXT]
{question_with_reference_to_key_facts}
"""

Placing critical information at both the beginning and end of the context (the “bookending” pattern) directly counters the lost-in-the-middle effect. For tasks requiring synthesis across the full document, chunked analysis with a final aggregation pass often outperforms single-pass full-context processing.

For cost-sensitive applications, Microsoft’s LLMLingua achieves up to 20× prompt compression with minimal accuracy loss, reducing a 1M token input to 50K before inference while preserving the semantic content that matters. (Microsoft. LLMLingua)

Where the Benchmark Frontier Stands Now (June 2026)

The retrieval benchmark that has emerged as the clearest discriminator between models at long range is MRCR v2 (Multi-Range Contextual Retrieval), which hides multiple pieces of information at different depths inside a million-token document and requires the model to surface all of them accurately. Unlike Needle-in-a-Haystack, which tests single-fact retrieval at a known depth, MRCR v2 requires the model to maintain attention across the full distribution of the document.

As of March 2026, Claude Opus 4.6 led the field on MRCR v2 at 1M tokens with a score of 78.3%, edging past Gemini 2.5 Pro and GPT-4.1. (Anthropic. “Claude Opus 4.6 and Sonnet 4.6 now support 1M tokens at standard pricing.”) The score is instructive not just for its absolute number but for the gap it reveals: even the best-performing model is wrong roughly one in five times when performing multi-range retrieval at maximum context. For applications requiring complete accuracy (legal extraction, compliance auditing, reference verification), human review of LLM long-context outputs remains a production requirement, not an optional safeguard.

Claude Fable 5, Anthropic’s most capable widely released model launched June 9, 2026, shares the same 1M token context and 128k max output as Opus 4.8 but sits in a new tier above it. Anthropic has not published MRCR v2 or other numeric long-context retrieval scores for Fable 5; its positioning on long-range tasks is established through benchmark rankings rather than absolute figures. The model also ships with autonomous operation designed for long-running tasks across millions of tokens, which aligns with agentic workloads that sustain context across multi-step pipelines rather than a single large-prompt retrieval call. For a deeper look at how long context interacts with agentic autonomy, see Claude Fable 5 vs Opus 4.8: when the 2x price is worth it. (Anthropic. “Claude Fable 5 and Mythos 5.”)

GLM-5.2, launched June 13, 2026 by Zhipu / Z.ai, brings a 1M-token context window to the open-weight tier. The model is a 753B Mixture-of-Experts architecture with approximately 40B active parameters (implied by the “744B-A40B” HuggingFace designation, though Zhipu has not stated the active count in plain text). Its IndexShare sparse attention reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9x at 1M context length versus a dense attention baseline, a tradeoff that makes million-token inference economically viable at self-hosted scale. (Zhipu / Z.ai. “GLM-5 family release notes and benchmarks.”) Weights are MIT-licensed and publicly downloadable (BF16 and FP8 quantizations at huggingface.co/zai-org), making GLM-5.2 the first open-weight model at this context length with a permissive commercial license. (Zhipu / Z.ai. “GLM-5.2 model card.”) On coding benchmarks, GLM-5.2 posts SWE-bench Pro 62.1% (up from 58.4% for GLM-5.1) and Terminal-Bench 2.1 at 81.0, versus Claude Opus 4.8’s 85.0 on the same benchmark — Opus 4.8 leads by 4 points. Zhipu has not published MRCR v2 or multi-range long-context retrieval scores for 5.2; the IndexShare FLOPs reduction is a cost claim, not a quality benchmark. For a deeper look at GLM-5.2’s open-source positioning, see Zhipu Ships GLM-5.2 With 1M Context and MIT Weights. For how IndexShare compares with other sparse attention approaches at million-token scale, see MiniMax M3 Bets on Sparse Attention for 1M Context.

The Gemini line has historically led on the raw Needle-in-a-Haystack benchmark (>99% recall at 1M tokens in the Gemini 1.5 Pro technical report), but MRCR v2 is harder to game by architectural choice. The comparison across these two benchmark types illustrates a recurring pattern in long-context evaluation: models that excel at targeted single-needle retrieval do not uniformly excel at distributed multi-needle extraction.

One structural development worth tracking: the competitive gap between closed-source and open-weight models at long context has narrowed substantially in the past year. Llama 4 Scout’s 10M token window, Qwen2.5-1M’s self-hosted capability, and now GLM-5.2’s 1M-token MIT-licensed weights mean that organizations with GPU infrastructure no longer need to route long-context workloads exclusively through API providers. GLM-5.2 is notable specifically because it is the first open-weight model at 1M context with a permissive commercial license (MIT), versus Meta’s Llama community license and Alibaba’s Qwen terms. The primary remaining advantage of commercial APIs at long context is retrieval quality and the absence of infrastructure overhead, not maximum advertised length.

For RAG pipelines in high-volume production, this has a practical implication: the economics of self-hosted long-context inference are now comparable to chunked retrieval over a vector database for certain workload profiles, particularly when prompt caching is not available or effective.

Frequently Asked Questions

Q: What’s the practical maximum context length for reliable LLM performance? A: At time of writing, approximately 60 to 70% of a model’s advertised maximum. For a 200K token window, budget for reliable performance up to roughly 130K tokens. Gemini 1.5 Pro is the notable outlier, with near-perfect retrieval demonstrated at 1M tokens in its technical report, but for most models and most tasks, the effective window is substantially shorter than the spec sheet.

Q: Is a million-token context window better than RAG for codebase analysis? A: For moderate-sized codebases that fit within the effective context window, direct ingestion outperforms RAG by preserving cross-file relationships that retrieval systems miss. For enterprise monorepos exceeding even 10M tokens, RAG or intelligent snapshot tools remain necessary. The right approach depends on codebase size and query type.

Q: Why does LLM performance degrade in the middle of long contexts? A: Rotary Position Embedding (RoPE), which most transformer models use to encode token positions, introduces a long-term decay effect that systematically de-emphasizes middle-context tokens. The architecture is optimized for primacy and recency. Research mitigation strategies include sparse attention mechanisms and modified position encodings, but no production model has fully eliminated this property.

Q: How much does a 1M token prompt cost across major providers? A: At current mid-2026 rates: approximately $10.00 for input on Claude Fable 5, $5.00 on Claude Opus 4.8, $3.00 on Claude Sonnet 4.6, $2.00 on GPT-4.1, and $0.075 on Gemini Flash-Lite. Output tokens are priced 4 to 8× higher than input tokens across the market, so for analysis-heavy workloads with short outputs, input cost dominates. DeepSeek’s self-hosted pricing at ~$0.48/M tokens disrupts the economics for on-premise deployments. Anthropic removed the extended-context pricing surcharge in March 2026, making 1M-token calls billable at standard rates across its model lineup.

Q: Can open-source models handle million-token contexts effectively? A: Meta’s Llama 4 Scout officially supports 10M tokens, Alibaba’s Qwen2.5-1M supports 1M for self-hosting, and Zhipu’s GLM-5.2 (launched June 13, 2026) adds 1M context under an MIT license — the most permissive terms of the three. The infrastructure cost is the barrier: a 1M token context requires approximately 15GB of KV cache memory per concurrent session, making this practical only for organizations with dedicated GPU infrastructure. GLM-5.2’s IndexShare sparse attention reduces per-token FLOPs by 2.9x at 1M context length, which partially offsets that cost for compatible inference stacks (SGLang, vLLM, KTransformers). Optimization techniques like NVFP4 KV cache quantization cut memory requirements by up to 50%, but the hardware floor remains significant.

Sources: