groundy
models & research

Do Reasoning LLMs Waste Tokens? OckBench Tries to Measure It

OckBench scores 37 reasoning LLMs on token efficiency alongside accuracy, finding comparably accurate models differ by up to 26× in token cost under per-token billing.

6 min · · · 4 sources ↓

Frontier reasoning models are getting more accurate, but the token cost of that accuracy is spiraling. OckBench, an open-source benchmark from researchers at Georgia Tech, MIT, and NVIDIA, now scores LLMs on how many tokens they burn to reach a correct answer, not just whether the answer is right. The authors’ 37-model leaderboard (v3, updated June 3, 2026) shows comparably accurate models differing by up to 26× in token usage. For anyone paying per output token, that ratio is the entire question.

What OckBench measures

The benchmark, described in arXiv:2511.05722 (Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu), covers three subfields: math (open-ended problems evaluated by answer extraction), coding (code generation evaluated against test cases), and science (multiple-choice questions). Datasets are available on HuggingFace.

The composite metric is OckScore, a unified score that rewards models achieving high accuracy with fewer output tokens, penalizing verbosity without banning it outright. For the formula and its parameters, see the paper.

A key methodological detail: OckBench applies a Differentiation Filter that isolates tasks exposing the efficiency gap between models, selecting problems where token variance is maximized and where accuracy avoids floor and ceiling effects. This filter surfaces problems that reveal efficiency differences, but it also means the benchmark problem set is curated for discriminative power rather than for representing typical production workloads.

The benchmark is open-source under MIT license, with datasets on HuggingFace. It supports OpenAI, Gemini, vLLM, and SGLang endpoints, tracks reasoning tokens separately from answer tokens, and supports incremental caching for fault-tolerant runs.

The leaderboard: efficiency favors commercial models

According to the OckBench Selected leaderboard (v3, updated June 3, 2026), GPT-5.5 at medium reasoning effort ranks first with an OckScore of 82.2, achieving 86% accuracy at 4,692 average tokens. The top eight settings are all GPT-5.x or Claude Opus variants. As of that update, no open-weight model reaches the top ten.

The accuracy gap between open-source and commercial models has narrowed, but the token-efficiency gap has not: the authors report that open-source models consume far more output tokens than commercial counterparts for comparable accuracy. The per-token-intelligence gap persists.

The Overthinking Tax

The benchmark’s most striking finding is what the authors call the Overthinking Tax: smaller models over-generate reasoning tokens to compensate for lower capacity, paradoxically increasing deployment cost while delivering worse results.

The authors frame this as the core proposition: token cost, not correctness alone, decides deployment value. A model that is somewhat more accurate but far more expensive per query is not automatically the better choice.

The reasoning-effort sweet spot

GPT-5.5 at medium reasoning effort leads the leaderboard, and the authors report that frontier commercial models are “rapidly co-optimizing both dimensions” of accuracy and efficiency. Each new generation pushes further into the upper-left corner of the accuracy-versus-tokens plot. The authors also demonstrate that efficiency is tractable: training-free model interpolation and difficulty-aware reinforcement learning both significantly improve OckScore, according to their experiments.

The engineering implication is straightforward: if your task tolerates a few percentage points of accuracy loss, running a current-generation model at moderate reasoning effort is likely the cost-optimal configuration. Extra tokens at higher effort buy accuracy at a rate that may not justify the linear cost increase under per-token billing.

What this means for procurement

The OckBench numbers, if they hold under independent replication, change how teams should evaluate reasoning-mode models. The paper reports that models solving the same problem with similar accuracy can exhibit up to a 5.0× difference in token length; across the full model set, the gap widens substantially. Under per-token billing, token ratios translate directly into cost ratios.

A procurement framework based on OckBench’s logic would weigh three axes: accuracy on your specific task distribution, average tokens per correct answer, and the marginal cost of each additional accuracy point. The third axis is the one most teams currently ignore. Whether a model that spends significantly more tokens for marginally better accuracy is worth the cost depends on the cost of a wrong answer in your specific application, not on the benchmark number itself.

The Overthinking Tax also has implications for open-weight model adoption. If a smaller open model burns far more tokens than a commercial model for comparable accuracy, the hardware cost of self-hosting that model at production throughput may exceed the API cost of the commercial alternative. The benchmark makes that comparison explicit.

Caveats and limitations

Every figure in this article comes from the OckBench authors’ own measurements in a single preprint (submitted November 2025; v3 updated June 2026). No independent third-party replication has been published. OckScore’s formula and its weighting parameters are authors’ design choices; alternative formulations would produce different rankings. The Differentiation Filter selects for problems that maximally discriminate between models on token variance, which may not reflect any given team’s actual workload.

Models referenced are post-2025 versions whose capabilities and behaviors are reported by the benchmark authors, not independently verified. Open-source models may have been run at different quantization levels or serving configurations that affect both token usage and accuracy, and the paper does not fully standardize these.

The benchmark covers only math, coding, and science tasks. Results should not be generalized to multilingual generation, long-context retrieval, or agentic tool use without specific evidence.

OckBench’s contribution is not the final word on reasoning efficiency. It is the first benchmark that makes the token-cost dimension legible enough to act on. Whether the rankings survive replication is an open question. The conceptual frame, that reasoning tokens have a price and that price should be measured, is likely to persist regardless.

Frequently Asked Questions

Does OckBench apply to streaming or multi-turn reasoning tasks?

No. Every run uses single-shot greedy decoding with temperature set to zero, so the benchmark does not capture how models behave under sampling, multi-turn tool use, or chain-of-thought with retries. Teams whose workflows depend on temperature-scaled outputs or agentic loops should treat OckBench numbers as a lower bound on actual token spend, not a predictor of production costs.

How much does switching from Claude Opus 4.6 to 4.7 save on tokens?

Opus 4.7 reaches 83.0% accuracy at 7,481 average tokens, while Opus 4.6 hits 84.5% at 28,582. That is a 3.8× token reduction for a 1.5 percentage point accuracy trade-off, lifting OckScore from 71.0 to 77.4. The same pattern shows up in Gemini: 3.1 Pro gains 22.5 accuracy points over 2.5 Pro while using 6.4× fewer tokens.

What happens if you run GPT-5.5 with reasoning entirely disabled?

With reasoning effort set to “none,” GPT-5.5 answers at 26.0% accuracy using only 260 tokens (OckScore 25.74). That score nearly matches MiniMax-M2.5, which spends 57,346 tokens to reach 44.5% accuracy. A team that needs only rough correctness on simple tasks can get competitive per-query economics by turning reasoning off entirely, accepting that accuracy drops sharply.

How does the per-problem token variance affect budget forecasting?

Two models with similar accuracy can differ by 5.0× in token length on a single problem, stretching to 25× (roughly 1,600 vs 42,000 tokens) across the full 37-model set. Per-token billing is linear, so that spread becomes a 25× cost spread per query. Budget forecasts that average token usage across models will underestimate spending on verbose models by an order of magnitude.

sources · 4 cited

  1. OckBench: Measuring the Efficiency of LLM Reasoning primary accessed 2026-06-05
  2. OckBench (arXiv 2511.05722) primary accessed 2026-06-05
  3. ockbench/ockbench dataset on HuggingFace primary accessed 2026-06-05
  4. OckBench GitHub repository community accessed 2026-06-05