STaD's Scaffolded Tasks Isolate the Compositional Skill Gaps That Aggregate LLM Benchmarks Hide

IBM Research’s STaD framework, published April 20, 2026, decomposes benchmark tasks into isolated sub-skills and measures where models fail when those skills must compose¹. The result is a per-model failure profile rather than a single pass rate, and it exposes a problem aggregate leaderboards cannot solve: two models with identical scores may break for entirely different reasons².

The Aggregate Score Problem: Why a 5-Point MMLU-Pro Delta Doesn’t Tell You Where the Model Breaks

Aggregate benchmarks are losing their resolving power. Community-compiled estimates place GPT-5.4 at roughly 78% on MMLU-Pro, Claude Opus 4.6 around 76%, Gemini 3.1 Pro near 75%, and DeepSeek V4 at 74%³. When the top tier clusters within five points, the headline number no longer distinguishes which model to deploy for a specific workload. Meanwhile, contamination concerns remain acute for older, widely-used benchmarks like GSM8K and MMLU, where test questions have been public long enough to appear in training data⁴.

IBM Research has separately argued that critical evaluation details, such as whether MMLU questions were presented as multiple-choice, are often buried in academic papers. As Elizabeth Daly put it: “Is the benchmark telling you that the model’s really good at algebra, or only when the test questions are presented as multiple-choice?”⁵. STaD does not fix contamination or saturation directly, but it changes the question from “which model scored higher?” to “which sub-skill is missing, and can you fix it?”²

How STaD Works: Task Decomposition + Scaffolded Variation as a Diagnostic Probe

STaD, or Scaffolded Task Design, requires three components: a benchmark to decompose, a capable teacher model to generate scaffolded variants, and a judge model to verify outputs². The IBM team used GPT-OSS-120B as the teacher and Llama-3.3-70B and Mistral-Large as judges².

The process works by breaking a benchmark task into its constituent sub-skills, then generating variants that provide intermediate scaffolding at different levels. If a model solves the fully scaffolded version but fails the minimally scaffolded one, the gap identifies a compositional bottleneck rather than a missing base capability. An ablation control replaced scaffolded intermediate values with placeholder tokens, dropping accuracy to approximately 12% and confirming that the gains come from revealing sub-tasks, not from reformatting the prompt².

What IBM Research Found: Same Score, Different Failure — Qwen3B vs Granite8B and Llama3B vs Granite2B on ToT Arithmetic

The team evaluated six small-to-mid-size models: Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, Granite-3.3-2B-Instruct, and Granite-3.3-8B-Instruct². Baseline pass rates on ToT Arithmetic ranged from 18.16% to 46.12%, GSM8K from 78.4% to 93.62%, and Math-Hard from 25.36% to 46.44%².

The clearest demonstration of the aggregate-score limitation came from two model pairs. Qwen2.5-3B-Instruct and Granite-3.3-8B-Instruct both scored approximately 32% on ToT Arithmetic (31.84% vs 32.46%), yet on the skill combination Overlapping + Overlapping + Discrete + Discrete + JSON — where skills repeat because the same capability is exercised at multiple reasoning steps — 23.07% of tasks remained intractable even with full scaffolding for Qwen3B, compared to 7.69% for Granite8B². A leaderboard ranks them as equivalent; STaD says one finds temporal-overlap reasoning fundamentally harder.

A second pair illustrates a structurally different failure pattern. Llama-3.2-3B-Instruct and Granite-3.3-2B-Instruct scored 19.2% and 18.16% on ToT Arithmetic respectively, but their bottlenecks diverge in kind rather than degree. Llama3B tends to require more support on isolated skills such as Natural-language date/time parsing and Calendar arithmetic, whereas Granite2B’s bottlenecks appear when multiple skills must be composed together — for example, Unit conversion + Arithmetic on durations and Relative-date reasoning + Calendar arithmetic + Multi-format date conversion². A leaderboard ranks these models as equivalent too. STaD says one needs stronger base skills and the other needs compositional training.

The Sub-Skill Taxonomy: 80 Skills Across Three Benchmarks and the Bottlenecks That Appear Repeatedly

Across the three benchmarks, the IBM authors identified 80 sub-skills: 20 for ToT Arithmetic, 40 for GSM8K, and 20 for Math-Hard². The bottlenecks that recurred were mathematical and specific: time-interval overlap, date parsing, unit conversion, sequential quantity tracking, and translating word problems to algebra².

On the skill-combination “Unit conversion + Arithmetic on durations + JSON output,” the scaffolding requirements, measured as the number of intermediate steps needed, ranged from 1.45 to 3.68 across the six models². The fraction of tasks deemed intractable even with full scaffolding ranged from 0% to 59%, a model-specific spread that no aggregate score captures².

Frequently Asked Questions

Does STaD diagnose failures on code-generation or NLP benchmarks?

Not yet. The authors explicitly limit scope to multi-step mathematical reasoning — NLP, code generation, and instruction-following benchmarks are untested. Teams selecting models for those workloads have no STaD-style sub-skill diagnostic available today.

How does STaD differ from LiveBench’s approach to benchmark problems?

LiveBench (ICLR 2025 Spotlight) fixes contamination by rotating in fresh questions monthly from recent arXiv papers and news across 18 tasks in 6 categories. STaD fixes a different problem: even with clean data, aggregate scores mask which compositional skills are missing. They are complementary — one sanitizes the input, the other raises diagnostic resolution.

Once STaD identifies a bottleneck sub-skill, what should a team actually do?

The practical fork is either fine-tune on the specific bottleneck sub-skills (e.g., targeted exercises on date parsing or unit conversion) or route tasks that trigger the failing composition to a model that doesn’t share that gap. Isolated base-skill weaknesses are good fine-tuning candidates; compositional-assembly failures are often cheaper to route around.

What would it take to apply STaD to code-generation evaluation?

Two missing pieces: a hand-crafted skill taxonomy for code tasks analogous to the 80 mathematical sub-skills IBM defined, and a teacher model capable of generating scaffolded code-task variants. Expanding to new domains would also likely amplify the existing teacher-model sensitivity — cross-teacher agreement is already only 3.7/5 within mathematical reasoning.

STaD's Scaffolded Tasks Isolate the Compositional Skill Gaps That Aggregate LLM Benchmarks Hide

The Aggregate Score Problem: Why a 5-Point MMLU-Pro Delta Doesn’t Tell You Where the Model Breaks

How STaD Works: Task Decomposition + Scaffolded Variation as a Diagnostic Probe

What IBM Research Found: Same Score, Different Failure — Qwen3B vs Granite8B and Llama3B vs Granite2B on ToT Arithmetic

The Sub-Skill Taxonomy: 80 Skills Across Three Benchmarks and the Bottlenecks That Appear Repeatedly

Frequently Asked Questions

Does STaD diagnose failures on code-generation or NLP benchmarks?

How does STaD differ from LiveBench’s approach to benchmark problems?

Once STaD identifies a bottleneck sub-skill, what should a team actually do?

What would it take to apply STaD to code-generation evaluation?

Sources

Enjoyed this article?

The Aggregate Score Problem: Why a 5-Point MMLU-Pro Delta Doesn’t Tell You Where the Model Breaks

How STaD Works: Task Decomposition + Scaffolded Variation as a Diagnostic Probe

What IBM Research Found: Same Score, Different Failure — Qwen3B vs Granite8B and Llama3B vs Granite2B on ToT Arithmetic

The Sub-Skill Taxonomy: 80 Skills Across Three Benchmarks and the Bottlenecks That Appear Repeatedly

Frequently Asked Questions

Does STaD diagnose failures on code-generation or NLP benchmarks?

How does STaD differ from LiveBench’s approach to benchmark problems?

Once STaD identifies a bottleneck sub-skill, what should a team actually do?

What would it take to apply STaD to code-generation evaluation?

Footnotes

Sources

Related Articles

STaD Exposes What HumanEval Hides: Compositional Skill Gaps in LLMs That Aggregate Benchmarks Miss

DuQuant++ Brings Fine-Grained Rotation to FP4: What Microscaling Quantization Means for Running Larger Models on the Same GPU

DuQuant++ Makes FP4 Quantization Practical for LLM Inference: What Fine-Grained Rotation Means for Blackwell Deployments

Enjoyed this article?