STaD Exposes What HumanEval Hides: Compositional Skill Gaps in LLMs That Aggregate Benchmarks Miss

STaD treats language models as black boxes and generates controlled, incremental scaffolded variations of benchmark tasks to measure the minimum support a model needs to succeed. Posted to arXiv on April 20, 2026 and accepted at ACL Findings 2026, the IBM Research framework reveals that models with nearly identical aggregate scores can fail on entirely different subskill combinations — a blind spot with direct implications for how teams read coding benchmark leaderboards¹.

What STaD Measures (and What It Doesn’t)

STaD — Scaffolded Task Design — is not a new benchmark score. It is a diagnostic method that takes existing tasks and generates scaffolded versions with varying levels of intermediate support, then measures the minimum scaffolding level a model requires to succeed². The framework needs no access to weights, activations, or training data; it treats the model as a black box and relies entirely on input-output behavior.

The current experiments cover three reasoning benchmarks — Tree-of-Thought Arithmetic, GSM8K, and Math-Hard — tested on six instruction-tuned models: Qwen2.5-3B and 7B-Instruct, Granite-3.3-2B and 8B-Instruct, Llama-3.2-3B-Instruct, and Llama-3.1-8B-Instruct². The study does not include code-generation benchmarks such as HumanEval or MBPP. Its scope is strictly mathematical and temporal reasoning on small-to-medium models, and the authors — Sungeun An, Swanand Ravindra Kadhe, Shailja Thakur, Chad DeLuca, and Hima Patel — acknowledge this constraint explicitly².

When Two Models Tie on the Leaderboard But Fail Differently

Original aggregate scores on the three benchmarks span wide ranges: GSM8K from 78.4% to 93.62%, ToT Arithmetic from 18.16% to 46.12%, and Math-Hard from 25.36% to 46.44%². Those percentages compress critical differences. On ToT Arithmetic, Qwen2.5-3B and Granite-3.3-8B scored 31.84% and 32.46% respectively, a gap of just over six-tenths of a percentage point. STaD’s subskill decomposition showed they struggled with different reasoning combinations, demonstrating that comparable aggregate performance does not imply identical skill gaps².

The pattern repeats at larger scale. For Qwen2.5-7B on ToT Arithmetic, scaffolding both Overlap and Discrete reasoning skills together lifted success to 44%, versus 22% when scaffolding Overlap alone². Roughly half the improvement came not from remediating a single missing skill but from supporting the interaction between two skills. No aggregate pass rate captures that distinction.

The Minimum Scaffolding Level: A New Diagnostic Metric

STaD introduces a granular diagnostic signal: the minimum scaffolding level required for a model to solve a task. The framework injects intermediate answers into scaffolded steps and measures whether accuracy recovers. A placeholder ablation confirms the gains are driven by the injected information, not by task restructuring alone. When researchers replaced injected values with placeholders, accuracy collapsed to an average of 11.8% across models — with individual scores between 7.6% and 14.4%, near baseline².

On Math-Hard, the dominant bottleneck across all six models was translating word problems into algebra, accounting for 41 to 63 failure cases per model². Multi-step reasoning and combinatorial reasoning followed. This level of specificity — naming the exact subskill and counting its failures — is what aggregate benchmarks discard when they compress performance into a single percentage.

From Math-Hard to Code-Hard: Applying the STaD Lens to HumanEval and MBPP

The paper’s experiments are confined to math and temporal reasoning, so extending its conclusions to code generation is an analytical leap, not a direct result. HumanEval and MBPP test compositional skills — parsing requirements, selecting algorithms, managing state across functions, handling edge cases — that are structurally similar to the multi-step reasoning STaD isolates. The same logic applies: a model that passes the majority of a benchmark’s tasks could still be failing the remainder on a single persistent bottleneck, such as API boundary reasoning, or on a combinatorial interaction between loop logic and exception handling.

For engineering teams choosing models for production code generation, the STaD lens suggests a shift in eval pipeline design. Subskill-level diagnostics should supplement — and in some cases replace — leaderboard rank as the primary selection signal when the downstream workload is compositional. A model that ranks third on an aggregate coding benchmark but shows clean skill-composition heatmaps on targeted scaffolding may be more reliable than a higher-ranked model with clustered failure modes.

Limitations: Teacher Bias, Filtering Skew, and Scope

The framework carries three acknowledged constraints that shape how much weight to give its conclusions. First, STaD depends on a teacher model to decompose tasks into subskills. The authors validate with cross-teacher robustness but report that only 3.7 out of 5 top bottlenecks agree across different teachers². Second, dataset filtering introduces a difficulty skew: because the framework retains only examples where teacher decompositions are consistent, the Math-Hard scaffolded set shows a 9.81% accuracy gap versus the original benchmark, suggesting an easier subset². Third, the scope is limited to small and medium instruction-tuned models; whether the same skill-composition patterns hold at 70B+ scale or on frontier code models remains unverified.

The release is nonetheless a complete artifact drop. The paper, datasets, and source code all arrived within a 72-hour window starting April 20, 2026¹ — a level of transparency that makes replication and adaptation feasible even as the methodology’s generalization to code benchmarks awaits independent testing.

Frequently Asked Questions

How does STaD differ from CheckList or Decomposed Prompting?

CheckList and Decomposed Prompting also break tasks into subcomponents, but they use decomposition as a training or prompting intervention intended to improve model performance. STaD uses scaffolding purely as a diagnostic probe — it varies the information available at test time to isolate which compositional gaps are load-bearing without ever attempting to fix them. The framework is measurement infrastructure, not a remediation technique.

What does a team need to adapt STaD to their own coding benchmarks?

The open-source release (Apache 2.0) supplies the scaffolding infrastructure, but teams must provide their own teacher model for task decomposition — and that choice is the gating factor. Cross-teacher experiments showed that only 3.7 out of 5 top bottleneck categories agreed when different teachers decomposed the same tasks, so switching teachers can reorder which failures appear most critical. Teams should budget for validating decomposition quality against human-annotated references before trusting the resulting heatmaps.

Can the dataset filtering step hide the hardest failure modes?

Yes. STaD retains only examples where teacher decompositions are internally consistent, which systematically excludes the hardest problems — precisely those where compositional failures are most likely to cluster. The retained Math-Hard subset was 9.81 percentage points easier than the original benchmark, and the same exclusion effect could be even more pronounced in domains like code generation, where decompositions are noisier and disagreement among teachers tends to be higher.

Would code-specific fine-tuning change the interaction patterns STaD finds?

The study tests only general instruction-tuned models between 2B and 8B parameters. Code-specialized checkpoints often redistribute competence across subskills — a model fine-tuned on function-call corpora may handle API boundary reasoning well but show new gaps in algorithmic selection. That means the skill-interaction profiles STaD maps for math reasoning (where algebraic translation paired with multi-step computation dominates) may not structurally resemble code-task profiles, where the dominant interaction is more likely between API surface reasoning and internal state management.

STaD Exposes What HumanEval Hides: Compositional Skill Gaps in LLMs That Aggregate Benchmarks Miss

What STaD Measures (and What It Doesn’t)

When Two Models Tie on the Leaderboard But Fail Differently

The Minimum Scaffolding Level: A New Diagnostic Metric

From Math-Hard to Code-Hard: Applying the STaD Lens to HumanEval and MBPP

Limitations: Teacher Bias, Filtering Skew, and Scope

Frequently Asked Questions

How does STaD differ from CheckList or Decomposed Prompting?

What does a team need to adapt STaD to their own coding benchmarks?

Can the dataset filtering step hide the hardest failure modes?

Would code-specific fine-tuning change the interaction patterns STaD finds?

Sources

Enjoyed this article?

What STaD Measures (and What It Doesn’t)

When Two Models Tie on the Leaderboard But Fail Differently

The Minimum Scaffolding Level: A New Diagnostic Metric

From Math-Hard to Code-Hard: Applying the STaD Lens to HumanEval and MBPP

Limitations: Teacher Bias, Filtering Skew, and Scope

Frequently Asked Questions

How does STaD differ from CheckList or Decomposed Prompting?

What does a team need to adapt STaD to their own coding benchmarks?

Can the dataset filtering step hide the hardest failure modes?

Would code-specific fine-tuning change the interaction patterns STaD finds?

Footnotes

Sources

Related Articles

STaD's Scaffolded Tasks Isolate the Compositional Skill Gaps That Aggregate LLM Benchmarks Hide

DuQuant++ Brings Fine-Grained Rotation to FP4: What Microscaling Quantization Means for Running Larger Models on the Same GPU

DuQuant++ Makes FP4 Quantization Practical for LLM Inference: What Fine-Grained Rotation Means for Blackwell Deployments

Enjoyed this article?