Topic

#model-evaluation

1 article exploring model-evaluation. Expert insights and analysis from our editorial team.

Showing 1–1 of 1 articles

Articles

Newest first
Models & Research

STaD's Scaffolded Tasks Isolate the Compositional Skill Gaps That Aggregate LLM Benchmarks Hide

IBM Research's STaD framework exposes compositional skill gaps aggregate benchmarks miss: two models at 32% on ToT Arithmetic needed fundamentally different fixes.