Topic

#llm-evaluation

1 article exploring llm-evaluation. Expert insights and analysis from our editorial team.

Showing 1–1 of 1 articles

Articles

Newest first
Models & Research

STaD Exposes What HumanEval Hides: Compositional Skill Gaps in LLMs That Aggregate Benchmarks Miss

IBM Research's STaD shows models with identical benchmark scores can fail on different subskills, making leaderboard rank a poor proxy for compositional code generation.