Does Giving AI Agents More Skills Help? A Controlled SkillsBench Study

The answer is yes, but only barely, and only up to a point. A controlled study across three arXiv preprints from May 2026 finds that whether a skill exists in an agent’s catalog matters far more than how that skill is described. Adding worked examples, tuning abstraction levels, or letting models write their own skills produces deltas so small they’re indistinguishable from noise.

Having a skill matters. How it’s written does not.

The headline result, from arXiv:2605.31408 (submitted May 29), is stark: across 1,800 evaluation runs (30 tasks, two reasoning-enabled models, six skill conditions, five trials per cell), the study tested whether agents perform better when given curated skills versus no skills, and whether the granularity of skill documentation shifts outcomes independently.

Skill availability, the binary condition of having a relevant curated skill at all, raised task-mean pass rates by 26.7, 36.0 percentage points for GPT-5.5 and 18.0, 26.0 pp for DeepSeek V4-Flash, relative to a no-skill baseline (Section 4, availability contrasts). This is the clearest signal in the paper.

Then the authors varied how those skills were presented: low-abstraction guidance (concrete, step-by-step) versus high-abstraction guidance (conceptual, principle-based). The difference was +0.7 pp for GPT-5.5 and −6.7 pp for DeepSeek V4-Flash, with 95% bootstrap confidence intervals crossing zero (Section 4, granularity contrasts). Adding a single worked example to medium-abstraction guidance shifted pass rates by +0.7 pp (GPT-5.5) and +1.3 pp (DeepSeek V4-Flash) (Section 4, worked-example contrasts).

These are not rounding errors on top of a real effect. They are the noise floor. The study’s design isolates the variable cleanly, and the answer is unambiguous: existence dominates presentation by an order of magnitude.

The original SkillsBench baseline

This controlled study extends the earlier SkillsBench benchmark (arXiv:2602.12670, February 2026), which evaluated 84 tasks across 11 domains using 7 agent configurations and 7,308 total trajectories. That earlier work established the broader context: curated skills raised average pass rate by 16.2 pp overall (Section 5, aggregate results).

The domain-level variance was enormous. Software Engineering saw only a +4.5 pp lift from skills, while Healthcare jumped +51.9 pp (Section 5, domain breakdown). The original benchmark also identified a structural finding worth noting: focused skills covering 2, 3 modules outperformed comprehensive documentation, and smaller models equipped with curated skills matched the performance of larger models without them (Section 6, skill design analysis).

When skills actively hurt

The most uncomfortable finding in the original SkillsBench is not that skills sometimes fail to help. It is that 16 of 84 tasks showed negative deltas when skills were added (Section 5, negative-delta analysis). Skills can make agents worse.

This happens when retrieval selects a near-relevant but wrong skill, or when the agent follows the skill’s procedure faithfully but the procedure itself is misaligned with the actual task. The 16-task negative-delta set is a concrete argument for catalog pruning: every additional skill in the catalog increases the surface area for misretrieval, and the original study already shows that comprehensive documentation underperforms focused 2, 3 module skills.

For teams building Claude Skills catalogs or MCP tool libraries, the implication is direct. Adding a skill has a fixed benefit (the existence effect) and a variable cost (the retrieval-selection penalty). The first few high-coverage, well-targeted skills deliver most of the value. Every skill after that competes for the model’s attention without proportional upside, and some of them actively interfere.

Self-generated skills are a dead end

Both SkillsBench papers converge on the same finding about self-generated skills: they provide no benefit on average (Section 5, self-generation results). Models cannot reliably author the procedural knowledge they need to consume. This is a specific failure mode worth highlighting.

The companion skill-evolution study (arXiv:2605.30621, May 28) reinforces this from a different angle. It separates the ability to update an external skill configuration (the skill-evolution step) from the ability to benefit from that configuration (the task-solving step). The finding: skill-updating capability is flat across model tiers. Even Qwen3.5-9B produces configuration updates with gains comparable to Claude Opus 4.6 (Section 4, cross-model evolution comparison).

But the benefit from those configurations is non-monotonic. Weak models benefit little (they can’t follow the updated skills they’re given), mid-tier models benefit most, and strong models benefit less than mid-tier (Section 5, benefit-by-tier analysis). The authors trace weak-tier failures to two modes: failure to activate relevant skill artifacts, and activation followed by failure to follow through faithfully (Section 5, failure-mode analysis).

The practical upshot: investing in better skill-evolving agents is misplaced effort. The bottleneck is the task-solving agent’s ability to retrieve and follow skills, not the skill-authoring agent’s ability to write them.

The mid-tier sweet spot

The non-monotonic benefit curve deserves attention. If mid-tier models gain the most from skills, then skill catalogs are most valuable not at the frontier (where models can muddle through without help) but in the middle of the capability distribution. This has obvious implications for cost optimization: equip a cheaper mid-tier model with a focused skill set and it matches a frontier model that doesn’t need skills but costs 5, 10x more per inference.

The original SkillsBench reported this directly: smaller models with skills matched larger models without them (Section 6). The controlled study’s granularity non-results reinforce it. If the skill just needs to exist, the documentation bar for effective mid-tier deployment is low.

A practical playbook for skill catalogs

The three papers together suggest a specific strategy for teams maintaining agent skill catalogs:

Cover the right task classes first. The existence effect (16, 36 pp lift) dwarfs everything else. Identify the task domains where your agents struggle most and write one focused skill per domain.
Keep skills narrow. The original SkillsBench shows 2, 3 module focused skills outperform comprehensive documentation (Section 6). A skill that tries to cover every edge case is a skill that will be misapplied.
Stop polishing. The granularity study’s worked-example and abstraction-level contrasts are indistinguishable from zero (arXiv:2605.31408, Section 4). Time spent refining a skill’s prose is better spent writing a new skill for an uncovered task class.
Prune aggressively. The 16/84 negative-delta tasks mean some skills in your catalog are actively harmful. Review task-level performance, not just aggregate metrics. A skill that helps 8 tasks and hurts 2 may be a net negative if those 2 are high-stakes.
Do not rely on self-generation. Models produce unhelpful procedural skills on average (arXiv:2602.12670, Section 5). Human-authored or curated skill sets remain the only validated approach.

What the study doesn’t establish

The controlled study uses 30 tasks, two models (GPT-5.5 and DeepSeek V4-Flash), and a pinned SkillsBench version. Generalization to all agent architectures, all model families, and all task domains is not demonstrated. The skill-evolution study references Claude Opus 4.6; as of this writing, the current release is Opus 4.8. And the domain-specific variance in the original benchmark (a +4.5 pp lift in Software Engineering versus +51.9 pp in Healthcare) means any blanket recommendation should be qualified by domain.

What holds across all three papers is the directional signal: existence beats granularity, self-authored skills don’t work, and the skill-catalog design problem is one of curation and coverage, not documentation quality. That finding is sturdy enough to survive the next model generation.

Frequently Asked Questions

If even small models can produce decent skill updates, why do self-generated skills fail?

The evolution study measured whether models can update an already-structured configuration, while SkillsBench measured whether models can generate procedural knowledge from scratch. A Qwen3.5-9B model can tweak a structured config file and produce gains comparable to Claude Opus 4.6, but the same model cannot reliably author the step-by-step procedures it needs to consume. Auto-tuning existing skill parameters is viable; auto-authoring new ones is not.

How should teams pressure-test vendor claims about skill-driven performance gains?

The SkillsBench setup (fixed task set, multiple trials per cell, bootstrap confidence intervals) provides a template. Vendor demos typically show single-run results on cherry-picked tasks without reporting the no-skill baseline. When a vendor claims skills improve performance by X percent, check whether X exceeds the 16-36 pp controlled availability effect and whether they isolated availability from prompt-engineering or model-selection confounds.

Does the domain variance mean some teams should skip building skills entirely?

The +4.5 pp Software Engineering lift suggests coding tasks are already well-represented in training data, reducing the marginal value of added procedural knowledge. Healthcare’s +51.9 pp reflects the opposite: specialized procedures models lack. Teams in data-rich domains should expect smaller gains and weigh the 16/84 negative-delta risk more heavily; teams in specialized domains should prioritize coverage.

What’s the practical risk that the 30-task controlled result doesn’t generalize?

The original benchmark’s 16 negative-delta tasks warn that task-specific effects are large, and a practitioner’s specific task mix may land in a negative-delta region even when the aggregate signal is positive. The controlled study also uses a pinned SkillsBench version, meaning absolute pass rates are tied to an evaluation framework that could change between releases.

What happens to skill-catalog value as baseline model capabilities improve?

If the non-monotonic benefit curve holds across generations, today’s mid-tier (which gains most from skills) becomes tomorrow’s weak tier (which gains least), and the value peak migrates upward. The dollar savings from equipping a cheaper model with skills instead of paying for a frontier model depend on which capability gap you’re bridging, and that gap narrows as baselines rise.