When Agent Skill Libraries Scale, Dependency-Aware Retrieval Beats Flat Search

Agent skill libraries are running into the same wall that broke vector-search RAG in 2024: flat embedding retrieval silently discards prerequisite chains. When an agent matches a skill by description similarity, it gets the skill but not the skills that skill depends on. Graph-of-Skills, a paper whose v3 revision landed May 27, 2026, treats skill retrieval as dependency-graph traversal rather than top-k similarity search, and reports a 25.55% reward increase on its SkillsBench benchmark while cutting token consumption by 56.72% compared to loading the full skill set.

The Prerequisite Gap: Why Flat Skill Search Degrades in Large Catalogs

The paper identifies two scaling problems that emerge once a skill catalog grows past a few hundred entries. First, loading the entire skill set into context saturates the context window, which drives up token costs, increases hallucination rates, and adds latency. Second, standard semantic retrieval surfaces topically relevant skills but misses their prerequisite chain, creating what the authors call a “prerequisite gap”: the retrieved bundle is topically correct but execution-incomplete (arXiv:2604.05333).

The dynamic is familiar to anyone who watched RAG systems fail on multi-hop reasoning. Embedding similarity finds the thing that looks like the query. It does not find the thing the query’s answer depends on. When a skill library has 50 entries, an agent can brute-force the gap by loading everything. At 2,000 entries, that strategy collides with context-window limits and per-token cost.

How Graph-of-Skills Works

GoS separates the problem into an offline phase and an inference-time phase.

Offline: Building the Executable Skill Graph

The system parses each skill package into a node and identifies dependency edges between skills. The result is a directed graph where an edge from skill A to skill B means B requires A as a prerequisite. This construction happens once, before any agent queries arrive, and the graph structure is what gets indexed rather than just skill descriptions.

Inference: Retrieval as Graph Traversal

At query time, GoS uses hybrid semantic-lexical seeding to identify initial candidate skills. Then, instead of returning those candidates directly, it runs reverse-aware Personalized PageRank over the skill graph to pull in prerequisite skills that the seeding step missed. A context-budgeted hydration step ensures the final bundle fits within the agent’s token budget (arXiv:2604.05333).

The design choice that matters: dependency edges, not description embeddings, are the primary retrieval signal. Embeddings get the agent into the right neighborhood of the graph. Graph traversal ensures the returned bundle is executable.

Benchmarks: SkillsBench and ALFWorld Across Three Model Families

The paper evaluates GoS on SkillsBench and ALFWorld across three model families: Claude Sonnet 4.5, MiniMax M2.7, and GPT-5.2 Codex (arXiv:2604.05333). The headline numbers come from GPT-5.2 Codex: a 25.55% peak reward increase and a 56.72% reduction in total tokens compared to the vanilla baseline of loading all skills into context.

Ablations across the three models confirm the pattern holds as skill libraries grow from 200 to 2,000 entries. The token reduction is the more operationally relevant number. If an agent can achieve higher task reward while consuming roughly half the tokens, the cost economics of maintaining a large skill catalog change: running 2,000 skills no longer requires paying for 2,000 skills’ worth of context on every inference call.

What SkillRevise Adds: Execution-Grounded Skill Improvement

A companion paper, SkillRevise (arXiv:2606.01139), addresses a related problem: the quality of the skills themselves. The paper observes that expert-authored skills are expensive to produce and may not align with how LLM agents actually execute tasks. Meanwhile, one-shot LLM-generated skills can be syntactically well-formed yet behaviorally weak, motivating iterative, execution-grounded refinement.

On SkillsBench, SkillRevise improves a base agent’s success rate from 36.05% to 61.63% (arXiv:2606.01139). The revised skills exhibit cross-model transferability: procedural knowledge extracted from execution traces generalizes across model families rather than encoding model-specific artifacts. The SkillRevise evaluation uses its own protocol, so cross-benchmark comparability with other skill-improvement methods is untested.

The two papers are complementary. GoS solves retrieval: given a skill graph, return the right subgraph. SkillRevise solves quality: given a skill, improve it through execution feedback. A production system needs both: refined skills indexed in a dependency graph, retrieved with prerequisite awareness.

Practitioner Playbook: Adding Dependency Edges to Your Tool Registry

The GoS results suggest a concrete engineering checklist for teams managing growing skill or tool catalogs, as of June 2026:

Audit your retrieval method. If your system selects tools or skills by embedding similarity alone, it has the prerequisite gap. The gap is invisible at small catalog sizes and degrades as the catalog grows.
Make dependency edges first-class metadata. For each skill, record which other skills it calls or requires. This is the graph GoS indexes. Without these edges, no amount of embedding tuning will recover missing prerequisites.
Budget for graph construction. The offline graph-building phase is a one-time cost per catalog update. It is cheaper than the ongoing token cost of loading unnecessary skills or the failure cost of missing prerequisites.
Measure retrieval completeness, not just relevance. A retrieval system that returns three highly relevant skills but omits their two prerequisites produces a bundle that looks correct and fails at execution. Completeness metrics should count prerequisite coverage.

Where the Ecosystem Is Heading: MCP, A2A, and the Missing Skill Layer

As of mid-2026, the agent interoperability stack is consolidating around two protocols. The LAP protocol paper (arXiv:2606.03755) positions Anthropic’s MCP as standardizing the agent-to-tool edge and Google’s A2A as the agent-to-agent edge. Neither protocol addresses dependency-aware skill retrieval. MCP handles how an agent invokes a tool. A2A handles how agents coordinate. The question of how an agent discovers and bundles the right set of interdependent tools for a given task is unaddressed at the protocol level.

This is the gap GoS fills, and it is a gap that will widen as skill catalogs grow. Frameworks that currently retrieve tools by description similarity (a characterization based on the problem framing in the GoS paper, not on vendor documentation as of June 2026) should expect the prerequisite gap to become a visible failure mode as their users’ tool registries expand. The fix is not better embeddings. It is structural indexing over dependency edges.

The 56.72% token reduction is the number that should concentrate minds. If dependency-aware retrieval can cut inference costs roughly in half while improving task outcomes, the economic argument for flat top-k skill search weakens. Graph construction is a fixed cost. Token consumption is a variable cost that compounds with every inference call.

Frequently Asked Questions

Is flat skill search sufficient for catalogs under 200 skills?

GoS’s ablations across the 200-to-2,000 skill range show the prerequisite gap is negligible at the low end. Teams with fewer than roughly 200 skills can load the full catalog without hitting context-window limits or incurring prohibitive token costs. Graph-retrieval overhead only pays for itself once the catalog grows large enough that full loading becomes expensive or impossible.

Can SkillRevise refine skills on a cheaper model and deploy on a pricier one?

Yes. The cross-model transferability result means a team could run SkillRevise’s iterative refinement loop on a smaller, cheaper model (tested across Claude Sonnet 4.5, MiniMax M2.7, and GPT-5.2 Codex) and deploy the refined skills on a more capable one. The refined skills capture generalized procedural knowledge rather than model-specific quirks, so the refinement investment carries over.

What happens when skill dependencies change after the graph is built?

GoS treats graph construction as a one-time cost per catalog update, so stale edges persist until the next rebuild. In a registry where skills are added or their dependencies change frequently, the rebuild cadence becomes an operational parameter the paper does not benchmark. Teams adopting this approach should factor graph-refresh latency into their deployment pipeline: a skill whose prerequisites changed mid-cycle will be retrieved with the old dependency structure until the next offline build.

Would this approach apply to MCP tool registries or plugin marketplaces?

MCP standardizes the invocation interface (how an agent calls a tool) but does not define a schema for declaring inter-tool dependencies. Applying GoS to an MCP registry would require extending each tool’s metadata with a depends_on or prerequisites field, then building the dependency graph from those declarations. Without that metadata layer, MCP registries would continue to retrieve tools by description matching, which is the exact failure mode GoS demonstrates at catalog sizes above roughly 200 tools.