Why Skill Creation and Reward Optimization Collide in Agentic RL

When an RL agent invents a new skill mid-training, the skill and the policy both draw from the same gradient budget. ReSkill, published on arXiv in June 2026, shows that decoupled skill creation actively degrades reward because the skill library drifts out of sync with the evolving policy. The fix is not more skills; it is co-optimization.

Skill Libraries Hate Moving Policies

Most agent frameworks treat skills as a static library bolted onto a policy. You pre-define a set of tool-use patterns, chain-of-thought strategies, or action templates, and the policy selects among them. This works when the policy is frozen. It breaks when the policy keeps learning.

ReSkill (arXiv:2606.01619) formalizes why. In existing skill-augmented RL methods, skill creation runs as a separate process from policy optimization. The skill inventor observes agent failures, proposes new reusable skills, and deposits them into a library. The policy then draws from that library during rollouts. The problem: by the time the policy samples a newly created skill, the policy itself has already shifted. Skills that were useful when invented may now conflict with the policy’s current action distribution, and the gradient signal that would have corrected the skill is instead spent on policy updates.

The consequence is not just missed opportunity. It is active degradation. The agent adopts skills that interfere with its current strategy, and the reward curve dips instead of climbing. This is the gradient conflict at the core of ReSkill’s analysis.

What ReSkill Proves About the Skill-Policy Conflict

The paper’s lifecycle analysis shows a clear trajectory for how skills evolve alongside the policy: skills are automatically created during training, tested against the current policy, refined when they underperform, and eventually pruned when the policy outgrows them (arXiv:2606.01619). This lifecycle finding carries a direct implication the paper calls out: static skill libraries bolted onto a fixed policy, the pattern used in most agent frameworks, are suboptimal. The skill set that works early in training is not the skill set that works later, and a framework that does not account for this drift is leaving performance on the table.

ReSkill also reports that its reconciled approach produces the largest gains on unseen tasks, not just on the training distribution. Skills that co-evolve with the policy generalize better because they capture policy-compatible strategies rather than policy-agnostic heuristics. The AIssential.tech analysis frames the practitioner takeaway: integrate skill creation directly into policy optimization via assertion-driven skill revision and controlled skill version comparison.

Three Mechanisms That Make Co-Evolution Work

ReSkill embeds three mechanisms inside GRPO’s group-wise structure with only marginal overhead (arXiv:2606.01619):

Assertion-driven skill creation. When a rollout fails, the system diagnoses the failure and proposes a conditional, trigger-based skill revision. The assertion acts as a contract: if the agent encounters condition X, try strategy Y. This is not a prompt-level heuristic; it is a structured, testable skill definition that the RL loop can evaluate.
Within-group rollout sampling. GRPO evaluates groups of rollouts together. ReSkill uses this structure to run controlled experiments: within a single group, some rollouts use the current skill version and others use a proposed revision. This gives the optimizer a direct comparison under identical conditions without requiring a separate evaluation phase.
Thompson Sampling with adaptive discounting. Skill selection is itself an exploration-exploitation problem. ReSkill uses Thompson Sampling, a bandit algorithm, to balance trying new skills against exploiting known-good ones, and adapts the discount rate as training progresses to shift the balance toward exploitation in later stages.

The GRPO dependency is worth noting. Whether these mechanisms transfer to other RL trainers, such as vanilla PPO or DPO, is unstated in the paper. Practitioners using non-GRPO setups should treat the three-mechanism architecture as a design pattern rather than a drop-in module.

What This Means for Agent Frameworks

The framework gap is real but indirect. ReSkill does not benchmark against CrewAI, LangGraph, or any specific agent framework. What it does is expose a structural problem those frameworks share: skill acquisition treated as a library management problem rather than an optimization problem.

As of mid-2026, CrewAI’s model configures agent tool sets and skill definitions at design time. The agent can compose these tools in novel ways, but the skill library itself does not evolve during deployment. LangGraph’s current approach is similar: the graph topology and available tools are specified by the developer, and the LLM reasons over them at inference time. Neither framework has a mechanism for the policy (the LLM’s behavior) to feed back into the skill definitions and trigger revision.

This is the architecture choice ReSkill challenges. If the policy is learning, whether through fine-tuning, RLHF, or online adaptation, the skill library must co-evolve or it will actively interfere. The cost of fixing this is non-trivial. Version-gated rollout for skills, Thompson-sampled selection, and assertion-driven revision are not features you can bolt on with a config flag. They require the skill layer to participate in the training loop.

A concurrent study on agentic misalignment (arXiv:2605.24197) reinforces the fragility of naive skill-sharing. Agents acting on implicit proxy utilities suffer posterior collapse, and the paper proposes Agentic Evidence Attribution (AEA) as a fix. The overlap with ReSkill is structural: both papers identify that implicit, unaudited behavioral modules, whether skills or proxy objectives, degrade agent performance when not explicitly tracked.

The Reliability Gap in Agentic Systems

ReSkill’s findings sit inside a broader pattern. The τ-Rec benchmark for agentic recommender systems, published June 8, 2026, tested five model families (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B, and GPT-5 mini). Even the best model achieves roughly 57% at pass^1 and 38% at pass^4, per the benchmark’s evaluation (arXiv:2606.10156). This is not a narrow benchmark; it spans the current frontier.

These numbers tell a consistent story alongside ReSkill. Agentic systems are unreliable in part because their component modules, whether skills, tools, or policy abstractions, are not co-optimized for the deployment environment. Skill-policy conflict is one instance of a general problem: agentic architectures assemble components designed in isolation and expect coherent behavior at inference time. The evidence suggests this expectation is poorly founded.

For practitioners building self-improving agents, the actionable takeaway from ReSkill is specific: treat skill acquisition as a separate optimization track that runs inside the policy training loop, not alongside it. This means version-gated skill rollout, controlled comparison against the current policy, and explicit pruning when skills drift out of alignment. It raises the engineering cost of the autonomous-skill-growth pitch. But a growing skill library that silently degrades as the policy evolves is the worse outcome.

Frequently Asked Questions

Does ReSkill’s approach transfer to agents trained with PPO or DPO instead of GRPO?

Two of the three mechanisms adapt cleanly. Thompson Sampling for skill selection and assertion-driven skill creation are optimizer-agnostic and can be wrapped around any training loop. Within-group rollout sampling is the exception: it depends on GRPO’s native batch structure, which groups rollouts and shares a baseline. With PPO or DPO, you would need to construct explicit evaluation batches for skill version comparison, adding a separate evaluation phase whose cost GRPO absorbs as part of its existing rollout budget.

What does an assertion-driven skill look like compared to a CrewAI tool definition?

A CrewAI skill is a tool plus a natural-language description that the LLM reasons over at inference time. A ReSkill assertion is a conditional contract: when the agent hits a specific failure trigger during rollout, the skill proposes a strategy revision tied to that trigger. It is closer to exception-handling logic than tool invocation. The RL loop then evaluates whether the trigger-strategy pair improves reward before committing it to the library, which is a test-and-commit cycle CrewAI’s static definitions never run.

Does skill co-optimization still matter if the policy is frozen after training?

A static skill library aligned to a frozen policy will not drift, so co-optimization adds no value in that specific scenario. The catch is that frozen policies are unlikely to remain frozen. The concurrent τ-Rec benchmark shows frontier models (GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B, GPT-5 mini) achieving only ~57% pass^1 on agentic tasks, which strongly incentivizes ongoing policy refinement through fine-tuning, RLHF, or online adaptation. Any such update reopens the skill-drift window.

How does the misalignment paper’s posterior collapse relate to ReSkill’s skill-policy conflict?

Both papers diagnose the same structural failure from different directions. The misalignment study (arXiv 2605.24197) finds that agents accumulating implicit proxy utilities lose track of the original objective as behavioral modules multiply. ReSkill finds that skills evolving separately from the policy inject conflicting gradient signals. The shared lesson: any behavioral module that is not explicitly tracked against the policy’s current state, whether a skill, a proxy objective, or an implicit utility function, degrades performance. The misalignment paper’s proposed fix, Agentic Evidence Attribution, and ReSkill’s Thompson Sampling with version-gated rollout address the same tracking gap with different mechanisms.