SkillOpt, published by Microsoft Research and collaborators (arXiv paper), treats an agent’s skill document as trainable state. The system applies structured add, delete, and replace edits governed by a textual learning-rate budget, validates each edit against held-out trajectories, and emits a compact best_skill.md artifact. The result: a sweep of all 52 evaluated benchmark cells, with average accuracy climbing by +23.5 points on GPT-5.5 direct chat across six benchmarks. The catch is not the numbers. It is the assumption those numbers break.
What SkillOpt actually optimizes
The system operates on a single Markdown skill document per domain. Rather than appending new heuristics to an ever-growing list, SkillOpt treats skill text the way a deep-learning optimizer treats weights: each training step proposes an edit (add a line, delete a line, replace a line), a validation gate scores the edit against held-out task trajectories, and the edit is either committed or rolled back. A “slow update” mechanism and a “meta skill” document govern the optimization schedule and reject edits that degrade validation performance.
The constraint that matters is the textual learning rate. The optimizer does not rewrite the skill from scratch each iteration. It makes bounded edits, measured in token delta, and stops when the validation score plateaus. This is the mechanism that keeps skills compact and that prevents the skill document from accumulating contradictory instructions the way an append-only registry does.
The ablation that makes the argument
The paper’s abstract names the textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update as the three mechanisms that keep skill training stable. An append-only registry has no mechanism for this. It can only grow.
The procedural benchmarks show the largest absolute gains. SpreadsheetBench and OfficeQA both improve substantially. These are domains where the skill document encodes a precise sequence of operations, and where a single wrong instruction in the skill text corrupts the entire trajectory. The optimizer’s ability to delete or replace bad instructions, rather than supplement them, appears to be where most of the lift concentrates.
The rejected-edit buffer is one of the three stability mechanisms the paper names in its abstract, alongside the learning-rate budget and the slow/meta update. It is part of the optimizer’s core loop: rejected edits feed back into later update proposals.
Transfer changes the build-vs-reuse calculus
The transfer results are where SkillOpt moves from “interesting optimizer” to “practical concern for framework builders.”
The paper reports that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.
The practical implication is that skill artifacts are reusable across models, environments, and related tasks, but only if the skill is compact and validated. A bloated, unvalidated skill accumulated through append-only registration will not transfer cleanly. The optimization step is what makes transfer work, and that step requires deletion.
Append-only registries are a design error
Current agent frameworks expose skill addition as the default operation. Register a tool, add a capability, append to the list. None of the major frameworks expose a skill-eviction API or a compute-budget primitive. The assumption, rarely stated, is that skill addition is monotonic and free. More tools, more capabilities, more context. SkillOpt’s results suggest this assumption is wrong, and not at the margin.
Framework maintainers who want their users to benefit from optimized skills need to expose three primitives: a budget mechanism (how many tokens of skill context the agent can afford), a validation gate (what counts as evidence that a skill edit helped or hurt), and an eviction operation (how to remove or replace a skill). As of mid-2026, this author is not aware of any major framework that exposes all three. Most appear to expose none.
SkillOS adds the curation layer
Published two weeks before SkillOpt, SkillOS trains an RL-based skill curator to manage a SkillRepo over streaming tasks. Where SkillOpt optimizes a single skill’s content, SkillOS optimizes the policy for deciding which skills to keep, merge, or retire across an entire repository. The two systems are complementary: SkillOpt is the optimizer that makes individual skills good; SkillOS is the curator that decides which skills deserve optimization in the first place.
The framing shift is the same in both papers. Skill curation is a learnable policy. Skill content is trainable state. Neither is a static configuration problem, and both require deletion as a first-class operation.
What to build now
The 52-of-52 result is self-reported on self-selected benchmarks without independent reproduction. Treat it as a strong signal, not a settled fact. But the architectural claim does not depend on the exact numbers. If skill optimization with bounded budgets and validation gates outperforms unoptimized skill accumulation by even half the reported margin, the framework implication holds: append-only skill registries leave performance on the table, and the gap widens as skill count grows.
For teams building agent systems today, the concrete takeaway is to stop treating skill registration as a one-way append operation. Add a validation step that scores the agent’s performance with and without each skill. Set a token budget for the total skill context and evict skills that fail validation. The SkillOpt repository ships with support for six benchmarks, three execution environments (direct chat, Codex, Claude Code), and models from GPT, Claude, and Qwen families. It is a working reference implementation for the skill-optimizer primitive, not just a paper.
The frameworks that add eviction APIs and budget primitives first will be the ones whose users can actually use optimized skills. Everyone else will be stuck accumulating context until the prompt fills up and performance degrades, with no mechanism to diagnose why.
Frequently Asked Questions
How much compute does SkillOpt’s training loop consume before a skill pays for itself?
Each training step requires rollout batches and a separate optimizer-model call, making upfront cost substantial. The paper notes this only amortizes for skills reused across many sessions; a skill tuned for a one-off task would cost more in optimization compute than it saves in inference accuracy.
How well do optimized skills transfer across model sizes and execution environments?
A SpreadsheetBench skill trained on GPT-5.4 improved that model by +10.7 points, GPT-5.4-mini by +9.4, and GPT-5.4-nano by +3.0, with transfer degrading as the gap widens. Cross-environment transfer is stronger: a Codex-trained spreadsheet skill applied to Claude Code produced a +59.7 point gain, suggesting the skill encodes domain logic rather than model-specific prompt patterns.
What do the ablation numbers reveal about which components matter most?
Dropping the meta skill and slow update collapsed SpreadsheetBench from 77.5 to 55.0. Removing only the rejected-edit buffer cost 1.6 to 4.6 points depending on the benchmark. LiveMathematicianBench’s entire 29.3-point improvement came from a single accepted edit, so for some domains the optimizer’s value is finding one high-signal instruction rather than accumulating many.
Does the 52-of-52 result hold up against a composite baseline?
SkillOpt beat an oracle that selects the best result among all six competing methods for each evaluated cell, by an average of +5.4 points. The result is self-reported on benchmarks that all have automatic verifiers, with no independent reproduction published yet.
What happened when an optimized skill was applied to a different benchmark in the same domain?
An OlympiadBench skill produced positive gains on Omni-MATH without further tuning, suggesting the optimizer captures generalizable domain reasoning rather than overfitting to question format. The paper reports only the direction of transfer, not its magnitude, so the practical extent of cross-benchmark reuse remains unclear.