Skill Injection: Hiding Undetectable Instructions in What an AI Agent Loads

A paper submitted to arXiv on June 6 demonstrates that malicious instructions hidden inside skills loaded by LLM agents are effectively invisible to the static content scanners that serve as the primary defense in skills and plugin ecosystems. POISE (arXiv:2606.07943) achieves an 89.3% attack success rate against codex+gpt-5.2 while triggering new detection alerts on only 5.6% of poisoned skill variants. The attack does not depend on cleverly worded payloads. It depends on where the payload sits.

What POISE Does

Agent skills in frameworks like Claude Code, OpenAI’s Codex, and the broader Model Context Protocol ecosystem follow an open format: a SKILL.md file containing YAML metadata and Markdown instructions, optionally accompanied by scripts, reference files, and assets. The format is designed for extensibility, and it inherently trusts loaded content. A skill’s instructions are injected directly into the agent’s context window at inference time, where the model treats them as actionable directives.

POISE exploits this trust through position-aware injection. Rather than stuffing a malicious command into a YAML header (where it is reliably loaded but trivially inspectable) or placing an explicit instruction in the body (where it reads suspiciously to the model itself), POISE compresses the malicious trigger into a single benign-looking body instruction and places it at a contextually feasible position inside the skill. A context-aware generator blends the instruction with nearby setup or prerequisite steps so it reads like a natural part of the workflow.

The mechanism is a single section’s worth of explanation because the insight is narrow and precise: placement, not obfuscation, is what makes the payload survive both human review and the agent’s own suspicion heuristic.

The Numbers

On the Skill-Inject benchmark, POISE achieves an 89.3% attack success rate (ASR) against codex+gpt-5.2, according to the POISE paper. Two baselines contextualize that number. A random-placement body injection baseline scores 61.3%, meaning position awareness adds 28.0 percentage points. A YAML-header injection baseline scores 86.7%, only 2.6 points lower, but header payloads are easily spotted during any manual or automated inspection of the skill’s metadata. POISE retains the stealth of body placement while closing the reliability gap.

The ASR metric itself is stricter than mere payload execution. It requires both that the injected payload executes and that the user’s legitimate task still passes its verifier in the same trial. A poisoned skill that breaks the user’s actual task would invite inspection, so a successful attack must preserve the skill’s normal function.

Why Static Scanning Fails

The standard defense in skills and plugin ecosystems is pre-publication content review, typically using LLM-based static scanners that flag skills containing instructions for privileged tool operations. POISE tested four such scanners across two benchmarks and found a fundamental signal-to-noise problem.

The scanners flag 74.6% of clean skills as suspicious on average, according to POISE. The reason is straightforward: legitimate skill bodies naturally require instructions for file access, network calls, shell execution, and other privileged operations. These are the same patterns a scanner looks for when hunting malicious payloads. The result is a sea of false alarms.

POISE hides within that noise. Only 5.6% of poisoned skill variants gained a new high-risk alert over their clean baselines, meaning the scanner’s judgment of a poisoned skill was nearly identical to its judgment of the same skill before injection. Content vetting, as implemented by current LLM-based scanners, cannot distinguish POISE payloads from normal skill instructions because the payloads are normal skill instructions, just placed where they will be acted upon.

The Reliability, Stealth Trade-off, Resolved

Prior skill-poisoning attacks faced a structural dilemma. YAML-header injections are reliably loaded by the agent’s skill parser but are trivially visible to anyone who inspects the metadata block. Body injections placing explicit malicious commands are stealthier, but out-of-context commands trigger the agent’s own reluctance to follow instructions that do not fit the surrounding task flow.

POISE resolves this by making the instruction fit the context. The context-aware generator rewrites the payload so it reads as a prerequisite or setup step consistent with the skill’s purpose. The position-aware placement ensures it lands in a part of the skill where the agent is already primed to execute sequential instructions. Neither the wording nor the placement is anomalous, which is why static scanners and the model’s own skepticism both miss it.

The Skill Ecosystem’s Expanding Attack Surface

A separate large-scale audit of 55,315 publicly available agent skills, documented in SkillReducer (arXiv:2603.29919), found systemic conditions that make the POISE vector practical. 26.4% of skills lack routing descriptions entirely, meaning there is no structured metadata that could constrain how the skill is loaded or invoked. Over 60% of body content across the audited skills is non-actionable, providing ample room to embed a payload without altering the skill’s observable behavior. Reference files can inject tens of thousands of tokens per invocation, expanding the context window available for placement.

The open skill format was designed for developer convenience and extensibility. It was not designed with adversarial content in mind, and the ecosystem has not yet added the infrastructure to retrofit that assumption.

What This Means for Agent Stacks

The agent frameworks that have adopted skill and plugin systems, including MCP-based stacks, Claude Code skills, and OpenAI’s Codex ecosystem, share a common architecture: skills are loaded as text into the context window, and the model interprets them as instructions. Content review, where it exists, happens at or before publication. What happens at inference time, when multiple skills compose together and the agent builds its execution plan, is unmonitored by any dedicated security layer.

POISE demonstrates that this gap is exploitable with high reliability. The attack does not require a model vulnerability, a jailbreak, or a parser bug. It requires a skill format that loads text as instructions and a content-review pipeline that cannot distinguish malicious instructions from legitimate ones when they are placed in the right position and worded to fit the surrounding context.

What’s Missing: Runtime Monitoring

The defensive implication is specific: pre-publication scanning of skill content cannot catch position-aware injection because the relevant signal is not in the text itself but in how the text composes with the agent’s execution state at inference time. A skill that reads as a normal prerequisite step and executes as a normal prerequisite step when loaded alone can behave differently when loaded alongside other skills, when the agent is in a particular execution phase, or when the user’s task creates a specific context window configuration.

Monitoring that behavior requires instrumentation at the inference layer: tracking which skill instructions the agent acts on, in what order, and whether the resulting tool calls match the declared intent of the skill. None of the major agent stacks implement this as a security control today. Building it means treating skills not as static text to be vetted but as executable content whose runtime behavior must be observed, which is a different engineering commitment than the current skills/plugins push assumes.

Frequently Asked Questions

Does POISE transfer to non-OpenAI models like Claude or Gemini?

The 89.3% ASR is specific to codex+gpt-5.2 on the Skill-Inject benchmark. The authors claim the attack generalizes across model families because it exploits skill format structure rather than model-specific quirks, but they do not publish per-model breakdowns for other stacks. No independent reproduction on Claude, Gemini, or open-weight models has appeared as of June 2026, so the transfer claim is plausible but verified only within OpenAI’s model line.

How does POISE differ from prompt injection and jailbreaking?

Prompt injection targets the user-model conversation channel, inserting adversarial text the model interprets as instructions. Jailbreaking crafts inputs that bypass safety training. POISE operates on a separate path: the agent loads what it treats as trusted internal content from the skill supply chain. Defenses designed for conversation-level attacks (input sanitization, role-boundary enforcement) do not inspect the skill-loading channel, which is why LLM-based scanners built to catch prompt injection miss POISE payloads even when those scanners are themselves LLMs.

What happens when multiple poisoned skills load in the same context?

The paper evaluates single-skill injection only. Two or more poisoned skills composing in the same inference call could produce cascading interactions: payloads reinforcing each other, interfering, or generating tool-call sequences no single payload would produce. The 55,315-skill audit found that reference files alone can inject tens of thousands of tokens per invocation, so multi-skill sessions routinely involve large composed contexts. This interaction space remains untested.

How fast is the skill ecosystem expanding, and does growth raise exposure?

The open skill format POISE targets collected 2,300 GitHub stars in its first week of availability, indicating rapid adoption across Claude Code and Codex workflows. The March 2026 audit found 26.4% of published skills lack routing descriptions, meaning no structured metadata constrains how they are invoked. As the skill pool grows, each entry without routing controls adds an unmonitored injection surface, and no major framework currently requires routing metadata at publication time.

What would a runtime monitoring layer need to instrument?

Three things: which skill instructions the agent acts on, the sequence of those actions relative to the skill’s declared intent, and whether the resulting tool calls match the declared purpose of the skill that triggered them. This is closer to application-level execution monitoring than content scanning. Current agent stacks do not expose hooks for this kind of instrumentation, so building it requires changes to the agent runtime itself rather than a wrapper or plugin around the skill format.