OpenAI’s instruction hierarchy trains models to rank system prompts above user inputs above tool output, but the best-performing model still fails 6% of hierarchy-conflict tests in controlled benchmarks. The burden of that gap lands on every developer deploying agents that trust the model to enforce its own privilege ladder.
What Instruction Hierarchy Actually Is
OpenAI organizes the messages in an LLM conversation into four privilege levels: system instructions at the top, then developer messages, then user messages, then tool output at the bottom. When a conflict arises, higher-trust instructions take precedence. OpenAI’s ChatGPT agent implements this as its first defense layer: system-level developer instructions get the highest trust, user instructions get moderate trust, and external content pulled from web pages, emails, and documents gets the lowest.
The model is not parsing a mandatory access-control scheme. It is reading a ranked list of message types and applying learned weights to resolve contradictions. That distinction matters because a model can be trained to prefer higher-privilege instructions, but it cannot be prevented from following lower-privilege ones when the context is adversarial. The hierarchy is a soft preference, not a sandbox boundary.
The IH-Challenge Numbers: Improvement Without Perfection
On March 10, 2026, OpenAI published IH-Challenge, a reinforcement-learning dataset of tasks designed to train models to respect the four-level ranking. The tasks are deliberately simple and objectively gradable with a Python script, structured so that blanket refusal does not score well. The design targets a known failure mode: models that learn to refuse everything rather than correctly resolving which instruction wins.
The benchmark results for GPT-5 Mini-R trained on IH-Challenge:
| Benchmark | Before | After |
|---|---|---|
| TensorTrust system-user conflict | 0.86 | 0.94 |
| TensorTrust developer-user conflict | 0.76 | 0.91 |
| RealGuardrails | 0.82 | 0.89 |
| System IFEval | 0.92 | 0.96 |
All four scores improved. None reached 1.0. The TensorTrust developer-user conflict score, the weakest starting point, climbed 15 points but still leaves a 9% failure rate; per the IH-Challenge report, capability benchmarks held steady: GPQA Diamond stayed at 0.83 and AIME 2024 moved from 0.93 to 0.94, so the training did not trade reasoning for compliance.
The takeaway is not that IH-Challenge failed. The gains are real and the capability preservation is encouraging. The takeaway is that the ceiling on hierarchy compliance is below 1.0 on every measured task, and these tasks are controlled single-turn conflicts. Real-world multi-turn agent workflows present wider attack surfaces.
Why Hierarchy Is Probabilistic, Not Protocol-Level
Instruction hierarchy is a trained behavior, not a protocol guarantee. There is no system-level enforcement that prevents the model from acting on a user instruction that contradicts a developer instruction. The model has learned to prefer the higher-ranked message, and the training makes that preference stronger, but preference is not prohibition.
OpenAI itself draws attention to this limit. Its April 2026 defense guide states that “prompt injection cannot be fully solved at the model level alone” and that application-layer defenses are required for production deployments. The company also describes the defense landscape as “an ongoing arms race rather than a solved problem.”
This is the right framing. The wrong framing, common in optimistic coverage, treats the 0.94 TensorTrust score as evidence that hierarchy “blocks” injection. It reduces injection success rates in benchmark conditions. It does not block injection, and the difference matters the moment an agent is given access to tools that modify state.
What Hierarchy Cannot Stop: The Gemini CLI Case Study
In May 2026, Pillar Security disclosed a CVSS-10 (maximum severity) supply-chain attack against Google’s Gemini CLI. A malicious npm package hid prompt-injection payloads inside code comments. When Gemini CLI read the file as part of its normal workflow, the agent followed the injected instructions and executed arbitrary shell commands, including exfiltrating environment variables.
Instruction hierarchy offers no defense here. The malicious content arrives through a legitimate file-read channel, not through a message the model can rank against other messages. The file content does not carry a privilege label. From the model’s perspective, it is just text that the tool returned, same as any other file. The attack exploits the trust the agent places in tool output, not a weakness in how the model ranks conflicting instructions.
Supply-chain injection is not a theoretical edge case. It is the natural attack path once direct prompt injection gets harder. If an attacker cannot fool the model through the user message, they embed the payload in a dependency the model will read on its own. Hierarchy training does not address this vector because the payload never competes with a higher-privilege instruction. It arrives as part of a task the model has already been instructed to perform.
What App Builders Must Do Beyond Hierarchy
OpenAI’s March 11 agent defense guide lays out a defense-in-depth architecture that treats hierarchy as one layer among many. The practical measures for production agents:
Separate untrusted inputs from high-trust instructions. External content should never be concatenated into the same message as developer instructions. Structure your prompts so that web pages, file contents, and user-supplied text occupy clearly bounded sections, and test that the model does not act on instructions embedded in those sections.
Tighten tool permissions to least privilege. OpenAI’s ChatGPT agent categorizes actions along a risk spectrum: low-risk read-only operations proceed, medium-risk actions get contextual analysis for intent mismatch, and high-risk actions (sending emails, modifying files, making purchases) always require explicit user confirmation regardless of injected instructions. Apply the same tiering to your own agents.
Validate tool outputs before passing them to the model. If an agent reads files, sanitize or strip comment blocks from untrusted sources before including them in context. If an agent fetches web pages, parse structured data rather than dumping raw HTML into the prompt.
Build repeatable security eval suites. Test your specific agent workflows against hierarchy violations, not just the generic IH-Challenge tasks. The benchmarks measure controlled conflict resolution; your deployment has its own message-layering patterns, tool combinations, and multi-turn flows. OpenAI’s defense guide explicitly recommends building eval suites that probe these domain-specific failure modes.
Use canary tokens and output monitoring. Embed unique markers in developer instructions and monitor whether those markers appear in model outputs or tool calls. Detecting a canary leak means catching a hierarchy violation in real time rather than discovering it in a post-mortem.
How to Test Your Own Message Layering
The IH-Challenge benchmark gives a useful baseline, but it measures generic performance, not your agent’s failure modes. Building a hierarchy eval for your own deployment requires:
Enumerate your message flows. Map every channel through which untrusted content enters the model context: user messages, file reads, API responses, web fetches, database queries. Each channel is an injection surface.
Write conflict probes per channel. For each untrusted-input channel, craft test cases where the content contains an instruction that contradicts a developer-level instruction. Measure how often the model follows the lower-privilege instruction.
Test multi-turn escalation. Single-turn conflict resolution is what IH-Challenge measures. Real attacks spread across turns, building context that shifts the model’s attention away from the original developer instructions. Probe for gradual hierarchy erosion over 5, 10, and 20 turns.
Measure refusal rates alongside compliance. A model that refuses all user requests when a system instruction is present is not correctly resolving hierarchy; it is over-refusing. IH-Challenge was specifically designed to avoid rewarding this behavior. Your eval should do the same: include cases where the user instruction is benign and should be followed, and verify that the model still complies.
Run the suite on every model update. Hierarchy compliance is a trained behavior. It can regress when the model weights change. Treat the eval as a regression test, not a one-time check.
The Practical Bottom Line
OpenAI’s instruction hierarchy is a genuine improvement. The TensorTrust scores moved from 0.76/0.86 to 0.91/0.94, and capability held. But 0.94 is not 1.0, the benchmarks are controlled, and OpenAI’s own publications state that model-level defenses are insufficient for production. The Pillar Security disclosure of a CVSS-10 supply-chain attack against Gemini CLI demonstrates a concrete attack path that hierarchy does not address, because the payload arrives through a channel the model has no reason to distrust.
For teams deploying LLM agents, the operating assumption should be that the model will sometimes follow lower-privilege instructions when it should not, and that some injection vectors bypass the hierarchy entirely. Build your defense stack accordingly: least-privilege tool access, structured output validation, human confirmation for high-risk actions, and eval suites that test your specific agent workflows rather than relying on generic benchmark scores.
Frequently Asked Questions
Does the four-level hierarchy protect against injection through images or audio?
IH-Challenge evaluates text-message conflicts only. A model that receives an image with embedded text or an audio clip containing spoken instructions has no mechanism to assign a privilege level to that content, because the hierarchy ranks message sources (system, developer, user, tool), not modalities. Multimodal inputs pass through without a privilege label, making them ungated injection surfaces even on hierarchy-trained models.
Can teams apply IH-Challenge training to their own fine-tuned models?
The dataset is public and ships with a Python grading script, so teams can fold it into RLHF or fine-tuning pipelines for any OpenAI-compatible model. The limitation is scope: the tasks are generic privilege-conflict scenarios (system says X, user says Y). A domain-specific agent handling medical records or financial transactions faces conflict patterns the generic dataset never touches. Teams deploying in regulated verticals should layer domain-adversarial examples on top of the IH-Challenge baseline to cover their actual attack surface.
Why do multi-turn attacks bypass hierarchy when single-turn benchmarks score 0.94?
The benchmarks test direct contradictions: a system instruction and a user instruction that explicitly disagree. Multi-turn attacks avoid direct contradiction entirely. Each turn adds context that gradually reframes the original developer instruction as irrelevant or superseded, without any single message triggering a hierarchy conflict. By turn 10 or 15, the model may comply with a request it would have rejected at turn 1. The 0.94 TensorTrust score measures none of this degradation, because TensorTrust evaluates isolated conflict pairs, not cumulative context shifts.
Would moving enforcement from model weights to the inference runtime fix the gap?
Runtime enforcement would strip or sandbox low-privilege content before the model sees it, turning the preference into a hard filter. The cost is rigidity: a filter that blocks all instructions in tool output also blocks legitimate cases where a file contains natural-language directions the agent should follow. The current soft-preference design exists partly because hard enforcement produces false positives that break useful agent workflows. Any protocol-level solution will need to distinguish malicious instructions from benign content in the same message, which is the same classification problem the model is already being trained to solve.