Catching LLM Agents Leaking Credentials From Their Own Activations

Q: What latency and engineering overhead do activation probes add to inference?

The paper does not benchmark inference latency with probes active. A linear probe applied to hidden states is a single matrix multiply per layer, which is computationally cheap relative to the transformer forward pass. The real cost is engineering: extracting intermediate activations from production serving frameworks like vLLM or TensorRT-LLM requires custom hooks, because most inference servers discard hidden states after each token step to conserve GPU memory.

Q: Could an adversary who knows the probe exists evade it?

The paper shows probes generalize to held-out encoding schemes, but an adversary with probe awareness could craft inputs that steer hidden states away from the detection boundary. The probe is a linear model operating on high-dimensional representations, and gradient-based attacks on similar probes have been demonstrated in the model-editing and safety literature. The defense assumes the attacker lacks direct access to the probe weights, which holds only if the probe stays internal to the serving infrastructure.

Q: How does conformal prediction for honeytokens differ from hand-tuned DLP thresholds?

Threshold tuning produces a single false-positive estimate that degrades as the input distribution shifts in production. Split conformal prediction provides a distribution-free statistical guarantee: set a 5% false-positive bound and the actual rate stays at or below 5% under any data distribution, provided the calibration and test samples are exchangeable. The tradeoff is that conformal methods require a held-out calibration set and produce wider prediction intervals when calibration data diverges from deployment conditions.

Q: Do these defenses transfer to multimodal agents that process images or audio?

The paper evaluates only text-based agents. Multimodal models add input channels that could carry injection payloads invisible to text-centric probes. A vision-language model receiving a poisoned image could trigger credential-seeking behavior through visual prompts the probe was never trained on. The 31,132-skill SkillScan survey that found 26.1% of marketplace skills carrying at least one vulnerability (13.3% specifically a data-exfiltration pattern) [Updated June 2026] also examined only text and code inputs, leaving the multimodal attack surface unmeasured.

Q: What would a closed-model provider need to change to support activation probing?

The provider would need to expose intermediate hidden-state vectors alongside token outputs, either per-layer or at a designated probe layer. This adds serialization cost on the serving side and reveals information about the model's internal geometry that providers treat as proprietary. Anthropic's published interpretability work, including sparse-autoencoder decompositions of Claude activations, shows the concept is not foreign to closed labs, but exposing live activations on every API call at production latency targets remains an unsolved engineering and competitive problem.

An LLM agent with access to your credentials, a tool-calling interface, and a multi-turn conversation window is a data-exfiltration risk that most production deployments are catching at the wrong layer: after the tokens are already on the wire. A paper submitted to arXiv on June 2, 2026 by Kargi Chauhan and Pratibha Revankar proposes moving detection upstream into the model’s own activations, before any output is generated, and backs it with three complementary defenses that expose how inadequate output-only data-loss prevention has become for agentic systems.

LLM agents operate with a context window that mingles user instructions, retrieved documents, tool outputs, and, increasingly, stored credentials or session tokens. Any untrusted content injected into that window, whether via a malicious email attachment, a poisoned tool response, or a prompt-injection payload embedded in a web page, can instruct the model to exfiltrate those credentials through its tool calls or generated text.

The threat is not hypothetical. A large-scale study of 17,022 skills from the SkillsMP marketplace identified 520 vulnerable skills carrying 1,708 distinct leakage issues. Debug logging alone, print and console.log statements that expose stdout to the LLM, accounted for 73.5% of all leaks. Across a broader survey of marketplace skills, 13.3% contained at least one data-exfiltration vulnerability, with 5.2% exhibiting high-severity patterns strongly suggesting malicious intent. Skills that bundled executable scripts doubled the odds of vulnerability (OR=2.12, p<0.001).

These are supply-chain problems in the skills layer. The Chauhan and Revankar paper targets a different and arguably harder vector: the model itself deciding to leak credentials during normal agentic operation, without any compromised skill involved.

Three Defenses, One Direction

The paper introduces three detection mechanisms that operate at different points in the agent pipeline.

Activation Probes

The core idea: before an LLM emits a single output token, its internal hidden states already encode whether it is about to engage in credential-seeking behavior. A trained probe on those activations can flag malicious intent at inference time, before any text reaches the user or a tool interface. According to the paper, these probes achieve high-accuracy separation between benign and credential-seeking prompts on open-weight models, and the accuracy holds even under held-out encoding transformations the probe was never trained on. That generalization matters: if a probe only caught encoding schemes it had seen during training, an attacker could sidestep it with trivial obfuscation.

Honeytokens with Calibrated Detection

The second defense plants fake credentials, constructed from format-specific character models, into the agent’s context. When these honeytokens appear in model output, they confirm exfiltration with no ambiguity. The detection layer uses split conformal prediction to set thresholds, which provides statistically rigorous bounds on false-positive rates rather than ad-hoc cutoffs. The difference matters in production: an operations team tuning a DLP threshold by hand is guessing at the false-positive rate. A conformal predictor tells you the rate and proves it.

Cumulative Leakage Budgets

The third defense reframes exfiltration as an information-flow problem measured across turns. Rather than evaluating each model output in isolation, it tracks an estimated cumulative leakage budget over the full conversation. In the paper’s synthetic multi-turn benchmark, this cumulative accounting caught attacks that per-turn detectors missed entirely. The intuition is straightforward: an adversary who extracts two characters of an API key per turn across fifty turns has exfiltrated the full key while never triggering a single-turn threshold.

Why Per-Turn Detection Fails

The multi-turn blind spot is the paper’s sharpest operational finding. Existing DLP systems, whether regex-based, classifier-based, or even LLM-judge-based, typically evaluate one output at a time. They ask: does this specific response contain sensitive data? For a single-turn leak of a full credential, that question is answerable. For a slow exfiltration distributed across dozens of turns, each individually innocuous, it is not.

The Back-Reveal attack documented in April 2026 demonstrates this in practice. A backdoored LLM agent, fine-tuned with embedded semantic triggers, uses memory-access tool calls to retrieve stored user context and exfiltrates it through disguised retrieval calls. Multi-turn interaction amplifies cumulative leakage over time. Each individual turn looks like a routine tool call. Only cumulative analysis reveals the pattern.

The White-Box Catch

Activation probes require weight-level access to the model. You need to read the hidden states at inference time. That means this defense is available only to teams running open-weight models locally or on infrastructure they control. It is not available to teams using proprietary API-only models from OpenAI, Anthropic, or Google, where the API returns tokens but not activations.

This is a structural gap, not a temporary limitation. Closed-model providers could expose activation snapshots as an API feature, but doing so would add latency, increase inference cost, and surface competitive information about model internals. In the meantime, teams on closed models can adopt the honeytoken and cumulative-budget defenses, which do not require white-box access, but they cannot use the probe layer the paper introduces.

The Escalating Threat Landscape

The credential-exfiltration problem does not exist in isolation. A concurrent paper published the same week demonstrates an AI-powered adaptive computer worm that parasitically uses compromised machines to run open-weight LLMs and generates tailored attack strategies per target at zero marginal cost per infection. Autonomous AI-driven cyber-threats are operational, not theoretical, as of mid-2026.

The supply-chain data reinforces the urgency. The SkillsMP study found that 76.3% of credential leakages required joint analysis of both code and natural language to detect. A text-only or code-only scanner would miss three-quarters of the vulnerable skills. This is a detection surface that traditional static-analysis tools were not built to cover, and it compounds the case for runtime monitoring inside the agent itself rather than relying on pre-deployment scanning alone.

The Probes Are Not Novel, and Neither Are Their Failure Modes

Reading a model’s hidden states to predict what it is about to do is an established research line, not an invention of this paper. Linear probes on activations have been used to detect hallucinations, classify deception, and reconstruct hidden inputs. Groundy has covered the case against fixed-layer probes for hallucination detection, where the layer carrying the cleanest signal shifts by model and by task, and work that reconstructs a system prompt from activations. The credential-exfiltration probe inherits both the appeal and the unsolved problems of that lineage.

The sharpest unsolved problem is brittleness under adversarial pressure. A November 2025 study red-teaming activation probes with prompted LLMs by Phil Blandfort and Robert Graham found interpretable failure modes in state-of-the-art probes using only black-box access: legalese-heavy phrasing induced false positives, and a bland procedural tone induced false negatives. The attacks were constrained and the failure rates shrank under those constraints, but they did not vanish. Applied to credential detection, the second failure mode is the one that bites. An exfiltration request written in flat administrative language is precisely the input most likely to slip past a probe tuned on more obviously malicious examples. Note that this attack assumes no white-box access at all, which undercuts the comforting framing that the probe operates on a surface the attacker cannot reach.

Layer selection is the other detail that is not settled. The Chauhan and Revankar probe reads activations at a chosen layer, but the layer that best separates benign from credential-seeking behavior is not guaranteed to be stable across models, fine-tunes, or prompt distributions. A probe validated on one open-weight checkpoint can degrade on its own instruction-tuned variant. None of this invalidates the approach. It places it correctly: a cheap, fast signal that raises the cost of exfiltration, not a verdict that proves a request is safe. As the former, an activation probe earns a slot in a defense-in-depth stack. As the latter, it becomes a single chokepoint an adversary will study and route around.

What Deployers Should Do Now

For teams running open-weight models with agentic tool access, the paper suggests a three-layer stack:

Pre-output activation probes trained to detect credential-seeking behavior in hidden states, catching intent before tokens are emitted.
Honeytoken injection with conformal-prediction-calibrated detection, providing a mathematically grounded false-positive guarantee.
Cumulative leakage budgets tracked across all conversation turns, closing the slow-drip exfiltration channel that per-turn filters leave open.

For teams on closed API models, layers two and three are implementable today. Layer one requires either a provider-side API change or a migration to an open-weight deployment.

All three defenses share an unstated premise: the agent is holding a credential worth stealing. The cheapest mitigation sits upstream of every probe and budget, which is to stop handing agents long-lived secrets at all. Short-TTL, narrowly scoped, single-use tokens shrink the blast radius of a leak that detection misses, and they degrade gracefully when detection produces a false negative. Cloudflare’s disposable credentials for AI agents are one production take on that idea. Detection and scoping are complementary, not substitutes: a probe tells you an exfiltration attempt happened, while an ephemeral token bounds what that attempt could have taken.

The paper’s contribution is not any single technique but the framing: credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting, rather than relying on text-level output filters alone. That framing holds regardless of whether the specific probe architectures in the paper survive further evaluation on larger, public benchmarks. The structural argument, that by the time you are scanning output text the agent has already decided to leak, is independent of the preliminary numbers. As of late June 2026 the work remains a single-version arXiv preprint with no public benchmark suite or code release, so the larger evaluation it invites has not yet landed. [Updated June 2026]

Frequently Asked Questions

What latency and engineering overhead do activation probes add to inference?

The paper does not benchmark inference latency with probes active. A linear probe applied to hidden states is a single matrix multiply per layer, which is computationally cheap relative to the transformer forward pass. The real cost is engineering: extracting intermediate activations from production serving frameworks like vLLM or TensorRT-LLM requires custom hooks, because most inference servers discard hidden states after each token step to conserve GPU memory.

Could an adversary who knows the probe exists evade it?

The paper shows probes generalize to held-out encoding schemes, but an adversary with probe awareness could craft inputs that steer hidden states away from the detection boundary. The probe is a linear model operating on high-dimensional representations, and gradient-based attacks on similar probes have been demonstrated in the model-editing and safety literature. The defense assumes the attacker lacks direct access to the probe weights, which holds only if the probe stays internal to the serving infrastructure.

How does conformal prediction for honeytokens differ from hand-tuned DLP thresholds?

Threshold tuning produces a single false-positive estimate that degrades as the input distribution shifts in production. Split conformal prediction provides a distribution-free statistical guarantee: set a 5% false-positive bound and the actual rate stays at or below 5% under any data distribution, provided the calibration and test samples are exchangeable. The tradeoff is that conformal methods require a held-out calibration set and produce wider prediction intervals when calibration data diverges from deployment conditions.

Do these defenses transfer to multimodal agents that process images or audio?

The paper evaluates only text-based agents. Multimodal models add input channels that could carry injection payloads invisible to text-centric probes. A vision-language model receiving a poisoned image could trigger credential-seeking behavior through visual prompts the probe was never trained on. The 31,132-skill SkillScan survey that found 26.1% of marketplace skills carrying at least one vulnerability (13.3% specifically a data-exfiltration pattern) [Updated June 2026] also examined only text and code inputs, leaving the multimodal attack surface unmeasured.

What would a closed-model provider need to change to support activation probing?

The provider would need to expose intermediate hidden-state vectors alongside token outputs, either per-layer or at a designated probe layer. This adds serialization cost on the serving side and reveals information about the model’s internal geometry that providers treat as proprietary. Anthropic’s published interpretability work, including sparse-autoencoder decompositions of Claude activations, shows the concept is not foreign to closed labs, but exposing live activations on every API call at production latency targets remains an unsolved engineering and competitive problem.

The Problem: Credentials and Untrusted Content Share a Context Window

Three Defenses, One Direction

Activation Probes

Honeytokens with Calibrated Detection

Cumulative Leakage Budgets

Why Per-Turn Detection Fails

The White-Box Catch

The Escalating Threat Landscape

The Probes Are Not Novel, and Neither Are Their Failure Modes

What Deployers Should Do Now

Frequently Asked Questions

What latency and engineering overhead do activation probes add to inference?

Could an adversary who knows the probe exists evade it?

How does conformal prediction for honeytokens differ from hand-tuned DLP thresholds?

Do these defenses transfer to multimodal agents that process images or audio?

What would a closed-model provider need to change to support activation probing?

sources · 6 cited