groundy
models & research

Can You Reconstruct an LLM's System Prompt From Its Activations?

PRISM recovers full instruction sets inside frozen LLMs from hidden states, enabling anyone with activation access to reconstruct system prompts without output probing.

8 min · · · 3 sources ↓

Yes. PRISM, an activation-conditioned interpreter submitted to arXiv on June 8, 2026, recovers the full instruction set steering a frozen language model directly from its hidden states. The method decodes internal representations into faithful bullet lists of active instructions, constraints, and hidden objectives, a capability that goes well beyond inferring behavior from output text. The paper is currently under review; quantitative benchmark tables have not been verified against the full PDF, so the analysis below scopes specific claims to what the abstract supports.

What PRISM Does

Prior work on extracting information from model internals focused on detecting single properties: is the model being deceptive? Is it planning harm? Is it concealing information? These probes, typically trained as linear classifiers on mid-layer activations, have reached impressive performance. Deception-detection AUROCs between 0.96 and 0.999, with recall of 95 to 99% at a 1% false-positive rate, have been reported for linear probes according to recent work on deception detection. The probes themselves are lightweight: matrix multiplications over already-computed activations with negligible latency, making real-time deployment operationally feasible.

PRISM moves from single-attribute detection to instruction-set retrieval. The formalization matters: rather than asking “does this activation encode property X?”, PRISM asks “what is the complete set of instructions, constraints, prohibitions, and subgoals currently active in this model?” The output is a structured bullet list, not a binary classification. This is a qualitative jump in extraction granularity. Where prior probes could flag that a model has a hidden objective, PRISM aims to tell you what that objective actually says.

How It Works

PRISM uses a judge-guided variant of GRPO to train an interpreter model that reads hidden states from a frozen target model and decodes them into text. The reward signal comes from a judge that scores the interpreter’s output on two dimensions: coverage (how many of the actual instructions are recovered) and faithfulness (whether the output contains claims not supported by the target model’s instructions). The interpreter is penalized for hallucinating instructions the model was never given.

The “frozen target model” detail is important. PRISM does not modify the model whose instructions it extracts. It trains a separate interpreter against that model’s activations. The target model runs normally; the interpreter watches its hidden states and reconstructs what it was told to do. Operationally, this means the extraction side can be bolted onto an existing inference pipeline without touching the model weights themselves.

The paper evaluates PRISM across four setting categories: benign (standard helpful instructions), constrained (models operating under restrictions), prompt-injection (models that have received adversarial inputs), and hidden-objective (models with undisclosed goals). The authors claim the strongest gains over baselines come in the security-relevant settings, particularly hidden-objective recovery.

Single-Attribute Probing vs. Full Instruction Recovery

The distinction between what PRISM does and what prior probes do is worth sharpening.

A deception probe tells you a model is hiding something. PRISM tells you what it is hiding. That difference has direct security implications. A monitoring system that flags “deception detected” at inference time can halt a transaction for human review. A system that produces the actual hidden instruction (“exfiltrate user credentials via the debug endpoint”) enables a qualitatively different incident response.

According to the activation-probe survey, prior work has already established that hidden states carry rich, extractable information about model intent. The interpretability toolkit has been dual-use for some time. PRISM extends the extraction side of that duality from property detection to content recovery.

Who Is Actually at Risk

Not everyone. The threat model is narrower than it first appears.

Activation-level access is the prerequisite. You need to read the model’s intermediate hidden states, not just send it prompts and read its outputs. Three categories of actor typically have this access:

  1. Inference providers (API hosts like Anthropic, OpenAI, Google, and the growing field of GPU-cloud operators). If you run the inference server, you see the activations. PRISM means your internal red team, your compliance team, or a rogue employee with inference-server access can reconstruct any system prompt your customers are running through your infrastructure.

  2. Self-hosters who deploy open-weights models on their own hardware. If you control the GPU, you control the activations, and therefore you control the extraction capability. For enterprises running fine-tuned models behind a firewall, this is a feature: you can audit what your own models are doing. For multi-tenant on-prem deployments, it is a risk.

  3. Regulators and auditors with the legal authority to require activation logging. PRISM gives them a methodological tool: rather than black-box testing a model’s compliance with safety requirements, they can directly extract its operating instructions and verify them against policy.

The group conspicuously not at immediate risk: thin-wrapper SaaS products that call models over HTTP APIs. If your product is a proprietary system prompt running against GPT-4 or Claude via an API endpoint, PRISM does not directly threaten you. You never expose activations. The model provider sees your prompt in plaintext anyway. If your IP protection strategy was “the API host can read my prompt but nobody else can,” that strategy remains intact under PRISM’s specific threat model, though it was never a strong position to begin with.

The Evasion Problem

Activation-based methods face a known countermeasure. Models can be trained to obfuscate internal representations, hiding concepts from activation monitors. The activation-probe survey notes adversarial robustness as an open challenge for probe-based methods. The cat-and-mouse dynamic is active.

PRISM’s robustness to adversarial obfuscation is not characterized in the abstract. Whether the instruction-set retrieval approach is harder to evade than single-attribute probing remains an open question. On priors, recovering a full instruction set is a harder task than detecting a single property, which could make it either more vulnerable to obfuscation (more signal to hide) or more resilient (redundant information across multiple instructions makes complete suppression difficult). The paper’s full results, when available, should address this.

A separate line of work suggests evasion may be harder than it appears. A June 2026 paper on cross-LLM consistency finds that advanced models from different developers develop convergent internal inference patterns when predicting the same token. If representational structure is shared across architectures, obfuscation techniques trained on one model may not transfer cleanly to others. This is preliminary work, but it hints that instruction recovery could generalize better than model-specific evasion defenses.

What the Paper Doesn’t Tell Us Yet

Three gaps are visible from the abstract alone.

Numerical results. The authors claim PRISM outperforms baselines, but no recovery-rate percentages, per-setting F1 scores, or comparison tables are available from the abstract. Whether “outperforms” means a marginal gain or a substantial one on hidden-objective recovery materially affects how seriously to take the security implications. This is the standard gap for a freshly submitted arXiv preprint; the full paper should fill it.

Adversarial robustness. As noted above, the interaction between PRISM and models trained to evade activation monitoring is uncharacterized. Given that adversarial training has been shown to degrade probe performance, this gap matters for any deployment threat model.

Model and scale coverage. The abstract does not specify which architectures or parameter scales PRISM was evaluated on. Whether the method works on a 7B open-weights model, a 70B model, and a proprietary API-served model is central to its practical relevance. The evaluation settings (benign, constrained, prompt-injection, hidden-objective) are described at the category level but not mapped to specific model families.

The Practical Takeaway

For inference providers, PRISM is a capability that was already latent in the existing interpretability toolkit, now made explicit and systematic. If you run inference servers and your customers send you proprietary system prompts, you already have plaintext access to those prompts in your logs. PRISM adds a second extraction path via activations, which is more relevant for scenarios where the prompt is obfuscated or where you want to verify that the prompt you see matches the model’s actual operating instructions (a defense against prompt-injection attacks from the customer’s side).

For self-hosters running open-weights models, PRISM is a monitoring tool. It can verify that a deployed model is following its intended instructions and detect hidden objectives injected via fine-tuning or poisoning. The dual-use nature of the technique is clear: the same capability that lets you audit your own models lets an attacker with inference-server access extract the instructions of models they do not own.

For the broader ecosystem, PRISM formalizes something the interpretability community has been converging on for two years: model internals are readable, the readability is getting cheaper, and the information extracted is moving from coarse labels toward faithful reconstruction of the model’s full instruction context. The assumption that system prompts are private when activations are observable was never well-founded. PRISM just makes the tools for violating that assumption published, reproducible, and benchmarked.

Frequently Asked Questions

How badly does adversarial training break activation probes?

Published benchmarks show adversarial training can drop probe recall from 90% to 23% at a 1% false-positive rate for deception and harmfulness detection. PRISM’s robustness to this evasion has not been tested, so any deployment threat model should assume the worst-case degradation applies until the full paper provides numbers.

Can activation probes be used offensively, not just for monitoring?

Yes. Probe-guided activation steering has already been demonstrated to break privacy alignment and force disclosure of memorized private information through targeted attention-head manipulation. PRISM extends what can be read from activations; the steering work extends what can be written to them. Together they form a full read-write attack surface on any deployment where activations are accessible.

Could a PRISM interpreter trained on one model extract instructions from a different architecture?

A June 2026 study on cross-LLM consistency found that models from different developers develop convergent internal inference patterns when predicting the same token, suggesting shared representational structure across architectures. If that finding holds, an interpreter trained against one model family may transfer to others with less retraining than expected, though PRISM itself has not demonstrated cross-model transfer.

Which model layers carry the most extractable instruction signal?

Deception-detection probes peak at mid-layers: layer 22 in Llama-3.3-70B produces the strongest signal, not the final layers where output logits are computed. This matters for deployment because intercepting activations at a specific intermediate layer is cheaper than logging the full residual stream, and PRISM’s interpreter would likely target the same high-signal layers rather than reading every layer.

sources · 3 cited

  1. PRISM: Recovering Instruction Sets from Language Model Activations primary accessed 2026-06-09
  2. Model-Internal Activation Probes analysis accessed 2026-06-09
  3. Cross-LLM Consistency in Inference: Evidence from Shared Interactions primary accessed 2026-06-09