Poisoning a RAG Retriever: How Conflict-Aware Edits Inject False Knowledge

RAG pipelines carry a trust assumption embedded in their architecture: if the document corpus is clean and the LLM is guarded, the system is safe. A preprint submitted June 16, 2026, arXiv:2606.18310, proposes an attack that invalidates exactly this assumption by targeting neither component. CAREATTACK edits the retriever’s model parameters directly, promoting attacker-controlled passages into the top-k results for targeted queries without touching a single document in the knowledge base.

How RAG’s Security Perimeter Got Built

The standard RAG architecture couples a dense retriever with an LLM-based generator, and most security work has followed the data flow toward the corpus. At inference time, the retriever encodes the incoming query into an embedding vector, scores that vector against pre-computed embeddings of corpus documents, and selects the top-k matches. Those matches get appended to the LLM’s context window, which then generates a response conditioned on both the query and the retrieved passages, per the Wikipedia RAG overview.

The dominant prior attack class, data poisoning, involves injecting crafted synthetic documents into the knowledge base so a retriever scores them highly against target queries. AWS’s RAG explainer, GeeksForGeeks’ RAG documentation, and most practitioner guides treat the retriever itself as fixed infrastructure, a static dependency that scores documents, nothing more. The embedding model that does the scoring doesn’t get the same scrutiny as the documents it operates on.

The retriever’s weights occupied an implicit trust position. Operators verify document provenance, monitor for anomalous content, and run output filters. None of those controls touch the retriever model itself.

What Changed: The Retriever Is Now the Target

CAREATTACK moves the attack surface from the knowledge base to the retriever’s own weights. The threat model, according to arXiv:2606.18310, requires access to the retriever’s model parameters rather than to the corpus or the LLM. This inverts what every prior RAG poisoning paper assumed: until this work, the adversary was granted write access to the external knowledge base but not to retrieval or LLM parameters, as the RAG poisoning survey on emergentmind.com notes.

The operational consequence is direct. If an attacker can supply or modify the retriever weights that end up in deployment, through a poisoned artifact on a model hub, for instance, a scrupulously clean, well-governed document corpus provides no protection. Every query the attacker has targeted will return their chosen passages in the top-k, and the LLM will read from them as if they were authoritative sources. The knowledge base stays pristine. The output is wrong.

This is a different threat than corpus poisoning, with different detection characteristics and different defenses.

How CAREATTACK Works: Two Stages, One Intact Corpus

The attack proceeds in two stages, both operating entirely on the retriever’s parameters. No document is modified, inserted, or deleted.

The first stage is conflict-aware retriever editing. According to arXiv:2606.18310, the approach uses closed-form parameter edits guided by graph-based conflict detection and parameter editing projection. The goal is to shift the retriever’s embedding geometry so that malicious passages score higher than benign ones for targeted queries. The “conflict-aware” component addresses a practical editing problem: naive parameter edits that boost one passage’s relevance can degrade performance on unrelated queries, leaving a behavioral fingerprint detectable through standard quality monitoring. Conflict detection identifies which parameter regions carry load for non-target queries; the projection step avoids overwriting them.

The second stage is attack-preserving anchor repair, a lightweight calibration pass that preserves the injected retrieval bias for target prompts while minimizing measurable impact on non-target prompts. An operator running retrieval quality checks on a representative query set would see roughly normal behavior. The attack only surfaces for the specific queries the adversary has targeted.

Why Open-Source Retrievers Are the Practical Entry Point

CAREATTACK is instantiated on Qwen3-Embedding-0.6B and BGE-M3, two widely deployed open-source embedding models, and evaluated across three benchmark datasets. The paper’s authors note that most RAG systems are built on open-source retrieval models, making them a practical target rather than a theoretical one.

This is a supply chain problem in a specific form. Open-source embedding models flow through model hubs, get downloaded into on-premise deployments, containerized, and treated as pinned dependencies. The typical infrastructure team pins a model version the same way it pins a package version: download once, verify a checksum at download time if anything, then treat as immutable. A weight file is not a library. A poisoned model file that passes a hash check (if an attacker controls the artifact before initial publication, the hash on the hub matches the poisoned file) carries the attack into every deployment that pulls it.

Managed embedding APIs, OpenAI, Cohere, represent a separate case. The operator there doesn’t control or verify the embedding model directly, but retriever weights also aren’t accessible to third-party adversaries via supply chain compromise. The threat model and the mitigations differ substantially from the self-hosted open-source deployment.

What Breaks in the Existing Defense Stack

Retrieval-stage defenses cannot stop CAREATTACK once the retriever weights are poisoned. The RAG poisoning survey notes that model-editing-based retriever poisoning alters attention geometry through low-rank, stealthy updates; once a poisoned retriever achieves high semantic similarity to target queries, the attack is already through the retrieval layer before any downstream check runs.

Corpus-level defenses, perplexity filters, anomaly detectors, document provenance checks, never touch the retriever model. If the corpus is clean, they have nothing to scan. Data poisoning attacks, the prior-generation threat, inject synthetic documents optimized for embedding similarity; those crafted documents often carry statistical signatures that perplexity filters or anomaly detectors can flag. CAREATTACK avoids that surface entirely by staying at the weight level, per arXiv:2606.18310. There is no document to inspect.

Generation-stage defenses, cross-checking retrieved content against known facts, source attribution in LLM prompts, output monitoring, can in principle catch the resulting bad outputs. But they operate downstream: the malicious content has already been retrieved and placed in context. A 2025 RAG architecture survey (arXiv:2506.00054) identifies adversarial robustness and privacy-preserving retrieval as open research directions, noting recurring tradeoffs between retrieval precision and generation flexibility. The survey predates CAREATTACK but frames the right gap: defenses deployed at the corpus and output layers are necessary but not sufficient. The retriever is now part of the trust boundary, not behind it.

What Operators Should Do: Retriever Integrity as Infrastructure Security

The retriever model file requires the same controls applied to other critical dependencies: verify the artifact, pin the version, detect drift, and treat a weight update as a security-relevant event.

For teams running open-source embedding models in self-hosted or containerized RAG deployments, the minimal control set includes cryptographic verification of model artifacts at deployment time (not just at initial download), retrieval behavior regression tests that cover targeted query classes, and monitoring for retrieval pattern shifts over time. A baseline of top-k results for a representative query set, computed from a verified model artifact at deploy time, gives a reference point for behavioral drift detection. Population-level behavioral tests across many query types are harder to evade than single-query spot checks.

The supply chain hygiene question for model weights differs from library package hygiene in one critical respect. Package managers publish checksums alongside artifacts, and build tooling has decades of investment in reproducibility. Model weight files are large binaries without equivalent ecosystem maturity. An operator who verifies a checksum against an artifact that was already poisoned at the hub gets no protection. Comparing a local artifact’s hash against a separately published reference hash, or better, a weight file built from a reproducible training run in a controlled environment, closes the gap that download-time verification alone leaves open.

The relevant operational question shifts from “is our corpus clean?” to “can we verify that the retriever model we deployed is the retriever model we intended to deploy?” These are different questions with different answers depending on how model artifacts enter the environment.

One note on scope: arXiv:2606.18310 is an arXiv preprint as of June 16, 2026, not yet peer-reviewed. The full paper covers evaluation across three benchmark datasets on Qwen3-Embedding-0.6B and BGE-M3, but specific attack success rate figures aren’t available in the abstract. The quantitative bar requires reading the paper directly. What the threat model contribution establishes, independent of specific numbers, is that the retriever weight file is a mutable attack surface requiring active controls, not a fixed dependency that earns trust by default.

Frequently Asked Questions

Does CAREATTACK apply if the team fine-tuned the retriever on proprietary data before deployment?

The supply chain risk sits at the base checkpoint download, not the fine-tuning step. Poisoned parameter geometry introduced before fine-tuning can survive the update if the fine-tuning data does not strongly contradict the attacker’s targeted embedding alignments. Teams that build a retriever from scratch using only internally audited base weights eliminate this specific supply chain vector, but that requires controlled access to training infrastructure as well as the model artifact.

How does this retriever-weight attack differ from prompt injection in a RAG system?

Prompt injection embeds adversarial instructions inside retrieved documents so the LLM follows attacker directives when it reads them. CAREATTACK removes the need for adversarial document content entirely: the poisoned retriever delivers attacker-selected passages as top-k results before the LLM ever reads them, so those passages can look entirely benign. Prompt injection defenses that harden the LLM’s instruction hierarchy do nothing against a retriever that has already decided what context to surface.

What is the fastest containment step if a team suspects its retriever weights were tampered with?

Swap to a known-good retriever artifact, either a locally archived copy with a verified hash from before the suspected compromise window, or a temporary fallback to a managed embedding API. Reverting retriever weights does not require corpus re-ingestion if the replacement uses a compatible vector space; if the vector space differs, re-embedding is unavoidable but isolates the poisoned layer without any LLM or knowledge base changes.

Why does broad query coverage in a behavioral baseline improve detection against anchor repair evasion?

The anchor repair calibration suppresses behavioral drift on queries the attacker does not target. Its stealth advantage depends on the operator’s evaluation set and the attacker’s target queries staying disjoint. If the baseline includes queries covering the specific high-value topics the attacker has poisoned, top-k results for those queries will diverge from the stored baseline immediately. Evasion holds only when the operator’s test distribution excludes the targeted topic class.