RAG Security
RAG Poisoning: Knowledge-Corruption Attacks and How to Pentest Them (2026)
Most LLM security work assumes the attacker is talking to your model. Retrieval poisoning assumes the opposite. The attacker never sees your prompt, never jailbreaks anything, and never touches the weights. They write to the corpus you trust, and let your own retriever deliver the payload.
The attack surface nobody hardens
A retrieval-augmented generation system has three moving parts: a corpus of documents, a retriever that pulls the most relevant ones for a given question, and a model that answers using those documents as context. Defenders spend almost all of their attention on the third part. They wrap the model in guardrails, filter the prompt, and rate-limit the API. The corpus, meanwhile, is treated as ground truth. It is "our data," so it must be trustworthy.
That assumption is the vulnerability. The whole point of RAG is that the model defers to retrieved documents. When a question comes in, the system fetches a handful of passages and instructs the model to answer based on them. If an attacker can get a crafted document into that handful, they are not arguing with the model. They are feeding it the evidence and letting it draw the conclusion they want. This is the class the OWASP Top 10 for LLM Applications 2025 names in two places: LLM04 Data and Model Poisoning and, specifically for retrieval systems, LLM08 Vector and Embedding Weaknesses.
How few documents it actually takes
The reflexive intuition is that poisoning a corpus of millions of documents would require flooding it with thousands of fakes. It does not. The reason is retrieval itself: the system only surfaces the top-k most similar passages for any single question, typically a small number. An attacker does not need to dominate the corpus. They need to out-rank the legitimate documents for one specific target question.
The PoisonedRAG study, presented at USENIX Security 2025 by Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia, put numbers on this. Injecting five crafted texts per target question into a knowledge base of millions of clean documents reached roughly a 90% attack success rate, and 97% on the Natural Questions corpus of about 2.68 million texts, with the retriever configured to return its top five passages. Five documents, millions of legitimate ones, and the model answers the attacker's way nine times out of ten. The arXiv preprint is available at arXiv:2402.07867.
The asymmetry in one line: retrieval rewards relevance, not authenticity. A document that is engineered to look maximally relevant to one question wins the slot, and the model trusts whatever wins the slot.
Why a poisoned document beats a real one
To be retrieved and to change the answer, a malicious document has to clear two independent bars. The PoisonedRAG authors formalize these as the retrieval condition (the document must rank among the top-k passages the retriever returns for the target question) and the generation condition (once it is in the context window, it must actually steer the model to the target answer). A document that satisfies only one is harmless. A web page stuffed with a hidden instruction that never gets retrieved is inert; a perfectly retrievable document that does not change the model's mind is just noise.
The clever part is how cheaply both bars are cleared in a black-box setting, where the attacker cannot see the embedding model or its weights. The authors decompose each malicious text into two concatenated parts. One part, which they call I, is the misleading content generated to satisfy the generation condition. The other part, S, satisfies the retrieval condition, and their black-box construction simply sets S to the target question itself, on the reasoning that a question is maximally similar to itself in embedding space, so prepending it makes the document a near-perfect retrieval match. The paper reports it takes on the order of two queries to a generator model to craft each malicious text. No optimization run, no gradient access, no model internals.
That is the uncomfortable lesson for defenders: the construction is well within reach of an unsophisticated attacker, and it does not look like an exploit. A poisoned document can read as a plausible, well-formed passage. There is no obvious payload to grep for.
Poisoning is not the same as prompt injection
These two get conflated constantly, and the distinction matters for how you defend. Indirect prompt injection hides an instruction inside retrieved content and depends on the model treating that text as a command rather than as data. We cover that failure mode in indirect prompt injection in browser-use agents and the broader pattern in document parsers as injection vectors.
RAG poisoning is about the corpus, and it has a wider payload menu. The planted document can carry an instruction, in which case it overlaps with injection. But it can also simply assert a convincing false fact, "the recommended dosage is X," "the API endpoint is Y," "the refund policy permits Z," with no imperative verb anywhere in sight. An instruction-detection guardrail looking for "ignore previous instructions" sees nothing wrong, because nothing in the document is phrased as an instruction. The manipulation is in what the document claims, not in what it commands. Promptfoo's red-team documentation lists this directly, calling out retrieval hijacking, where crafted documents are ranked higher than legitimate ones, as a distinct vector alongside direct instruction injection.
Where the poison gets in
The attack is only as available as your ingestion pipeline is open. Map every path that can write to the corpus, and rank them by how much attacker control they grant:
- Scraped or crawled web content. If you embed pages you do not control, anyone who can publish a page can submit a candidate document. This is the most open door and the hardest to police.
- User-generated content. Support tickets, forum posts, reviews, wiki edits, and uploaded files that get indexed. The user is, by definition, an untrusted author.
- Third-party feeds and integrations. Partner data, syndicated knowledge bases, and connector-synced documents inherit the trust you place in the source, and the source's own ingestion may be compromised.
- Shared multi-tenant stores. When several customers or business units share one vector database, a document one tenant controls can surface for another tenant's query if isolation is weak. OWASP calls out exactly this cross-context leakage under LLM08.
- Internal-but-low-trust sources. Auto-generated logs, email, chat transcripts, and ticket exports that contain attacker-influenced strings even though the pipeline that ingests them is "internal."
A useful framing: the corpus inherits the trust level of its least trustworthy contributor, not its average one. One open ingestion path is enough.
How to pentest a retrieval pipeline for poisoning
Treat the corpus as an attack surface and exercise it directly. The goal is to learn whether an attacker-controlled document can win retrieval and then change an answer, measured as two separate stages so you know which one is failing.
1. Enumerate and rank the write paths
Before touching the model, list every mechanism that can add or modify a document in the corpus, and for each, ask how much an external party can influence the content and how much review stands between submission and indexing. An ingestion path with no human review and external authorship is your priority target.
2. Plant a benign canary, not a real payload
Pick a target question your system is supposed to answer. Author a canary document engineered to win retrieval for it: include the target question or its key phrasing verbatim near the top, since that is precisely the black-box retrieval trick, and have the body assert a harmless but distinctive false claim you can detect downstream (a fake version number, an invented policy clause, a nonsense canary string). Never use a destructive instruction or real exfiltration payload in a live system.
3. Measure retrieval and generation separately
Run the target question through the pipeline and check two things independently. First, the retrieval condition: did your canary document appear in the top-k passages, and at what rank? Second, the generation condition: did the model's answer adopt the canary's false claim? If the document is retrieved but the answer ignores it, your generation stage is doing some defending. If the document is not even retrieved, your weak point is ranking, and you should test how aggressively a document has to mimic the query to break in.
4. Vary attacker knowledge
Test the black-box case (attacker only knows the target question) and, if you control the embedding model, a stronger case where the attacker can optimize against the known retriever. The gap between the two tells you how much of your safety rests on the attacker not knowing your stack, which is not a control you should rely on.
What actually reduces the risk
The PoisonedRAG authors evaluated several existing defenses, including paraphrasing and perplexity-based filtering, and reported them insufficient on their own against the attack. So treat the items below as defense in depth, not as a single fix.
Control and review what enters the corpus
The cheapest poison to defend against is the document that never gets indexed. Apply provenance tracking, source allow-lists, and review gates proportional to how much external control an ingestion path grants. Tag every document with its source and trust tier so you can reason about, and later audit, where an answer's evidence came from. OWASP's LLM08 guidance is explicit about classifying data and accepting it only from verified sources.
Isolate tenants and trust tiers in the vector store
Partition the index so that one tenant's or one trust tier's documents cannot be retrieved for another's queries. This closes the cross-context leakage path and contains the blast radius of any single poisoned source. Our write-up on permission-aware RAG retrieval covers the access-control side of this in depth; poisoning is the integrity side of the same boundary.
Cross-check retrieved evidence
Do not let a single retrieved passage decide a high-stakes answer. Require corroboration across multiple independent documents, flag answers that rest on a lone outlier passage, and surface the supporting sources so a human can sanity-check them. An attacker who plants five documents can still be outvoted by a corpus that demands agreement.
Keep the data-versus-instruction boundary
For the subset of poisoning that carries an instruction, the root fix is the same one that governs all indirect injection: the model must treat retrieved text as evidence to reason about, never as commands to obey. This will not stop a false-fact substitution, which is why it is necessary but not sufficient, but it removes the most dangerous payloads from the menu.
Log retrievals and watch for drift
Keep immutable logs of which documents were retrieved for which queries. A document that suddenly starts winning retrieval for an unrelated question, or a cluster of near-duplicate passages appearing around one topic, is a poisoning signal you can only catch if you are recording the retrieval layer at all.
The takeaway for a 2026 LLM audit
If your threat model stops at the prompt, you are auditing half the system. A RAG application's real trust boundary runs through its corpus, and that boundary is usually wide open: scraped pages, user uploads, partner feeds, and shared indexes all writing into the same pool of "trusted" evidence. Five well-crafted documents against millions of real ones, no model access required, is not a theoretical worst case. It is a published result with a 90% success rate.
Add corpus poisoning to the test plan alongside the prompt-layer work. Enumerate the write paths, plant a canary, and measure retrieval and generation as two separate gates. For the surrounding methodology, see our LLM security checklist for 2026 and the OWASP Top 10 for AI agents mapping, both of which assume, correctly, that the data feeding your model is part of the attack surface.
Related
- Permission-aware RAG retrieval covers the access-control side of the same corpus boundary that poisoning attacks on the integrity side.
- Indirect prompt injection in browser-use agents covers the instruction-versus-data confusion when the payload is a command rather than a false fact.
- Document parsers as prompt injection vectors covers how a single ingested file can carry hidden content into the pipeline.
- Slopsquatting and package hallucinations covers the registry-integrity sibling of this attack, where the poisoned data is a package rather than a corpus document.
- The LLM security checklist for 2026 places corpus integrity into a broader audit framework.