Context window poisoning is an attack that introduces malicious content into an AI agent's active context window — causing the agent to reason over attacker-controlled information as though it were trusted input.
What Is Context Window Poisoning?
Whatever gets into the context window gets reasoned over. The model doesn't distinguish between your system prompt, the user's message, retrieved database records, and the contents of an attacker-controlled document. It's all tokens. It all informs the next action.
Poisoning the context window is as effective as jailbreaking the model — and structurally easier. Jailbreaking requires finding an adversarial input that bypasses the model's trained behavior. Context poisoning requires getting malicious content into one of the many channels that feed the agent's context: a retrieved document, a memory lookup, a tool call response, a web page fetch. The attack surface is every data source the agent reads.
Why This Is Distinct From Prompt Injection
Prompt injection typically refers to instructions injected via user input. Context window poisoning is broader: it covers any mechanism that introduces attacker-controlled content into the agent's active reasoning context, including:
- Retrieval poisoning — embedding malicious instructions in documents that RAG systems retrieve
- Memory poisoning — corrupting an agent's long-term memory store so future sessions are compromised
- Tool output manipulation — returning malicious content from a compromised MCP server or API
- Multi-turn context corruption — gradually shifting the agent's behavior over multiple interactions through cumulative context manipulation
Each of these is a different vector for the same fundamental problem: the agent reasons over the full context without the ability to reliably distinguish trusted from untrusted content.
The Insidious Part
Context poisoning can be subtle. An attacker doesn't need to issue a dramatic "ignore all previous instructions" command. They can introduce context that shifts the agent's priors, adds false facts it will cite, or embeds instructions that only activate under specific conditions later in the session. The agent never produces an obviously wrong output — it just gradually produces attacker-favored outputs.
Defense
- Tag content by trust level before it enters context. Untrusted content should be structurally marked as such, with the agent's tool access reduced accordingly.
- Isolate retrieval from action. An agent that retrieves untrusted content should complete that task before being granted tools for consequential actions.
- Limit context accumulation. Agents with long persistent contexts across many sessions are harder to audit and easier to poison incrementally.
The context window is the attack surface. Design it like one.