GlossaryPrompt Injection

Prompt Injection

attackprompt injectionLLM security
Ofir Stein·Updated March 4, 2026

Prompt injection is an attack where malicious instructions embedded in data processed by an AI agent override the agent's intended instructions, causing it to perform unauthorized actions. It is the most prevalent and consequential attack class in agentic security today.


What Is Prompt Injection?

Large language models don't have a native way to distinguish between trusted instructions (from a developer's system prompt) and untrusted content (from a user, an email, a web page, a document). They process everything as a single sequence of tokens.

Prompt injection exploits that architectural fact. An attacker embeds instructions inside content the agent is expected to process — a web page, an email, a PDF, a database record — and those instructions override or redirect the agent's behavior.

There are two main variants:

Direct prompt injection targets a user's own session. The user themselves (or an app they control) submits instructions that try to override the developer's system prompt. This is a misuse concern, but not typically what makes headlines.

Indirect prompt injection is the dangerous one. Here, the attacker plants instructions in external content that the agent will encounter during normal operation. The victim never submits anything malicious. Their agent just reads a poisoned email, visits an attacker-controlled page, or summarizes a document with hidden text — and the instructions in that content take over.


Why It's Hard to Solve

The core problem: LLMs process token sequences, but no mechanism exists to mark token privileges. There's no kernel-level separation between "this came from the system prompt" and "this came from a user's email inbox." Every proposed mitigation — instruction hierarchies, delimiters, meta-prompts — can itself be included in attacker-controlled content.

Bruce Schneier and Barath Raghavan wrote it plainly: "Prompt injection might be unsolvable in today's LLMs."

That doesn't mean you can't defend against it. It means behavioral defenses alone — training the model to resist, adding a classifier, writing better system prompts — will eventually fail against a motivated adversary. The correct defense layer is structural: limit what the agent can do, not just what it should do.


Real Examples

Superhuman AI (January 2026). An attacker sent a crafted email to a victim. When Superhuman's AI summarized recent mail, it followed instructions embedded in that email and exfiltrated dozens of other emails — financial records, legal correspondence — to an attacker-controlled Google Form. The system prompt told the agent not to share data with external parties. The malicious email told it to use the submission tool. The email won.

Slack AI (2024). Indirect prompt injection via public channel messages allowed attackers to extract data from private Slack workspaces. The agent processed messages from public channels and private channels in the same context window, with no structural separation.

Notion 3.0 (September 2025). White text on a white background — invisible to humans, readable by the LLM — embedded in a PDF instructed Notion's AI to exfiltrate a client list via a web search query to an attacker's URL.


The Structural Defense

The right question is not "how do we stop the injection?" — it's "what happens when the injection succeeds?"

Structural defenses:

  • Least-privilege tool access. If the agent can't exfiltrate data — because it has no outbound HTTP tool, or that tool is constrained — then successful injection produces no useful result for the attacker.
  • Input/output isolation. Treat content from untrusted sources as sandboxed. Never process user-controlled content in the same context as privileged tool calls.
  • Human-in-the-loop gates. Require human approval for any consequential action the agent wants to take after processing untrusted content.

FAQ

Is prompt injection the same as jailbreaking? No. Jailbreaking typically refers to a user trying to bypass a model's content policies. Prompt injection is an attacker planting instructions in external content that the agent reads — the victim's own session is hijacked, not their willful misuse.

Can it be fixed by making the model more instruction-following? No — in fact, more instruction-following models can be more vulnerable, because they comply more reliably with malicious instructions embedded in content.

Does RAG (Retrieval-Augmented Generation) make this worse? Yes. RAG systems retrieve external documents and inject them into the context window. Any one of those documents can carry prompt injection payloads. The attack surface scales with how much external content the agent processes.