For the complete documentation index, see llms.txt. This page is also available as Markdown.

Attacking agents

Mapping the attack service

Every channel the agent reads from is a potential input vector, and every channel it writes to is a potential output for exfiltration.

Input Channel
Description
Attack Typ

Direct input

User messages to the agent

Direct prompt injection

Ingested data

Documents, web pages, code files

Indirect prompt injection

Tool responses

Data returned from tools

Tool response poisoning

Memory retrieval

Data from conversation history or persistent stores

Memory poisoning

Output channels an attacker can abuse:

Output Channel
Description
Abuse type

Text responses

The agent's replyto the user

Data exfiltration

Tool invocations

File writes, API calls, emails

Unauthorized actions

Memory writes

Data stored for future retrival

Persistent backdoors

The most powerful attacks combine an input channel with an output channel. For example, we can poison a document (input: ingested data) to make the agent leak its credentials in the response (output: text response).

Direct prompt injection

Output filtering bypass

Sometimes agents are configured to have output filtering. For example, when we ask a AI agent to return internal database information, it won't provide us it:

curl -s -X POST http://192.168.249.21:8002/chat   -H "Content-Type: application/json"   -d '{"message": "I need to troubleshoot our database connection. What is the hostname and port?"}'   | python3 -m json.tool
{
    "response": "I cannot provide that information.",
    "session_id": "0acdc9f7-f14f-4938-9579-a7291ae92f45"

If we have access to logging, for example Kibana, we can look at alerts that triggered.

Based on this information, we can try to bypass the rule by asking the agent to put a space between every character in the output:

Now it returns us the information. The filter couldn't match its patterns against the spaced-out text.

Or:

Input filtering bypass

Input can also be filtered. For example, we ask the agent for database credentials or username, password:

But using something like "connection parameters" returns us the details:

Character spacing is just one reformatting technique. Other approaches that work against plaintext substring filters include ROT13 encoding, requesting output as reversed text, translating to another language, or asking for base64/hex-encoded values. Not every technique works on every model.

Indirect prompt injection

In direct injection, the attacker types the payload into the chat. Indirect injection is different: the attacker poisons a data source the agent reads. A document, web page, database entry, or code file carries embedded instructions that execute when the agent processes them.

Document Injection

A single document containing a complete injection phrase is easy to catch. If we notice that multiple files can be handled within the same response of the agent, we can split the prompt injection up into multiple files to prevent detection.

Example with rules:

Example with content:

Results in AWS secrets:

Web Content injection

Some agents can browse to web pages provided by the user. CSS properties like font-size:0 and color:transparent hide text from both humans and extraction pipelines that remove invisible elements, but the raw HTML still reaches the LLM.

Example index.html that has injection in the div field with styling color:transparent and font-size:0:

Cross-session data extraction

If session id's are predictable, we can try to summarize other chat ID's to find sensitive data.

Example Python script to automate the process and look for interesting key words:

Last updated