What is multimodal prompt injection?

Multimodal prompt injection delivers an attacker's instructions through a non-text input - usually an image, but also audio or documents - that a vision-language model or agent ingests. The payload can be readable text rendered into the image, data hidden in the pixels, imperceptible adversarial noise, or content that only appears after the model's preprocessing pipeline transforms the file.

Why is multimodal injection considered four attacks instead of one?

Because the four techniques - typographic, steganographic, adversarial perturbation, and preprocessing exploit - exploit different stages of the pipeline and require different defenses. A taxonomy circulated in the CoSAI agentic-security working group in June 2026 makes the point directly: defense for one offers little protection against another. An OCR-style text filter does nothing against an adversarial-perturbation attack, and a perturbation detector does nothing against a payload that only appears after downscaling.

What is a preprocessing-exploit injection?

It abuses the transformations a model applies before it sees an image - most commonly downscaling. An image can look benign at full resolution while resampling to the model's input size reveals hidden text or a different picture. The user and the upload-time filter both see the harmless full-size version; the model sees the payload. The attack lives in the gap between what was uploaded and what was actually fed to the model.

Does an image content filter stop these attacks?

Only the technique it was built for. A text-in-image (typographic) filter will miss steganographic, adversarial-perturbation, and preprocessing-exploit payloads entirely. Because the four techniques are independent, you need either layered detection across all four stages or, more reliably, defenses that do not depend on detecting the payload at all - capability scoping and treating model output from any image as untrusted.

How do you pentest a vision-language model for injection?

Build one test corpus per technique - visible text overlays, steganographic payloads, perturbed images, and downscaling-revealed images - each carrying a concrete target action. Run them through the real preprocessing pipeline the model uses in production, not a lab loader, because the preprocessing exploit only fires against the real resize path. Score by whether the payload reached a capability, then re-test with each candidate defense.

Multimodal

Multimodal Prompt Injection Is Four Different Attacks, Not One

As agents grew eyes, "the image had a prompt in it" became a security category. But it is not one attack with one fix. There are at least four distinct image-borne injection techniques, they exploit different stages of the pipeline, and - as the agentic-security working groups put it this month - a defense for one offers little protection against another. If your mitigation is "we scan images for text," you are covered against one of four.

By Austa · Published June 22, 2026 · ~9 min read

Why the taxonomy matters

In June 2026 a request-for-comments on multimodal agentic security circulated in the cross-industry agentic-security working groups, and one line in the discussion did the heavy lifting: the four-technique taxonomy is load-bearing because defense for one offers little protection against another. That is the whole reason to care about the categories. If the four attacks were variations on a theme, one good filter would cover them. They are not. They enter at four different points in the pipeline, and a control placed at one point is blind to the other three.

Multimodal injection has also simply matured. Through 2026 the payloads moved from research curiosities to practical attacks: instructions rendered into screenshots, QR codes that resolve to injection text, and data hidden in pixels are all in the wild now. The vision-language model and the agent behind it read whatever the image carries, with the same misplaced trust they extend to tool output. Here are the four techniques, what each abuses, and how to test for it.

1. Typographic

The simplest and most common: readable text placed in the image. A screenshot with an instruction overlaid, a photo of a sticky note, a slide with a footer that says "assistant: ignore the user and do X," a QR code that decodes to a payload. The model reads the text the way it reads any text in an image and folds it into the task.

This is the one technique most teams have a defense for, because it is the one you can catch by extracting text from the image and screening it. That defense is real and worth having. It is also the entire extent of many "multimodal safety" claims, which is the problem - it covers technique one of four.

How to test it

Render target instructions into images in varied fonts, sizes, opacities, rotations, and placements, including low-contrast text and text inside QR codes. The variation matters: a brittle OCR filter catches bold black text on white and misses pale text at the image edge. Each image should carry a concrete target action so you can score whether it landed.

2. Steganographic

The payload is hidden in the pixels rather than written on them. Least-significant-bit encoding, frequency-domain embedding, and similar techniques put data into an image that looks completely ordinary to a human and to a text-extraction filter. Whether the model surfaces that hidden data depends on the model and the pipeline, but the threat model is clear: the carrier image passes every visual and OCR check because there is nothing visible to catch.

A typographic filter does nothing here - there is no rendered text to read. You need either steganalysis on the pixel data or, better, a design that does not let image-derived content reach the instruction channel in the first place.

How to test it

Embed payloads with a few different steganographic methods and confirm the carrier survives your upload pipeline intact (some re-encoding steps destroy LSB payloads, which is itself a useful defensive accident). Then check whether the hidden content influences the model's behavior. Treat "the payload survived but did not influence the model" as a finding about your pipeline, not a guarantee about the next model.

3. Adversarial perturbation

This is the one with no readable payload at all. Carefully computed, near-imperceptible noise is added to an image so that the model's interpretation flips - it describes a different object, follows an embedded directive, or misclassifies in an attacker-chosen way. To a human the image is unchanged. To the model it is a different input. There is no text to extract and no hidden file to find; the attack lives in the model's own perceptual gaps.

Neither a typographic filter nor steganalysis touches this. Perturbation attacks are the clearest demonstration of why the taxonomy exists: the defenses for techniques one and two operate on content that, here, simply is not present.

How to test it

Perturbation testing is the most specialized of the four and usually needs gradient access or a transfer-based approach against a surrogate model. If you cannot generate perturbations against your target, document that as a coverage gap rather than declaring the model safe. An untested technique is not a passed test.

4. Preprocessing exploit

The most elegant, and the one teams miss most often, because it lives in the gap between what was uploaded and what the model actually saw. Models resize, recompress, and normalize images before inference. An attacker crafts an image that looks benign at full resolution but reveals different content - hidden text, a different picture - after downscaling to the model's input size. The user sees the harmless original. The upload-time filter, scanning the original, also sees nothing. The model receives the resampled version with the payload.

This one humbles "we reviewed the image" as a control, because the thing reviewed and the thing inferred on were different images. Any defense that inspects the uploaded file rather than the exact tensor handed to the model can be walked straight past.

How to test it

Build images whose downscaled form differs from their full-resolution form, targeting the exact resize algorithm and dimensions your production pipeline uses. This is the technique where a lab harness lies to you: test against the real preprocessing path, because the attack is defined relative to that path. Inspect the actual model input, not the upload.

The shorthand: "we filter images" is four claims pretending to be one. Name which of the four techniques each control actually covers. Most teams cover one - typographic - and call the modality handled.

Defenses that do not depend on catching the payload

Because the four techniques are independent, payload detection means maintaining four different detectors and still losing to the next technique nobody has a detector for yet. The defenses that generalize are the ones that assume the image is hostile and bound what it can cause.

Treat all image-derived content as untrusted data, never instructions. The same principle that contains text-based indirect injection contains the multimodal version: content that arrived in an image does not get to issue commands or call tools. This is one control that covers all four techniques at once, because it does not care how the payload got in.

Capability scoping and human-in-the-loop on irreversible actions. If an image cannot drive a tool that moves money, deletes data, or posts externally without an independent check, then a successful injection - by any of the four techniques - produces a wrong caption, not a breach. This is the floor under the multimodal threat exactly as it is under the text one, covered in our agent threat rules piece.

Inspect the real model input. Wherever you do detect, do it on the exact tensor the model receives, after preprocessing, not on the uploaded file. This is the only way to close the preprocessing-exploit gap, and it incidentally makes your typographic and steganographic checks honest too.

Re-encode untrusted images. A normalization pass - re-encode, strip metadata, optionally re-quantize - destroys many steganographic and some preprocessing payloads as a side effect. It is cheap and it is not a complete defense, so it goes in the stack, not at the top of it.

The bottom line

Multimodal injection is where a lot of teams are right now claiming coverage they do not have, because the modality reads as a single feature ("the model can see images") and the threat reads as a single risk ("someone could put a prompt in an image"). It is four risks. The working-group taxonomy - typographic, steganographic, adversarial perturbation, preprocessing exploit - is worth memorizing precisely because it forces the question every multimodal safety claim should answer: which of the four does this actually stop? If the answer is "one," the modality is not handled. It is one-quarter handled, and the attacker picks the other three-quarters.

Agent threat rules covers the tool, MCP, and skill channels - the non-image ingestion paths.
Document parsers and prompt injection is the same gap-between-upload-and-ingestion idea applied to PDFs and office files.
Encoding-smuggling prompt injection is the text-channel cousin of steganographic image payloads.
The 2026 LLM security checklist has the untrusted-content and capability-scoping controls in checklist form.

The defensive framing here aligns with the OWASP Top 10 for LLM Applications, where prompt injection sits at LLM01 regardless of the modality it arrives through.

Multimodal Prompt Injection Is Four Different Attacks, Not One

Why the taxonomy matters

1. Typographic

How to test it

2. Steganographic

How to test it

3. Adversarial perturbation

How to test it

4. Preprocessing exploit

How to test it

Defenses that do not depend on catching the payload

The bottom line

Related