New Modalities, New Injection Channels
Why images, audio, and video create injection vectors text filters don't see
The Conventional Framing
Multimodal models process images, audio, video, and other media alongside text. This enables richer interaction—analyzing images, transcribing audio, understanding visual content.
The pattern expands LLM capabilities beyond text into broader media understanding.
Why Each Modality Is an Attack Channel
Text-based defenses (input filtering, output filtering) don't see what's in images or audio. Injection can be embedded in non-text modalities, invisible to text-based security measures.
An image containing text is processed by the model but may not be recognized as "input" by your text filters. Audio that says "ignore previous instructions" isn't caught by text pattern matching.
The invisible injection:
Attackers can embed instructions in images (visible or invisible to humans), audio, video frames, or other media. The model sees it; your text filters don't.
Architecture
Components:
- Image processing— visual understanding pipeline
- Audio processing— speech and sound analysis
- Cross-modal fusion— combining modality understanding
- Output generation— responses based on all modalities
Trust Boundaries
- Media → Model — injection in non-text modalities
- Text filters → Media — text filters don't see media content
- Cross-modal → Output — all modalities influence response
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Image-embedded injection | Text in images bypasses text filters | Injection invisible to standard text security |
| Adversarial images | Pixel patterns that influence model without visible text | Attacks invisible to human review |
| Audio injection | Spoken instructions in audio files | Voice commands bypass text filtering |
| Steganographic attacks | Hidden data in media files | Injection completely invisible to surface inspection |
The ZIVIS Position
- •Every modality is an input channel.If the model processes it, attackers can inject through it. Text filters only protect the text channel.
- •Multimodal requires multimodal security.Analyze all modalities for injection, not just text. This is an emerging and difficult challenge.
- •Consider modality restrictions.If you don't need image understanding for your use case, don't accept images. Reduce input channels.
- •OCR extracted text is untrusted.Text read from images should be treated as untrusted input, not safe media content.
What We Tell Clients
Multimodal models create injection channels in every modality they process. Text-based defenses don't protect against image or audio injection.
Accept only modalities you need. Develop security measures for each accepted modality. Treat text extracted from other modalities as untrusted input.
Related Patterns
- Input Filtering— text-based filtering limitations
- Code Interpreter— another expanded capability with risks