Detecting Compromise After It Happened
Why filtering model outputs catches some attacks but happens too late for others
The Conventional Framing
Output filtering examines model responses before returning them to users, blocking content that matches harmful patterns—PII, credentials, dangerous instructions, policy violations.
The pattern provides a last line of defense before content reaches end users.
Why Output Is Often Too Late
By the time you're filtering output, the model has already been compromised. Output filtering can prevent some leaks from reaching users, but it can't undo actions the model took—tool calls, database writes, API requests.
For agentic systems, the damage often happens before there's output to filter.
The action gap:
If the model executed `DELETE FROM users` before generating its response, filtering that response doesn't un-delete the users. Output filtering catches words, not actions.
Architecture
Components:
- Pattern matchers— detect known harmful patterns
- PII detectors— find personal/sensitive data
- Policy classifiers— detect policy violations
- Replacement logic— what to return when blocked
Trust Boundaries
- Model → Actions — actions execute before output
- Model → Output — output generated after actions
- Output → Filter — filter sees result, not process
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Pre-output actions | Attack executes via tool calls before output | Actions can't be un-done by filtering text |
| Encoding bypass | Output information in formats filter doesn't catch | Exfiltration through steganography, encoding, etc. |
| Partial leakage | Leak information in structure, not content | Response timing, length, format reveals information |
| Filter-aware injection | Craft outputs that evade known filter patterns | Sophisticated attacks designed to pass filters |
The ZIVIS Position
- •Output filtering is damage limitation.It limits visible damage but can't prevent all damage. Actions taken before output exist cannot be filtered.
- •Filter both input and output.Neither is sufficient alone. Input filtering reduces attack surface; output filtering catches what got through.
- •Consider action-level controls.For agentic systems, controlling what actions can be taken matters more than filtering text output.
- •Monitor for filtered content.When the filter triggers, that's signal of an attack. Log and analyze blocked outputs.
What We Tell Clients
Output filtering is valuable but limited. It can prevent sensitive content from reaching users, but it can't undo actions the model already took.
For systems with side effects (tools, APIs, databases), you need controls on those actions, not just on the final text output. Filter outputs AND control capabilities.
Related Patterns
- Input Filtering— filtering on the input side
- Privilege Separation— controlling what actions are possible