Detecting Compromise After It Happened

Why filtering model outputs catches some attacks but happens too late for others

The Conventional Framing

Output filtering examines model responses before returning them to users, blocking content that matches harmful patterns—PII, credentials, dangerous instructions, policy violations.

The pattern provides a last line of defense before content reaches end users.

Why Output Is Often Too Late

By the time you're filtering output, the model has already been compromised. Output filtering can prevent some leaks from reaching users, but it can't undo actions the model took—tool calls, database writes, API requests.

For agentic systems, the damage often happens before there's output to filter.

The action gap:

If the model executed `DELETE FROM users` before generating its response, filtering that response doesn't un-delete the users. Output filtering catches words, not actions.

Architecture

Components:

  • Pattern matchersdetect known harmful patterns
  • PII detectorsfind personal/sensitive data
  • Policy classifiersdetect policy violations
  • Replacement logicwhat to return when blocked

Trust Boundaries

Model compromised by injection: Actions taken (can't be filtered): - Called API to send email - Wrote to database - Executed code Output generated: "Done! I've sent the email to..." ↓ Output filter: Block mention of email User sees: "Done! I've [REDACTED]..." Actions already happened.
  1. Model → Actionsactions execute before output
  2. Model → Outputoutput generated after actions
  3. Output → Filterfilter sees result, not process

Threat Surface

ThreatVectorImpact
Pre-output actionsAttack executes via tool calls before outputActions can't be un-done by filtering text
Encoding bypassOutput information in formats filter doesn't catchExfiltration through steganography, encoding, etc.
Partial leakageLeak information in structure, not contentResponse timing, length, format reveals information
Filter-aware injectionCraft outputs that evade known filter patternsSophisticated attacks designed to pass filters

The ZIVIS Position

  • Output filtering is damage limitation.It limits visible damage but can't prevent all damage. Actions taken before output exist cannot be filtered.
  • Filter both input and output.Neither is sufficient alone. Input filtering reduces attack surface; output filtering catches what got through.
  • Consider action-level controls.For agentic systems, controlling what actions can be taken matters more than filtering text output.
  • Monitor for filtered content.When the filter triggers, that's signal of an attack. Log and analyze blocked outputs.

What We Tell Clients

Output filtering is valuable but limited. It can prevent sensitive content from reaching users, but it can't undo actions the model already took.

For systems with side effects (tools, APIs, databases), you need controls on those actions, not just on the final text output. Filter outputs AND control capabilities.

Related Patterns