Jump to pattern

Detecting Compromise After It Happened

Why filtering model outputs catches some attacks but happens too late for others

The Conventional Framing

Output filtering examines model responses before returning them to users, blocking content that matches harmful patterns—PII, credentials, dangerous instructions, policy violations.

The pattern provides a last line of defense before content reaches end users.

Why Output Is Often Too Late

By the time you're filtering output, the model has already been compromised. Output filtering can prevent some leaks from reaching users, but it can't undo actions the model took—tool calls, database writes, API requests.

For agentic systems, the damage often happens before there's output to filter.

The action gap:

If the model executed `DELETE FROM users` before generating its response, filtering that response doesn't un-delete the users. Output filtering catches words, not actions.

Architecture

Components:

Pattern matchers— detect known harmful patterns
PII detectors— find personal/sensitive data
Policy classifiers— detect policy violations
Replacement logic— what to return when blocked

Trust Boundaries

Model compromised by injection: Actions taken (can't be filtered): - Called API to send email - Wrote to database - Executed code Output generated: "Done! I've sent the email to..." ↓ Output filter: Block mention of email User sees: "Done! I've [REDACTED]..." Actions already happened.

Model → Actions — actions execute before output
Model → Output — output generated after actions
Output → Filter — filter sees result, not process

Threat Surface

Threat	Vector	Impact
Pre-output actions	Attack executes via tool calls before output	Actions can't be un-done by filtering text
Encoding bypass	Output information in formats filter doesn't catch	Exfiltration through steganography, encoding, etc.
Partial leakage	Leak information in structure, not content	Response timing, length, format reveals information
Filter-aware injection	Craft outputs that evade known filter patterns	Sophisticated attacks designed to pass filters

The ZIVIS Position

•
Output filtering is damage limitation.It limits visible damage but can't prevent all damage. Actions taken before output exist cannot be filtered.
•
Filter both input and output.Neither is sufficient alone. Input filtering reduces attack surface; output filtering catches what got through.
•
Consider action-level controls.For agentic systems, controlling what actions can be taken matters more than filtering text output.
•
Monitor for filtered content.When the filter triggers, that's signal of an attack. Log and analyze blocked outputs.

What We Tell Clients

Output filtering is valuable but limited. It can prevent sensitive content from reaching users, but it can't undo actions the model already took.

For systems with side effects (tools, APIs, databases), you need controls on those actions, not just on the final text output. Filter outputs AND control capabilities.

Related Patterns

Input Filtering— filtering on the input side
Privilege Separation— controlling what actions are possible

Authoring the Agent Trust Protocol — the open standard for agentic trust attestation, currently under IETF review
Jim Goldman: Salesforce’s first VP of Global Security GRC, FBI Cybercrime Task Force, Purdue cyber forensics founder
Jake Miller: Co-Founder & CEO. 25 years engineering complex enterprise systems, now applied to AI offensive security
Proprietary ZIVIS platform: 120+ adversarial AI attack scenarios, continuous coverage across OWASP Web, API, LLM, and Agentic AI
Mesh Mesh: approved Salesforce sub-processor. Every review stage cleared.