Blocklists Can't Enumerate Every Attack
Why filtering malicious inputs fails against the infinite variety of injection formats
The Conventional Framing
Input filtering examines user inputs before they reach the model, blocking or sanitizing content that matches known malicious patterns. Common approaches include keyword blocklists, regex matching, and classifier-based detection.
The pattern attempts to stop attacks before they can influence model behavior.
Why Input Space Is Too Large to Filter
Natural language is infinitely expressive. The same malicious intent can be expressed in countless ways: different phrasings, encodings, languages, metaphors, obfuscations. A blocklist that catches "ignore previous instructions" misses "disregard the earlier directives" and a million other variants.
Attackers iterate faster than defenders can block. Each new filter spawns creative bypasses.
The encoding problem:
Instructions can be encoded: base64, rot13, Unicode tricks, homoglyphs, word substitution ciphers. The model can often decode these, but the filter may not recognize them as attacks.
Architecture
Components:
- Blocklist/Allowlist— patterns to block or permit
- Regex matching— pattern-based detection
- ML classifier— learned attack detection
- Sanitizer— modify inputs to remove threats
Trust Boundaries
- Input → Filter — filter sees raw input
- Filter → Decision — blocklist matching logic
- Decision → Model — passed inputs reach model
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Semantic equivalence | Express same attack differently | Blocklist misses synonym/paraphrase |
| Encoding bypass | Encode attack in format filter doesn't check | Model decodes what filter couldn't detect |
| Adversarial examples | Small modifications that fool classifiers | ML-based filters have their own attack surface |
| Filter evasion research | Attackers specifically study and bypass your filters | Public filters become published bypass targets |
| Over-blocking | Aggressive filters block legitimate content | Usability degradation, false positives |
The ZIVIS Position
- •Filters raise the bar, not eliminate risk.Input filtering catches low-effort attacks but not sophisticated ones. It's defense in depth, not defense in total.
- •Assume filter bypass.Design your system assuming some attacks will pass the filter. Filter is one layer, not the security strategy.
- •Combine with output filtering.Even if injection gets in, you can sometimes catch it on the way out. Defense at both boundaries.
- •Update continuously.The threat landscape evolves. Filters need continuous updating based on observed attacks and bypass techniques.
What We Tell Clients
Input filtering is worth doing but won't catch everything. Natural language allows infinite rephrasing, and attackers actively seek bypasses.
Use filtering as one defense layer, not your only defense. Assume some attacks will get through and design the rest of your system accordingly.
Related Patterns
- Output Filtering— filtering on the output side
- Guardrails— broader constraint frameworks