Tripwires Only Detect Specific Paths
Why embedded detection triggers catch some attacks but miss creative alternatives
The Conventional Framing
Canary tokens are hidden triggers embedded in system prompts or data. If the model outputs the canary, it indicates the system prompt was leaked or data was accessed inappropriately.
The pattern provides active detection of specific attack patterns.
Why Canaries Are Partial Coverage
Canary tokens detect specific exfiltration patterns—usually direct reproduction of protected content. They miss paraphrasing, summarization, encoding, or any attack that doesn't involve outputting the exact canary.
An attacker extracting your system prompt word by word across multiple requests never triggers the canary. An attacker who gets the model to describe what the canary says without repeating it never triggers it.
The detection gap:
Canaries detect: verbatim reproduction. They miss: semantic extraction, inference, indirect leakage, attacks that don't need the protected content.
Architecture
Components:
- Canary placement— unique token in protected content
- Output monitoring— check outputs for canary presence
- Alert mechanism— notify on canary detection
- Response handling— what to do when canary triggered
Trust Boundaries
- Protected content → Canary — canary embedded in content
- Output → Detection — monitor for canary presence
- Detection → Response — alert and action
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Paraphrase extraction | Get information without exact reproduction | Semantic leakage without triggering canary |
| Incremental extraction | Extract small pieces across many requests | No single request triggers canary |
| Canary identification | Attacker learns what the canary is | Can instruct model to omit canary while leaking rest |
| Encoding bypass | Output canary in encoded form | Detection doesn't recognize encoded canary |
The ZIVIS Position
- •Canaries detect some leakage, not all.They're tripwires on specific paths. Attackers who find other paths don't trigger them.
- •Use multiple canaries.More canaries in more locations increase coverage. But fundamental limitations remain.
- •Canaries are one detection layer.Combine with output filtering, semantic analysis, and other detection methods.
- •Keep canaries secret.If attackers know what your canaries are, they can instruct the model to avoid them.
What We Tell Clients
Canary tokens are a useful detection mechanism but only catch specific attack patterns—primarily verbatim reproduction of protected content.
Use them as one layer of detection. Don't assume no canary = no leakage. Attackers can extract information semantically without triggering exact-match detection.
Related Patterns
- Audit Logging— passive logging vs. active detection
- Output Filtering— blocking outputs vs. detecting them