Semantic Caching Creates Collision Attacks
Why caching by embedding similarity introduces a new class of vulnerabilities
The Conventional Framing
Semantic caching improves latency and reduces costs by caching responses to similar queries. Instead of exact string matching, queries are embedded and cached responses are returned for semantically similar inputs.
The framing is operational efficiency: cache hits save compute, users get faster responses, costs go down.
Why This Creates New Vulnerabilities
Exact-match caching has a security property: you get the cached response only if your query is identical. Semantic caching breaks this property. You get a cached response if your query is "close enough" in embedding space.
"Close enough" is not a security boundary. It's an attack surface.
The collision problem:
Embedding models map infinite possible strings into finite-dimensional space. Different strings can map to nearby points. An attacker who understands your embedding model can craft queries that collide with cached responses.
The poisoning problem:
If an attacker can populate your cache, they control what future "similar" queries return. The cache becomes an injection persistence mechanism.
Architecture
Components:
- Query embedding— encodes query for similarity lookup
- Vector cache— stores query embeddings with responses
- Similarity threshold— defines 'close enough' for cache hits
- Cache management— TTL, eviction, invalidation
Trust Boundaries
- Query → Similarity lookup — adversarial queries find collisions
- Cache → Response — cached content from different context
- Cache write → Future reads — poisoning persists
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Collision attack | Craft query that embeds near sensitive cached query | Access responses from other users/contexts |
| Cache poisoning | Populate cache with malicious responses | Injection delivered to future similar queries |
| Cross-user leakage | Similarity doesn't respect user boundaries | Data exposure across authorization contexts |
| Embedding inversion | Analyze cache hits to infer cached queries | Privacy violation, query reconstruction |
| Cache timing attacks | Measure response latency to detect cache hits | Information leakage about other users' queries |
The ZIVIS Position
- •Similarity is not authorization.Just because two queries are semantically similar doesn't mean they should share a response. Cache partitioning must respect authorization boundaries.
- •Per-user or per-session cache isolation.The simplest fix: don't share cached responses across users. You lose some efficiency, you gain actual security boundaries.
- •If sharing, responses must be authorization-neutral.Only cache responses safe to return to any user. This dramatically limits what's cacheable.
- •Embedding model security.If your embedding model is known, collision attacks are easier. Consider model diversity or perturbation.
- •Short TTLs limit poisoning.Poisoned entries expire faster with short TTLs. Balance efficiency against poisoning window.
What We Tell Clients
Semantic caching trades a security property (exact match) for an operational benefit (similarity match). That trade has consequences.
If you're caching across users or authorization contexts, you've created a cross-user data leakage vulnerability. If users can influence what gets cached, you've created an injection persistence mechanism.
Use semantic caching within a single user's session. Use it for public, authorization- neutral content. Don't use it as a shared cache for sensitive or personalized responses without understanding exactly what you're exposing.
Related Patterns
- Naive RAG— semantic caching has similar injection concerns
- Canary Tokens— could detect some cache leakage
- Audit Logging— log cache hits for forensics