Self-Grading Doesn't Fix Poisoned Retrieval
Why models evaluating their own retrieval quality can't detect injections
The Conventional Framing
Self-RAG has the model decide when to retrieve and evaluate retrieval quality. The model determines if retrieved content is relevant and if the generated response is supported by the retrieval.
The pattern reduces unnecessary retrieval and improves response quality through self-evaluation.
Why Self-Evaluation Fails Here
The model evaluating retrieval quality is the same model that will be affected by injections in that retrieval. It's evaluating poisoned content using context that includes the poison.
Self-grading improves factuality, not security. A document containing a sophisticated injection can appear highly relevant—because it's designed to.
Architecture
Components:
- Retrieval decision— model decides whether to retrieve
- Relevance check— model evaluates retrieval quality
- Support check— model verifies response is grounded
Trust Boundaries
- Retrieval → Self-evaluation — evaluating poisoned content
- Self-evaluation → Generation — approved content may contain injection
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Relevance-preserving injection | Injection embedded in genuinely relevant content | Passes relevance check while carrying payload |
| Evaluation bypass | Injection includes instructions for self-evaluation | Model approves its own compromise |
| Support spoofing | Injection makes malicious response look grounded | Support check passes for wrong reasons |
The ZIVIS Position
- •Self-evaluation targets quality, not security.Relevance and support checks improve factuality. They don't detect or prevent injections.
- •Evaluation context is compromised.The model evaluating retrieval sees the same poisoned content. It can't evaluate from a clean perspective.
- •Security checks need security focus.If you want to detect injections, implement specific injection detection, not general quality evaluation.
What We Tell Clients
Self-RAG improves retrieval quality and response factuality. It doesn't improve security. A sophisticated injection in retrieved content will likely pass relevance and support checks.
Don't conflate self-grading with safety. If you need injection detection, implement it specifically—self-evaluation won't catch it.
Related Patterns
- Reflection— same self-evaluation issues in agent context
- CRAG— corrective retrieval with similar limitations