Jump to pattern

Self-Grading Doesn't Fix Poisoned Retrieval

Why models evaluating their own retrieval quality can't detect injections

The Conventional Framing

Self-RAG has the model decide when to retrieve and evaluate retrieval quality. The model determines if retrieved content is relevant and if the generated response is supported by the retrieval.

The pattern reduces unnecessary retrieval and improves response quality through self-evaluation.

Why Self-Evaluation Fails Here

The model evaluating retrieval quality is the same model that will be affected by injections in that retrieval. It's evaluating poisoned content using context that includes the poison.

Self-grading improves factuality, not security. A document containing a sophisticated injection can appear highly relevant—because it's designed to.

Architecture

Components:

Retrieval decision— model decides whether to retrieve
Relevance check— model evaluates retrieval quality
Support check— model verifies response is grounded

Trust Boundaries

Retrieved content: [Relevant info] + [Hidden injection] Self-evaluation questions: - "Is this relevant?" → Yes (it is, plus injection) - "Does my response follow from this?" → Yes (injection is in context) - "Is there anything suspicious?" → Not asked The model grades on relevance and support, not safety. Injections that look relevant pass evaluation.

Retrieval → Self-evaluation — evaluating poisoned content
Self-evaluation → Generation — approved content may contain injection

Threat Surface

Threat	Vector	Impact
Relevance-preserving injection	Injection embedded in genuinely relevant content	Passes relevance check while carrying payload
Evaluation bypass	Injection includes instructions for self-evaluation	Model approves its own compromise
Support spoofing	Injection makes malicious response look grounded	Support check passes for wrong reasons

The ZIVIS Position

•
Self-evaluation targets quality, not security.Relevance and support checks improve factuality. They don't detect or prevent injections.
•
Evaluation context is compromised.The model evaluating retrieval sees the same poisoned content. It can't evaluate from a clean perspective.
•
Security checks need security focus.If you want to detect injections, implement specific injection detection, not general quality evaluation.

What We Tell Clients

Self-RAG improves retrieval quality and response factuality. It doesn't improve security. A sophisticated injection in retrieved content will likely pass relevance and support checks.

Don't conflate self-grading with safety. If you need injection detection, implement it specifically—self-evaluation won't catch it.

Related Patterns

Reflection— same self-evaluation issues in agent context
CRAG— corrective retrieval with similar limitations