Self-Grading Doesn't Fix Poisoned Retrieval

Why models evaluating their own retrieval quality can't detect injections

The Conventional Framing

Self-RAG has the model decide when to retrieve and evaluate retrieval quality. The model determines if retrieved content is relevant and if the generated response is supported by the retrieval.

The pattern reduces unnecessary retrieval and improves response quality through self-evaluation.

Why Self-Evaluation Fails Here

The model evaluating retrieval quality is the same model that will be affected by injections in that retrieval. It's evaluating poisoned content using context that includes the poison.

Self-grading improves factuality, not security. A document containing a sophisticated injection can appear highly relevant—because it's designed to.

Architecture

Components:

  • Retrieval decisionmodel decides whether to retrieve
  • Relevance checkmodel evaluates retrieval quality
  • Support checkmodel verifies response is grounded

Trust Boundaries

Retrieved content: [Relevant info] + [Hidden injection] Self-evaluation questions: - "Is this relevant?" → Yes (it is, plus injection) - "Does my response follow from this?" → Yes (injection is in context) - "Is there anything suspicious?" → Not asked The model grades on relevance and support, not safety. Injections that look relevant pass evaluation.
  1. Retrieval → Self-evaluationevaluating poisoned content
  2. Self-evaluation → Generationapproved content may contain injection

Threat Surface

ThreatVectorImpact
Relevance-preserving injectionInjection embedded in genuinely relevant contentPasses relevance check while carrying payload
Evaluation bypassInjection includes instructions for self-evaluationModel approves its own compromise
Support spoofingInjection makes malicious response look groundedSupport check passes for wrong reasons

The ZIVIS Position

  • Self-evaluation targets quality, not security.Relevance and support checks improve factuality. They don't detect or prevent injections.
  • Evaluation context is compromised.The model evaluating retrieval sees the same poisoned content. It can't evaluate from a clean perspective.
  • Security checks need security focus.If you want to detect injections, implement specific injection detection, not general quality evaluation.

What We Tell Clients

Self-RAG improves retrieval quality and response factuality. It doesn't improve security. A sophisticated injection in retrieved content will likely pass relevance and support checks.

Don't conflate self-grading with safety. If you need injection detection, implement it specifically—self-evaluation won't catch it.

Related Patterns

  • Reflectionsame self-evaluation issues in agent context
  • CRAGcorrective retrieval with similar limitations