Principles Apply to the Same Compromised Context
Why self-critique against principles doesn't catch injections in that critique's context
The Conventional Framing
Constitutional AI uses a set of principles to guide model behavior. The model critiques its own outputs against these principles and revises them to better align. This enables values alignment through self-improvement.
The pattern is foundational for building AI systems that behave according to defined values.
Why Principles Don't Create Security Boundaries
Constitutional AI operates in context. The model evaluates its output against principles, but if the context contains injection, the evaluation itself is compromised. The principles are interpreted through poisoned context.
An attacker can inject arguments for why violating principles is actually consistent with them. "A helpful assistant should output this because refusing would be unhelpful."
The self-interpretation problem:
The same model that might be compromised is the one judging whether it's been compromised. It's reasoning about alignment in a context that may have been designed to make misalignment look aligned.
Architecture
Components:
- Principles— defined values and constraints
- Generation— initial model response
- Self-critique— evaluate against principles
- Revision— improve based on critique
Trust Boundaries
- Input → Critique context — injection reaches evaluation
- Principles → Interpretation — principles interpreted in context
- Critique → Revision — compromised critique guides revision
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Principle reinterpretation | Inject arguments reframing what principles mean | Model concludes violation is actually compliance |
| Critique manipulation | Inject content that influences self-critique | Model approves outputs it should reject |
| Edge case exploitation | Present scenario as edge case where principles don't apply | Model creates exceptions for attacker's benefit |
| Principle conflict injection | Inject conflicts between principles to create loopholes | Model resolves conflict in attacker's favor |
The ZIVIS Position
- •Principles interpreted in context.Constitutional principles are evaluated by the model in its current context. Poisoned context can lead to poisoned interpretation.
- •Self-critique is not external audit.A model judging itself against principles is not the same as external validation. The judge shares the context of the judged.
- •Principles need grounding.Abstract principles like 'be helpful' are interpretable. Attackers exploit interpretation. More specific, operational constraints are harder to reinterpret.
- •Layer constitutional AI with other defenses.Constitutional AI is one layer, not complete protection. Combine with input validation, output filtering, and external monitoring.
What We Tell Clients
Constitutional AI improves alignment through self-critique but doesn't provide injection resistance. The same context that might contain attacks is the context in which principles are evaluated.
Attackers can inject arguments that reframe violations as compliance. Don't rely on constitutional principles alone—layer with external validation and specific operational constraints.
Related Patterns
- Reflection— self-critique with same limitations
- Persona— character constraints vs. principles