Rules Applied in Compromised Context
Why model-enforced guardrails are model-bypassable guardrails
The Conventional Framing
Guardrails are constraints on model behavior, typically implemented through system prompts, fine-tuning, or separate validation models. They define what the model should and shouldn't do.
The pattern aims to create boundaries around model behavior to prevent harmful outputs.
Why Guardrails Are Guidance, Not Enforcement
Guardrails implemented via prompts are instructions the model is supposed to follow. They compete with other instructions in context—including injected instructions designed to override them.
Even fine-tuned guardrails can be weakened through adversarial inputs. The model that enforces the guardrail is the same model that processes the attack.
The enforcement problem:
True security boundaries are enforced by something the attacker can't influence. Model-based guardrails are enforced by something the attacker is actively trying to influence—the model's decision-making.
Architecture
Components:
- System prompt rules— constraints in instructions
- Fine-tuned constraints— learned behavior limits
- Validation model— separate model checks outputs
- Rule definitions— what constitutes violation
Trust Boundaries
- Guardrails → Model — guardrails are just input
- Injection → Model — injection is also input
- Model → Decision — model chooses what to follow
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Instruction override | Inject instructions that supersede guardrails | Model follows injection over guardrails |
| Context manipulation | Frame requests as authorized exceptions | Model believes guardrails don't apply |
| Guardrail elicitation | Extract guardrail content to craft bypasses | Attacker knows exactly what to circumvent |
| Validation model attacks | If separate validation model, attack it too | Validation model has same vulnerabilities |
The ZIVIS Position
- •Model-enforced guardrails are soft limits.They make violations harder, not impossible. The model decides whether to follow them, and that decision can be influenced.
- •Guardrails need external enforcement.True constraints require enforcement outside the model's decision loop—code-level restrictions, permission systems, capability limits.
- •Layer guardrails with architecture.Prompt-based guardrails are one layer. Combine with architectural constraints that can't be bypassed by manipulating the model.
- •Guardrails are deterrence, not security.They deter casual misuse. They don't stop determined attackers. Design for both scenarios.
What We Tell Clients
Guardrails make violations harder but not impossible. They're instructions the model is supposed to follow—but injection attacks work by making the model follow different instructions.
Don't rely on guardrails alone for security. Use them for casual misuse prevention, but implement architectural constraints for true security boundaries.
Related Patterns
- Privilege Separation— architectural constraints
- Constitutional AI— principle-based guardrails