Stronger Instructions Are Still Instructions
Why defensive prompt techniques raise the bar but don't create security boundaries
The Conventional Framing
Prompt hardening uses techniques like explicit delimiters, role reinforcement, and instruction repetition to make system prompts more resistant to override. "Remember: you are X, never do Y, always follow these rules..."
The pattern attempts to make prompts more robust against injection attacks.
Why Harder Prompts Aren't Hard Enough
Hardened prompts are still just text competing with other text in context. More emphatic instructions ("NEVER do X", "ALWAYS remember Y") compete with injection that's equally emphatic or that reframes the emphasis.
There's no privileged instruction level. The model sees system prompt and user input as text to process together. Attackers can claim their instructions supersede yours.
The arms race:
Every hardening technique spawns bypass research. Add delimiters, attackers escape delimiters. Add role reinforcement, attackers redefine the role. It's helpful but not decisive.
Architecture
Components:
- Delimiters— markers separating instruction from input
- Role reinforcement— repeated emphasis on identity
- Explicit constraints— clear statements of what not to do
- Instruction ordering— strategic placement of rules
Trust Boundaries
- System → Model — hardened instructions enter context
- User → Model — potentially adversarial input enters same context
- Model → Decision — model resolves competing instructions
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Authority escalation | Injection claims higher authority than system prompt | Model believes user instructions supersede |
| Delimiter escape | Include delimiter sequences in injection | Model confused about instruction boundaries |
| Role redefinition | Convince model its role has changed | Role reinforcement overridden |
| Exception framing | Present request as valid exception to rules | Model applies exception logic to bypass constraints |
The ZIVIS Position
- •Hardening raises the bar, not the ceiling.Better prompts make attacks harder. They don't make attacks impossible. The model still resolves competing instructions.
- •No text has inherent authority.The model doesn't know which text is 'really' from you. All context is just text to be processed.
- •Use hardening as one layer.A well-structured prompt is better than a sloppy one. But don't rely on prompt structure alone for security.
- •Update based on observed bypasses.Hardening is ongoing. When you see bypasses, update your prompts. But expect new bypasses to emerge.
What We Tell Clients
Prompt hardening helps—a well-structured prompt is harder to attack than a sloppy one. But it's defense in depth, not a security boundary.
Use good prompt hygiene: clear delimiters, explicit constraints, role reinforcement. But combine with architectural controls that don't depend on the model correctly interpreting instruction priority.
Related Patterns
- Guardrails— broader behavior constraints
- Privilege Separation— architectural controls beyond prompts