Hints Point Both Directions
Why guiding model attention can guide it toward adversarial content
The Conventional Framing
Directional stimulus prompting provides hints or keywords to guide the model's attention toward relevant aspects of the input. Small cues can significantly improve performance on specific tasks.
The pattern is effective for focusing model attention without heavy prompt engineering.
Why Guidance Is Bidirectional
If you can guide the model's attention with hints, so can an attacker. Injected hints in the content can direct model attention toward malicious instructions and away from safety constraints.
Directional stimulus is attention manipulation. In adversarial contexts, attention manipulation favors the attacker.
The saliency competition:
Your hints compete with any hints an attacker embeds. The model attends to whatever is most salient—and attackers are good at making their content salient.
Architecture
Components:
- System hints— attention guidance you provide
- Content processing— model analyzes with hints
- Attention allocation— where model focuses
- Hint-influenced output— response shaped by attention
Trust Boundaries
- Hints → Attention — hints shape model focus
- Content → Attention — injected hints compete
- Attention → Output — focused content shapes response
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Attention hijacking | Inject hints that override system guidance | Model focuses on attacker-specified content |
| Distraction injection | Inject hints that direct attention away from safety checks | Safety-relevant content ignored |
| Saliency competition | Make malicious hints more prominent than legitimate hints | Model attention captured by injection |
The ZIVIS Position
- •Hints are soft guidance.Hints influence attention but don't guarantee it. More salient content (including injections) can override your hints.
- •Adversaries optimize for saliency.Attackers know how to make content attention-grabbing. Your subtle hints may lose to their aggressive formatting.
- •Layer hints with hard constraints.Use directional stimulus for quality but don't rely on it for security. Combine with explicit constraints.
What We Tell Clients
Directional stimulus works by guiding attention—but attention can be captured by injected content that's more salient than your hints.
Use hints for quality improvement but don't rely on them for security. Attackers can inject their own hints that compete with or override yours.