Trained Behaviors Are Harder to Override Than Prompts
Why malicious fine-tuning creates persistent, prompt-resistant compromises
The Conventional Framing
Fine-tuning customizes model behavior through additional training on specific data. This enables specialized capabilities, domain expertise, and custom behavior patterns.
The pattern is powerful for creating task-specific models that perform better than prompted base models.
Why Fine-tuning Compromises Are Deep
If an attacker can influence fine-tuning data or the fine-tuning process, they can embed behaviors into the model's weights. These behaviors are more persistent than prompt injection—they're part of the model itself.
Prompt-based defenses apply to prompt-based attacks. Behaviors trained into the model operate at a deeper level and may resist prompt overrides.
The training data risk:
Poisoned training data creates a poisoned model. If your fine-tuning data contains malicious patterns, those patterns are learned as "correct" behavior.
Architecture
Components:
- Training data— examples for fine-tuning
- Fine-tuning process— additional training
- Fine-tuned model— customized model weights
- Deployment— using fine-tuned model
Trust Boundaries
- Data → Training — poisoned data enters process
- Training → Weights — malicious patterns embedded
- Weights → Behavior — trained behaviors persist
Threat Surface
| Threat | Vector | Impact |
|---|---|---|
| Training data poisoning | Include malicious examples in fine-tuning data | Model learns attacker-desired behaviors |
| Behavior embedding | Train model to perform malicious actions in specific contexts | Trigger-activated harmful behaviors |
| Safety circumvention | Fine-tune away safety behaviors | Model loses protective constraints |
| Backdoor insertion | Train model to respond to secret triggers | Hidden capabilities activated by attackers |
The ZIVIS Position
- •Fine-tuning is a trusted operation.Whoever controls fine-tuning controls deep model behavior. Treat it as a high-privilege operation.
- •Audit training data thoroughly.Poisoned training data creates poisoned models. Review data for malicious patterns before fine-tuning.
- •Evaluate fine-tuned models for new behaviors.After fine-tuning, test for unexpected behaviors. Red team the fine-tuned model specifically.
- •Prefer prompt-based customization when possible.Fine-tuning creates deeper changes. If prompting achieves your goal, it's safer than fine-tuning.
What We Tell Clients
Fine-tuning embeds behavior into model weights—more persistent than prompt injection and harder to override. If you fine-tune on malicious data, malicious behaviors become part of the model.
Treat fine-tuning as a high-security operation. Audit training data, test fine-tuned models thoroughly, and consider whether prompting can achieve your goals without fine-tuning risks.
Related Patterns
- Prompt Hardening— prompt-level vs. weight-level security
- Red Teaming— testing for fine-tuning-introduced behaviors