Bodies, Guardians, and Other Failed Safety Features for AI Agents

Fazm Team··2 min read

Bodies, Guardians, and Other Failed Safety Features for AI Agents

The standard playbook for AI agent safety is to add a guardian - a second model that watches the first model and blocks dangerous actions. It sounds reasonable. In practice, guardians fail precisely when you need them most.

Why Guardians Fail

A guardian model evaluates actions against a set of rules. The problem is that adversarial inputs are designed to look safe to exactly this kind of evaluation. If an attacker knows there is a guardian (and they always do, because the architecture is usually documented), they craft inputs that pass the guardian's checks while still achieving the malicious goal.

This is the fundamental issue with anticipated attacks - the defense is known, so the attack is shaped to bypass it. Adding more guardians just adds more known defenses to bypass.

The Pattern Repeats

We have seen this cycle before:

  • Antivirus signatures get bypassed by polymorphic malware
  • Web application firewalls get bypassed by encoding tricks
  • Content filters get bypassed by rephrasing
  • Guardian models get bypassed by prompt injection

Each layer adds latency and cost without fundamentally solving the problem.

What Actually Works

Instead of trying to detect bad actions before they happen, focus on limiting the blast radius after they happen:

Sandboxing - The agent can only access files and APIs it explicitly needs. Even if compromised, damage is contained.

Reversibility - Every action the agent takes can be undone. Git stash before risky operations. Snapshot before file modifications.

Rate limiting - Even a compromised agent cannot do much damage if it can only perform 10 actions per minute.

Human checkpoints - For irreversible actions (sending emails, deploying code, deleting data), require human approval regardless of what the guardian says.

The unsexy truth is that good security architecture beats smart detection every time. Build systems where a compromised agent cannot cause catastrophic damage, rather than trying to prevent compromise entirely.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts