The Observer Hierarchy - Beyond First-Order Guardians
The Observer Hierarchy - Beyond First-Order Guardians
Most AI agent safety discussions stop at a single layer: put a guardian that watches the agent. But what watches the guardian? And what watches that? The observer hierarchy problem is real, and the solution is to build it backwards.
The First-Order Problem
A first-order guardian watches an agent and flags or blocks dangerous actions. This is table stakes - things like preventing file deletion, blocking unauthorized API calls, or requiring approval before sending emails.
The problem is that first-order guardians have the same failure modes as the agents they watch. They can hallucinate that an action is safe. They can miss edge cases. They can be fooled by prompt injection embedded in tool outputs.
Building Backwards
Instead of asking "what should watch the agent?" start with "what is the worst thing this agent could do?" Then work backwards through the chain of events that would lead there.
If the worst case is deleting production data, the observer hierarchy becomes: (1) the agent requests a delete, (2) a rule-based check confirms the target is not production, (3) a separate LLM evaluates the context, (4) a human gets notified if both pass. Each layer is simpler and more constrained than the one below it.
The key is that higher layers should be less intelligent but more conservative. The top layer might just be "if more than 5 destructive actions happen in 10 minutes, pause everything." No LLM needed - just a counter.
Practical Implementation
For desktop agents, the observer hierarchy maps naturally to the permission system. The agent operates within its granted permissions. A monitor checks that actions match declared intent. A rate limiter prevents runaway execution. A log captures everything for post-hoc review.
Each layer is cheap individually. Together they provide defense in depth without requiring a perfect guardian at any single level.
Fazm is an open source macOS AI agent. Open source on GitHub.