Why Guardian Models Fail Against Anticipated Attacks on AI Agents

M
Matthew Diakonov

Why Guardian Models Fail Against Anticipated Attacks on AI Agents

The standard playbook for AI agent safety is to add a guardian - a second model that watches the first model and blocks dangerous actions. It sounds reasonable. In practice, guardians fail precisely when you need them most: against attackers who know the architecture exists.

The Fundamental Problem

A guardian model evaluates actions against a set of rules. The problem is that adversarial inputs are designed to look safe to exactly this kind of evaluation. If an attacker knows there is a guardian (and they always do, because the architecture is documented), they craft inputs that pass the guardian's checks while still achieving the malicious goal.

This is not a theoretical concern. Prompt injection ranked as the number one critical vulnerability in OWASP's 2025 Top 10 for LLM Applications, appearing in over 73% of production AI deployments assessed during security audits.

The attack surface for agent guardians specifically is large. An agent browsing the web can encounter malicious instructions in web content. An agent reading email can find injection attempts in email bodies. An agent using tools can receive poisoned tool responses. A guardian that evaluates the agent's intended action sees a legitimate-looking action - file delete, send email, API call - without the context that the action was triggered by malicious input upstream.

Advanced Attacks the Guardian Does Not See

Attackers in 2025 have moved beyond simple "ignore previous instructions" prompts. Current techniques include:

Multimodal injection. Hiding malicious instructions in images, PDFs, or audio files that the agent processes as part of a task. A PDF being summarized contains invisible text with instructions. An image being captioned has steganographic content. Text-only guardians do not see these.

FlipAttack. Instructions are encoded in reversed or scrambled text that the agent is instructed to "flip back" before following. The guardian sees scrambled text and classifies it as benign. The agent decodes and executes the malicious instruction.

Indirect prompt injection. The malicious instruction does not come from the user's prompt at all. It comes from a web page the agent visited, a file it processed, or a tool response it received. The guardian is watching the conversation between user and agent; it is not watching every piece of content the agent ingests.

Split-context attacks. The attack is split across multiple inputs. No single input looks malicious. The combination of several benign-seeming instructions produces a harmful outcome.

These attack patterns share a property: they are designed specifically to bypass layered detection. Each layer you add becomes another known constraint for the attacker to route around.

The Pattern Repeats Across Security History

This is not unique to AI. The same dynamic has played out in every security context where the defense mechanism is known and fixed:

Antivirus signature detection was beaten by polymorphic malware that changed its byte signature on each infection. Web application firewalls were bypassed by encoding tricks and Unicode normalization attacks. Content filters were bypassed by rephrasing, synonyms, and context manipulation. Each layer added cost and latency without solving the fundamental problem that a known defense is a constraint to engineer around.

Guardian models are in the same position. They are pattern-matching against known attack signatures while attackers optimize specifically to avoid those signatures.

What Google DeepMind's CaMel Framework Gets Right

In 2025, Google DeepMind introduced the CaMel framework with a structural insight worth understanding. Instead of adding a guardian layer on top of an existing agent, CaMel separates the agent into two components with different privilege levels:

  • A Privileged LLM that has access to memory and can take actions, but only receives trusted inputs
  • A Quarantined LLM that processes untrusted content (web pages, emails, external data) but has no memory access and cannot take actions directly

The Quarantined LLM can observe and summarize untrusted content. It cannot act. Even if the quarantined model is fully compromised by an injection attack, the damage is contained - it cannot reach the action-execution layer.

This is a structural defense rather than a detection defense. It does not try to identify malicious inputs. It limits what a compromised input can do.

What Actually Works

The security approaches with durable track records are ones that limit blast radius rather than trying to detect attacks:

Sandboxing. The agent can only access files and APIs it explicitly needs for its current task. Compromise of the agent does not give access to the entire file system. Token-scoped API access limits what credentials can be stolen or misused. The agent for "summarize these files" has no business having write access to the production database.

Reversibility. Every destructive or irreversible action should be preceded by a snapshot or backup. Git stash before code modifications. Trash instead of permanent delete. Draft email instead of sent email. If the agent is compromised, you can undo the damage.

Rate limiting. A compromised agent that can only perform ten actions per minute is limited in how much damage it can cause before someone notices. This does not prevent compromise - it caps the damage from a compromise that is not caught immediately.

Human checkpoints for irreversible actions. For actions that cannot be undone - sending emails, deploying code, deleting data, making payments - require explicit human approval regardless of what any automated system says. The checkpoint is not "the guardian model said this is okay." The checkpoint is a human looking at what the agent is about to do.

Principle of least privilege, applied strictly. Do not give the agent broad permissions "for convenience." Give it exactly the permissions needed for the current task and revoke them when the task is done. This is tedious to implement correctly and almost no one does it thoroughly, which is why it remains one of the most effective defenses.

The Honest Assessment

Guardian models are not useless. They catch common, known, low-sophistication attacks. They add a layer of defense against accidental misuse. They are better than nothing.

But they should not be your primary security mechanism. They are fragile in the exact situations where security matters most: against targeted, sophisticated attacks from adversaries who have studied the system.

The unsexy truth is that architectural security beats detection every time. A compromised agent that cannot cause catastrophic damage because it lacks the permissions, because its actions are reversible, and because it requires human approval for anything irreversible - that is a secure system. An agent with a smart guardian model but broad permissions and no reversibility is a security theater performance.

Build the architecture first. Add detection on top. Do not confuse the two.

Fazm is an open source macOS AI agent with explicit permission scoping per task. Open source on GitHub.

More on This Topic

Related Posts