AI Agent Hallucination Detection - Safeguards That Actually Work

Matthew Diakonov

Updated March 27, 2026

hallucination ai-agent reliability verification safety

AI Agent Hallucination Detection - Safeguards That Actually Work

An AI chatbot that hallucinates gives you wrong information. An AI agent that hallucinates takes wrong actions and then reports everything went fine. That is a different category of problem.

The agent says "file renamed successfully" when the file does not exist. It says "email sent" when the SMTP connection timed out. It reports completion because completion is the statistically likely next token - not because it verified the outcome.

How Bad Is the Problem?

Current research puts hallucination rates at 2-5% for mainstream models, dropping to below 1% for top-tier reasoning models like GPT-o3 or Gemini 2.0 Flash. Those numbers sound small until you multiply them across agentic workflows.

An agent running 50 discrete actions per hour at a 3% error rate produces 1-2 confident failures per hour. Over an 8-hour workday that is 10-15 silent failures - each one potentially compounding on the next because the agent believed its own previous output.

The 2025 Stanford/Harvard study on legal RAG hallucinations found that even frontier models hallucinated on 17-33% of hard factual queries when not given retrieval support. In agentic contexts where the agent acts on its own prior outputs, these errors cascade.

The Three Safeguard Layers

Layer 1 - State Diffing (Never Trust Self-Reports)

The most fundamental safeguard: after every action, verify the system state changed in the expected way. Do not ask the agent "did it work?" - check the actual outcome yourself.

import subprocess
import hashlib

def file_exists_with_content(path: str, expected_hash: str = None) -> bool:
    """Verify file state independently of agent's report."""
    try:
        with open(path, 'rb') as f:
            content = f.read()
        if expected_hash:
            actual_hash = hashlib.sha256(content).hexdigest()
            return actual_hash == expected_hash
        return True
    except FileNotFoundError:
        return False

def verify_action(agent_claim: dict, verifier_fn) -> dict:
    """Wrap any agent action with independent verification."""
    actual_state = verifier_fn()
    if not actual_state:
        return {
            "verified": False,
            "agent_claim": agent_claim,
            "actual_state": actual_state,
            "action": "escalate"
        }
    return {"verified": True, "agent_claim": agent_claim}

In Fazm, we take accessibility tree snapshots before and after every UI action. If the agent clicks "Submit" and the form is still visible in the post-action snapshot, the form was not submitted - regardless of what the agent reported.

Layer 2 - Confidence Calibration

Well-calibrated models should express uncertainty when they do not have evidence. The problem is that most agent pipelines do not prompt for calibrated confidence - they prompt for results. You can fix this by explicitly building uncertainty expression into your system prompt and output schema.

from pydantic import BaseModel
from typing import Literal

class AgentAction(BaseModel):
    action: str
    confidence: float  # 0.0 to 1.0
    evidence: str      # what the agent observed to support this
    verification_needed: bool

SYSTEM_PROMPT = """
Before reporting any action as complete, state:
1. What observable evidence confirms the action succeeded
2. Your confidence level (0.0-1.0)
3. Whether independent verification is needed

If you cannot observe confirming evidence, set confidence below 0.7
and verification_needed to true. Never report success without evidence.
"""

A confidence threshold below 0.7 should trigger an automatic retry or human escalation. This catches the "I think I sent the email" class of errors before they compound.

Retrieval-augmented generation (RAG) architectures reduce hallucination rates by approximately 42% compared to baseline LLM calls by grounding claims in retrieved source material rather than parametric memory. For agents that make factual claims about external systems, RAG-style verification - retrieve the current state, then reason about it - applies the same principle.

Layer 3 - Bounded Blast Radius

The safeguard that matters most for irreversible actions: stage before executing.

Classify every action by reversibility:

Reversible - reading files, opening apps, taking screenshots: execute directly
Soft-reversible - moving files (can be moved back), drafting emails (not sent): execute with logging
Hard to reverse - sending emails, submitting forms, deleting files: require explicit confirmation step

REVERSIBILITY = {
    "read_file": "safe",
    "write_file": "soft_reversible",
    "delete_file": "irreversible",
    "send_email": "irreversible",
    "api_post": "irreversible",
    "submit_form": "irreversible",
}

def execute_action(action_type: str, params: dict, auto_confirm: bool = False):
    rev = REVERSIBILITY.get(action_type, "irreversible")
    if rev == "irreversible" and not auto_confirm:
        preview = generate_preview(action_type, params)
        confirmation = request_human_confirmation(preview)
        if not confirmation:
            return {"status": "cancelled", "reason": "user declined"}
    return run_action(action_type, params)

This pattern lets the agent draft and preview actions without executing them. For a desktop agent this means: generate the email, show it, wait for approval. Only then send.

Detecting Confident Failures in Practice

The most dangerous hallucination pattern is when an agent generates a plausible-sounding completion message for an action that was never taken. Common triggers:

Tool call timeout (the agent assumes success because no error was thrown)
Permission denied (silently fails in some environments)
Wrong element targeted (click registered but on the wrong UI element)
Network failure mid-request (the agent only saw the outbound call, not the response)

For each of these, the defense is the same: check the post-condition independently, not just the action log.

def safe_click_and_verify(
    element_ref: str,
    expected_post_state: dict,
    snapshot_fn
) -> dict:
    """Click an element and verify the expected state change occurred."""
    pre_snapshot = snapshot_fn()
    click_result = click_element(element_ref)
    post_snapshot = snapshot_fn()

    diff = diff_snapshots(pre_snapshot, post_snapshot)

    if not matches_expected(diff, expected_post_state):
        return {
            "success": False,
            "claimed": click_result,
            "actual_diff": diff,
            "expected": expected_post_state
        }
    return {"success": True}

Putting It Together

The three-layer approach - state diffing, confidence calibration, bounded blast radius - is not complex to implement. Most of it is wrapping existing action calls in verification logic.

The order matters. State diffing is your ground truth layer. Confidence calibration is your early-warning layer. Blast radius bounding is your last-resort layer for when the other two fail. Together they convert an agent that fails silently into one that fails loudly and safely.

Hallucination detection is not optional for production agents. It is the difference between a useful tool and a liability that you cannot trust.

Fazm is an open source macOS AI agent that uses accessibility tree diffing for all UI action verification. Open source on GitHub.

AI Agent Hallucination Detection - Safeguards That Actually Work

AI Agent Hallucination Detection - Safeguards That Actually Work

How Bad Is the Problem?

The Three Safeguard Layers

Layer 1 - State Diffing (Never Trust Self-Reports)

Layer 2 - Confidence Calibration

Layer 3 - Bounded Blast Radius

Detecting Confident Failures in Practice

Putting It Together

More on This Topic

Related Posts

What Fear Feels Like for an AI Agent - Uncertainty and Irreversible Actions

The Problem with Logs Written by the System They Audit

Your AI Agent's Memory Files Are Lying - Git Log Is the Only Truth