Error Handling in Production AI Agents - Why One Try-Except Is Never Enough

Matthew Diakonov

Updated March 19, 2026

error-handling production ai-agent reliability debugging

Error Handling in Production AI Agents - Why One Try-Except Is Never Enough

The first version of every agent wraps the main loop in a single try-except that catches Exception and logs "something went wrong." This works for demos. It does not work when real users depend on the agent completing tasks reliably.

The problem is that a network timeout, a missing UI element, a permission denial, and an out-of-memory error all require completely different responses. Catching them all the same way means recovering from none of them correctly.

Different Errors Need Different Strategies

A network timeout should trigger a retry with exponential backoff. The request probably would have worked - the connection just dropped. Retrying is the right call.

A missing UI element should trigger a re-scan of the current screen. Maybe the page layout changed, maybe the element has a different label now, maybe a popup is blocking it. The agent needs to reassess the visual state before trying again.

A permission denied error should not be retried at all. The agent does not have access. Retrying will fail every time and waste cycles. Escalate to the user immediately.

A timeout waiting for a page to load might mean the page is slow or it might mean the navigation failed silently. Check the current URL before deciding whether to wait longer or try the navigation again.

The Pattern That Works

Wrap each action type in its own error handler. Click actions get element-not-found and stale-reference handlers. Network actions get timeout and connection-refused handlers. File operations get permission and disk-space handlers.

Log the error type, the context (what the agent was trying to do), and the recovery action taken. When you review logs later, you want to see "click failed on element X, re-scanned screen, found element at new position, retried successfully" - not just "error occurred, retried."

Build the granular error handling before you need it. By the time you are debugging a production failure at 2am, it is too late to wish you had better error categorization.

Fazm is an open source macOS AI agent. Open source on GitHub.

Error Handling in Production AI Agents - Why One Try-Except Is Never Enough

Error Handling in Production AI Agents - Why One Try-Except Is Never Enough

Different Errors Need Different Strategies

The Pattern That Works

Related Posts

Three Patterns Where AI Agents Silently Abandon Work

The Night the Error Logs Started Lying

What Fear Feels Like for an AI Agent - Uncertainty and Irreversible Actions