How Are You Testing Agents in Production?

Fazm Team··2 min read

How Are You Testing Agents in Production?

Your tool tests pass. Every function works in isolation. Then you deploy the agent and it clicks the wrong button, loops on an error, or confidently does the opposite of what was asked. Welcome to the gap between tool tests and agent tests.

The Testing Gap

Traditional testing validates components. Does the click function click? Does the text extraction return text? These pass with flying colors. But agent behavior is emergent - it comes from the model's decisions about which tools to use, in what order, and how to interpret results.

You cannot unit test judgment. And that is exactly what makes agent testing hard.

What Actually Works

Scenario replay testing - Record real agent sessions, including the model's decisions and tool outputs. Replay them to catch regressions. When the model changes behavior after an update, you see it immediately.

Golden path assertions - Define the expected sequence of actions for common tasks. Not rigid step-by-step scripts, but assertions like "should not click more than 3 times" or "must verify result before reporting success."

Shadow mode - Run the agent alongside a human doing the same task. Compare decisions without letting the agent actually execute. This catches the 5% of cases where the agent would have done something wrong.

The Production Reality

Some failures only show up in production. Apps update their UI, screen layouts change, unexpected dialogs appear. The best testing strategy accepts this and focuses on detection and recovery rather than prevention.

At Fazm, we log every agent action with screenshots. When something goes wrong, we have the full context to understand why - not just that a test failed, but what the agent was seeing and thinking.

Testing agents is less about catching bugs before deployment and more about building systems that catch failures fast and recover gracefully.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts