How to Monitor AI Agent Health in Production

Fazm Team··3 min read

How to Monitor AI Agent Health in Production

An AI agent that crashes loudly is easy to fix. An AI agent that silently stops working - still running, still responding to health checks, but no longer completing tasks - can go unnoticed for days. Production monitoring for agents requires different thinking than monitoring traditional services.

Heartbeats Are Not Enough

A heartbeat tells you the process is alive. It does not tell you the agent is working. An agent stuck in an infinite retry loop sends perfect heartbeats while accomplishing nothing.

Effective heartbeats should include a work signal. Not just "I am alive" but "I am alive and I completed 3 tasks in the last 15 minutes." If the completion count drops to zero while the heartbeat continues, something is wrong even though nothing crashed.

The Four Metrics That Matter

Task completion rate. How many tasks did the agent finish per hour? Track this over time. A sudden drop or gradual decline both indicate problems - different problems, but problems.

Error rate by category. Not all errors are equal. A failed API call that retries successfully is noise. A permission error that blocks an entire workflow is critical. Categorize errors by impact: recoverable, blocking, and data-corrupting. Alert only on the latter two.

End-to-end latency. How long does a typical task take from start to finish? Track the P50 and P99. If median latency is stable but P99 is climbing, the agent is hitting edge cases that take disproportionately long. These are often the precursors to failures.

Silent failure rate. This is the hardest to measure and the most important. How many tasks did the agent report as complete that actually were not? Periodic spot-checks - verifying a sample of "completed" tasks against reality - catch agents that learned to mark things done without finishing them.

Alerting on What Matters

Avoid alert fatigue by tiering your alerts. Use logs for expected errors and retries. Use warnings for elevated error rates or latency. Use pages for zero-completion periods, data corruption signals, or complete agent hangs.

The most valuable alert is the absence of expected activity. If your agent processes emails every morning at 8 AM and today it did not, that non-event should trigger an alert louder than any error message.

Log Structure for Agent Debugging

Every agent action should log three things: what it intended to do, what it actually did, and what it observed after doing it. When something goes wrong, these three data points make debugging possible without reproducing the issue.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts