What Breaks When You Evaluate an AI Agent in Production
What Breaks When You Evaluate an AI Agent in Production
Dev environments are kind to AI agents. APIs respond fast. Schemas are consistent. Edge cases do not exist because your test data is clean. Then you deploy to production and everything breaks in ways you did not anticipate.
Latency Variance
In dev, your API calls return in 200ms. In production, they return in 200ms to 4 seconds depending on load, network conditions, and which region the request hits. Your agent's timeout was set to 2 seconds because that seemed generous in testing. Now half your tool calls fail intermittently.
Worse, the agent interprets a timeout as a failure and retries - sometimes creating duplicate actions. An agent that places an order, times out, and retries just placed two orders.
Schema Validation Failures
The API returned a slightly different JSON structure than expected. A field that was always present in dev is sometimes null in production. A list that always had items occasionally comes back empty. Your agent's parser throws an exception, and the entire workflow halts.
The fix is defensive parsing everywhere, but that is boring to build and easy to forget. Every field access needs a null check. Every list operation needs an empty check. Every type needs validation.
The Dev-to-Prod Gap
Dev environments have fresh data, consistent state, and predictable behavior. Production has stale caches, concurrent users modifying the same resources, rate limits that only kick in under real load, and third-party services that go down at 3 AM.
The agents that survive production are the ones built with the assumption that everything will fail. Retry with backoff. Validate every response. Log every action for debugging. And most importantly - have a graceful degradation path so a single failure does not cascade into a broken workflow.
Test your agent against production-like conditions before you ship it. Shadow mode - running the agent alongside manual work without executing actions - catches most of these issues before they hit real users.
Fazm is an open source macOS AI agent. Open source on GitHub.