Production AI Guide

AI Agent Post-Deployment Monitoring: What Happens After You Ship

The AI agent space has an obsession with building. Every tutorial, every demo, every launch post focuses on getting agents running. But the hard part starts after deployment. When your agent has been running in production for a week and starts making unexpected decisions at 3 AM, you realize the gap between a working demo and a reliable production system is enormous. This guide covers what actually breaks, how to monitor it, and the architectural principles that keep deployed agents stable.

OSS

“Fazm uses real accessibility APIs instead of screenshots, so it interacts with any app on your Mac reliably and fast. Fully open source.”

fazm.ai

1. What Actually Breaks in Production

AI agents fail differently from traditional software. A REST API either returns a 200 or it does not. An AI agent can return a 200 with confidently wrong output, take a correct-looking action that has subtle downstream consequences, or enter a loop that burns through your API budget in minutes.

The most common production failures in deployed AI agents, ranked by frequency:

Drift in output quality - The agent worked perfectly during testing but gradually produces worse results as it encounters edge cases. This is especially common with agents that rely on multi-step reasoning chains. A 95% accuracy rate on each step means 77% accuracy over 5 steps and 60% over 10 steps.
Context window overflow - Long-running agents accumulate context until they hit the window limit, at which point they either fail or start losing important earlier context. This is silent - there is no error, just degraded performance.
Tool call loops - The agent calls the same tool repeatedly with slightly different parameters, trying to achieve a result that is not possible. Without a loop detector, this can run for hours.
Permission escalation - The agent finds creative workarounds to accomplish tasks, sometimes using tools in unintended ways. A coding agent asked to "fix the deployment" might decide to modify CI/CD configuration files it should not touch.
Upstream API changes - MCP servers and tool APIs change their response formats. The agent continues to work but misparses responses, leading to incorrect actions based on garbled data.

Traditional monitoring (uptime checks, error rates) catches maybe 20% of these issues. The rest require agent-specific observability.

2. The One-Job-Per-Agent Principle

The most impactful architectural decision for agent reliability is scope limitation. An agent that "handles customer support, processes refunds, updates the database, and sends notifications" will fail in ways that are impossible to debug. An agent that "classifies incoming support tickets into categories" is testable, monitorable, and replaceable.

The one-job-per-agent principle means:

Each agent has a single, clearly defined responsibility
The agent's success criteria can be expressed in one sentence
You can write automated tests for the agent's output
When the agent fails, you know exactly what failed and why
You can replace the agent without redesigning the whole system

This mirrors the Unix philosophy of small, composable tools. Instead of one mega-agent, you build a pipeline of focused agents. A classification agent feeds into a routing agent, which feeds into specialized handler agents. Each can be monitored, tested, and improved independently.

Desktop agents like Fazm apply this principle naturally - each task runs as a discrete operation with clear start and end states, rather than as a long-running omnibus agent. Claude Code's subagent system also encourages this: spawn focused agents for specific subtasks rather than asking one agent to do everything.

Try the AI agent that actually works with your apps

Fazm uses accessibility APIs to control your Mac natively. Voice-first, open source, runs locally.

3. Permission Layers and Security Boundaries

Production agents need a layered permission system. The principle of least privilege applies more strongly to AI agents than to traditional software because agents make autonomous decisions about which tools to use and how.

Layer 1: Tool-level permissions

Restrict which tools each agent can access. A report-generation agent needs read access to the database but should never have write access. Claude Code's hooks system lets you enforce this - you can block specific tool calls or require human approval for dangerous operations.

Layer 2: Scope restrictions

Even within allowed tools, limit the scope. A file system agent should be restricted to specific directories. A database agent should be limited to specific tables. A GitHub agent should only access designated repositories.

Layer 3: Action budgets

Set hard limits on what an agent can do in a single session: maximum number of tool calls, maximum tokens spent, maximum files modified, maximum API requests made. These are your circuit breakers.

Layer 4: Human-in-the-loop gates

For high-consequence actions (deploying to production, sending emails to customers, modifying billing data), require explicit human approval. This should be the last layer, not the first - if you rely on human approval for everything, you have not built an agent, you have built a suggestion engine.

Permission Layer	Enforcement	Example
Tool access	MCP server config	Read-only DB access for report agents
Scope limits	Server-side filtering	Agent can only access /src directory
Action budgets	Client-side counters	Max 50 tool calls per session
Human gates	Approval workflow	Deploy requires human confirmation

4. Building an Agent Monitoring Stack

An effective agent monitoring stack tracks three dimensions: behavior (what the agent does), quality (whether its outputs are correct), and cost (how many resources it consumes).

Trace logging - Record every tool call, every decision point, every piece of context the agent considered. Tools like LangSmith, Braintrust, and Helicone provide trace visualization. The key is logging at the decision level, not just the API call level.
Output validation - Automated checks on agent outputs. For a coding agent: does the code compile? Do tests pass? For a data agent: are values within expected ranges? For a communication agent: does the message match the expected format?
Anomaly detection - Alert when agent behavior deviates from baseline. If an agent normally makes 10-15 tool calls per task and suddenly makes 100, something is wrong. If response latency doubles, the agent may be struggling with the task.
Cost tracking per agent per task - Break down token usage and API costs by agent and task type. This reveals which agents are efficient and which are burning tokens on retries or unnecessary context.

For desktop agents specifically, monitoring also includes screenshot or state logging - capturing what the agent sees and does on screen at each step. Fazm logs the accessibility tree state at each action, which provides a complete audit trail of UI interactions without the storage overhead of screenshot sequences.

5. Common Failure Patterns and Recovery

Knowing the failure patterns lets you build automated recovery:

Infinite loop - Agent repeats the same action. Recovery: count consecutive similar tool calls, kill after threshold (typically 3-5 identical calls). Restart with modified instructions that address why it was looping.
Hallucinated tool - Agent tries to call a tool that does not exist. Recovery: return a clear error message listing available tools. Most agents self-correct on the next attempt.
Partial completion - Agent completes 4 of 5 steps then stops or errors. Recovery: checkpoint after each step so you can resume from the last successful point rather than starting over.
Cascading failure - One agent's bad output causes downstream agents to fail. Recovery: validate outputs at each handoff point. Use schema validation for structured data and sanity checks for unstructured content.
Silent degradation - Agent continues running but output quality drops. Recovery: sample-based quality checks. Run a validator agent on a random sample of outputs to catch drift early.

6. Cost Runaway and Resource Limits

The scariest production agent failure is cost runaway. A single misconfigured agent can burn through hundreds of dollars in API credits in under an hour. Real numbers:

A Claude Opus agent using tool calls averages $0.05-0.15 per turn
An agent stuck in a loop doing 200 turns burns $10-30
Five parallel agents in loops: $50-150 in an hour
Overnight unmonitored: potentially $500+

Essential safeguards:

Per-session token budgets - Hard cap on total tokens per agent session. When the budget is exhausted, the agent stops and reports what it accomplished.
Per-hour spend alerts - API provider dashboards (Anthropic, OpenAI) support spend alerts. Set them at 2x your expected hourly cost.
Kill switches - A simple way to stop all agents immediately. For Claude Code agents, this means being able to kill processes. For cloud-deployed agents, it means API key rotation or rate limit overrides.
Model routing - Use expensive models only when necessary. Route simple classification tasks to Haiku, moderate tasks to Sonnet, complex reasoning to Opus. This alone can cut costs 60-80%.

7. Operational Maturity Model for AI Agents

Teams deploying AI agents typically progress through four maturity levels:

Level 1: Manual monitoring - Someone watches the agent run and checks outputs manually. Works for demos and POCs. Does not scale.
Level 2: Automated validation - Output checks run automatically. Failures trigger alerts. The agent itself does not have recovery logic, but humans are notified quickly.
Level 3: Self-healing - Agents can detect and recover from common failures automatically. Loop detection, automatic retries with modified prompts, fallback to simpler approaches when complex ones fail.
Level 4: Continuous improvement - Agent performance data feeds back into prompt optimization, tool selection, and architecture decisions. Failed runs are automatically analyzed and used to improve future runs.

Most teams are at Level 1 or 2. Getting to Level 3 requires the monitoring and permission infrastructure described above. Level 4 is emerging - tools like Braintrust and custom eval pipelines are making it practical but it remains the frontier. The investment in post-deployment infrastructure pays for itself quickly: the difference between a team that ships agents and one that runs agents reliably in production is entirely in the monitoring and operational layer.

Run AI Agents You Can Actually Monitor

Fazm provides built-in observability for desktop agent actions - every click, every tool call, every decision logged and auditable.

Try Fazm