Back to Blog

AI Agents for On-Call Incident Response - The Trust Boundary Problem

Fazm Team··2 min read
on-callincident-responsetrustai-agentdevops

AI Agents for On-Call Incident Response - The Trust Boundary Problem

It is 3am. Your pager goes off. A production database is running out of disk space and queries are timing out. You are half asleep and need to act fast. This is exactly the scenario where an AI agent could help - and exactly where the trust problem is most acute.

An agent that can read metrics, identify the root cause, and suggest a fix is incredibly valuable at 3am. An agent that can execute the fix automatically is terrifying. What if it drops the wrong table? What if it restarts the wrong service? What if the "fix" makes things worse?

Dry-Run Mode Is Not Optional

Every destructive action an AI agent can take needs a dry-run mode that shows exactly what will happen without executing it. Not a summary, not a description - the actual commands it would run, the actual queries it would execute, the actual API calls it would make.

At 3am, you do not want to read a paragraph explaining what the agent plans to do. You want to see kubectl delete pod web-server-abc123 -n production and decide whether to approve it.

The Confirmation Ladder

Not all actions carry equal risk. Reading logs and metrics should be automatic - no confirmation needed. Restarting a single pod might need a quick approval. Scaling down a database cluster or modifying DNS records should require explicit typed confirmation.

Build a tiered permission system where the risk level determines how much friction the agent encounters. Low-risk actions execute immediately. Medium-risk actions need a single approval. High-risk actions need confirmation plus a reason.

Audit Trails

Every action the agent takes during an incident needs to be logged with timestamps, the context that led to the decision, and the actual result. Post-incident reviews depend on being able to reconstruct exactly what happened and why. If the agent made a bad call, you need to understand its reasoning to prevent it from happening again.

The agents that will earn trust in production are the ones that make it easy to verify what they did and why.

Fazm is an open source macOS AI agent. Open source on GitHub.

Related Posts