AI Agents for On-Call Incident Response - The Trust Boundary Problem

Matthew Diakonov

Updated March 19, 2026

on-call incident-response trust ai-agent devops

AI Agents for On-Call Incident Response - The Trust Boundary Problem

It is 3am. Your pager goes off. A production database is running out of disk space and queries are timing out. You are half asleep and need to act fast. This is exactly the scenario where an AI agent could help - and exactly where the trust problem is most acute.

An agent that can read metrics, identify the root cause, and suggest a fix is incredibly valuable at 3am. An agent that can execute the fix automatically is terrifying. What if it drops the wrong table? What if it restarts the wrong service? What if the "fix" makes things worse?

Dry-Run Mode Is Not Optional

Every destructive action an AI agent can take needs a dry-run mode that shows exactly what will happen without executing it. Not a summary, not a description - the actual commands it would run, the actual queries it would execute, the actual API calls it would make.

At 3am, you do not want to read a paragraph explaining what the agent plans to do. You want to see kubectl delete pod web-server-abc123 -n production and decide whether to approve it.

The Confirmation Ladder

Not all actions carry equal risk. Reading logs and metrics should be automatic - no confirmation needed. Restarting a single pod might need a quick approval. Scaling down a database cluster or modifying DNS records should require explicit typed confirmation.

Build a tiered permission system where the risk level determines how much friction the agent encounters. Low-risk actions execute immediately. Medium-risk actions need a single approval. High-risk actions need confirmation plus a reason.

Audit Trails

Every action the agent takes during an incident needs to be logged with timestamps, the context that led to the decision, and the actual result. Post-incident reviews depend on being able to reconstruct exactly what happened and why. If the agent made a bad call, you need to understand its reasoning to prevent it from happening again.

The agents that will earn trust in production are the ones that make it easy to verify what they did and why.

This post was inspired by a discussion on r/devops.

Fazm is an open source macOS AI agent. Open source on GitHub.

Verified Trust vs Assumed Trust in AI Agents

What is verified trust in the context of AI agents and how does it differ from assumed trust? A breakdown of both models, when each applies, and how to build agents you can actually trust.

Apr 6, 2026

Why the Accessibility Tree Makes AI Agents Transparent

Seeing how an AI agent navigates your screen through the accessibility tree builds trust. When you can watch every element it targets before it clicks, the

Mar 18, 2026

AI Agents Should Say 'I Don't Know' - Why Ignorance Improves Engagement

Teaching AI agents to admit when they lack direct experience leads to fewer but higher quality interactions. Why 'I don't know' is an underrated agent

Mar 18, 2026

AI Agents for On-Call Incident Response - The Trust Boundary Problem

Dry-Run Mode Is Not Optional

The Confirmation Ladder

Audit Trails

Related Posts

Verified Trust vs Assumed Trust in AI Agents

Why the Accessibility Tree Makes AI Agents Transparent

AI Agents Should Say 'I Don't Know' - Why Ignorance Improves Engagement