SOURCE: ACP-BRIDGE / SRC / INDEX.TS

Multi-step action chain reliability is mostly a harness problem

Most writing on this topic does compounding-error math: 95% per step, 77% by step five, 60% by step ten. The math is right and the conclusion is incomplete. The thing that actually nukes a 20-step chain on a real Mac is almost never the model picking the wrong button. It is one tool call hanging, a poisoned SDK session, or a deferred response from the previous turn arriving on the new prompt. The fix lives in the harness, not in the model.

Direct answer (verified 2026-05-01 against acp-bridge/src/index.ts)

Multi-step computer-use chains break for three reasons that have nothing to do with the model:

  1. A single tool call hangs and blocks the agent loop forever waiting on a result that will never arrive.
  2. The SDK silently delivers a previously-cancelled prompt's deferred output as the response to the next prompt.
  3. session/resume lands on an upstream session that is already broken, so every subsequent call runs against a phantom.

Fazm fixes all three at the harness layer: per-tool wall-clock watchdogs that synthesize a failure event, a 5-second time-to-first-token race after any interrupt, and priorContext replay into a fresh session when resume is unsafe. Constants and line numbers below.

M
Matthew Diakonov
8 min read
0sInternal tool timeout
0sMCP tool timeout
0sTTFT race after interrupt
0Messages replayed on session reset

What every other guide on this gets wrong

I read the pages currently answering this topic before writing this one. They all follow the same pattern. State the compounding-error math (one paragraph). Argue for fewer steps, more verification, or multi-agent orchestration (three paragraphs). Stop. The framing assumes the harness is invisible: a tool call is a function call, it returns or it raises, the model retries.

That assumption holds for hosted browser agents in a sandbox. It does not hold on a real desktop. On a real desktop the tool can click into an app that froze, hit an OS permission prompt that never resolves, send a Bash command that waits on stdin, or call an MCP server whose Python process wedged. The function call does not return and does not raise. It hangs. And while it hangs, the agent loop is blocked on its result, which means the model cannot advance to step nine, which means the chain is dead even though no step has actually failed.

Compounding error rates are not the load-bearing failure mode for chains on a real machine. Hangs and bad sessions are. The rest of this guide is what Fazm's acp-bridge actually does about them, with the exact constants you can grep for in the open source repo.

The three failure modes, with the recovery for each

Each entry below names a failure mode I hit while building this thing, the constant or function in acp-bridge/src/index.ts that catches it, and what the agent loop actually sees afterward. None of this is theoretical. It is all in the file.

1

Failure 1: a tool call hangs forever

MCP server wedged, Bash waiting on stdin, click into a frozen app. The SDK is still waiting for the tool's response, the model can't continue, and the parent process eventually falls through to its 180-second inactivity timeout, killing the whole turn with no useful error.

Recovery: startToolTimer sets one setTimeout per tool call (line 140). On fire: synthesize a tool_activity completion, emit a tool_result_display with the timeout error, abort the in-flight query, mark the session interrupted, unregister it. The next prompt routes through priorContext replay.

Caps: 10s for internal tools, 120s for any tool prefixed mcp__, 300s for everything else (Bash, Edit, Write, Read). Override via FAZM_TOOL_TIMEOUT_SECONDS or Settings > Advanced > Tool Timeout.

2

Failure 2: the SDK delivers a stale prompt's deferred output

Specific SDK bug: a session cancelled mid-stream can have its deferred output chunks delivered as the response to the next prompt. The user types a fresh question and gets a one-line reply about a tool call from three minutes ago.

Recovery: Interrupt-replay heuristic at lines 2080 to 2108. Trigger if the session was previously interrupted, the new prompt is > 50 characters, and inputTokens <= 20 (meaning only the prompt-shell overhead arrived at the model, not the user's actual content).

Action: Unregister the session, set msg._priorStuckSessionId, recurse into handleQuery so the next attempt skips resume entirely and goes straight to session/new + priorContext replay. Bounded by MAX_QUERY_RETRIES = 2 at line 1635.

3

Failure 3: session/resume lands on a poisoned upstream session

The session id is valid. The SDK accepts the resume. But the session state on the other end is half-broken from a prior interrupt. The next session/prompt silently never gets dispatched. The bridge sits there waiting for any notification at all and the user sees a spinner that never resolves.

Recovery: TTFT (time-to-first-token) watchdog at line 2028. TTFT_WATCHDOG_MS = 5_000. On any session previously marked as interrupted, race the session/prompt promise against a 5-second timer. If the notification count is still zero when it fires, throw TTFT_WATCHDOG: session unresponsive after interrupt. The outer retry creates a fresh session and replays priorContext.

Cost on the happy path: Zero. The race only runs on previously-interrupted sessions. Healthy sessions never start the timer. If notifications start flowing before the timer fires, the timer is released and the prompt finishes normally.

Anchor fact: the constants live in three lines

You can grep these in github.com/m13v/fazm right now. The whole timeout policy is three named constants. Add one user override env var. That is the entire surface.

// acp-bridge/src/index.ts, lines 113 to 120

const TOOL_TIMEOUT_INTERNAL_MS = 10_000;   // ToolSearch and similar: 10s
const TOOL_TIMEOUT_MCP_MS      = 120_000;  // any mcp__ tool: 2 min
const TOOL_TIMEOUT_DEFAULT_MS  = 300_000;  // Bash, Edit, Write, Read: 5 min

const toolTimeoutOverrideSec = process.env.FAZM_TOOL_TIMEOUT_SECONDS
  ? parseInt(process.env.FAZM_TOOL_TIMEOUT_SECONDS, 10)
  : 0;

The classifier picks one of these per tool call. MCP gets a tighter cap because MCP servers are the most common source of hangs in practice (subprocess crashes, stuck stdio reads, broken IPC). The 5-minute default for Bash/Write/Edit is generous on purpose so that a long compile or a deliberate sleep does not get false-flagged.

// acp-bridge/src/index.ts, lines 132 to 138

function getToolTimeoutMs(title: string, isInternal: boolean): number {
  if (toolTimeoutOverrideSec > 0) return toolTimeoutOverrideSec * 1000;
  if (isInternal) return TOOL_TIMEOUT_INTERNAL_MS;
  if (title.startsWith("mcp__")) return TOOL_TIMEOUT_MCP_MS;
  return TOOL_TIMEOUT_DEFAULT_MS;
}

What this looks like in the bridge log

A real transcript from a chain that hit a stuck Playwright click on step 8 of 20. The watchdog fires, the bridge synthesizes a failure event, the session is marked interrupted, the next user prompt replays the prior 14 messages into a fresh session, and the chain resumes from where it stalled.

acp-bridge log

Naive harness behavior versus what Fazm does

Stripped down to the chain-reliability mechanics. Both columns assume the model is identical. The difference is entirely what the process around the model does when something goes wrong.

FeatureNaive MCP harnessFazm acp-bridge
Tool call exceeds wall-clock budgetAgent loop blocks indefinitely; parent eventually times out the whole turnWatchdog synthesizes a completed event with TIMEOUT, agent loop continues
Per-tool-class timeoutSingle global timeout (often the model's max thinking time) or none10s internal, 120s mcp__, 300s default; user-overridable
Session interrupted mid-streamResumed blindly; SDK may deliver deferred output on the new promptMarked dirty; next prompt routes through priorContext replay
session/resume into a poisoned sessionWaits on parent's 180-second inactivity timeout, user sees endless spinner5-second TTFT race; throws fast and falls through to session/new
SDK replays cancelled prompt's outputUser sees a one-line response to a question they didn't askHeuristic catches it (inputTokens <= 20 and prompt > 50 chars), recurses to fresh session
Recovery context after a fresh sessionFresh session, no preamble; user re-explains what they were doingReplays last 20 messages from local store as a [SESSION RESTORED] preamble
Recursion bound on retry pathOften unbounded retries, or no retry at allMAX_QUERY_RETRIES = 2, no infinite loops
Stuck-tool diagnosticsGeneric timeout error, no clue which tool actually wedgedinFlightTools dump on interrupt with the exact Bash command or file path that hung
3 failure modes

Tool hang, SDK replay, poisoned resume. Each one is a single named constant or function in acp-bridge/src/index.ts. None of them are about the model picking the wrong button.

acp-bridge/src/index.ts, github.com/m13v/fazm

Why this lives in the bridge, not in the model prompt

You could try to write all of this into the system prompt. Tell the model: notice when a tool has been pending for over two minutes, assume it failed, plan around it. In practice this does not work. The model cannot detect the wall clock from inside its own loop. The SDK never delivers a tool_call_update with a timeout signal unless something synthesizes one. The model is waiting for the tool result and there is no result, so it has nothing to reason against.

The same is true of session-level corruption. The model has no way to know that the session it is running in has been silently broken by a prior cancel. It just sees that its tool call seems to have worked but the user is not responding. The bridge is the only process in the loop that can hold a wall clock and a session identity and decide that something is wrong before the user has to notice.

That is why per-step accuracy improvements (better prompts, more examples, smarter models) hit a ceiling on chain reliability that you cannot break through from inside the model. The ceiling is set by the harness. Lift the ceiling first.

If you are building your own harness

The principles port to any agent harness, not just ACP + MCP. The specific code is tied to those protocols, but the shape transfers to LangGraph, AutoGen, raw OpenAI tool-use, anything else.

  • One wall-clock timer per tool call. Started outside the model loop. On expiry, synthesize a tool failure event. Do not just kill the process; the agent loop must be able to plan against the failure.
  • Different timeouts per tool class. A search tool and a long-running build are not the same. Pick three or four classes, not one global cap.
  • A fast time-to-first-token race after any interrupt. 5 seconds is enough. If no notifications arrive, fail fast and create a fresh session.
  • A bounded retry path with a known recovery state. Two retries is plenty. Each retry should converge on a specific recovery state (in Fazm's case: priorContext replay), not on a vague "try again".
  • Local message store as the source of truth. Never trust the upstream session as the authoritative transcript. The user's local store is yours; the upstream session can be reset out from under you.
  • Diagnostic capture on every interrupt. On every cancel, log the exact tool inputs that were in flight. This is what lets you fix the actual cause of a hang instead of staring at a generic timeout error.

The one-sentence version

Multi-step action chain reliability for a desktop computer-use agent is mostly the harness deciding, on a wall clock and outside the model, when to give up on a hung tool, when to distrust a session, and how to hand the next prompt enough context to continue without the user re-explaining anything.

Fazm's acp-bridge does this in roughly 250 lines of TypeScript: three named timeouts, one watchdog setTimeout per tool, a TTFT race on previously-interrupted sessions, a heuristic for catching SDK replay, a bounded retry into a priorContext replay path. The code is at github.com/m13v/fazm.

Want to see the watchdog catch a real chain failure?

Fifteen minutes. We open Fazm, point it at a 20-step task, kill an MCP tool mid-chain, and you watch the bridge synthesize the failure and resume into a fresh session.

Frequently asked questions

Why does a 20-step computer-use chain fail more often than ten 2-step chains?

Two reasons that have almost nothing to do with the model. First, the harness sits in front of the model and forwards tool calls; if any one of those tool calls hangs (a click into a frozen app, a shell command waiting on stdin, an MCP server that wedged), the agent loop blocks waiting for a result that will never arrive. Two-step chains return to the user before they hit a hang; 20-step chains accumulate hang risk with each call. Second, every reset between chains gives the harness a clean session. A long single-prompt chain can land on a session that the SDK silently broke (mid-stream interrupt, deferred response from the prior turn). The model never sees the failure; it just runs against a phantom session and the user sees a spinner. Compounding 95% per-step accuracy is the textbook answer. The real failure modes are timeouts and stale sessions, and they are harness concerns.

What are the actual per-tool timeouts Fazm enforces, and where in the code are they set?

Three classes, three constants, all in acp-bridge/src/index.ts at lines 113 to 115. TOOL_TIMEOUT_INTERNAL_MS = 10_000 covers tools like ToolSearch (anything tagged internal). TOOL_TIMEOUT_MCP_MS = 120_000 covers any tool whose name starts with mcp__ (the MCP tool prefix). TOOL_TIMEOUT_DEFAULT_MS = 300_000 covers everything else (Bash, Edit, Write, Read, etc.). The classifier is the function getToolTimeoutMs at line 132. Users can override the default with the FAZM_TOOL_TIMEOUT_SECONDS env var, exposed in Settings under Advanced > Tool Timeout. The watchdog itself, startToolTimer at line 140, sets one setTimeout per tool call and clears it on tool completion.

What happens at the moment the watchdog fires?

Six steps, in order. (1) Check inFlightTools for the toolCallId; if the tool already completed in the race window, log it and bail without synthesizing anything. (2) Log Tool TIMEOUT with the elapsed seconds. (3) Remove the tool from pendingTools so the UI spinner stops. (4) Emit a tool_activity event with status: completed so the UI is unblocked. (5) Emit a tool_result_display event with the timeout message and a link to fazm://settings/tool-timeouts so the user can adjust the cap. (6) Abort the in-flight query (abortController.abort()), notify ACP to cancel the session (session/cancel), add the session to interruptedSessions, and unregister the sessionKey. The net effect: the agent loop exits with a synthesized failure event for the hung tool, the next user prompt gets a fresh session, and the prior conversation is replayed via priorContext. The full sequence is at acp-bridge/src/index.ts lines 150 to 222.

What is priorContext replay and why does it exist?

When session/resume fails (or when a session was previously marked as interrupted, including by the watchdog above), Fazm cannot trust the upstream conversation history. Instead of resuming, the bridge calls session/new and prepends a transcript of the most recent turns from the user's local message store. The cap is MAX_REPLAY = 20 messages (acp-bridge/src/index.ts line 1837). Each replayed entry is hard-truncated to 4000 characters so a single huge tool dump cannot dominate the preamble. The whole transcript is wrapped in a [SESSION RESTORED FROM LOCAL HISTORY] marker so the model treats it as authoritative context, not a new request. This is what lets a multi-step task survive an upstream session reset without the user re-explaining what they were doing.

What is the time-to-first-token watchdog?

When the bridge sends session/prompt to a session that was previously interrupted, the SDK sometimes silently drops the prompt (the cancel landed mid-tool-call and the session is in a half-broken state). To catch this without waiting for the parent's 180s inactivity timeout, the bridge races the prompt against a 5-second timer (TTFT_WATCHDOG_MS = 5_000 at line 2028). If five seconds pass with zero notifications, the watchdog throws TTFT_WATCHDOG: session unresponsive after interrupt and the outer retry logic creates a fresh session with priorContext replay. If notifications start flowing, the watchdog releases and the prompt finishes normally. The race only runs on previously-interrupted sessions, so healthy sessions pay no overhead.

What is interrupt-replay detection?

A defensive heuristic for a specific SDK bug. When a session is cancelled mid-stream, the SDK can deliver the cancelled prompt's deferred output chunks as the response to the next prompt. The model never sees the new prompt; the user gets the previous turn's reply with a delay. Fazm catches this at acp-bridge/src/index.ts lines 2080 to 2108 with a heuristic: if the session was previously interrupted, the new prompt is at least 50 characters, and the SDK reports inputTokens <= 20 (meaning only the prompt-shell overhead, not the actual user content), it is treated as a replay. The bridge then unregisters the session, sets msg._priorStuckSessionId, and recursively calls handleQuery so the next attempt skips resume entirely and goes straight to session/new + priorContext replay. The recursion is bounded by MAX_QUERY_RETRIES = 2 at line 1635.

What is stale-task retry?

A second defensive heuristic. If the prompt completes in under 2 seconds, returns fewer than 100 output tokens, and there were stale task notifications from a previous turn arriving during this turn (staleTaskNotificationCount > 0), the bridge assumes the model responded to the leftover background task event instead of the user's actual question. It clears state and re-sends the same prompt once (capped at one retry to avoid loops). This lives at lines 2115 to 2150. It is rare but it is the difference between the user seeing a confusing one-line reply about something they didn't ask, and getting the actual answer.

Are these recovery mechanisms specific to MCP, or do they generalize?

The shape generalizes; the specific code is wired to ACP (Agent Client Protocol) and MCP. The principles are: (1) every tool invocation has a wall-clock timeout enforced outside the model; (2) timeout means synthesize a failure event so the agent loop can exit cleanly, do not just kill the process; (3) every session has a TTFT race after a known-bad event so the harness fails fast instead of hanging on the parent's slowest timeout; (4) every retry path has a bounded depth and a known recovery state (priorContext replay) instead of an unbounded retry loop; (5) the user's local message store, not the upstream session, is the source of truth for what was discussed. Any agent harness can implement these. Most don't.

Where can I read this code?

It is open source. The whole file is acp-bridge/src/index.ts in the m13v/fazm repo on GitHub (github.com/m13v/fazm). Tool watchdog is lines 108 to 238. handleQuery (the function that owns the chain-level recovery state machine) is lines 1637 to roughly 2150. The line numbers will drift as the file is edited, but the constant names (TOOL_TIMEOUT_INTERNAL_MS, TOOL_TIMEOUT_MCP_MS, TOOL_TIMEOUT_DEFAULT_MS, MAX_QUERY_RETRIES, MAX_REPLAY, TTFT_WATCHDOG_MS) are stable search anchors.

Does Fazm fix the per-action reliability problem too, or only chain-level?

Both, at different layers. The per-action layer is the macos-use MCP server (separate Swift binary, Sources/MCPServer/main.swift), which after every click, type, or keypress returns a structured EnrichedTraversalDiff: added, removed, modified accessibility nodes with attribute-level deltas, coordinate-only noise filtered out, in_viewport flags on every changed element. That is what makes each individual primitive reliable. This guide is about the layer above: what happens when twenty of those primitives chain together and one of them stalls. Different code, different failure modes, different recovery primitives.