Field notes from one shipping harness

The part of agent harness scaffolding that earns its keep when a tool stalls

Most write-ups on this topic talk about loops, prompts, tools, memory, and context compaction. Useful, but those parts work on the happy path. The part that earns its keep is the failure-recovery layer: what the harness does when a tool stalls, how it converts a hang into a line the model can read, what the diagnostic dump on user interrupt actually contains. This is a walkthrough of that layer in one shipping harness, with exact line numbers and constant values, anchored to a file you can open yourself.

Matthew Diakonov, Written with AI

Published April 28, 202613 min read

The setup

Fazm is a macOS computer-use agent. The agent itself is Claude, reached over the Agent Client Protocol. The harness is a Node.js process called acp-bridge that gets spawned by the Swift app on first chat, lives for the rest of the session, and acts as the seam between the user-facing chat UI on one side and the agent plus its MCP tool servers on the other. The whole bridge is one file on disk: acp-bridge/src/index.ts, 2772 lines, MIT-licensed, in the public repo at github.com/m13v/fazm.

The bridge wires up five MCP servers by default: the framework's own internal tools (fazm_tools), the Playwright MCP for browser control, a native macOS accessibility binary (mcp-server-macos-use), the WhatsApp Catalyst app controller, and a bundled Python Google Workspace server. A user can append more in ~/.fazm/mcp-servers.json. That is the happy-path scaffolding everyone covers.

The watchdog is what sits underneath all of that. It is also the part nobody writes about, because it doesn't show up in the architecture diagram and it doesn't have a marketing name. It is a Map of timers and onesetTimeoutcallback. That callback is the difference between a harness that runs for an afternoon and one that the user has to restart every twenty minutes.

Three timeout tiers, one classifier

The harness picks a wall-clock budget per tool call. There are three tiers and the rule for which tier applies is six lines long. The values come from real production: 10 seconds is the longest an internal lookup ever legitimately takes; 2 minutes is the upper bound on a Playwright navigation against a slow site; 5 minutes is the longest a model-issued Bash command should run before something is wrong.

Tier	Budget	Applies to
internal	10 s	Built-in agent SDK tools whose work is purely local lookups (ToolSearch and similar)
mcp	120 s	Anything whose tool name starts with the`mcp__`prefix (every MCP server, built-in or user-added)
default	300 s	Everything else: Bash, Read, Edit, Write, the long tail of agent SDK built-ins

The classifier is a single function that tests one boolean and one string prefix. It runs once at the moment the timer is created. The values can be globally overridden by setting the environment variableFAZM_TOOL_TIMEOUT_SECONDSbefore the bridge starts; the Settings page surfaces this as Tool Timeout.

acp-bridge/src/index.ts

What happens when the timer fires

The hard problem is that the harness cannot kill the work. The MCP subprocess holds the actual tool implementation. Maybe Playwright is stuck on a page that never fired DOMContentLoaded. Maybe the macOS accessibility tree is locked because the system showed a security prompt. The harness does not own those failures. It owns one thing: the chat UI is frozen because the tool call is still pending, and the model is waiting for a result it will never receive.

The recovery is a four-step synthesis. The harness pretends a tool result arrived. It removes the tool from the pending-tools list so the loop's bookkeeping matches reality. It emits atool_activityevent with status completed so the spinner stops. It emits atool_result_displaywhose body is a literal text result the model can read on its next turn ("Tool X timed out after Ns"), and which embeds the deep linkfazm://settings/tool-timeoutsso the user can click straight into the override. And it logs the string "Tool completed" with a TIMEOUT body to stderr, because the Swift host process counts running tools by tailing those log lines and decrementing on every "Tool completed". Without the log line, the Swift counter never reaches zero, and the host's own bookkeeping drifts.

The four steps are deliberate. The first three serve the chat. The fourth serves the host. None of them is the "right" thing in the sense that the work actually completed. They are the right thing in the sense that everyone above the harness can now make a decision.

The synthesized event, drawn out

From the model's perspective the message flow looks identical to a tool that took a long time and then returned a short text body. From the user's perspective the chat thread shows one greyed-out tool call and a paragraph of explanation. From the host's perspective the running-tools counter goes from one to zero. All three are reading the same four messages.

Tool stalls, harness recovers, four messages

The MCP tool is still running on the right side of that diagram. The harness does not draw the line that kills it. That line never gets drawn. The subprocess is allowed to finish whenever the OS lets it finish; if its eventual reply does arrive, the harness sees a tool_call_update for a toolCallId no longer inactiveToolTimers, and silently ignores it.

The literal code

The above prose corresponds to onesetTimeoutcallback. It is reproduced below verbatim, with comments added for readability. Variable names match the source.

acp-bridge/src/index.ts (synthesized completion)

The function this lives in is calledstartToolTimer. Its only job is to register the callback and stash a reference inactiveToolTimersso a real completion can cancel it before it fires. There is a companionclearAllToolTimersthat is called when the bridge unloads, and that is everything the watchdog is. Two functions, one Map, one setTimeout.

What a naive harness does, and why we don't

Most reference implementations of an agent loop you find online have no per-tool wall clock at all. They look like this: build the tool list, send the prompt, await the model's tool calls, await each tool, feed the results back, repeat. That is a fine teaching shape and a terrible production shape, because the third await is unbounded. One bad tool freezes the loop forever.

Two ways to handle a stuck tool

while (model.wantsTools()) { for (call of model.toolCalls()) { result = await tool.run(call); } model.feed(results); } — every await is unbounded, one stuck tool freezes the agent forever, the only recovery is killing the process. The user sees a spinner with no exit.

No wall-clock budget on tool execution
Hang propagates to the chat UI as a permanent spinner
Recovery requires restarting the bridge or the whole app
User has no way to read or change the timeout

The naive shape is what most "build your own agent" tutorials end on. The watchdog shape is what survives the first week of running a harness against real users on real machines with real Wi-Fi.

The diagnostic dump on user interrupt

The watchdog handles wall-clock-style hangs. The other failure mode is the user pressing stop because the agent is doing the wrong thing, not because anything is technically wrong. The harness handles those too, but the visible artifact is different: a single log line per stuck tool, dumped at the moment the interrupt arrives.

The data structure is a second Map calledinFlightTools, keyed by toolCallId, with one entry per tool the bridge has seen start but not finish. Each entry holds the tool's title, kind, sessionId, sessionKey, startedAt timestamp, last status, and a JSON-stringified copy of the input the model passed in. When the user hits stop, the functionlogStuckToolsOnInterruptwalks the map and writes one line per still-running tool.

Concretely, the line looks like this: Tool STUCK on user interrupt (key=onboarding): Bash (id=t_42, kind=execute, session=onboarding, elapsed=18.4s, lastStatus=in_progress) [command=npm install --legacy-peer-deps, description=install deps]. ThesummarizeToolInputhelper truncates Bash commands to 300 characters and Edit old_strings to 80, so the log stays grep-friendly without leaking entire file contents into the user's machine logs.

The reason this exists is that "user pressed stop while a tool was mid-flight" is the most common single source of frustrating bug reports against an agent. Without the diagnostic line, the support conversation is "the agent froze when I pressed stop." With the diagnostic line, it is "the agent was 18.4 seconds into a Bash command that ran npm install --legacy-peer-deps when you stopped it," which is a different conversation.

Why these specific values

10 seconds for internal lookups is generous. ToolSearch and the other internal handlers are local, in-process, with no network dependency. If they take more than a few seconds, something in the harness is wrong, not slow. 10 seconds is a sanity check, not a budget.

120 seconds for MCP tools is the tightest interesting number. It covers the long tail of legitimate browser navigations, the heavier macOS accessibility traversals on cold app launches, and the occasional Google Workspace API call against a slow OAuth flow. Anything past two minutes is almost always a hang, not slow work. The two-minute mark is also short enough that the user has not yet given up on the agent.

300 seconds (5 minutes) for everything else is mostly there to keep Bash commands from becoming infinite. Anything legitimately longer than 5 minutes is something the user explicitly asked for (a long build, a deep research script, a video render) and they should either setFAZM_TOOL_TIMEOUT_SECONDSglobally or break the work into smaller steps. Five minutes is the wall the model has to learn to live within.

All three are user-overridable through the same env var and the same settings panel. There is no per-tool override; the design choice there was that a per-tool table grows complexity faster than it adds value, and a single global override covers every observed case.

What this means for anyone scaffolding their own

The list is short. If you are building a harness from scratch, the failure-recovery layer is the part you will be tempted to skip first and regret first. Three things are worth taking from this file as a starting template, regardless of which agent SDK and which transport you are on:

Have a tier system, not one global timeout. The shape that holds up is something close to internal vs MCP vs everything-else. The exact numbers can be tuned. The tiering is what makes the system feel responsive instead of either trigger-happy or sluggish.
Synthesize a completion, do not try to kill the worker. You usually cannot. Pretending the tool returned a short text body keeps the loop alive and lets the model decide what to do, which is almost always smarter than what the harness would have decided.
Embed a deep link into the failure message. The user is staring at the chat. The setting they need to change is two clicks away. A literal URL the chat UI can render as a button is the cheapest way to bridge the gap between "I see the error" and "I changed the timeout."

None of this is novel research. It is an inventory of the small decisions that, in aggregate, are the difference between an agent harness people complain about and one they keep open during the workday. The whole watchdog and diagnostic layer in Fazm's bridge is somewhere around 200 lines. That is a small piece of the file that does the largest amount of work in keeping the product usable.

Want to see the watchdog do its thing on your own machine?

20 minutes, screen-share. Bring a workflow that has historically frozen your other agents and we'll run it through Fazm together.

Questions, answered specifically

What is an agent harness, in one paragraph, before this page goes deep on one corner of it?

Everything around the model that turns a stateless next-token predictor into something that can pick up tasks, call tools, fail safely, and resume. The system prompt, the loop that re-asks the model after each tool result, the registry of tools and MCP servers, the context manager, the memory store, the logger, the permission layer, the cancellation path, the timeout watchdog. Agent equals model plus harness. The model is bought, the harness is built, and almost all of the differentiation between two products on the same model lives in the harness.

Why does a harness need a per-tool timeout watchdog at all? Doesn't the model decide when to stop?

The model decides when to stop calling tools. It cannot decide when a tool that is already running has died. A browser MCP server can hang on a navigation, a native binary can deadlock waiting for an OS permission prompt that the user dismissed, a remote MCP can lose its socket and stop replying. From inside the agent loop, none of those look different from a tool that just hasn't returned yet. Without a wall-clock watchdog, the loop sits there forever, the UI spinner never stops, and the only way out is to kill the whole bridge process. The watchdog is what converts a hang into a recoverable error the model can read.

What are the exact timeout values used in Fazm's harness, and where are they?

Three constants at the top of acp-bridge/src/index.ts. TOOL_TIMEOUT_INTERNAL_MS is 10000 milliseconds, used for internal tools like ToolSearch where a slow response is almost certainly a bug. TOOL_TIMEOUT_MCP_MS is 120000 milliseconds, used for any tool whose name starts with the prefix mcp__ (Playwright, macos-use, whatsapp, google-workspace, and any user-installed MCP). TOOL_TIMEOUT_DEFAULT_MS is 300000 milliseconds, used for everything else (Bash, Read, Edit, Write, the long tail of built-ins). The classifier is the function getToolTimeoutMs and it is six lines long.

What does the harness do when one of those timers fires? Does it kill the tool process?

No. It cannot reach into the MCP subprocess and stop the work. What it does instead is synthesize a completion event so the model and the UI both move on. Concretely: it removes the tool from the pending-tools list, emits a tool_activity message with status "completed", emits a tool_result_display whose body is the string 'Tool "X" timed out after Ns. Adjust timeout: fazm://settings/tool-timeouts', and writes a stderr line 'Tool completed: X (id=...) output=TIMEOUT after Ns' so the Swift side decrements its acpToolsRunning counter. The tool subprocess may still be running in the background. The model now sees a tool result it can reason about.

What is the in-flight tool diagnostic dump, and when does it run?

It runs when the user hits the stop button. The harness keeps a Map called inFlightTools, keyed by toolCallId. Each entry stores the tool's title, kind, sessionId, sessionKey, startedAt timestamp, last status, and a JSON-stringified copy of the input the model passed to the tool. When an interrupt arrives, the function logStuckToolsOnInterrupt walks the map and writes one stderr line per stuck tool: 'Tool STUCK on user interrupt (key=...): TOOL_NAME (id=..., kind=..., session=..., elapsed=N.Ns, lastStatus=...) [command=..., description=...]'. The summarizer truncates Bash commands to 300 chars and Edit old_strings to 80 chars, so the log is grep-friendly without leaking entire file contents.

Can a user override these timeouts? How?

Yes. There is a single environment variable, FAZM_TOOL_TIMEOUT_SECONDS, parsed once at process start. If set to a positive integer, it replaces all three tiers (the override is multiplied by 1000 and returned ahead of the tier branch in getToolTimeoutMs, so internal, MCP, and default tools all use the same value). The Settings page exposes this via Settings, Advanced, Tool Timeout, which is also reachable through the deep link fazm://settings/tool-timeouts that the synthetic completion event embeds. The reason an override exists at all is that long-running tasks (deep research scripts, video edits) legitimately need more than 5 minutes and shouldn't have to fight the watchdog.

How does this fit into ACP, the wire protocol the harness speaks?

ACP, the Agent Client Protocol, is the JSON-RPC layer between the harness and the Claude Code agent. The harness sends session/new, session/prompt, and session/cancel, and receives streamed notifications including tool_call, tool_call_update, and message. The watchdog and the synthetic completion event sit one layer above ACP: when a tool times out, the harness does not lie to the agent over ACP, it just pretends the tool result arrived from the user side. The agent reads it like any other tool result and decides what to do next. ACP itself has no concept of a wall-clock per-tool budget; that is a harness-side decision, which is exactly the point of harness scaffolding.

Is this code I can read myself, or does the page just describe it?

It is open source. The full file is in github.com/m13v/fazm under acp-bridge/src/index.ts. The watchdog block runs from roughly line 72 to line 163. The in-flight diagnostic block runs from roughly line 165 to line 254. The MCP server registration that determines which tools fall under the mcp__ prefix is the function buildMcpServers, starting around line 992. Every line number, constant value, and string in this guide can be checked by opening that file. The repo is MIT-licensed.

Is the watchdog itself the differentiator? Lots of harnesses must do something like this.

The watchdog is table stakes once you operate a harness in production for more than a week. The differentiator is the choice of three tiers (most public examples either use one global timeout or none), the synthetic completion (versus killing the bridge or showing a forever-spinner), the diagnostic dump on user interrupt (versus the standard 'tool was cancelled' line), and the deep link back to the user-facing setting in the failure message. Each of those is a small piece. Together they are why the harness can run for hours without the user ever needing to restart the app.

How is this different from how Claude Code or Codex or Cursor handle stuck tools?

Claude Code, Codex, and Cursor are all coding-tool harnesses that primarily run in a developer terminal. Their assumption is that you, a developer, are the one watching the screen and you can ctrl-C or kill the process when something stalls. Fazm's harness runs inside a consumer macOS app where the user is not in a terminal and cannot read stderr. That changes the failure model. The synthetic-completion approach exists because the only feedback channel back to the user is the chat UI; if a tool hangs and never completes, the chat sits frozen with a spinner the user has no way to clear. The watchdog converts the freeze into a recoverable line in the chat thread.

Adjacent