SOURCE: MCP-SERVER-MACOS-USE / MAIN.SWIFT

Computer use agent reliability is a verification loop, not a benchmark score

The quiet difference between a reliable desktop agent and an unreliable one is not the model, not the prompt, and not the vision stack. It is whether the agent gets a structured diff back after every primitive action, so it can plan the next step against ground truth. This guide walks through the loop Fazm ships for Mac: capture the accessibility tree, act, capture again, diff, return the delta. The exact Swift structs, the filter that removes coordinate-only noise, and why that matters for every click the agent makes.

Matthew Diakonov, Fazm

Published April 23, 202610 min read

4.9from Read from the Fazm source tree, not from a whitepaper

Swift-level struct names

Line numbers that verify

Noise filter explained

Sheet + viewport handling

Open source MCP server

Reliability is the diff

What Fazm returns after every click, type, or keypress on your Mac

Traverse the AX tree. Act. Traverse again.

Diff: added, removed, modified.

Strip coordinate-only noise.

Flag in_viewport on every changed node.

Hand the agent a structured delta, not a screenshot.

0:00 / 0:05

traversalBeforeprimary actiontraversalAftercomputeDiff()isScrollBarNoise()isStructuralNoise()coordinateAttrs filterin_viewport enrichmentAXSheet detectionsummary: N added, M removed, K modified

What every other guide on this gets wrong

I read the pages that currently answer this topic. The research papers (the arxiv 2604.17849 reliability paper, the UI-CUBE operational reliability benchmark, Microsoft Research's verifiers piece) frame reliability as stochasticity plus task ambiguity plus behavioral variability, then propose clarification of task instructions and LLM-based user simulators as the fix. The industry posts frame it as a latency contest: accessibility APIs are faster than screenshots, which are faster than vision-only. Both framings are useful, and both miss the mechanic that actually moves the needle for a consumer-facing desktop agent.

The mechanic is simple, and it lives one level below benchmarks or latency. After every single primitive action (one click, one keypress, one scroll, one typed string), the agent needs to know: did the screen change, and if so, how. Not a picture of the new screen. A structured answer: these elements appeared, these elements disappeared, these elements changed these attributes. Without that, every subsequent decision is a guess.

Fazm ships this as a first-class tool response. Every_and_traversecall in the bundled macos-use MCP server returns anEnrichedTraversalDiff. The rest of this guide shows you that struct, the code that builds it, the filter that keeps the signal clean, and why the agent can chain twenty actions in a row without losing the plot.

The reliability loop, one action at a time

0Traversal tools that emit a diff

0sAX messaging timeout per node

0Noise filters applied before emit

0Screenshots required to know what changed

The shape the agent actually sees

Four Codable structs. One hundred and thirty lines total.

These four structs are the entire contract between the macos-use Swift binary and the Claude agent driving it. Any attribute the model reasons about after an action comes through one of these fields. There is no hidden state, no second pass, no vision model second-guessing.

mcp-server-macos-use/Sources/MCPServer/main.swift

Anchor fact: coordinate-only changes get dropped on line 681

This is the part nobody else mentions. An accessibility tree diff that did not filter coordinate-only deltas would be useless, because real apps reflow on every action. A button moves two pixels when a tooltip renders. A table row repositions when content loads below it. If the agent saw all of that as "modified", its working set would balloon to dozens of false positives per action.

Fazm's macos-use binary walks every modified element, strips out attribute changes whose names are in the coordinate set, and skips the element entirely if nothing meaningful remains. The result: the modified list contains only semantic changes (value, text, enabled state, focus, title) that a reasoning model should actually care about.

mcp-server-macos-use/Sources/MCPServer/main.swift

Watch it compose an email

This is a real transcript of two consecutive tool calls against Mail.app. Every line after the command is returned by the macos-use binary to the agent. The second tool call plans itself from the diff.added list produced by the first.

tool transcript

One action across four processes

Every click_and_traverse call you see above actually crosses four process boundaries. The Claude agent running inside Fazm speaks MCP JSON-RPC to acp-bridge (the TypeScript process launched per chat). acp-bridge forwards it to mcp-server-macos-use over stdio. The Swift binary calls into the macOS accessibility API, which is the one that actually reads and mutates the target app.

tool_call lifecycle

Four layers of noise removal

Before the diff reaches the agent, it passes through four filters in the Swift binary. Each one exists because of a specific class of false positive I saw while building the thing. They are not hypothetical; they are the difference between a clean "a dialog opened with three fields" and a 40-element soup.

isScrollBarNoise

Drops added/removed/modified entries whose role is AXScrollBar or AXValueIndicator. Scroll thumbs and indicators move on every frame; they are not useful signal.

Applied to both added and removed lists before any text resolution.

coordinateAttrs filter

Strips attribute changes named x, y, width, or height from every modified element, then discards any element whose remaining change set is empty. This is the one on line 681.

Keeps AXValue, AXEnabled, AXFocused, AXTitle, and text deltas. Drops pure reflow.

isStructuralNoise

Drops roles that are pure containers with no text: AXRow, AXCell, AXColumn, AXMenu. These exist as tree structure, not as interactive surface.

An empty outline row reports 'added' during content load even when no visible information changed.

AXSheet-aware viewport

If a modal sheet (file dialog, save dialog) overlays the window, findSheetBounds returns the sheet frame. The in_viewport flag is computed against that frame so the agent sees the sheet's controls instead of the whole window's.

main.swift lines 241 to 278. Prevents the agent from reasoning about hidden controls behind a modal.

Where this matches and diverges from the academic framing

The arxiv paper on this topic decomposes unreliability into three sources: stochasticity in execution, ambiguity in task specification, and variability in agent strategy. The diff loop does not solve any of those directly. What it does is collapse the window in which those three sources can compound. An agent that verifies after every primitive cannot drift for more than one action before catching itself.

Feature	Screenshot + vision loop	Fazm (AX diff loop)
Observation surface	Rendered pixels	Structured accessibility tree
What changed after an action	New image; agent re-infers the delta	Added / removed / modified list with attribute deltas
Cost per action verification	Image encode + vision inference, seconds	One AX tree walk, hundreds of ms
Coordinate noise handling	Implicit; any pixel shift risks re-OCR	Explicit filter at main.swift line 681
Reachable element signal	Off-screen vs on-screen must be inferred	in_viewport flag per diff element
Dark mode / DPI / theme robustness	Can shift OCR accuracy	Unaffected (semantic, not visual)
Fallback when the approach fails	Already at the fallback	capture_screenshot for Canvas/games/Electron

Why this is bundled, not pip-installable

The reliability loop only works if the OS grants the process accessibility permission. That is a TCC prompt with a reboot-prone failure mode: macOS caches the grant, and after a crash or an app rename the cache can go stale, so the grant exists in System Settings but fails at runtime. Fazm's Desktop/Sources/AppState.swift (lines 431 to 504) wraps this with a retry loop: probe with a real AXUIElement call, detect the stuck state, surface a reset prompt. None of that is in the library; it is in the app. That is why the MCP server is shipped bundled inside a signed .app rather than installed via pip.

AXUIElementCreateApplication

The probe. A real call against the frontmost app. If it returns .apiDisabled the permission is not granted. If it returns .cannotComplete the grant is stuck and a restart is needed.

accessibilityRetryInterval = 5.0

Re-checks the probe every five seconds. Defined in AppState.swift. Prevents the app from hammering TCC on every frame.

maxAccessibilityRetries = 3

After three failed probes Fazm shows a reset prompt. The agent never silently runs against a broken permission state.

capture_screenshot fallback

When the accessibility tree is opaque (Canvas, games, some Electron apps), the agent is prompted to fall back to capture_screenshot. ChatPrompts.swift line 66.

Bundled binary at Contents/MacOS

mcp-server-macos-use ships inside the signed Fazm .app. No user-level pip install, no separate code signing, no runtime download.

ACP bridge (v0.29.2)

acp-bridge speaks Agent Client Protocol to Claude Code. The diff loop is transparent: the agent thinks it is calling a regular MCP tool.

If you are building your own, steal this checklist

The filters and signals below are the ones I would port first if I were implementing this on a different OS or a different observation surface. Every one of them was added because it removed a class of false positives that was fooling the agent in real sessions.

what goes in the diff contract

Capture the observation state BEFORE the action, then AFTER. Never just after.
Emit a structured delta (added, removed, modified), not a full replay of the new state.
Drop changes whose only attribute deltas are spatial (x/y/w/h). That is reflow, not information.
Flag every changed element with a reachability boolean (in_viewport, on_screen, enabled).
Special-case modal overlays (sheets, dialogs). Their frame is the effective viewport when they are present.
Drop pure-container role deltas (empty rows, cells, columns) that only exist as tree structure.
Keep attribute-level changes (oldValue -> newValue) instead of just 'this element is different'.
Summarize the diff as 'N added, M removed, K modified' so the agent can branch cheaply on the scale of the change.

The one-sentence version

A computer use agent is reliable when every primitive action returns a cheap, structured answer to "what changed" before the next action starts.

Fazm gets that answer from the macOS accessibility tree, twice per action, diffed through four noise filters, tagged with viewport reachability, and emitted as anEnrichedTraversalDiffthat the Claude agent driving it can read in the same tool response. The entire thing is 0 lines of Swift inside a single Sources/MCPServer/main.swift, open source, bundled at Contents/MacOS inside the signed .app.

4 filters

“Coordinate-only noise, scroll-bar thrash, empty-row structure, and off-window children all get stripped before the agent sees the diff. Reliability is mostly in what you refuse to report.”

mcp-server-macos-use, Sources/MCPServer/main.swift lines 600 to 718

Want to see the diff loop driving your apps?

Fifteen minutes. We open Fazm, point it at Mail or Calendar, and you watch the traversal diff come back in real time.

Book a call →

Frequently asked questions

What actually makes a computer use agent reliable in practice?

Not a benchmark score. Reliability is the agent's ability to know whether its last action worked before it plans the next one. Benchmarks measure the end state after a long chain of actions; an unreliable agent can still get lucky on a benchmark, and a reliable one can still fail a benchmark because of a single mis-typed selector. The mechanic that matters is a verification step after every primitive: after clicking, the agent needs to know which elements changed on screen. In Fazm, that verification is the EnrichedTraversalDiff returned by every `_and_traverse` tool in mcp-server-macos-use. Three added, two removed, one modified. Structured. Cheap to parse. Independent of any visual model.

Why is 'act, re-read, diff' more reliable than just taking a screenshot after the action?

A screenshot is a flat image; the agent then has to re-run OCR or vision to figure out what changed, which is slow, expensive, and itself unreliable. A structured diff is the set of accessibility nodes that changed, keyed by role, text, and frame. In Fazm's implementation (mcp-server-macos-use main.swift lines 648 to 718), the server captures the accessibility tree before the action, runs the action, captures the tree again, diffs them, and returns only the meaningful changes: added nodes, removed nodes, and modified nodes with their attribute-level deltas. Coordinate-only changes get filtered out on line 681 so a button that just repainted one pixel over does not look like a real change to the agent.

How does the diff avoid confusing the agent with noise?

Three filters, all in mcp-server-macos-use. First, `isScrollBarNoise` drops scroll-bar-only deltas (the scroll thumb moves every frame). Second, `isStructuralNoise` drops empty outline rows, cells, and columns that have no text. Third, the coordinate-only filter on line 681 drops any modified element whose only change was x/y/width/height with the same text and role; if an element just moved, the agent does not need to reason about it. What survives is the small set of genuinely meaningful changes: a new dialog opened, a text field got a value, a button title changed from 'Send' to 'Sending'. That is exactly what the model needs to plan its next step.

Why does Fazm use real accessibility APIs instead of screenshots for its primary read path?

Two reasons. Latency: AXUIElementCopyAttributeValue returns in milliseconds per node, while a screenshot round-trip through a vision model is one to several seconds. Fidelity: the accessibility tree reports role, value, focused state, AXRoleDescription, frame, and hierarchy directly from the app. A screenshot only reports pixels. Dark mode, high-DPI scaling, theme swaps, and font rendering all change the pixels without changing the semantics. The accessibility tree survives all of those. For the rare case where a visual assertion is genuinely required (a PDF figure, a Figma canvas), Fazm still captures screenshots; it just does not use them as the default observation surface.

Which apps break this approach, and what does Fazm do about them?

Any app that ignores accessibility. Games, some Electron apps, Canvas-rendered web content. For those, the accessibility tree reports a single opaque AXWindow with no meaningful children, and the diff after an action will look empty. Fazm's fallback is the screenshot path through `capture_screenshot` (modes: 'screen' or 'window'), defined separately from the macos-use tool family. The agent is prompted to prefer macos-use tools for structured apps (Finder, Calendar, Mail, Messages, Notes, Settings, and the WhatsApp Catalyst app, which has its own dedicated MCP server) and to fall back to screenshots only when the tree is bare. This is in Desktop/Sources/Chat/ChatPrompts.swift around line 66 to 72.

What is an AXSheet and why does the diff loop care about it?

An AXSheet is macOS's accessibility role for modal sheets: file open dialogs, Save As dialogs, permission prompts. They overlay the main window and can trap focus. If the agent clicked something that spawned a sheet, the naive diff would report dozens of added elements (every control inside the sheet). Fazm's `findSheetBounds` (main.swift lines 241 to 278) detects AXSheet children of any window and returns their frame. When a sheet is present, the viewport used to compute `in_viewport` for each diff element is the sheet's frame, not the window's. The agent gets a clean 'this sheet opened with these controls inside it' summary instead of 'forty-seven new elements appeared somewhere'.

What does 'in_viewport' mean in the diff, and why is it load-bearing for reliability?

Every DiffElementData has an optional `in_viewport: Bool?` field (main.swift line 191). It is true if the element's top-left coordinate falls inside any window of the target app, false otherwise. Off-screen nodes are legal in the accessibility tree — apps often keep hidden tabs, off-screen menu items, and lazy rows for performance — but they cannot be clicked until the viewport changes. By telling the agent `in_viewport: true/false` on every changed node, the diff gives the model the exact information it needs to decide whether an element is reachable from the current screen. That alone cuts a large class of 'clicked an invisible element and got silently no-op'd' failure modes.

How does this compare to the screenshot-plus-vision approach used by browser-only computer use agents?

A screenshot-plus-vision loop (Claude Computer Use on OS-level screenshots, OpenAI CUA, Gemini browser agents) has three reliability costs the tree-diff loop does not pay. First, vision latency per action (image encode, upload, decode, reason). Second, OCR fragility: rendered text can be missed, especially at custom DPI or with non-standard fonts. Third, no attribute-level deltas: the agent sees the final pixels, not the specific attributes that changed. Tree-diff gives attribute-level changes for free; it is the difference between being told 'the value of this text field went from empty to hello' and being told 'here is a picture of the screen, figure out what happened'.

Is this approach Fazm-specific, or can I use it outside the app?

The MCP server is open source: github.com/mediar-ai/mcp-server-macos-use, a single ~1900-line Swift binary in Sources/MCPServer/main.swift. You can run it against Claude Desktop, Cline, Zed's ACP, or anything that speaks MCP. Fazm bundles it at Contents/MacOS/mcp-server-macos-use inside the signed .app (resolved in acp-bridge/src/index.ts around line 63) so users get the reliability loop without wiring any of the plumbing. Outside Fazm, you handle accessibility permission prompts, the MCP client, and the agent loop yourself.

How long does the traverse-act-traverse loop take per action?

The bottleneck is the accessibility tree walk, not the action itself. A click or keypress completes in under 50 ms. The traversal is bounded by AXUIElementSetMessagingTimeout set to 5.0 seconds per element (main.swift line 245 and elsewhere) but in practice a window of a hundred or so interactive elements traverses in the low hundreds of milliseconds. Total round trip per action is typically well under a second, which is why the tool response includes both a visible_elements sample and the full diff without hitting any chat-level latency ceiling. Compare that to a screenshot-plus-vision loop, which pays image I/O on every action.