SOURCE: MCP-SERVER-MACOS-USE / MAIN.SWIFT

One chain link is one action plus an AX-tree diff

A computer-use action chain on macOS is not click, screenshot, click, screenshot. Each link is a single tool call that performs the action and returns the accessibility tree diff: added nodes, removed nodes, attribute-level changes on modified nodes, with coordinate-only churn filtered out and an in_viewport flag on every changed element. The model reads the diff and picks the next link. The diff is what lets a 20-step chain stay coherent in a normal context window.

Direct answer (verified 2026-05-01 against mcp-server-macos-use main.swift)

Each link in the chain has the same shape:

  1. The model picks one of nine action primitives (open, click, type, press_key, scroll, set_value, press_ax, set_selected, refresh) and supplies a target.
  2. The MCP server runs the action on the live Mac app, then walks the accessibility tree once before and once after.
  3. It returns an EnrichedTraversalDiff with three buckets (added, removed, modified) plus an in_viewport flag on each changed element.
  4. Coordinate-only modifications and scrollbar churn are filtered out of the diff before it leaves the server.
  5. The model reads the diff (typically 100-300 tokens) and picks the next link.

The structs and filters are at lines 187-216 and 592-649 of mcp-server-macos-use/Sources/MCPServer/main.swift. The whole binary is open source.

M
Matthew Diakonov
9 min read

The four stages inside one chain link

Zooming in on one link. Everything between “model picks an action” and “model gets a diff back” happens inside a single MCP tool call. The model never has to issue a separate observation call.

One chain link, four internal stages

  1. 1

    Pick action

    Model selects one of 9 primitives plus a target (element string or coordinates).

  2. 2

    Pre-traverse

    Server walks the AX tree of the target app to capture the before state.

  3. 3

    Run action

    CGEvent for click/type/press, AX API for set_value or press_ax. Optional inline chain (text + pressKey) before re-traverse.

  4. 4

    Diff and return

    Walk again, compute added/removed/modified, filter coordinate noise, mark in_viewport, return as one MCP response.

The diff struct, verbatim

This is the data shape the model receives after every action. Three buckets at the top level. Each modified element carries before, after, and a list of attribute-level changes so the model can see exactly which AX attribute flipped.

Sources/MCPServer/main.swift

A typical diff after clicking a button that opens a menu has 8 to 20 entries in added, 0 in removed, and 1 to 3 in modified (the button itself flipping AXFocused or AXSelected). That is the unit of attention the model needs to plan the next link.

Three noise filters that make the diff usable

The raw diff between two AX-tree walks is full of churn that has nothing to do with the action that just ran. Three filters run server-side before the diff leaves the binary.

Sources/MCPServer/main.swift
Filter 1

Coordinate-only changes

A modified node whose only delta is in x, y, width, or height is dropped. Window resizes and reflow churn would otherwise dominate the diff.

Filter 2

Scroll-bar churn

scrollbar, scroll bar, value indicator, page button, and arrow button roles are dropped entirely. None of them are actionable targets.

Filter 3

Empty structural rows

axrow, axcell, axcolumn, axmenu containers without text are dropped. They are scaffolding, not anything the model can click.

3 buckets

added, removed, modified, with attribute-level before/after on every modified node and an in_viewport flag on every node that changed. That is the entire data shape the model reasons against between actions.

EnrichedTraversalDiff, mcp-server-macos-use main.swift line 212

Why in_viewport changes the chain

The accessibility tree is a logical graph. It contains nodes that exist in the app but are scrolled off-screen or under a hidden tab. A naive agent finds a button in the tree, picks its coordinates, and fires a click; the OS routes that click to whatever pixel is there right now, which is usually not the button. Long chains accumulate this kind of silent miss until the chain is dead and nobody knows when it died.

The fix is symbolic: every node in the diff carries in_viewport: Bool?, set by checking whether the node's center point falls inside the app's window rectangles (or the bounds of an open AXSheet, for save/file dialogs). The model knows, before clicking, whether the target it picked is actually clickable from the current scroll position. If not, it issues a scroll first.

The MCP server also runs scrollIntoViewIfNeeded before any click on coordinates that fall outside the visible window, so the model can fire-and-forget on most cases. The in_viewport flag is the safety net for the rest.

Inline action fusion: three actions, one chain link

Sending a Slack message is naively three actions: click into the input, type the message, press Return. Three tool calls means three AX-tree walks and three diffs to read. The MCP server lets one tool call do all three by accepting optional text and pressKey parameters on click_and_traverse.

Sources/MCPServer/main.swift

The pre-traverse runs once. The click fires. The type and press actions run in sequence as additionalActions before the post-traverse. One diff comes back, showing the message appearing in the conversation log and the input field clearing. The server's own instructions explicitly tell agents to do this rather than splitting the actions: “ALWAYS prefer a single combined call over multiple sequential calls.”

What sending a Slack message looks like as one chain link

1

click_and_traverse

element=Message field, text=hello, pressKey=Return

2

Pre-traverse

3

Click input

4

Type text

5

Press Return

6

Post-traverse

7

Return diff

The nine action primitives in the chain

Every primitive ends in _and_traverse. The suffix is the contract: the action runs and the tree gets re-walked in the same call. The seven action primitives all return a diff; the two information-only primitives (open and refresh) return a full traversal because there is no meaningful before state.

PrimitiveWhat it doesReturns
open_application_and_traverseLaunch (or focus) an app by bundle id or name.Full traversal
click_and_traverseClick by element string or x/y/w/h. Optional inline text + pressKey.Diff
type_and_traverseType into the focused field. Optional inline pressKey.Diff
press_key_and_traversePress a named key (Return, Escape, Tab) with optional modifiers.Diff
scroll_and_traverseScroll at a point by deltaX/deltaY.Diff
set_value_and_traverseSet the AXValue of an element directly (skips typing animation).Diff
press_ax_and_traverseTrigger an element's AXPress action without sending a CGEvent.Diff
set_selected_and_traverseToggle AXSelected on a target (checkboxes, list rows).Diff
refresh_traversalRe-walk the tree without performing any action.Full traversal

The list is from allTools at line 1482 of MCPServer/main.swift. The MCP server's reported version is 1.6.0 (line 1487).

AX-tree action chain versus screenshot-driven action chain

Same task, same model. The chain mechanics differ. This is what changes per link.

FeatureScreenshot action chainAX-tree action chain
What the model sees per linkTwo raw images plus a coordinate guess from the modelSymbolic diff: role, text, attribute changes, in_viewport flag
Tokens consumed per linkFull screenshot, often 1000 to 3000 vision tokens100 to 300 (typical filtered diff)
Round trips per chain linkTwo minimum: take screenshot, then actOne MCP call (action and observation fused)
Targeting stability across app updatesPixel coordinates break the moment a button movesRole plus text identifies a button across UI redesigns
Off-screen elementsElement is invisible to the model; chain stallsin_viewport=false; agent issues scroll first
Cross-app handoffNew screenshot, model has to recognise the new appappSwitchPid + appSwitchTraversal in same response
Inline action fusionThree separate model turns for click, type, enterclick_and_traverse(element, text, pressKey) is one call
Noise floorCursor blink, antialiasing, spinner all show as visual diffCoordinate-only and scrollbar diffs filtered server-side

What this means for chain length

Every additional link adds the cost of one diff to the next prompt. With filtered, symbolic diffs averaging 100 to 300 tokens, a 20-link chain stays well inside a normal context window with room for the system prompt, the user's instructions, and the model's own reasoning. With full-screenshot links, the same chain runs out of room around link 8 to 10 unless aggressive summarisation kicks in, and summarisation tends to drop the exact visual detail the next click depends on.

That is the practical reason AX-tree chains scale. It is not that screenshot-based agents cannot work. It is that the per-link cost of a screenshot puts a hard cap on how many actions you can chain before the prompt either bloats past the limit or drops too much information to plan reliably.

Symbolic diffs also let the model use the chain history. After link 12, the model can grep its own conversation for “when did I last see element AXButton with text Send” and find a useful answer. Doing that against a sequence of screenshots requires a second OCR pass per turn, which most agent harnesses do not run.

Want to watch a 20-link AX-tree chain run live?

Fifteen minutes. We open Fazm on a Mac, give it a multi-step task across three apps, and you watch the diffs scroll past in real time so you can see exactly what the model gets between actions.

Frequently asked questions

What is one link in a computer-use AX tree action chain, exactly?

One tool call that does two things in a single response: it performs an action (click, type, press a key, scroll, set a value, set selected, press an AX action) on a target Mac app, and it returns the accessibility-tree diff between the moment before the action and the moment after. The diff has three buckets: added nodes, removed nodes, and modified nodes (with attribute-level before/after deltas). The model reads the diff and picks the next action. There is no separate 'see what changed' tool call. The action and the observation are fused into one round-trip.

Why a diff instead of returning the full AX tree after every action?

A real macOS app's AX tree is hundreds of nodes deep. Returning the full tree after every click means the model spends most of its context reading mostly-unchanged scaffolding. The diff is two orders of magnitude smaller: a typical click that opens a menu adds maybe 8 to 20 nodes, modifies 1 to 3, and removes nothing. The model reads exactly the part of the page that changed because of the action it just took, which is what it needs to decide the next action. Full traversal is reserved for the two cases where it is actually useful: opening an app for the first time, and explicit `refresh_traversal` calls when something looks off.

What gets filtered out of the diff before the model sees it?

Three classes of noise. (1) Pure coordinate moves: any modified node whose only changed attributes are x, y, width, or height is dropped (`coordinateAttrs: Set<String> = ["x", "y", "width", "height"]`, line 649 of mcp-server-macos-use/Sources/MCPServer/main.swift). Layout reflows after a window resize would otherwise flood the diff. (2) Scroll-bar churn: scrollbar, scroll bar, value indicator, page button, and arrow button roles are dropped via `isScrollBarNoise` (lines 592-597). Scrolling triggers tons of these, none of them are actionable. (3) Empty structural rows, cells, columns, and menu containers without text are dropped via `isStructuralNoise` (lines 600-607). The result is a diff the model can actually reason against in 100-300 tokens.

What is the in_viewport flag and why does it matter for the chain?

Every element in the diff carries `in_viewport: Bool?`. It is set by checking whether the element's center point falls inside any of the app's window rectangles (or the bounds of an active AXSheet, when one is open). The accessibility tree includes elements that exist in the tree but are scrolled off-screen or in a hidden tab. Without the viewport check, the model often clicks an element it found in the tree, the click coordinates land outside the visible window, the OS routes it to whatever is on top there, and the chain is dead. With the flag, the model knows which added elements are actually visible and clickable right now, and falls back to a scroll if the target is in_viewport=false.

How does the MCP server fuse multiple actions into one chain link?

click_and_traverse accepts optional `text` and `pressKey` parameters (lines 1678-1684 of MCPServer/main.swift). If both are set, the server clicks the target, types the text, presses the key, and only then runs the post-action traversal. That is one model turn, one tool call, one diff. Without this, sending a Slack message would be three separate tool calls (click into the input, type the message, press Return) and three separate diffs to read. type_and_traverse has the same trick with its own `pressKey` argument (lines 1694-1698). The MCP server's instructions explicitly tell agents to chain inside a single call: 'do NOT split into separate click, type, and press calls.'

What happens when the action triggers a different app to come to the front?

Cross-app handoff is detected and surfaced in the same response. The ToolResponse struct carries optional `appSwitchPid`, `appSwitchAppName`, and `appSwitchTraversal` fields (lines 227-230). After the action, the server checks the current frontmost app and if it changed, it traverses that app too and includes its tree alongside the diff for the original target. The model sees both 'here is what changed in the app you clicked into' and 'oh, also you are now in this other app, here is its tree.' That makes Cmd+Tab, Open With, deep-linking out of an email, and any other app-launching action a single chain link instead of two.

Why does the action automatically scroll the target into view?

Because the AX tree contains elements that exist but are not currently scrolled into the visible region of the window. If the model picks coordinates from such an element and the click fires at those raw coordinates, the click lands somewhere else (or off-screen, depending on the OS). Before clicking, the server calls `scrollIntoViewIfNeeded` (line 1662) which checks the target point against window bounds and scrolls the relevant container so the point becomes visible, then translates the click coordinate to wherever the element ended up. The model writes 'click element X' once; the server handles the visibility plumbing.

How is this different from screenshot-driven action chains?

A screenshot-driven chain is 'take screenshot, model picks pixel coords, click, take screenshot, model picks pixel coords, click.' The model has no symbolic identity for any element. The 'diff' between two screenshots is a pixel diff, which is information-poor and noisy: a flashing cursor, an antialiasing change, a progress spinner all show up as differences with no semantic meaning. The AX-tree diff is symbolic: the model gets the role, the text, the attribute changes (like AXValue going from empty to 'hello world'), and the visibility flag, all without sending a single image. Round-trip is faster, the prompt is dramatically smaller, and the chain is reproducible because role+text identifies a button across app updates while pixel coordinates do not.

Where can I read this code?

It is open source. The structs and the filtering live in mcp-server-macos-use/Sources/MCPServer/main.swift in the public repo. EnrichedTraversalDiff is at lines 212-216, DiffElementData at lines 187-196, the filter pipeline at lines 648-718, the noise predicates at lines 592-607, and the per-tool action handlers at lines 1598-1795. The MCP server version on the binary is 1.6.0 (line 1487). Fazm consumes this server through its acp-bridge, but the chain mechanics described here are entirely the MCP server's job; the bridge sits a layer up.

Does Fazm use this for the browser too, or only native Mac apps?

For native apps the AX-tree action chain is the primary path. For the browser, Fazm uses a Playwright bridge that exposes the rendered DOM through the same accessibility surface (the browser's own AX tree, not the raw DOM), which gives a chain shape that is similar but not identical: the diff is computed against accessibility-tree snapshots of the page rather than against the OS-level macOS AX tree. The principle, action plus diff plus visibility flag, is the same in both. The mechanics of how diffs are computed and noise filtered are tuned per surface.