SOURCE: MCP-SERVER-MACOS-USE / MAIN.SWIFT

The four places an AX-tree chain breaks at app boundaries

An AX-tree action is bound to a PID. The boundary between two apps is where most chains die: a click opens a different process, Cmd+Tab moves you to a new app, a save sheet appears in front of the target window, or another app's modal steals focus mid-action. Each of those is a distinct failure with a distinct handler in the open-source MCP server. Most public writeups on AX-tree agents talk about one app at a time and pretend the boundary does not exist.

Direct answer (verified 2026-05-01 against mcp-server-macos-use main.swift v1.6.0)

AX-tree actions cross app boundaries in four ways, and the chain only stays alive if the server detects all four:

  1. An action in app A activates app B (deeplink, “Open With”, notification banner, web link).
  2. The action itself was an app switch (Cmd+Tab, app launcher, clicking the Dock).
  3. An AXSheet appears on top of the target window inside the same PID (Save dialog, file picker, permission prompt).
  4. A modal from a third process steals focus mid-action and the next click would land on it instead of the target.

mcp-server-macos-use has a named handler for each. Cross-app handoff at lines 1925-1948, sheet detection at 243-278, focus restore at 1913-1920, PID-bound dispatch on every action tool. Source: mediar-ai/mcp-server-macos-use, MIT licensed.

M
Matthew Diakonov
8 min read

Why the boundary is a structural problem, not a polish problem

The accessibility API is per-process. You build an AXUIElement from a PID, you walk its tree, you send AX messages to its elements. There is no “system tree” that contains every running app. There is one tree per process, and the tree for app A has no nodes belonging to app B in it.

A screenshot-driven agent has the opposite property: it sees pixels of whatever happens to be on top, with no notion of which process produced those pixels. Both surfaces have boundary problems, but they are different boundary problems. The screenshot agent never knows which app it is in. The AX-tree agent always knows which app it intends to act on, but if the world moves under it, the next call goes to a stale PID and the chain ends.

So the question is not “can the AX-tree path work across apps” (it can, the API supports targeting any running process), but “how does the agent learn the world moved.” That learning happens on the response of the action that caused the move. The four failure modes below are the four ways that response can communicate “the next call should not use the PID you just used.”

The dispatch shape

Every action call has the same shape: input is a PID plus an element target plus an optional inline chain (text, pressKey). Output is a diff of the original PID's tree, optionally plus a full traversal of a different PID that became frontmost as a side effect.

One action call, two possible outputs

click_and_traverse
type_and_traverse
press_key_and_traverse
MCP server
Diff of original PID
appSwitchTraversal
sheetDetected

The interesting part is on the right side. A vanilla AX-tree binding would only return the first item: a diff of the app you targeted. The other two are how the boundary cases are surfaced without the model having to make an extra observation call to figure out what just happened.

The response struct, verbatim

Three fields carry the cross-app information. Two more carry the in-app modal information. The model decides which fields were populated and routes its next call accordingly.

Sources/MCPServer/main.swift

When appSwitchPid is set, the next tool call uses that PID. When sheetDetected is set, the model knows the target window is occluded and should drive the sheet first. When neither is set, the chain stays in the original PID. That is the entire boundary contract.

Failure mode 1: a click in app A activates app B

The most common boundary case. The model clicks a link in Mail, which opens a browser tab. Or a button in a chat client, which opens a meeting app. Or anything that fires a deeplink via NSWorkspace.open under the hood. The click was honest, the target element existed, the AX action succeeded. The diff in the original app is small or empty. From the model's side, this looks like the click did nothing. From the user's side, an entire new app just opened.

The handler is a frontmost-PID comparison on the action response, attached for every diff-based tool call:

Sources/MCPServer/main.swift

Three things to notice. First, the check runs unconditionally, not gated on a flag the model has to set. Second, the new app gets a full traversal in the same response, so the model never has to make a follow-up “what is this app I am in” call. Third, the original app's diff is still returned alongside the app-switch payload, because both are useful: the diff tells the model “the click landed and Mail closed the message”, the appSwitchTraversal tells it “you are now in Safari with this tree.”

Failure mode 2: the action itself is the app switch

Cmd+Tab is just a key press. macos-use_press_key_and_traverse accepts keyName: “Tab” plus modifierFlags: [“Command”]. The key event fires, the OS swaps the frontmost app, and the same frontmost-PID check from failure mode 1 picks up the change and traverses the new app. The handler does not care whether the switch was intentional. From the response shape it looks identical to a deeplink: appSwitchPid is set, and the next call uses that PID.

The same path works for clicking the Dock, clicking an item in the macOS app switcher overlay, and clicking a notification banner. All of them are AX actions whose side effect happens to be a process becoming frontmost. The server does not need a special “app switch tool”; the handoff falls out of the standard action plus diff plus frontmost-check pipeline.

Failure mode 3: an AXSheet covers the target window

Sheets do not cross the app boundary in the PID sense. The sheet still belongs to the same process as the parent window. But they create the same chain-breaking effect: the diff still contains all the buttons of the parent window, all of those buttons are in the AX tree, and in_viewport may report them as visible because their AX coordinates are inside the window frame. None of them are clickable, because the sheet is on top.

The handler walks the app's AXWindows looking for an AXSheet child (lines 243-278). When one is found, the in-app viewport gets re-scoped: viewport bounds are the sheet's frame, not the window's frame. Every diff element gets re-checked against the sheet bounds. Buttons behind the sheet flip to in_viewport: false, the sheet's buttons stay in_viewport: true, and the model's decision rule (“only target in-viewport elements”) does the right thing without any new logic.

This is the failure mode that bites people testing AX-driven agents in the wild. A Save dialog, a permissions prompt, a confirmation sheet: all of them break a naive “just-walk-the-AX-tree” agent because the tree looks fine and the model picks an unreachable target. Without the sheet-aware viewport, no amount of cross-app handoff detection helps, because no app switch happened.

Failure mode 4: a third app steals focus mid-action

A meeting starts. A Slack call comes in. A 1Password prompt appears. The agent was halfway through a click on the target app when a different process became frontmost. For screenshot-driven agents, the next screenshot shows the interrupting app and the chain pivots into nonsense. AX-driven actions have a structural defense: they are sent to a PID, not to whatever is on top. The click that already fired went to the original PID via AX API and does not care which window is drawn on top of it.

The server still has to clean up the visible state. Two mechanisms cooperate. First, the InputGuard engaged for every disruptive action (line 1835) blocks user input during the action and shows an overlay, so the human cannot accidentally click the interrupting app and confuse the agent further. Second, after the action, the frontmost-app restore reasserts the original frontmost so the user's work continues where they left it:

Sources/MCPServer/main.swift

If the interrupting app is a real third process, the cross-app handoff detection from failure mode 1 still fires, and the model can decide whether the interruption is something to handle (answer the call) or something to defer (mark unread, return to the original target). The point is that the decision happens with full structured context, not on the basis of a screenshot of the new app drawn over the old one.

What the “just take a screenshot” path actually loses

A screenshot agent does not have process identity in its payload. There is no PID on a screenshot. So none of the four handlers above can be expressed in that surface. The closest analogue is a heuristic on the visual diff: did the menu bar change, did the app icon flash in the Dock, did the title bar rename. Those are guesses, and they fail silently when the interrupting app looks similar (a second instance of the same browser, a sheet rendered with the parent app's style, a modal from a helper process bundled into the main app).

The AX-tree path makes “which process am I driving” a structured field on every response. Screenshot pipelines treat it as a property of the camera angle. That is the gap. Everything else (latency, token cost, prompt size) is a consequence of the same structural difference.

Fazm consumes mcp-server-macos-use through its acp-bridge. The bridge does not implement boundary handling itself; it inherits all four behaviours from the MCP server, so any model wired through Fazm gets the same cross-app fields on every tool response without the bridge layer doing anything clever.

Want to see what a chain that survives app boundaries actually looks like?

15 minutes. We will run a chain across two or three of your daily apps and let you watch the appSwitchPid field flip in real time.

Frequently asked questions

Why is the PID a required parameter on every AX-tree action?

Because an AX action is sent to a specific process, not to whatever happens to be on top. macos-use_click_and_traverse, macos-use_type_and_traverse, macos-use_press_key_and_traverse and the others all declare `required: ["pid"]` in their input schema (lines 1330, 1347, 1367, 1402 of mcp-server-macos-use/Sources/MCPServer/main.swift). Without a PID, the server has nothing to attach the AXUIElementCreateApplication call to. The model has to know which process it is driving. This is the structural reason cross-app failures show up at all: the moment the model's intended PID and the actual frontmost PID disagree, the chain is in a boundary state and has to either notice or break.

What does the MCP server actually do when a click in app A causes app B to come to the front?

After every diff-based action, the server reads NSWorkspace.shared.frontmostApplication.processIdentifier and compares it to the original PID it just acted on (lines 1925 to 1948). If they differ, three things land on the same response: appSwitchPid (the new frontmost PID), appSwitchAppName (its localized name), and appSwitchTraversal (a fresh accessibility-tree walk of the new app). The model gets the diff for the original app PLUS the full tree of the app that came up, both in one tool result. The next tool call uses the new PID. No second round trip needed to discover that the world changed.

Why does taking another screenshot not fix this for screenshot-driven agents?

A screenshot agent does not have a process identity for what it is looking at. It sees pixels of "the frontmost window," no matter which process drew them. So a click that launches a different app produces a screenshot that looks fine, the model picks new coordinates, the click goes to the new app, and the chain limps along but the model's mental model of which app it is in slowly drifts from reality. Subtle versions of this are the worst: a small modal sheet from a helper process appears over the target window, the screenshot still looks like the target app, and the model's clicks land on the helper's buttons. With the AX-tree path the PID is in every payload, so a frontmost change is a structured event the model has to handle on the very next link.

What happens when the user pressed Cmd+Tab intentionally?

Cmd+Tab is just a key press emitted by macos-use_press_key_and_traverse. The server sends the key, then runs the same frontmost-PID check. If the press caused a different app to become frontmost (which is the whole point of Cmd+Tab), the appSwitchPid + appSwitchTraversal fields are populated with the new app and the model can immediately see it. Same plumbing as the deeplink case, different cause. From the model's side, intentional cross-app and accidental cross-app look identical: a single response that contains both the original app's diff and the new frontmost's tree.

What is the isDisruptive flag and why does it matter at app boundaries?

Every action tool is marked disruptive except for macos-use_refresh_traversal (line 1800). For disruptive tools the server saves NSWorkspace.shared.frontmostApplication and the cursor position before running the action (lines 1804-1808), engages an InputGuard that blocks user keystrokes during the action (line 1835), and at the end checks whether the frontmost app changed (lines 1914-1920). If the original frontmost is still alive but is no longer on top, the server reactivates it. This is why the user can keep typing in their editor while the agent operates a different app in the background: AX actions do not require the target app to be frontmost, and the server actively undoes any focus theft as a side effect of the action. Cross-app handoff detection runs in addition to this restore step, so an intentional app-switch is preserved while accidental focus theft is reversed.

How does the server handle a save sheet or file picker that appears mid-action?

Sheets are detected separately from cross-app handoff because a sheet still belongs to the same PID. The findSheetBounds function (lines 243 to 278 of main.swift) walks the app's AXWindows, looks for an AXSheet child on each window, and if it finds one, returns the sheet's frame. The bounds are then used as the active viewport instead of the window: the in_viewport flag on every diff element is computed against the sheet, not the parent window. Without this, the model sees a tree full of "in viewport" buttons that are actually behind the sheet and ungettable. Sheets are the most common source of "my click did nothing" failures, and they would not show up at all in a cross-app boundary check because the PID has not changed.

What happens when a modal from another app pops on top while the agent is mid-action?

The InputGuard plus the focus-restore logic catches this. While the action is running, InputGuard blocks user input and shows an overlay (line 1835); after the action, the frontmost-restore step (lines 1914-1920) puts the original frontmost back on top if a different app stole focus. If the modal belongs to a third process, the cross-app handoff detection then surfaces that third process so the model can decide whether to deal with it. Crucially the action that already fired was sent to the original PID by AX API, not to whatever was on top, so the click did not land in the modal. This is the failure mode that screenshot-driven agents have no defense against, because their entire input pipeline assumes the target is the frontmost window.

Where is this code? Is it open source?

Yes. mcp-server-macos-use is open source on GitHub at github.com/mediar-ai/mcp-server-macos-use, MIT licensed. The cross-app fields on ToolResponse are at lines 227-230 of Sources/MCPServer/main.swift. The handoff-detection block is lines 1925-1948. The disruptive-action save/restore is lines 1799-1836 and 1913-1920. Sheet detection is the findSheetBounds function at lines 243-278. The MCP server reports version 1.6.0 (line 1487). Fazm consumes this server through its acp-bridge and inherits all four boundary handlers without doing anything special at the bridge layer.