Accessibility tree limits beyond the browser: four boundaries you cross at once.

If your AX intuitions come from Playwright getByRole, axe-core, or Chrome DevTools, the address bar is the boundary where the model you have stops working. Four things change at the same moment, and the failures show up as one error code with three meanings. This page names the four, with the exact Swift the Fazm desktop agent ships to keep going when the layer goes ambiguous.

Matthew Diakonov, Written with AI

Published May 1, 20267 min read

Direct answer (verified 2026-05-01)

Outside the browser the accessibility tree is no longer one spec, one process, one permission, and one tidy set of error codes. You cross four boundaries at once: a different AX API per OS (NSAccessibility on macOS, UI Automation on Windows, AT-SPI on Linux, instead of an ARIA-derived in-page tree), a system trust model (the user has to grant accessibility access; the cache for that access can lie), an addressing model rooted in process and window instead of URL and selector, and OS error codes that overload meanings the browser never had to express, the worst being AXError.cannotComplete which is returned for both revoked permission and apps that never implemented AX. A useful agent treats each boundary explicitly.

The four boundaries, in the order you hit them

A browser-AX dev moving to the desktop usually hits these in this order: pick an OS, ask the user for a permission, figure out how to address an element, then debug a confusing error code. Each one is small. Together they are the gap.

API surface

Three OSes, three accessibility APIs, no shared interface.

macOS gives you AXUIElement (NSAccessibility under the hood), declared in ApplicationServices/HIServices. Windows gives you UI Automation (UIAutomationCore.dll), with a COM-style interface. Linux gives you AT-SPI over DBus. The roles map across cleanly enough (button is AXButton is ControlType.Button) but the call shape, the threading rules, and the lifecycle of element references are different on each OS. The Chromium accessibility overview documents how Chrome itself bridges the DOM and ARIA into each platform's native API; outside Chrome you call the native API directly.

Trust model

The OS gates cross-process AX behind a user-granted permission.

In the browser your script runs in-process; AX access is implicit. On the desktop you are reading another process's tree, which the OS treats as a privileged action. macOS routes it through TCC and the System Settings > Privacy & Security > Accessibility toggle. Windows uses UIA client privileges and an asymmetric elevation model. Linux uses AT-SPI bus permissions. On macOS specifically the per-process AXIsProcessTrusted cache can go stale on macOS 26 (Tahoe) after an app re-sign or update, which means a naive agent reports 'broken' to a user who already granted the permission.

Addressing model

Process plus window plus element, not URL plus selector.

There is no document.querySelector for the desktop. You start from a process identifier, get an AX root with AXUIElementCreateApplication(pid), enumerate windows via kAXFocusedWindowAttribute or kAXWindowsAttribute, then recursively walk kAXChildrenAttribute. The graph is moving as the user opens and closes apps. The practical shape an agent gives a model is a flattened text dump scoped to one window, with one line per element, refreshed after every action because the desktop AX tree can lag what is on screen by hundreds of milliseconds.

Error codes

One OS error, three different meanings. The browser never had this.

AXError.cannotComplete is returned for at least three conditions: revoked accessibility permission, target process that never implemented NSAccessibility (most Qt apps, Tk apps), and target process alive but unresponsive. An agent that retries on cannotComplete will spin on a Qt app forever. An agent that maps it to 'permission broken' will nag the user every time they open Blender. The disambiguation pattern is to probe a known-good control app like Finder, which is the anchor of this page.

What the model actually reads, and where to find it

The string an LLM gets when a desktop agent traverses an app does not look like a Chrome DevTools accessibility tree. It is one line per element, scoped to one window. Here is the start of a real traversal, written by the bundled mcp-server-macos-use tool to /tmp/macos-use after the Fazm app was opened. Every line you see is the format the model has in context for that turn.

/tmp/macos-use/1777425476971_open_application_and_traverse.txt

Three things to notice. First, the header line counts elements and reports a traversal time, which is the only honest way to talk about cost on the desktop because it is not page-load bound. Second, the frame is in CGFloat points and the menu bar x extends past 7680 because of multi-monitor coordinates, which a browser AX tree never has to express. Third, the tree is flat to the point of being a flat list per turn, because for an LLM that is cheaper to read than a deeply nested structure. This shape is what makes the desktop AX surface usable for an agent at all, and it is why a literal port of a browser AX harness does not work: the address space is bigger and the structure has to be squashed differently.

The pattern that does not exist in the browser

When the boundaries combine, the worst case is this: a user opens a Qt app like PyMOL or OBS, the agent calls AXUIElementCopyAttributeValue, the OS returns AXError.cannotComplete, and the agent has to decide whether the user just lost their permission or whether this app is simply opaque to AX. The browser never had this question because the browser is one process and one document. The desktop solution is to probe a known-good control. The Fazm agent uses Finder.

// Desktop/Sources/AppState.swift, line 468
// confirmAccessibilityBrokenViaFinder

private func confirmAccessibilityBrokenViaFinder(
  suspectApp: String
) -> Bool {
  if let finder = NSRunningApplication
       .runningApplications(withBundleIdentifier: "com.apple.finder")
       .first {
    let finderElement = AXUIElementCreateApplication(
      finder.processIdentifier
    )
    var finderWindow: CFTypeRef?
    let finderResult = AXUIElementCopyAttributeValue(
      finderElement,
      kAXFocusedWindowAttribute as CFString,
      &finderWindow
    )
    if finderResult == .cannotComplete
        || finderResult == .apiDisabled {
      // Finder also fails: permission is truly stuck.
      return false
    } else {
      // Finder works: failure was app-specific.
      return true
    }
  } else {
    // Finder not running: tie-break with an event-tap probe.
    return probeAccessibilityViaEventTap()
  }
}

The companion routine is probeAccessibilityViaEventTap at line 490 of the same file. CGEvent tap creation hits the live TCC database directly, bypassing the per-process cache that goes stale on macOS 26. A tap that succeeds while AXIsProcessTrusted returns false is the canonical signature of a stale cache. The two routines together cover the four-by-four table of (this app implements AX yes or no) by (process cache is fresh yes or no) without nagging the user when the system is actually fine.

No browser-AX library has anything like this. The closest analogue in the browser is checking if a frame has loaded before querying it, which is a much smaller question. Outside the browser, the question is whether the OS can talk to the app at all, and the only honest way to answer is by reference to a control.

1 error, 3 meanings

“AXError.cannotComplete is returned for at least three different conditions. An agent that maps it directly to 'permission broken' will nag the user. An agent that maps it to 'transient' will spin on a Qt app forever.”

AppState.swift, Fazm open source repo, github.com/m13v/fazm

The honest summary, for someone porting browser-AX habits

Most of what you know transfers. Roles and names map. Tree walks are tree walks. The mental model of "find element by role plus name, then act on it" still works. What does not transfer is the assumption that the tree is dense, fresh, and one process away. Outside the browser the tree can be sparse (Electron, Qt, OpenGL canvases), can lag what is on screen, and lives behind a permission you have to ask the user to grant.

The practical advice is to take a fresh tree dump after every action, treat AXError.cannotComplete as a signal to disambiguate (not a hard failure), and plan for a vision fallback on the apps where the tree is thin. Fazm does these three things in 200 lines of Swift in AppState.swift, and uses the same approach across every macOS app it drives. That code is open source on GitHub if you want to compare against your own harness.

Want to walk through the four boundaries on a live machine?

Fifteen minutes. Bring an app where your browser-AX habits stopped working, and we will trace what the macOS AX layer returns and where the disambiguation should live.

Questions browser-AX devs ask before adopting a desktop agent

I use Playwright's getByRole and axe-core. What is actually different about AX outside the browser?

Inside Chrome the accessibility tree is a single in-process structure derived from the live DOM and ARIA. You query it with one API, addressing is by selector or role, and access is implicit because your script runs inside the same page. Step outside the browser and four things change at once. The API splits into NSAccessibility on macOS, UI Automation on Windows, and AT-SPI on Linux, with no shared interface. The trust model splits too: you need a system-wide accessibility permission granted by the user in System Settings, not zero-config in-page access. Addressing becomes a process plus window plus element triple, not a URL plus selector. And the error codes overload meanings the browser never had to express, like one OS error returned for both 'permission revoked' and 'this app never implemented AX'. Each boundary on its own is a small change. Together they are why a browser-AX harness does not port to the desktop with a tweak.

Why three different APIs instead of one cross-platform spec?

Each OS predates the W3C ARIA spec and grew its own assistive-tech layer for screen readers in the 90s and 00s. macOS has NSAccessibility, exposed in C as the AXUIElement family declared in the AppKit-adjacent ApplicationServices framework. Windows has Microsoft UI Automation, an evolution of Microsoft Active Accessibility (MSAA). Linux has AT-SPI, the GNOME assistive technology service interface. They were designed for VoiceOver, Narrator, and Orca respectively. Browser AX trees are layered on top of these per-platform APIs: Chromium maps the DOM and ARIA to the native AX layer of whatever OS it runs on, which is documented in Chromium's accessibility overview. So in a sense the browser AX tree you query in DevTools is already a translation of the platform API. Outside the browser there is no translation, you call the platform API directly.

What does 'trust model' mean in practice?

In the browser, your code can read the AX tree of the page it is in for free. There is no permission to grant. On the desktop, an external process that wants to read another process's AX tree needs to be granted accessibility access by the user. On macOS this is the System Settings > Privacy & Security > Accessibility toggle, gated by TCC. On Windows it is UI Automation client privileges, generally available without a special grant but elevated processes have an asymmetric view. On Linux it is at-spi-bus access plus DBus session policy. The macOS case is the most user-visible: every desktop AX agent on a Mac has to walk a user through a settings toggle on first run. Worse, the toggle has a cache that goes stale on macOS 26 after an app re-sign or update, which means an agent that does not probe around AXIsProcessTrusted will report 'broken' to a user who already granted the permission.

How does addressing change?

In the browser, an element is addressed by a CSS selector, an ARIA role plus name, or an XPath rooted at the document. The document is one. Outside the browser, you address by process identifier first, then by window inside that process, then by element inside that window. On macOS that is AXUIElementCreateApplication(pid) to get a per-app root, then kAXFocusedWindowAttribute (or kAXWindowsAttribute) to enumerate windows, then a recursive walk of kAXChildrenAttribute to find a leaf. There is no single root for 'the desktop'. There is no document.querySelector. An agent has to discover the process graph, which is also moving in real time as the user opens apps. The output you give the model is usually a flattened text dump scoped to one window, not a query result.

What error codes does the desktop add that the browser does not have?

The interesting one is AXError.cannotComplete, which is what AXUIElementCopyAttributeValue returns when the call cannot be answered. It is overloaded: returned for revoked accessibility permission, returned for apps that never implemented NSAccessibility (most Qt builds, Tk apps like PyMOL), and returned for a target process that is alive but not responding. The browser never had to express any of these because the browser's AX tree is in-process and either populated or not. A naive desktop agent that maps cannotComplete to 'permission broken' will nag the user every time they open a Qt app. The fix is to disambiguate against a control. AXError.apiDisabled is the unambiguous 'system AX is off' code. AXError.notImplemented and AXError.attributeUnsupported mean the element exists but the attribute does not, which is fine to fall back from.

Show me one concrete pattern that does not exist in the browser at all.

Probe a known-good control. The Fazm agent, when it gets AXError.cannotComplete from a frontmost app, runs the same AXUIElementCopyAttributeValue against com.apple.finder before doing anything else. Finder always implements NSAccessibility and is always running on a Mac. If Finder fails too, the permission is genuinely stuck and the agent surfaces a Quit and Reopen dialog. If Finder succeeds, the original failure was app-specific, the permission is fine, and the agent logs the suspect app as AX-incompatible and falls back to vision. This pattern is at AppState.swift line 468 in github.com/m13v/fazm under the function name confirmAccessibilityBrokenViaFinder. There is no equivalent in the browser because the browser has one document and one process, so 'control element' is a category that does not exist.

Is there a way to keep my browser-AX intuitions when I move to the desktop?

Mostly. Roles map across cleanly: a button in ARIA is AXButton on macOS and ControlType.Button in UI Automation. Names map: aria-label becomes AXTitle / AXValue on macOS and Name on UIA. Hierarchy maps: a tree walk is a tree walk. What does not map is the assumption that the structure is dense. Browser AX trees are dense because the DOM is dense. Desktop AX trees can be very sparse, especially in Electron and Qt apps where the renderer never built the nodes. The other thing that does not map is the assumption that 'the tree is up to date with what is on screen'. On the desktop, an animation, a modal that just opened, or a window minimization can leave the AX tree out of sync for hundreds of milliseconds. A productive working assumption: take a fresh tree dump after every action, do not cache.

Where can I read what the model actually sees?

If you run the Fazm agent locally, every time it calls a desktop tool it writes a file to /tmp/macos-use/<timestamp>_<tool>.txt that contains the exact AX tree dump the LLM received as context. Open one in any editor. Each line is one element, format is [AXRole (subrole)] "title" x:N y:N w:W h:H, and the file starts with a header like '# Fazm Dev — 146 elements (0.19s)' showing the element count and traversal time. That is ground truth: if a click landed, the element was in the .txt. If the action failed, you can grep the file to find out whether the model was guessing from a stale tree or whether the element was simply absent (AX-thin app, screenshot fallback territory).

Does this mean accessibility-API agents are worse than screenshot agents on the desktop?

Not in general. On the apps where the AX tree is well-implemented, which on macOS is most native AppKit and SwiftUI apps plus the Apple-shipped suite, an AX agent is faster, cheaper, and more deterministic than a screenshot agent. No frame, no model pass to tokenize pixels into elements, no coordinate prediction. The honest answer is hybrid: AX-first for apps that expose a usable tree, vision fallback for the apps where AX is thin or absent. The boundary work is what makes hybrid possible, because without disambiguating cannotComplete you cannot tell whether to fall back or to ask the user to fix a permission.

What about Windows and Linux specifically?

Windows UI Automation is closer to a single coherent API than macOS or Linux, and most Win32, WinUI, and WPF apps participate. WinForms and Electron apps on Windows have similar gaps to their macOS counterparts. The big Windows-specific quirk is the elevation asymmetry: an unelevated UIA client cannot read elements in elevated processes. Linux AT-SPI is the most fragmented surface, because it works only when the toolkit (GTK, Qt with the AT-SPI bridge plug-in) opts in. On both OSes the four boundaries crossed are the same conceptually: a different API, a permission or capability check, a process-rooted addressing model, and ambiguous error codes. The function names change. The shape does not.

Keep reading

Limits

Computer-use accessibility limits on macOS, by app category

The five app categories where the macOS AX tree is thin or absent, and the two macOS-26 cache states that look like the agent broke.

Read

Signal

Accessibility tree computer use: the six signals a screenshot cannot carry

What a single line in a real AX dump carries that a JPEG of the same pixels does not.

Read

Primer

macOS accessibility automation: the layer agents use to drive any native app

How NSAccessibility lets a Mac agent click any button in any app, and what the layer was originally designed for.

Read