Accessibility Tree Desktop Automation: the text format an LLM actually reads
The W3C spec describes the accessibility tree. Microsoft documents UIAutomation. Neither shows the actual text an LLM sees when it clicks a button on your Mac. This guide opens up the bundled mcp-server-macos-use v1.6.0 inside Fazm, the six tools it exposes, and the one-line-per-element format that arrives back from every action.
What the SERP misses about accessibility-tree desktop automation
Search this term and you get Microsoft's UIAutomation specification, the W3C Core Accessibility API Mappings, a couple of blog posts about Accessibility Insights for Windows, and one developer-oriented Fazm post about MCP plumbing. Useful, all of them. None of them answers the question a normal person asks first: what does the AI actually see when I tell it to click something?
The answer is a small text file. Less than ten kilobytes for most apps. One line per element. A role in brackets, a title in quotes, four small integers for the frame, and a flag if it is currently visible. That is the entire substrate the model reasons against. Once you have seen it, screenshot-first agents start to look like they are doing extra work for no reason.
That format is what this guide is about. Where it lives in Fazm, what it looks like, and why six MCP tools are enough to drive any app on your Mac.
The same protocol VoiceOver uses, on every app
If a screen reader can describe it, an accessibility-tree automation tool can read it. That is the entire surface area.
The anchor fact: one line per element
Every traversal Fazm runs ends as a UTF-8 text file at /tmp/macos-use/<timestamp>_<tool>.txt with a sibling .png. The header is the application name, the element count, and the traversal time. After that, every line is a single AX element.
Here is a real dump from an open Fazm Dev window. Notice the negative y values: this Mac has a second monitor positioned above the laptop screen, so the menu bar lives at y = -2160. The model does not need to know that, it just reads the frame and uses it.
The grammar of one line
Why this is uncopyable: no other consumer Mac AI app exposes this format publicly. The file lives on your disk. You can open it. You can grep it. You can verify the model is acting on real data, not pixels it halfway interpreted.
Where it lives in the app
The MCP server is a standalone Swift-native binary, version 1.6.0, shipped inside the .app bundle. The Node.js bridge that registers it is two short blocks of TypeScript. You can audit both.
What the binary advertises
Pipe a JSON-RPC tools/list into the binary and it logs the six tools it exposes:
The data path, end to end
What the model receives, what it returns, and what hits the OS. Every arrow is a real call you can trace in the source.
One round trip of accessibility-tree automation
What ships in the box
6 tools, all '_and_traverse'
open, click, type, press_key, scroll, refresh. Every one returns a fresh accessibility tree alongside the action result.
1 line per element
[AXRole (subrole)] "title" x:N y:N w:W h:H visible. Header shows app name, element count, traversal time.
Bundled v1.6.0 binary
Contents/MacOS/mcp-server-macos-use, Mach-O 64-bit arm64, ~21 MB. No npm install, no pip install.
Falls back to pixels
When AX is empty (some Qt apps, OpenGL games), Fazm switches to screen capture so the model still has something to see.
Files, not blobs
Tree saved to /tmp/macos-use/<ts>_<tool>.txt. Screenshot saved to <ts>_<tool>.png. The MCP response carries paths, not megabytes.
Accessibility tree vs screenshot agents, line by line
Same task, two substrates. The tree is structured text the model can search. A screenshot is a million pixels the model has to interpret.
| Feature | Screenshot agents | Accessibility tree (Fazm) |
|---|---|---|
| Token cost per page | 1 to 5 MB image, base64 encoded | few KB of UTF-8 text |
| Click 'Send' button | vision model picks coordinates | search title='Send', use frame |
| Off-screen elements | invisible to model | present in tree, no visible flag |
| Role of an element | guessed from pixels | AXButton, AXTextField, etc. |
| Round-trip per action | screenshot, decode, infer, click | 1 call: act + return new tree |
| Works on Retina displays | yes, but Retina = bigger image | yes, frame is in points |
| Multi-monitor | yes, but image scales linearly | yes, negative coords are valid |
| Setup | varies, often dev framework | install Fazm, grant Accessibility once |
How a single turn flows through the stack
From the moment you press Return to the moment the chat replies. Every step is text, except for one: the actual CGEvent click.
You ask Fazm to do something
Plain English prompt in the chat window.
The model picks an MCP tool
Fazm exposes the macos-use server (and Playwright, WhatsApp, Google Workspace, fazm_tools) to the model.
The bundled binary opens the app and walks the tree
Swift code calls AXUIElementCreateApplication, then traverses children depth-first using AXUIElementCopyAttributeValue for kAXChildrenAttribute.
The model reads the tree text
It searches for elements by role and title (e.g. AXButton with title containing 'tomorrow' or AXCell containing today's date).
Action + re-traverse in one call
click_and_traverse takes either an element string or x,y,w,h. It synthesizes the click via CGEvent, then re-walks the tree.
Loop until done, then reply
Fazm keeps the loop running until the model is satisfied, then writes a chat reply that quotes the actual data it read out of the tree.
Numbers from one real traversal
Numbers come from the header of /tmp/macos-use/1776457209163_refresh_traversal.txt and the binary's own startup log. Reproduce with ls /tmp/macos-use after Fazm runs an action.
See your own tree in /tmp/macos-use
Install Fazm, ask it to open any app, then peek inside /tmp/macos-use. The .txt file is the same text the model just read. The .png is what your screen looked like at that moment.
Download Fazm →“The model never has to call a separate 'see what changed' tool after a click. Round-trip cost goes from two MCP calls to one, and the tree is always less than a second old.”
mcp-server-macos-use v1.6.0, six tools all suffixed _and_traverse
Frequently asked questions
Frequently asked questions
What is an accessibility tree, in plain language?
Every windowed app on macOS, Windows, and Linux exposes a hierarchical map of its UI to the operating system. Roots are the desktop and application windows. Children are menus, buttons, text fields, lists, and so on. Each node has a role (AXButton, AXMenuItem, AXTextField), a title, sometimes a value, and a frame in screen coordinates. VoiceOver, NVDA, and Orca read it to describe the screen out loud. A desktop automation tool can read it for the same reason: it is a structured, machine-friendly description of what is on screen, without any pixel parsing.
Why use the accessibility tree instead of screenshots for desktop automation?
Three reasons. First, it is text, not pixels. A typical app's tree is a few kilobytes; a 4K screenshot is tens of megabytes uncompressed. Second, the LLM does not have to guess where 'the Send button' is by interpreting an image. The button is right there in the tree, with a role of AXButton, a title of 'Send', and a frame of x:842 y:1004 w:60 h:32. Third, the tree includes things pixels cannot: invisible items, role information, accessibility hints, and the exact bounding box. Pixels alone require a vision model to read back the UI, which is slower and more expensive on every step.
Where is Fazm's accessibility tree code, and how can I verify it?
Fazm bundles a standalone Swift-native MCP server inside the app: Contents/MacOS/mcp-server-macos-use, version 1.6.0. The Node.js bridge that registers it lives at acp-bridge/src/index.ts, line 63 (binary path) and lines 1056 to 1064 (registration). The list of built-in MCP servers is hardcoded as fazm_tools, playwright, macos-use, whatsapp, google-workspace at line 1266. You can run the binary directly: pipe a JSON-RPC tools/list request into it and it will print the six tool names.
What does an accessibility tree dump from Fazm actually look like?
One line per AX element. The header shows the application name, element count, and traversal time, e.g. '# Fazm Dev — 441 elements (0.72s)'. Each element line is '[AXRole (subrole)] "text" x:N y:N w:W h:H visible'. A real example: '[AXMenuBarItem (menu bar item)] "Apple" x:1681 y:-2160 w:34 h:30'. Negative coordinates appear when you have multiple displays. The 'visible' flag at the end is set when the element is currently rendered, not just attached to the tree. The full file is saved to /tmp/macos-use/<timestamp>_<tool>.txt and a screenshot to a sibling .png.
What is the 'and_traverse' suffix on every macos-use tool?
All six tools end in _and_traverse: open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, refresh_traversal. The suffix is not decorative. It means the tool performs the requested action and then immediately rebuilds the accessibility tree, returning both in a single response. The model never has to call a separate 'see what changed' tool after a click. Round-trip cost goes from two MCP calls to one, and the tree the model reasons against is always less than a second old.
Does this only work on macOS?
The mcp-server-macos-use binary is macOS-only because it talks to AXUIElement, the native macOS accessibility C API. The same architectural pattern (read accessibility tree, send compact text to model, post action back to OS) works on Windows via UIAutomation and Linux via AT-SPI. Fazm's Mediar group also ships an MIT-licensed Rust library called Terminator that wraps both UIAutomation on Windows and AX on macOS, which is what Fazm uses for cross-platform automation features. None of that requires the user to install a separate runtime: on macOS, the binary is bundled inside Fazm.app.
Do I need to write code to use this?
No. Fazm is a consumer Mac app, not a developer framework. You install it, grant Accessibility permission once (the same TCC dialog VoiceOver uses), and the macos-use MCP server is registered automatically alongside Playwright, WhatsApp, and Google Workspace. You then talk to Fazm in plain English. Behind the scenes, the model picks tools, the bridge routes them, and the accessibility tree text comes back into the LLM's context. No Python, no Node, no manual MCP config. Free to start.
How does this compare to Anthropic Computer Use, OpenAI Operator, and screenshot-based agents?
Anthropic's reference Computer Use implementation and OpenAI Operator are both pixel-first: they capture the screen, send it to a vision model, and click on coordinates the model returns. That is general but slow, expensive, and blind to non-visible structure. Accessibility-tree automation reads structured text, which is faster and lets you click 'Send' by name rather than by image patch. The trade-off is that some apps (Qt, OpenGL games, certain Electron apps) implement the AX protocol poorly, and you fall back to pixels. Fazm uses both: AX first, vision when AX is empty.
More on the accessibility-API-first approach to desktop automation.
Read next
macOS Accessibility API vs Screenshot Agents
50ms element lookup vs 2500ms vision pass: where the gap comes from and when it does not matter.
Accessibility API vs Screenshot Agents (Concept)
The two camps in desktop automation, the trade-offs, and where each one breaks.
Hybrid Desktop Automation: Vision plus Accessibility (2026)
Why production agents end up using both layers, and how Fazm decides which one to read first.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.