desktop automation

Accessibility Tree Desktop Automation: the text format an LLM actually reads

The W3C spec describes the accessibility tree. Microsoft documents UIAutomation. Neither shows the actual text an LLM sees when it clicks a button on your Mac. This guide opens up the bundled mcp-server-macos-use v1.6.0 inside Fazm, the six tools it exposes, and the one-line-per-element format that arrives back from every action.

F
Fazm
9 min read
4.9from 200+
Built on real AXUIElement, not screenshots
Works in any app on your Mac, not just the browser
Consumer app, no MCP config required

What the SERP misses about accessibility-tree desktop automation

Search this term and you get Microsoft's UIAutomation specification, the W3C Core Accessibility API Mappings, a couple of blog posts about Accessibility Insights for Windows, and one developer-oriented Fazm post about MCP plumbing. Useful, all of them. None of them answers the question a normal person asks first: what does the AI actually see when I tell it to click something?

The answer is a small text file. Less than ten kilobytes for most apps. One line per element. A role in brackets, a title in quotes, four small integers for the frame, and a flag if it is currently visible. That is the entire substrate the model reasons against. Once you have seen it, screenshot-first agents start to look like they are doing extra work for no reason.

That format is what this guide is about. Where it lives in Fazm, what it looks like, and why six MCP tools are enough to drive any app on your Mac.

The same protocol VoiceOver uses, on every app

If a screen reader can describe it, an accessibility-tree automation tool can read it. That is the entire surface area.

Mail
Safari
Chrome
Arc
Slack
Messages
Notes
Calendar
Finder
VS Code
Xcode
Figma
Spotify
Terminal
Pages
Numbers
Keynote
Cursor
Linear
1Password

The anchor fact: one line per element

Every traversal Fazm runs ends as a UTF-8 text file at /tmp/macos-use/<timestamp>_<tool>.txt with a sibling .png. The header is the application name, the element count, and the traversal time. After that, every line is a single AX element.

Here is a real dump from an open Fazm Dev window. Notice the negative y values: this Mac has a second monitor positioned above the laptop screen, so the menu bar lives at y = -2160. The model does not need to know that, it just reads the frame and uses it.

/tmp/macos-use/1776457209163_refresh_traversal.txt

The grammar of one line

format.txt

Why this is uncopyable: no other consumer Mac AI app exposes this format publicly. The file lives on your disk. You can open it. You can grep it. You can verify the model is acting on real data, not pixels it halfway interpreted.

Where it lives in the app

The MCP server is a standalone Swift-native binary, version 1.6.0, shipped inside the .app bundle. The Node.js bridge that registers it is two short blocks of TypeScript. You can audit both.

acp-bridge/src/index.ts

What the binary advertises

Pipe a JSON-RPC tools/list into the binary and it logs the six tools it exposes:

stderr from mcp-server-macos-use
Run it yourself

The data path, end to end

What the model receives, what it returns, and what hits the OS. Every arrow is a real call you can trace in the source.

One round trip of accessibility-tree automation

AXUIElement traversal
Frontmost app + window
Your prompt
Last action result
macos-use MCP server
Tree as compact text
Screenshot path (.png)
CGEvent click / type
Re-traversal on every step

What ships in the box

6 tools, all '_and_traverse'

open, click, type, press_key, scroll, refresh. Every one returns a fresh accessibility tree alongside the action result.

1 line per element

[AXRole (subrole)] "title" x:N y:N w:W h:H visible. Header shows app name, element count, traversal time.

Bundled v1.6.0 binary

Contents/MacOS/mcp-server-macos-use, Mach-O 64-bit arm64, ~21 MB. No npm install, no pip install.

Falls back to pixels

When AX is empty (some Qt apps, OpenGL games), Fazm switches to screen capture so the model still has something to see.

Files, not blobs

Tree saved to /tmp/macos-use/<ts>_<tool>.txt. Screenshot saved to <ts>_<tool>.png. The MCP response carries paths, not megabytes.

Accessibility tree vs screenshot agents, line by line

Same task, two substrates. The tree is structured text the model can search. A screenshot is a million pixels the model has to interpret.

FeatureScreenshot agentsAccessibility tree (Fazm)
Token cost per page1 to 5 MB image, base64 encodedfew KB of UTF-8 text
Click 'Send' buttonvision model picks coordinatessearch title='Send', use frame
Off-screen elementsinvisible to modelpresent in tree, no visible flag
Role of an elementguessed from pixelsAXButton, AXTextField, etc.
Round-trip per actionscreenshot, decode, infer, click1 call: act + return new tree
Works on Retina displaysyes, but Retina = bigger imageyes, frame is in points
Multi-monitoryes, but image scales linearlyyes, negative coords are valid
Setupvaries, often dev frameworkinstall Fazm, grant Accessibility once

How a single turn flows through the stack

From the moment you press Return to the moment the chat replies. Every step is text, except for one: the actual CGEvent click.

1

You ask Fazm to do something

Plain English prompt in the chat window.

"Open the Calendar app and find what is on my schedule for tomorrow."
2

The model picks an MCP tool

Fazm exposes the macos-use server (and Playwright, WhatsApp, Google Workspace, fazm_tools) to the model.

For "open Calendar" the right tool is macos-use_open_application_and_traverse with arg appName="Calendar".
3

The bundled binary opens the app and walks the tree

Swift code calls AXUIElementCreateApplication, then traverses children depth-first using AXUIElementCopyAttributeValue for kAXChildrenAttribute.

Result is serialized to one element per line, saved to /tmp/macos-use/<ts>_open_application_and_traverse.txt, and returned in the MCP response alongside the .png screenshot path.
4

The model reads the tree text

It searches for elements by role and title (e.g. AXButton with title containing 'tomorrow' or AXCell containing today's date).

No vision pass needed. The text is structured enough that text search wins.
5

Action + re-traverse in one call

click_and_traverse takes either an element string or x,y,w,h. It synthesizes the click via CGEvent, then re-walks the tree.

Round-trip is one MCP call. The model gets the new tree as part of the same response and decides the next move.
6

Loop until done, then reply

Fazm keeps the loop running until the model is satisfied, then writes a chat reply that quotes the actual data it read out of the tree.

For the Calendar example: "You have Standup at 9, Lunch with Sara at 12, Dentist at 4." Pulled from AXStaticText nodes inside AXCell children, not OCR.

Numbers from one real traversal

0
AX elements in one Fazm Dev window dump
0s
Wall clock time to traverse all 441
0
MCP tools, all suffixed _and_traverse

Numbers come from the header of /tmp/macos-use/1776457209163_refresh_traversal.txt and the binary's own startup log. Reproduce with ls /tmp/macos-use after Fazm runs an action.

See your own tree in /tmp/macos-use

Install Fazm, ask it to open any app, then peek inside /tmp/macos-use. The .txt file is the same text the model just read. The .png is what your screen looked like at that moment.

Download Fazm
1 round-trip

The model never has to call a separate 'see what changed' tool after a click. Round-trip cost goes from two MCP calls to one, and the tree is always less than a second old.

mcp-server-macos-use v1.6.0, six tools all suffixed _and_traverse

Frequently asked questions

Frequently asked questions

What is an accessibility tree, in plain language?

Every windowed app on macOS, Windows, and Linux exposes a hierarchical map of its UI to the operating system. Roots are the desktop and application windows. Children are menus, buttons, text fields, lists, and so on. Each node has a role (AXButton, AXMenuItem, AXTextField), a title, sometimes a value, and a frame in screen coordinates. VoiceOver, NVDA, and Orca read it to describe the screen out loud. A desktop automation tool can read it for the same reason: it is a structured, machine-friendly description of what is on screen, without any pixel parsing.

Why use the accessibility tree instead of screenshots for desktop automation?

Three reasons. First, it is text, not pixels. A typical app's tree is a few kilobytes; a 4K screenshot is tens of megabytes uncompressed. Second, the LLM does not have to guess where 'the Send button' is by interpreting an image. The button is right there in the tree, with a role of AXButton, a title of 'Send', and a frame of x:842 y:1004 w:60 h:32. Third, the tree includes things pixels cannot: invisible items, role information, accessibility hints, and the exact bounding box. Pixels alone require a vision model to read back the UI, which is slower and more expensive on every step.

Where is Fazm's accessibility tree code, and how can I verify it?

Fazm bundles a standalone Swift-native MCP server inside the app: Contents/MacOS/mcp-server-macos-use, version 1.6.0. The Node.js bridge that registers it lives at acp-bridge/src/index.ts, line 63 (binary path) and lines 1056 to 1064 (registration). The list of built-in MCP servers is hardcoded as fazm_tools, playwright, macos-use, whatsapp, google-workspace at line 1266. You can run the binary directly: pipe a JSON-RPC tools/list request into it and it will print the six tool names.

What does an accessibility tree dump from Fazm actually look like?

One line per AX element. The header shows the application name, element count, and traversal time, e.g. '# Fazm Dev — 441 elements (0.72s)'. Each element line is '[AXRole (subrole)] "text" x:N y:N w:W h:H visible'. A real example: '[AXMenuBarItem (menu bar item)] "Apple" x:1681 y:-2160 w:34 h:30'. Negative coordinates appear when you have multiple displays. The 'visible' flag at the end is set when the element is currently rendered, not just attached to the tree. The full file is saved to /tmp/macos-use/<timestamp>_<tool>.txt and a screenshot to a sibling .png.

What is the 'and_traverse' suffix on every macos-use tool?

All six tools end in _and_traverse: open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, refresh_traversal. The suffix is not decorative. It means the tool performs the requested action and then immediately rebuilds the accessibility tree, returning both in a single response. The model never has to call a separate 'see what changed' tool after a click. Round-trip cost goes from two MCP calls to one, and the tree the model reasons against is always less than a second old.

Does this only work on macOS?

The mcp-server-macos-use binary is macOS-only because it talks to AXUIElement, the native macOS accessibility C API. The same architectural pattern (read accessibility tree, send compact text to model, post action back to OS) works on Windows via UIAutomation and Linux via AT-SPI. Fazm's Mediar group also ships an MIT-licensed Rust library called Terminator that wraps both UIAutomation on Windows and AX on macOS, which is what Fazm uses for cross-platform automation features. None of that requires the user to install a separate runtime: on macOS, the binary is bundled inside Fazm.app.

Do I need to write code to use this?

No. Fazm is a consumer Mac app, not a developer framework. You install it, grant Accessibility permission once (the same TCC dialog VoiceOver uses), and the macos-use MCP server is registered automatically alongside Playwright, WhatsApp, and Google Workspace. You then talk to Fazm in plain English. Behind the scenes, the model picks tools, the bridge routes them, and the accessibility tree text comes back into the LLM's context. No Python, no Node, no manual MCP config. Free to start.

How does this compare to Anthropic Computer Use, OpenAI Operator, and screenshot-based agents?

Anthropic's reference Computer Use implementation and OpenAI Operator are both pixel-first: they capture the screen, send it to a vision model, and click on coordinates the model returns. That is general but slow, expensive, and blind to non-visible structure. Accessibility-tree automation reads structured text, which is faster and lets you click 'Send' by name rather than by image patch. The trade-off is that some apps (Qt, OpenGL games, certain Electron apps) implement the AX protocol poorly, and you fall back to pixels. Fazm uses both: AX first, vision when AX is empty.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.