Accessibility Tree Computer Use: six signals a screenshot cannot carry
Every thread about accessibility tree computer use stops at "it uses fewer tokens than a screenshot." That is the thin version. A single tree line carries six discrete signals a JPEG literally cannot. This is the long version, with a real 441-element dump from the Swift binary Fazm bundles.
The thin argument, and the real one
Search accessibility tree computer use on Google and you get the same pitch six times. Accessibility trees are structured text, so they use 93% fewer tokens than a full-screen JPEG. That is true. That is not the reason to care.
A tree line is not a compressed picture. It is a different kind of evidence. It carries explicit information a pixel image does not have, no matter how many tokens you throw at it. The model on the other end is not just reading a smaller input, it is reading a richer one.
The rest of this page counts the specific signals, shows a real dump, and walks through how an agent grounds "click send" on a line of text.
Six signals every tree line carries
A vision model can guess most of these from a pixel patch. It will be wrong often enough to matter. The tree makes every one of them explicit.
1. Role
AXButton, AXTextField, AXMenuItem, AXCell. A vision model has to infer role from shape. The tree states it.
2. Subrole
close button, pop up button, search field. The second noun after role. Disambiguates three AXButtons in the same window.
3. Accessible title
The kAXTitleAttribute string. Localized, stable, present even for icon-only buttons. Pixels would need OCR plus semantic guessing.
4. Point-based frame
x, y, w, h in CGFloat points. Retina stable. Negative values are legal on multi-monitor setups. No scaling math required.
5. State
Visible, enabled, focused, selected. Explicit in the tree. A JPEG has no way to tell a disabled button from an enabled one at a glance.
6. Hierarchy
Every element sits under a parent. 'The second cell in the third row' is trivial to express. In a screenshot it is a geometry problem.
Anatomy of one line
Here is a real line from a traversal of an open Mac app, exactly as the Swift binary emits it and exactly as the LLM reads it:
[AXMenuBarItem (menu bar item)] "Apple" x:1681 y:-2160 w:34 h:30
Five visible fields plus one implicit one (hierarchy, from the line's position relative to its parent). Every field maps to a macOS accessibility attribute the OS already tracks for VoiceOver:
Why this is uncopyable: the file at /tmp/macos-use/<ts>_<tool>.txt is on your disk after any Fazm tool call. You can open it and grep it. No other consumer Mac AI app exposes this format publicly. If the model claims it read something, the file either contains it or it does not.
Same element, two different kinds of evidence
For the menu bar Apple logo above, here is what a pixel-first agent holds in context versus what a tree-first agent holds:
Apple menu bar item
PIXEL VIEW: what the vision model gets for the same element
--------------------------------------------------------------
[raw RGB patch, ~34x30 pixels]
... pixel values ...
model has to infer:
- is this clickable?
- what role does it have?
- where exactly does it start?
- what is its text (even if icon-only)?
- is it visible or attached?
- what is its parent?A real 441-element dump
This is the header plus the first fifteen lines of a real traversal of an open Fazm Dev window. Notice the negative y-coordinates. This Mac has a second monitor positioned above the laptop screen, so the menu bar lives at y = -2160 points. The model does not need to know why. It just reads the line and clicks.
Full file is a few kilobytes. That is the entire substrate the model reasons against for this app. No vision pass, no OCR, no screenshot encoding.
The numbers from the dump above
These are not benchmarks. They are the numbers a real traversal reports in its own header line and file system metadata.
Compare to a typical screenshot-first round trip: capture the frame, base64-encode it, ship it to a vision model, wait for a coordinates response. That is 0s on a good day and 0s on a bad one, before the model has even decided what to do.
How the six signals reach the model
Every piece of text the LLM reasons against has a path from the OS into its context window. Fazm's bundled binary is the pipe.
AX API, Swift binary, LLM context
How 'click send' grounds on a tree line
User types a plain-English request
They say: "Click send."
Model asks the bridge for the current tree
macos-use_refresh_traversal. The Swift binary walks the frontmost app's AX tree depth-first.
Model substring-searches the text
It looks for [AX button pattern with title containing "Send"].
Model extracts the frame from that line
It reads x, y, w, h directly from the line it matched. No inference. No pixel math.
Act and re-observe in one call
macos-use_click_and_traverse. The binary synthesizes a CGEvent click, then immediately re-walks the tree.
Loop until the model is done
Each iteration is act + re-observe in a single round trip. No screenshots unless AX goes empty.
Six tools, all _and_traverse
Each one performs an action and rebuilds the tree in the same MCP call. Round-trip cost for observe-act-observe is one call, not two.
Verify the binary yourself
The binary lives inside the Fazm bundle. You can pipe a JSON-RPC tools/list request straight into it and watch it announce the same six tools the bridge registers:
What an accessibility tree line carries that a JPEG cannot
- AX role (AXButton, AXTextField, AXMenuItem...)
- AX subrole (close button, pop up button, search field...)
- Accessible title, localized, survives icon-only buttons
- Frame in CGFloat points, Retina-stable
- visible flag, distinct from attached-but-hidden
- Parent pointer (implicit from depth-first line order)
- AXEnabled / AXSelected / AXFocused state
- Accessibility hint string for unlabeled controls
Tree-first vs pixel-first computer use
A line-by-line accounting of what each primitive gives the model, for the same task.
| Feature | Screenshot + vision | Accessibility tree |
|---|---|---|
| Input the model reasons against | ~500 KB to 5 MB base64 PNG | ~3 to 10 KB of UTF-8 text, structured |
| Signals per element | 3 fuzzy (visual patch, approximate position, rendered text) | 6 (role, subrole, title, frame, state, parent) |
| Off-screen / attached elements | invisible by definition | present in tree, visible flag is false |
| Icon-only buttons | OCR fails, model guesses from shape | title comes from AX, e.g. "Close" |
| Click target derivation | vision model returns coordinates | read x,y,w,h from the matched line |
| Observe-act-observe | 2 calls: action + screenshot | 1 MCP call via _and_traverse |
| Retina / multi-monitor | image scales linearly, coordinates shift | CGFloat points, negative y legal |
| Setup | varies, usually developer framework | install Fazm, grant Accessibility once |
Where the tree runs out
Not every Mac app implements AX well. Some Electron builds expose only the window shell. Qt apps often ship a half-broken bridge. SDL games and Metal canvases return nothing. When the tree text comes back empty or suspiciously short, Fazm switches to the sibling screenshot and the model issues coordinate-based clicks against pixels instead.
That is the hybrid mode most production computer-use agents converge on after a few months in the wild. Start with the tree, because it is cheaper and richer. Fall back to pixels only when the tree has nothing to say. The goal is not to purity-test one approach. It is to pay for vision tokens only when they actually buy information.
Roughly speaking: menu-driven native apps (Mail, Calendar, Finder, Slack, Safari, Messages) are almost entirely tree-navigable. Chrome and Arc fall in between, with the DOM reflected through AX for the page but some chrome elements ambiguous. Games and fully custom canvases drop to pixel mode.
Want to see the tree on your own Mac?
30 minutes on a call. We open /tmp/macos-use/ together and show you what the model actually reads when it runs your first workflow.
Book a call →Frequently asked questions
Is accessibility tree computer use actually different from Anthropic's Computer Use?
Anthropic's reference Computer Use is screenshot first. The model receives a base64 PNG of the screen plus a list of coordinate-based actions (mouse_move, left_click, type, key). It decides where to click by interpreting pixels. Accessibility tree computer use flips the default input. The model receives a compact text listing of every on-screen element, each with a role, a localized title, a frame, and a visible flag. It decides where to click by substring-searching that text. Same computer, same task surface, very different primitive. Fazm uses the tree first and only falls back to pixels when the tree is empty (some Qt apps, OpenGL games).
What are the six signals an accessibility tree line carries that a pixel image does not?
Look at a single line from a real dump: [AXMenuBarItem (menu bar item)] "Apple" x:1681 y:-2160 w:34 h:30. That one line carries (1) the AX role, AXMenuBarItem; (2) the disambiguating subrole, menu bar item; (3) the accessible title, Apple, which survives localization and icon-only buttons; (4) a CGFloat-point frame that is stable across Retina scaling and multi-monitor negative coordinates; (5) implicit state (visible is the explicit flag at the end of the line, but presence in the tree also implies AXEnabled and parent focus); (6) hierarchy (the line immediately above is the parent AXMenuBar). A JPEG of the same menu bar pixel area carries only RGB values and has to guess every single one of those signals.
What file should I open to verify what the model is reading?
After any tool call Fazm makes, look in /tmp/macos-use/. You will see pairs of files timestamped by millisecond: <ts>_<tool>.txt (the tree) and <ts>_<tool>.png (the screenshot). Open the .txt in any editor. That is the exact string the LLM has in context. For the computer-use agent story, that file is ground truth: if you can grep the element the model clicked, the model read it from the tree. If the element isn't in the .txt but is in the .png, the tree failed and the model fell back to vision.
Why does 'visible' deserve its own field if the frame is already on the line?
Because AX trees contain elements that exist but are not rendered. Menu items under a collapsed menu, rows below the scroll clip, popover children that are attached but not yet visible on-screen. Their frame is still meaningful (the OS knows where they would draw), but acting on them will silently fail. The visible flag lets the model filter: when the user asks to click Send, the agent picks the AXButton with title Send and visible set to true. A screenshot cannot represent this at all. It either shows the element or doesn't. There is no 'attached but hidden' state in a 2D image.
How fast is this compared to taking a screenshot every step?
A real Fazm Dev dump ran 441 elements in 0.72 seconds. That is the full walk, including the bundled Swift binary's serialization and file write. A 4K screenshot capture plus a vision model pass (base64 encode, API round trip, token cost) is usually 1.5 to 3 seconds before the model has read anything useful. Over a 20-step workflow, tree-first saves tens of seconds and cuts vision-token cost to zero unless the tree actually is empty.
What does the '_and_traverse' suffix actually do for computer use?
All six macos-use tools end in _and_traverse: open_application, click, type, press_key, scroll, and refresh. The suffix means the tool performs the requested action and immediately rebuilds the AX tree, returning both in the same MCP response. For a computer use agent, this collapses the observe-act-observe loop to a single round trip. Screenshot-based agents have to call a separate screenshot tool after every action; tree-first agents get the new world state free.
Where exactly is the binary, and can I check it myself?
Inside Fazm.app at Contents/MacOS/mcp-server-macos-use, Mach-O 64-bit arm64, ~21 MB, version 1.6.0. It is registered by the Node bridge at acp-bridge/src/index.ts line 63 (path resolution) and lines 1056 through 1064 (conditional registration). The full built-in MCP set is hardcoded at line 1266: fazm_tools, playwright, macos-use, whatsapp, google-workspace. You can pipe a JSON-RPC tools/list request into the binary directly and it will print the six _and_traverse tool names.
What happens in apps that implement AX poorly?
Some Electron builds, a lot of Qt apps, SDL games, and anything rendering to a raw OpenGL or Metal canvas either expose an empty tree or return only the window shell. Fazm treats those as the fallback path: when the tree text is empty or suspiciously short, the model reads the sibling .png instead and issues coordinate-based clicks. That is the hybrid mode most production computer-use agents converge on. Start with the tree, fall back to pixels, never pay for vision tokens you don't need.
Do I need to write code to use Fazm's accessibility tree computer use?
No. Fazm is a consumer Mac app, not a developer framework. Install it, grant Accessibility permission once (the same TCC dialog VoiceOver uses), and the macos-use MCP server is registered automatically alongside Playwright, WhatsApp, and Google Workspace. You then talk to Fazm in plain English. Behind the scenes the model picks tools, the bridge routes them, and the tree text enters the model's context. No Python, no Node, no manual MCP config, free to start.
Adjacent angles on accessibility-tree-based computer use.
Related reading
Accessibility Tree Desktop Automation: the Text Format an LLM Reads
The other half of the story. Where the bundled binary lives, what the six tools do, and how the one-line-per-element format is shaped.
macOS Accessibility API vs Screenshot Agents
Where the 50ms-vs-2500ms gap comes from, and when the latency difference stops mattering.
Hybrid Desktop Automation: Vision plus Accessibility (2026)
Why production computer-use agents end up using both layers and how Fazm decides which one to read first.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.