computer useaccessibility APImacOS

Accessibility Tree Computer Use: six signals a screenshot cannot carry

Every thread about accessibility tree computer use stops at "it uses fewer tokens than a screenshot." That is the thin version. A single tree line carries six discrete signals a JPEG literally cannot. This is the long version, with a real 441-element dump from the Swift binary Fazm bundles.

Matthew Diakonov, Written with AI

Published April 19, 202610 min read

Pixels are not the only input

What the model reads when you say 'click send' on a Mac

Anthropic's Computer Use sends the model a screenshot.

Fazm sends it a few KB of structured AX text.

Same computer, different primitive: text, not pixels.

Six signals per element. Role, subrole, title, frame, state, parent.

Six MCP tools. All end in _and_traverse. Act and re-observe in one call.

0:00 / 0:05

4.9from 200+

Built on real AXUIElement, not screenshots

Works on any Mac app, not just the browser

Consumer app, no MCP config required

The thin argument, and the real one

Search accessibility tree computer use on Google and you get the same pitch six times. Accessibility trees are structured text, so they use 93% fewer tokens than a full-screen JPEG. That is true. That is not the reason to care.

A tree line is not a compressed picture. It is a different kind of evidence. It carries explicit information a pixel image does not have, no matter how many tokens you throw at it. The model on the other end is not just reading a smaller input, it is reading a richer one.

The rest of this page counts the specific signals, shows a real dump, and walks through how an agent grounds "click send" on a line of text.

Six signals every tree line carries

A vision model can guess most of these from a pixel patch. It will be wrong often enough to matter. The tree makes every one of them explicit.

1. Role

AXButton, AXTextField, AXMenuItem, AXCell. A vision model has to infer role from shape. The tree states it.

2. Subrole

close button, pop up button, search field. The second noun after role. Disambiguates three AXButtons in the same window.

3. Accessible title

The kAXTitleAttribute string. Localized, stable, present even for icon-only buttons. Pixels would need OCR plus semantic guessing.

4. Point-based frame

x, y, w, h in CGFloat points. Retina stable. Negative values are legal on multi-monitor setups. No scaling math required.

5. State

Visible, enabled, focused, selected. Explicit in the tree. A JPEG has no way to tell a disabled button from an enabled one at a glance.

6. Hierarchy

Every element sits under a parent. 'The second cell in the third row' is trivial to express. In a screenshot it is a geometry problem.

Anatomy of one line

Here is a real line from a traversal of an open Mac app, exactly as the Swift binary emits it and exactly as the LLM reads it:

[AXMenuBarItem (menu bar item)] "Apple" x:1681 y:-2160 w:34 h:30

Five visible fields plus one implicit one (hierarchy, from the line's position relative to its parent). Every field maps to a macOS accessibility attribute the OS already tracks for VoiceOver:

line-anatomy.txt

Why this is uncopyable: the file at /tmp/macos-use/<ts>_<tool>.txt is on your disk after any Fazm tool call. You can open it and grep it. No other consumer Mac AI app exposes this format publicly. If the model claims it read something, the file either contains it or it does not.

Same element, two different kinds of evidence

For the menu bar Apple logo above, here is what a pixel-first agent holds in context versus what a tree-first agent holds:

Apple menu bar item

PIXEL VIEW: what the vision model gets for the same element
--------------------------------------------------------------
[raw RGB patch, ~34x30 pixels]
... pixel values ...
model has to infer:
  - is this clickable?
  - what role does it have?
  - where exactly does it start?
  - what is its text (even if icon-only)?
  - is it visible or attached?
  - what is its parent?

0% smaller input, richer signal

A real 441-element dump

This is the header plus the first fifteen lines of a real traversal of an open Fazm Dev window. Notice the negative y-coordinates. This Mac has a second monitor positioned above the laptop screen, so the menu bar lives at y = -2160 points. The model does not need to know why. It just reads the line and clicks.

/tmp/macos-use/1776457209163_refresh_traversal.txt

Full file is a few kilobytes. That is the entire substrate the model reasons against for this app. No vision pass, no OCR, no screenshot encoding.

The numbers from the dump above

These are not benchmarks. They are the numbers a real traversal reports in its own header line and file system metadata.

0Elements in one real Fazm Dev tree

0sFull walk + serialize time

0MCP tools, all _and_traverse

0 MBBundled Swift binary size (arm64)

Compare to a typical screenshot-first round trip: capture the frame, base64-encode it, ship it to a vision model, wait for a coordinates response. That is 0s on a good day and 0s on a bad one, before the model has even decided what to do.

How the six signals reach the model

Every piece of text the LLM reasons against has a path from the OS into its context window. Fazm's bundled binary is the pipe.

AX API, Swift binary, LLM context

How 'click send' grounds on a tree line

User types a plain-English request

They say: "Click send."

No coordinates, no element IDs, no DOM queries. The model is in charge of turning intent into a tool call.

Model asks the bridge for the current tree

macos-use_refresh_traversal. The Swift binary walks the frontmost app's AX tree depth-first.

Result arrives at /tmp/macos-use/<ts>_refresh_traversal.txt. Header line: app name, element count, walk time in seconds.

Model substring-searches the text

It looks for [AX button pattern with title containing "Send"].

On a 441-element dump this is a few kilobytes of text. The model reads it all in a single context window slot and picks one line.

Model extracts the frame from that line

It reads x, y, w, h directly from the line it matched. No inference. No pixel math.

Example: [AXButton (button)] "Send" x:6272 y:-1754 w:56 h:28 visible — click target is (6300, -1740).

Act and re-observe in one call

macos-use_click_and_traverse. The binary synthesizes a CGEvent click, then immediately re-walks the tree.

New tree returns in the same MCP response. Model has fresh ground truth for the next step, no separate observe call needed.

Loop until the model is done

Each iteration is act + re-observe in a single round trip. No screenshots unless AX goes empty.

For the typical workflow (open app, click button, type text, submit) that is 4 tool calls, 4 round trips, 4 fresh trees, under 3 seconds of binary time.

Six tools, all _and_traverse

Each one performs an action and rebuilds the tree in the same MCP call. Round-trip cost for observe-act-observe is one call, not two.

macos-use_open_application_and_traverse

macos-use_click_and_traverse

macos-use_type_and_traverse

macos-use_press_key_and_traverse

macos-use_scroll_and_traverse

macos-use_refresh_traversal

Verify the binary yourself

The binary lives inside the Fazm bundle. You can pipe a JSON-RPC tools/list request straight into it and watch it announce the same six tools the bridge registers:

tools/list

What an accessibility tree line carries that a JPEG cannot

AX role (AXButton, AXTextField, AXMenuItem...)
AX subrole (close button, pop up button, search field...)
Accessible title, localized, survives icon-only buttons
Frame in CGFloat points, Retina-stable
visible flag, distinct from attached-but-hidden
Parent pointer (implicit from depth-first line order)
AXEnabled / AXSelected / AXFocused state
Accessibility hint string for unlabeled controls

Tree-first vs pixel-first computer use

A line-by-line accounting of what each primitive gives the model, for the same task.

Feature	Screenshot + vision	Accessibility tree
Input the model reasons against	~500 KB to 5 MB base64 PNG	~3 to 10 KB of UTF-8 text, structured
Signals per element	3 fuzzy (visual patch, approximate position, rendered text)	6 (role, subrole, title, frame, state, parent)
Off-screen / attached elements	invisible by definition	present in tree, visible flag is false
Icon-only buttons	OCR fails, model guesses from shape	title comes from AX, e.g. "Close"
Click target derivation	vision model returns coordinates	read x,y,w,h from the matched line
Observe-act-observe	2 calls: action + screenshot	1 MCP call via _and_traverse
Retina / multi-monitor	image scales linearly, coordinates shift	CGFloat points, negative y legal
Setup	varies, usually developer framework	install Fazm, grant Accessibility once

Where the tree runs out

Not every Mac app implements AX well. Some Electron builds expose only the window shell. Qt apps often ship a half-broken bridge. SDL games and Metal canvases return nothing. When the tree text comes back empty or suspiciously short, Fazm switches to the sibling screenshot and the model issues coordinate-based clicks against pixels instead.

That is the hybrid mode most production computer-use agents converge on after a few months in the wild. Start with the tree, because it is cheaper and richer. Fall back to pixels only when the tree has nothing to say. The goal is not to purity-test one approach. It is to pay for vision tokens only when they actually buy information.

Roughly speaking: menu-driven native apps (Mail, Calendar, Finder, Slack, Safari, Messages) are almost entirely tree-navigable. Chrome and Arc fall in between, with the DOM reflected through AX for the page but some chrome elements ambiguous. Games and fully custom canvases drop to pixel mode.

Want to see the tree on your own Mac?

30 minutes on a call. We open /tmp/macos-use/ together and show you what the model actually reads when it runs your first workflow.

Frequently asked questions

Is accessibility tree computer use actually different from Anthropic's Computer Use?

Anthropic's reference Computer Use is screenshot first. The model receives a base64 PNG of the screen plus a list of coordinate-based actions (mouse_move, left_click, type, key). It decides where to click by interpreting pixels. Accessibility tree computer use flips the default input. The model receives a compact text listing of every on-screen element, each with a role, a localized title, a frame, and a visible flag. It decides where to click by substring-searching that text. Same computer, same task surface, very different primitive. Fazm uses the tree first and only falls back to pixels when the tree is empty (some Qt apps, OpenGL games).

What are the six signals an accessibility tree line carries that a pixel image does not?

Look at a single line from a real dump: [AXMenuBarItem (menu bar item)] "Apple" x:1681 y:-2160 w:34 h:30. That one line carries (1) the AX role, AXMenuBarItem; (2) the disambiguating subrole, menu bar item; (3) the accessible title, Apple, which survives localization and icon-only buttons; (4) a CGFloat-point frame that is stable across Retina scaling and multi-monitor negative coordinates; (5) implicit state (visible is the explicit flag at the end of the line, but presence in the tree also implies AXEnabled and parent focus); (6) hierarchy (the line immediately above is the parent AXMenuBar). A JPEG of the same menu bar pixel area carries only RGB values and has to guess every single one of those signals.

What file should I open to verify what the model is reading?

After any tool call Fazm makes, look in /tmp/macos-use/. You will see pairs of files timestamped by millisecond: <ts>_<tool>.txt (the tree) and <ts>_<tool>.png (the screenshot). Open the .txt in any editor. That is the exact string the LLM has in context. For the computer-use agent story, that file is ground truth: if you can grep the element the model clicked, the model read it from the tree. If the element isn't in the .txt but is in the .png, the tree failed and the model fell back to vision.

Why does 'visible' deserve its own field if the frame is already on the line?

Because AX trees contain elements that exist but are not rendered. Menu items under a collapsed menu, rows below the scroll clip, popover children that are attached but not yet visible on-screen. Their frame is still meaningful (the OS knows where they would draw), but acting on them will silently fail. The visible flag lets the model filter: when the user asks to click Send, the agent picks the AXButton with title Send and visible set to true. A screenshot cannot represent this at all. It either shows the element or doesn't. There is no 'attached but hidden' state in a 2D image.

How fast is this compared to taking a screenshot every step?

A real Fazm Dev dump ran 441 elements in 0.72 seconds. That is the full walk, including the bundled Swift binary's serialization and file write. A 4K screenshot capture plus a vision model pass (base64 encode, API round trip, token cost) is usually 1.5 to 3 seconds before the model has read anything useful. Over a 20-step workflow, tree-first saves tens of seconds and cuts vision-token cost to zero unless the tree actually is empty.

What does the '_and_traverse' suffix actually do for computer use?

All six macos-use tools end in _and_traverse: open_application, click, type, press_key, scroll, and refresh. The suffix means the tool performs the requested action and immediately rebuilds the AX tree, returning both in the same MCP response. For a computer use agent, this collapses the observe-act-observe loop to a single round trip. Screenshot-based agents have to call a separate screenshot tool after every action; tree-first agents get the new world state free.

Where exactly is the binary, and can I check it myself?

Inside Fazm.app at Contents/MacOS/mcp-server-macos-use, Mach-O 64-bit arm64, ~21 MB, version 1.6.0. It is registered by the Node bridge at acp-bridge/src/index.ts line 63 (path resolution) and lines 1056 through 1064 (conditional registration). The full built-in MCP set is hardcoded at line 1266: fazm_tools, playwright, macos-use, whatsapp, google-workspace. You can pipe a JSON-RPC tools/list request into the binary directly and it will print the six _and_traverse tool names.

What happens in apps that implement AX poorly?

Some Electron builds, a lot of Qt apps, SDL games, and anything rendering to a raw OpenGL or Metal canvas either expose an empty tree or return only the window shell. Fazm treats those as the fallback path: when the tree text is empty or suspiciously short, the model reads the sibling .png instead and issues coordinate-based clicks. That is the hybrid mode most production computer-use agents converge on. Start with the tree, fall back to pixels, never pay for vision tokens you don't need.

Do I need to write code to use Fazm's accessibility tree computer use?

No. Fazm is a consumer Mac app, not a developer framework. Install it, grant Accessibility permission once (the same TCC dialog VoiceOver uses), and the macos-use MCP server is registered automatically alongside Playwright, WhatsApp, and Google Workspace. You then talk to Fazm in plain English. Behind the scenes the model picks tools, the bridge routes them, and the tree text enters the model's context. No Python, no Node, no manual MCP config, free to start.

Adjacent angles on accessibility-tree-based computer use.

Accessibility Tree Computer Use: six signals a screenshot cannot carry

The thin argument, and the real one

Six signals every tree line carries

1. Role

2. Subrole

3. Accessible title

4. Point-based frame

5. State

6. Hierarchy

Anatomy of one line

Same element, two different kinds of evidence

A real 441-element dump

The numbers from the dump above

How the six signals reach the model

AX API, Swift binary, LLM context

How 'click send' grounds on a tree line

User types a plain-English request

Model asks the bridge for the current tree

Model substring-searches the text

Model extracts the frame from that line

Act and re-observe in one call

Loop until the model is done

Six tools, all _and_traverse

Verify the binary yourself

Tree-first vs pixel-first computer use

Where the tree runs out

Want to see the tree on your own Mac?

Frequently asked questions

Related reading

Accessibility Tree Desktop Automation: the Text Format an LLM Reads

macOS Accessibility API vs Screenshot Agents

Hybrid Desktop Automation: Vision plus Accessibility (2026)

Comments ()

Accessibility Tree Computer Use: six signals a screenshot cannot carry

The thin argument, and the real one

Six signals every tree line carries

1. Role

2. Subrole

3. Accessible title

4. Point-based frame

5. State

6. Hierarchy

Anatomy of one line

Same element, two different kinds of evidence

A real 441-element dump

The numbers from the dump above

How the six signals reach the model

AX API, Swift binary, LLM context

How 'click send' grounds on a tree line

User types a plain-English request

Model asks the bridge for the current tree

Model substring-searches the text

Model extracts the frame from that line

Act and re-observe in one call

Loop until the model is done

Six tools, all _and_traverse

Verify the binary yourself

Tree-first vs pixel-first computer use

Where the tree runs out

Want to see the tree on your own Mac?

Frequently asked questions

Related reading

Accessibility Tree Desktop Automation: the Text Format an LLM Reads

macOS Accessibility API vs Screenshot Agents

Hybrid Desktop Automation: Vision plus Accessibility (2026)

Comments (••)

Comments ()