AX TREE / NOT A SCREENSHOT

A macOS computer use agent that reads the accessibility tree, not a picture of your screen

The common answer to "give me a macOS computer use agent" in 2026 is a cloud model that screenshots your display, runs OCR and a vision pass over it, and returns pixel coordinates to click. That works and it is expensive. Fazm does the opposite: it walks the live accessibility tree that VoiceOver narrates, hands Claude a list of typed elements with role, text, and frame, and only falls back to pixels when visual reasoning is genuinely needed. This page is the schema, the caps, the exact file paths, and what the JSON payload actually looks like when Claude perceives your Mac.

Matthew Diakonov, Fazm

Published April 23, 202611 min read

4.9from Written from the MacosUseSDK and Fazm source trees

6-field ElementData schema

2000-element BFS cap per traversal

5-second traversal budget

14 non-interactable roles pruned

Signed Mach-O shipped inside the .app

macOS computer use agent

What Claude actually reads when Fazm looks at your screen

Input: a Finder window with 40 files and a search bar

AXUIElementCreateApplication(pid) returns the root element

BFS walker collects 6 fields per node until caps hit

Output: ~50 elements, ~8 KB of JSON, sorted y-first

Claude plans the click, calls a tool, reads the diff

0:00 / 0:05

What most guides on this topic cover, and what they leave out

Search for a macOS computer use agent in April 2026 and you will find three kinds of articles. There are cloud announcements: OpenAI shipped Codex with Mac control on April 16 2026, built on screen-reading plus an action model. There are comparison matrices: Claude versus Codex versus Gemini across latency, reliability, and concurrency. There are research frameworks: the BIGAI MacAgent paper describes a three-tier hierarchical planner with app-specific sub-agents for Word, Excel, Finder, QuickTime, and so on.

All three are about perception through pixels. None of them show what the alternative input shape looks like. The macOS accessibility API has existed since Mac OS X 10.2; VoiceOver has driven every shipping Mac app through it for two decades. It is the highest-fidelity structured representation of a Mac window that does not require a vision model. The pages that rank for this topic do not show the schema, do not show the caps, and do not show what the payload looks like before it reaches the model.

The rest of this guide is that schema, those caps, and that payload, lifted straight from the MacosUseSDK source and the Fazm ACP bridge.

6 fields

“ElementData is the entire schema a language model sees per UI node. role, text, x, y, width, height. No color, no font, no bitmap. Spatial sort y-first so elements arrive in reading order.”

MacosUseSDK/Sources/MacosUseSDK/AccessibilityTraversal.swift lines 32-57

Three input shapes a computer use agent can feed a model

The schema Claude actually perceives

Every time Claude asks the agent "what is on screen right now", the macos-use MCP server runs a BFS over the frontmost app's accessibility tree and returns a list of elements. Each element is a six-field record. That is the entire shape. No nested children, no parent pointers, no bitmap data. The tree is flattened into a list, sorted y-first so the top of the window arrives first and Claude reads it like a page.

MacosUseSDK/Sources/MacosUseSDK/AccessibilityTraversal.swift

acp-bridge/dist payload (illustrative)

The caps that keep the payload cheap

Left unconstrained, an accessibility walk can fall into a cycle or balloon into tens of thousands of nodes on a busy app. The traversal is bounded in three ways. A BFS that stops at depth 100, a global cap of 2000 collected elements, and a hard stopwatch of five seconds. Whichever hits first ends the walk and sets stats.truncated to true so the caller knows the tree was clipped.

A second trim runs at insertion time. Fourteen AX roles are considered non-interactable by default and skipped unless the caller asks for them. They are structural containers like AXGroup, presentational atoms like AXStaticText and AXHeading, and chrome like AXToolbar and AXScrollArea. Keeping them out cuts a typical traversal from several thousand raw nodes down to a few hundred actionable ones.

MacosUseSDK/Sources/MacosUseSDK/AccessibilityTraversal.swift

0max elements per traversal (BFS cap)

0max tree depth before the walker gives up

0shard stopwatch on a single read

0non-interactable AX roles pruned by default

One perception step, end to end

01 / 05

1. Claude decides it needs to see the frontmost window

Session on claude-sonnet-4-6 emits a tool call to macos-use_refresh_traversal with the current PID.

Accessibility tree versus screenshot, same Mac window

Neither approach is universally better. The two primitives have different failure modes and different cost structures. For the common case of a native Mac app that implements the accessibility API, the AX tree wins on every axis except visual fidelity. For the cases where pixels actually matter (PDFs, figures, Canva, a design file in Figma), screenshots are the right primitive and Fazm keeps that path open through ScreenCaptureManager.

AX tree input vs screenshot input for a language model

Rough numbers for a standard Finder or Mail window on a 14-inch MacBook Pro. They shift with window size and model.

Feature	Screenshot + vision	AX tree (Fazm default)
Payload shape	PNG bitmap, downscaled to model image budget	JSON list, 6 typed fields per element
Typical perception cost per window	~1 to 3 MB PNG, full image token budget	~10 to 40 KB of text, a few hundred tokens
Element targeting	Pixel coordinates from a vision pass	Exact frame from the app itself
Dark mode, high-DPI, custom themes	Requires the vision model to handle it	Unaffected, app reports its own semantics
Dialogs partially off-screen	Clipped pixels, OCR partial	Still returned, frame is reported
Non-Latin text, ligatures, emoji in labels	OCR approximation	Exact string from kAXTitleAttribute
PDFs, images, hand-drawn figures	Only path that works	Not readable this way
Required permission	Screen Recording (kTCCServiceScreenCapture)	Accessibility (kTCCServiceAccessibility)
Apps that do not implement AX well	Always works	Some Qt, OpenGL, and older Electron builds fall back to screenshot

How Fazm wires the macos-use server into Claude

There are two pieces. First, the macos-use binary is built from the open source project at github.com/mediar-ai/mcp-server-macos-use, which depends on MacosUseSDK for the traversal and input controller primitives. The build step is a standard Swift Package Manager release build. Fazm's build.sh copies the resulting binary into the app bundle so end users never need to install Swift or run a command.

Second, the Node.js ACP bridge inside Fazm registers the binary as an MCP server at boot. The name 'macos-use' and the absolute path Contents/MacOS/mcp-server-macos-use are hardcoded in acp-bridge/src/index.ts. Claude Sonnet 4.6 sees the tool list on the first prompt and can start calling macos-use_open_application_and_traverse without any user setup.

acp-bridge/src/index.ts

The first-run permission handshake, as seen in Fazm logs

macos-use_open_application_and_traverse

Opens or activates an app by name, bundle ID, or file path. Returns the post-activation AX tree. The first tool Claude calls after 'open Mail and show me yesterday's messages'.

macos-use_click_and_traverse

Simulates a CGEvent click at (x, y) within the target PID, waits, then re-traverses. The diff between before and after is how Claude verifies the click did something.

macos-use_type_and_traverse

Types text into the currently focused field inside the PID. Claude uses it after a click lands on an AXTextField or AXTextArea.

macos-use_press_key_and_traverse

Named-key press with optional modifier flags (Command, Option, Shift, Control, Function, NumPad). Enter, Tab, ArrowUp, Cmd-S, Cmd-W, and the rest.

macos-use_refresh_traversal

Re-reads the AX tree without acting. Claude calls this when it needs to perceive the screen without changing it, for example while planning a multi-step action.

What you can hand a macOS computer use agent that uses this schema

Anything that maps to "look at a native Mac window, pick an element by role and text, click or type, verify". The AX tree is unusually good at exactly the tasks most users describe when they say "I want Claude to do things on my Mac". Below are prompts that complete in one or two tool rounds because the elements Claude needs are present in a single traversal.

Prompts the AX path handles in a few tool calls

Rename every file in this Finder window to follow YYYY-MM-DD-kebab-case, based on the modification date.
In Calendar, move my 3pm to the next free slot tomorrow that overlaps with the shared work calendar.
Open the WhatsApp conversation with the contact named Marwan and reply to the last message with a short thank-you.
In Numbers, find the row where the Status column is 'Pending' and the Amount is above 1000, and set Status to 'Paid'.
Close every Finder window, open my Drafts folder in Mail, and flag messages that mention 'invoice' in the subject.
Delete every screenshot on Desktop from before last Tuesday and show me the list before moving them to Trash.

The install path, end to end

From a dmg to a prompt that already reaches other Mac apps is five steps. Every step after the first is automated.

Install the signed Fazm .app

fazm.ai serves the notarized dmg. Drag to /Applications. First launch triggers the usual 'app downloaded from the internet' prompt once. macos-use is already inside Contents/MacOS/; no separate Swift build.

Grant Accessibility once

AppState.swift calls AXIsProcessTrustedWithOptions with the prompt option. If the TCC cache is stale after a reset, the fallback probe calls AXUIElementCopyAttributeValue against Finder and, if needed, CGEvent.tapCreate on the live TCC database.

The ACP bridge boots

Node subprocess starts from acp-bridge/. Five MCP servers register, macos-use among them. ChatProvider.swift opens three sessions on claude-sonnet-4-6 (main, floating, observer).

First prompt

You type a sentence. Claude receives the tool list and the session context. If the task touches a Mac app, the first tool call is usually macos-use_open_application_and_traverse. The AX tree comes back in under 250 ms for most windows.

Plan, act, verify

Claude proposes a sequence of click_and_traverse, type_and_traverse, press_key_and_traverse calls. Each call returns the post-action AX tree (optionally with a diff), so Claude can verify and recover before moving on.

Anchor fact

The entire payload Claude sees is 0 fields per element

Role, text, x, y, width, height. That is it. A BFS walker with a hard cap of 0 elements,0 levels deep, and a0-second stopwatch produces a JSON list sorted y-first. 0 non-interactable roles are pruned before the list leaves the Swift side. A typical Finder or Calendar window fits in a few hundred tokens; the equivalent screenshot fits in a vision budget that can be 10 to 100 times the cost.

Read it yourself in MacosUseSDK/Sources/MacosUseSDK/AccessibilityTraversal.swift at the ElementData struct (lines 32 to 57) and the AccessibilityTraversalOperation class caps (lines 96 to 114).

AXUIElementCreateApplication

kAXFocusedWindowAttribute

kAXTitleAttribute

kAXValueAttribute

kAXDescriptionAttribute

CGEvent mouse click

AXObserver

ApplicationServices.framework

NSWorkspace.frontmostApplication

MCP stdio transport

ACP v0.29.2

claude-sonnet-4-6

Where this sits next to Claude Computer Use, Codex, and Gemini

Anthropic's own Computer Use, OpenAI Codex's desktop mode, and Google's Gemini Computer Use all use pixels as the primary input primitive. That is the right call for a portable API surface: pixels work on any OS, in any VM, over any remote desktop. It is the wrong call for a local-first Mac app where the OS already exposes a cheaper, more accurate structured representation of every window.

Fazm keeps both paths open and prefers the cheaper one when the app cooperates. The model selection is the same Claude Sonnet 4.6 you would get from the other options; the difference is what arrives in the model's context window before it decides where to click.

Want to see the AX payload Claude sees on your own Mac?

Book a 15-minute walkthrough. I will screen-share a live macos-use traversal on a window of your choice and show the JSON Claude receives.

macOS computer use agent, frequently asked

What is a macOS computer use agent?

A macOS computer use agent is a program that takes a natural language instruction, reads the state of your Mac, and performs mouse clicks, keystrokes, and app commands on your behalf. There are two common architectures. The first captures a screenshot and hands it to a multimodal language model, which returns pixel coordinates to click. The second reads the macOS accessibility tree, the same data VoiceOver narrates, and hands a structured list of UI elements to a language model, which returns element targets. Fazm uses the second approach as its default path and falls back to screenshots only when visual reasoning is actually required.

How is Fazm different from Claude's built-in Computer Use tool?

Anthropic's Computer Use tool, introduced in Claude Sonnet 3.5 and refined through Claude Sonnet 4.6 and Claude Opus 4.7, is a screenshot plus mouse and keyboard surface. It works across any OS because it has no OS dependency, but every read of the screen costs an image token budget and survives OCR only as well as the vision model does. Fazm packages a signed, notarized consumer Mac app that routes Claude through a second MCP server named macos-use. That server calls AXUIElementCreateApplication and walks the tree into JSON. For most native Mac tasks the AX tree is smaller, cheaper, and more precise than a PNG. Fazm still supports screenshots for the cases where they win, like PDFs and hand-drawn figures.

What is the exact schema Fazm sends to Claude for each UI element?

Six optional or required fields. The struct is ElementData in MacosUseSDK/Sources/MacosUseSDK/AccessibilityTraversal.swift at lines 32 to 57. 'role' is a string like 'AXButton', 'AXTextField', 'AXMenuItem'. 'text' is the concatenated label from kAXValueAttribute, kAXTitleAttribute, kAXDescriptionAttribute, AXLabel, and AXHelp. 'x', 'y', 'width', 'height' come from the element's frame. That is it. No color, no font, no opacity, no pixel data. Spatial sort is y-first, then x, so elements arrive in reading order. The entire ResponseData struct wraps that list with app_name, stats, and processing_time_seconds.

How big is a typical Fazm traversal?

AccessibilityTraversalOperation in AccessibilityTraversal.swift caps traversal at maxElements = 2000, maxDepth = 100, and maxTraversalSeconds = 5.0. Most app windows fall well under those caps. A Finder window with 40 files is around 60 to 120 elements. A Calendar month view is around 300 to 500. A busy Gmail inbox can push above 1000. At roughly 80 to 120 bytes of JSON per element, that maps to 10 to 40 kilobytes of structured text per window, compared to a 1 to 3 megabyte PNG if you were using screenshots. 14 non-interactable roles are pruned by default to keep the output tight: AXGroup, AXStaticText, AXUnknown, AXSeparator, AXHeading, AXLayoutArea, AXHelpTag, AXGrowArea, AXOutline, AXScrollArea, AXSplitGroup, AXSplitter, AXToolbar, AXDisclosureTriangle.

Where does the macos-use binary actually live inside the app?

acp-bridge/src/index.ts at line 63 resolves the binary with 'const macosUseBinary = join(contentsDir, "MacOS", "mcp-server-macos-use")'. The file is a signed Mach-O bundled into Contents/MacOS/ of the Fazm .app. At startup the ACP bridge registers it as an MCP server with the name 'macos-use' (index.ts lines 1057 to 1060) and hands it to Claude alongside four other bundled servers (playwright, whatsapp, google-workspace, fazm_tools). Source is open at github.com/mediar-ai/mcp-server-macos-use; it depends on github.com/mediar-ai/MacosUseSDK.

What tools does the macos-use MCP server expose?

Five, all documented in mcp-server-macos-use/README.md. macos-use_open_application_and_traverse opens or activates an app and returns the tree. macos-use_click_and_traverse clicks at (x, y) inside the target PID and returns the new tree. macos-use_type_and_traverse types text. macos-use_press_key_and_traverse presses a named key with optional modifier flags (Shift, Command, Option, Control, Fn, and so on). macos-use_refresh_traversal re-reads the tree without acting. Each call can optionally include traverseBefore, showDiff, onlyVisibleElements, and animationDuration flags. The diff output highlights what changed between the before and after traversal, which is how Claude verifies its own click actually did something.

Which macOS permission does this approach need, and why does Fazm probe it so carefully?

Accessibility, which in TCC terms is kTCCServiceAccessibility. Fazm still requests Screen Recording for the screenshot fallback path, but the default read path only needs Accessibility. AppState.swift wraps the permission check in a multi-stage probe because AXIsProcessTrusted() caches per-process on macOS 26 Tahoe; after a permission reset it can keep returning true even when AX is dead. The probe calls AXUIElementCopyAttributeValue against the frontmost app, disambiguates AXError.cannotComplete by retrying against Finder (which is known to implement AX correctly), and falls back to CGEvent.tapCreate on the live TCC database when Finder is not available.

Why not just use OpenAI Codex or Claude Computer Use for Mac tasks?

You can. Codex on Mac, shipped April 16 2026, is a cloud-first screen-reading plus action model. Claude Computer Use is a screenshot plus mouse and keyboard surface. Both are strong when you want a portable, OS-agnostic agent or when you need vision for genuinely visual tasks. They are slower and more expensive than AX-tree reads for native Mac automation because the agent pays an image token budget on every perception step. Fazm is the local-first consumer path: accessibility tree by default, screenshots on demand, Claude Sonnet 4.6 pinned as the model, five MCP servers bundled into the signed .app so the first prompt already reaches Finder, Safari, WhatsApp, and Google Workspace without setup.

What can I actually tell a macOS computer use agent to do through Fazm?

Anything that maps to a sequence of 'look at a window, find an element, click or type'. Concrete examples users run: rename every Finder file in a folder to kebab-case by date, pull every invoice PDF in ~/Downloads from the last two weeks and file them in Google Drive, reply to every WhatsApp message from a named contact with a templated response, update a column in a running Numbers sheet from a terminal-visible log, drag a meeting in Calendar to the next free slot that matches a teammate's availability, close all Slack notifications and summarize what mattered. The first prompt is the hard one; the rest are refinements on the tool call list Claude proposes before it acts.

Is the accessibility-first approach open source?

Yes, both halves. The SDK at github.com/mediar-ai/MacosUseSDK provides the traversal, activation, and input controller primitives. The MCP wrapper at github.com/mediar-ai/mcp-server-macos-use turns those into MCP tool calls. Fazm bundles the compiled binary inside the signed app so end users do not need a developer toolchain, but anyone building a different Mac agent can drop the same open source pieces into their stack and get the same 6-field ElementData schema and the same 2000-element BFS cap.