Alternative to screenshot-first computer-use agents

Accessibility Tree vs Screenshots: the code path that drops image bytes from every tool response in a shipping Mac agent.

Every SERP result for this keyword compares accessibility trees and screenshots as abstract trade-offs. Fazm actually implements the choice in code. The MCP tool-result handler at acp-bridge/src/index.ts lines 2271 to 2307 has exactly two branches, both text, and no type:"image" branch. A 500 KB base64 screenshot becomes zero bytes of context. A 691-char Playwright AX snapshot and a 1,847-char Mail AX tree flow through untouched.

F
Fazm
5.0from Fazm source tree
Zero image branches in the MCP tool-result handler (index.ts:2271-2307)
MAX_IMAGE_TURNS = 20 per-session cap at index.ts:793
macos-use native MCP registered at index.ts:1056-1063 for any Mac app

Works on any AX-compliant Mac app, not just a browser

The filter lets a text AX tree flow through for any app. That text comes from either Playwright's browser_snapshot (for Chrome) or macos-use's native traversal (for everything else). Same bridge code, two tree sources, one filter.

Apple MailCalendarNotesRemindersFinderSystem SettingsSafariChrome via PlaywrightSlack (Catalyst)Discord (Catalyst)WhatsApp (Catalyst)Figma DesktopVS CodeCursorObsidianiA WriterXcodeZoomLinear DesktopSpotify
0Image branches in the tool-result handler
0Text branches (line 2282 and 2287)
0Per-session image cap, index.ts:793
0 charsPlaywright snapshot size
0 BScreenshot base64 dropped
0+Tokens saved per turn vs screenshot-first

The anchor: lines 2271 to 2307 of acp-bridge/src/index.ts

This is the literal from the Fazm bridge. Read the two if-branches; neither checks for image content. The comment above states the intent. The rawOutput fallback is also strictly text-only. Every MCP tool call, for every Mac app Fazm touches, runs through this code.

acp-bridge/src/index.ts:2271-2307
0 image branches

We extract only text items and skip images to keep context small.

acp-bridge/src/index.ts, comment above line 2273

Screenshot-first path vs Fazm's AX-first path

Left: the typical computer-use handler that forwards screenshot base64 to the model. Right: Fazm's actual code. The two blocks do the same job (turn a tool result into model input) and have very different runtime costs.

Same interface. Different realities per turn.

// Typical screenshot-first handler in a computer-use agent
function handleToolResult(update: ToolResult) {
  const parts = [];
  for (const item of update.content ?? []) {
    if (item.type === "text") {
      parts.push({ type: "text", text: item.text });
    }
    if (item.type === "image") {
      // Forward the full base64 to the model.
      parts.push({
        type: "image",
        source: {
          type: "base64",
          media_type: "image/png",
          data: item.data,      // ~500 KB per 1920x1200 screen
        },
      });
    }
  }
  return { role: "user", content: parts };
}
// Per-turn: ~350K image tokens. Per 10 turns: ~3.5M tokens of pixels.
45% fewer lines

How four tool sources collapse into one text channel

Fazm wires four potential inputs into one filter. Two of them (accessibility tree sources) flow through to LLM context. Two of them (image sources) get routed to file output or dropped. The per-session MAX_IMAGE_TURNS cap at line 793 is the hard stop when even the fallback is unwelcome.

Four inputs, one filter, two text-only destinations

playwright browser_snapshot
macos-use traversal
playwright browser_take_screenshot
capture_screenshot
Filter at lines 2271-2307
LLM context
/tmp/playwright-mcp/
Dropped on the floor
Capped at 20/session

The Playwright flags that enforce the same rule for web pages

The bridge also asks Playwright MCP to stop returning inline screenshots. Snapshots get written to file and the tool response references them by path. This is redundant with the handler filter, but redundancy is the point: two independent mechanisms both point at the same outcome.

acp-bridge/src/index.ts:1027-1033

What the accessibility tree gives you that a screenshot does not

Each item below is a named AX attribute that arrives as a field, not a pixel inference. That is the difference between asking a vision model to locate the Send button and asking AXUIElementCopyAttributeValue for an AXButton with title 'Send'.

Delivered as named fields, per element

  • Element role (kAXRole): AXButton, AXTextField, AXCheckBox, AXMenuItem
  • Element title (kAXTitle): 'Send', 'Reply', 'Compose'
  • Element value (kAXValue): textbox content, slider position, toggle state
  • Position and size (kAXPosition, kAXSize): x, y, w, h in screen space
  • Focused state (kAXFocused): which element currently has keyboard focus
  • Children tree (kAXChildren): the full window subtree, recursively
  • Identity across renders: same element, same handle, across turns
  • Hidden UI: overflow menus, off-screen rows, collapsed sidebars with role/title intact

A full turn from user prompt to AX click, with the log

This is a lightly trimmed trace from a Fazm dev build's /tmp/fazm-dev.log. Three turns, zero base64 bytes, about six thousand tokens total across the conversation.

acp-bridge stderr

Same user prompt, two different runtimes

'Click Send in Mail' is the minimum viable Mac automation task. A screenshot-first agent solves it with 350K image tokens and a pixel coordinate. Fazm solves it with an AX tree and a role-title match. The visible difference is per-turn cost; the invisible difference is what survives a window move.

'Click Send in Mail' on the two architectures

A user asks 'click the Send button in Mail'. Screenshot-first agent: capture full-screen PNG, base64-encode, ship 350K tokens. Model gets pixel blob; has to visually locate the Send button. Output: 'click at (1342, 188)' plus confidence number. Wrong coords if the window moves. Wrong app if focus shifts. 10 turns cost 3.5M image tokens.

  • 350K input tokens per 1920x1200 screen
  • Pixel coords that break on window move
  • One turn per click and one turn to verify
  • Vision model has to OCR every screen

The full message trace, from user click to next token

Twelve messages across five actors. Note the filter step at line 2271-2307 happens inside the bridge process before the tool_result ever reaches the model. No downstream component has to make an image vs text decision.

AX-tree turn: click to next token

UserChatProvider.swiftacp-bridge (Node)MCP serverLLM'Click Send in Mail'query { prompt, sessionKey: 'main' }session/prompt with AX tree from last turntool_call mcp__macos-use__click_and_traverse { element: 'Send' }spawn MCP request { name, args }AXUIElementCopyAttributeValue probes window, finds AXButton 'Send' (Mail.app pid)result: { content: [{type:'text', text: axTreeYaml}] }filter at index.ts:2271-2307. item.type==='text' matches at 2282. Forward.tool_result content, text only. 0 bytes of pixels.next turn streams: 'Send clicked, compose window closed.'onTextDelta rendersUI updates; Mail really did click Send

How a click travels from user intent to a real AX action

Six steps from prompt to pixel. The screenshot path is an optional branch at step six, gated by MAX_IMAGE_TURNS and by an explicit prompt instruction that it 'costs extra tokens' (ChatPrompts line 61).

1

1. Agent plans a tool call in the LLM

The model decides to click Send in Mail. It emits a tool_use for mcp__macos-use__click_and_traverse { element: 'Send' }. No screenshot involved.

2

2. acp-bridge routes it to the right MCP server

index.ts dispatches to the macos-use subprocess registered at lines 1056-1063. The tool name prefix decides the route; there is no fallback to screenshot.

3

3. macos-use traverses the frontmost app's AX tree

Native binary calls AXUIElementCreateApplication on the Mail pid, walks kAXChildren, extracts kAXRole, kAXTitle, kAXValue, kAXPosition, kAXSize per element. Returns a text accessibility tree.

4

4. Bridge filter at lines 2271-2307 picks up the result

item.type === 'text' matches at line 2282; the text tree is appended. If an image had sneaked in at item.type === 'image', there is no branch to catch it; it is dropped.

5

5. Model sees an accessibility tree, not a PNG

The next turn streams into session/prompt carrying the AX tree as plain text. The model picks by role and title, then calls the next tool. One RPC round trip, no vision pass.

6

6. Screenshot only if explicitly asked

capture_screenshot or browser_take_screenshot are available, but ChatPrompts.swift line 61 says 'only use when you need visual confirmation, it costs extra tokens.' MAX_IMAGE_TURNS = 20 caps the damage.

What breaks in a screenshot-only pipeline on Mac

Six concrete failure modes that the accessibility tree path side-steps. Each one is something a Mac user running a screenshot-first computer-use agent will hit within a handful of turns.

Screenshot OCR misreads dense UI

Tables with narrow columns, small icon-only buttons, right-to-left text, and accent-colored pills routinely get mis-OCR'd by even GPT-4V and Claude 3.5 Sonnet vision. The AX tree does not care: kAXRole says AXRow, kAXValue says the exact cell contents. Fazm's path at index.ts:2271-2307 forwards that text verbatim.

Pixel coordinates drift on window moves

Every screenshot-first agent clicks at (x,y). The window moves between capture and click. macos-use re-traverses after every action and returns the new x, y, w, h. Fazm's tool pattern auto-centers on the AX bounds.

Context blows up on turn 3

A 1920x1200 screenshot is ~350K input tokens. Three turns and you have spent more than half a Claude 3.5 Sonnet context window on pixels. Zero tokens with Fazm's filter, because no image ever enters the array at lines 2280-2290.

Off-screen UI is invisible to pixels

Overflow menus, collapsed sidebars, and rows below the scroll fold show up in the AX tree with kAXChildren even when not visible on screen. A screenshot-first agent must scroll, capture, OCR, repeat. macos-use returns the full subtree in one call.

Native non-browser apps need AX anyway

Slack Catalyst, Discord, Figma desktop, VS Code, Cursor, Apple Mail, Calendar all expose AX trees but do not expose a DOM. A screenshot-first agent either OCRs them or cannot touch them. Fazm's macos-use at index.ts:1056-1063 reads them directly.

Per-image tokenizer limits kick in

Anthropic caps image long-edge to 2000px, and per-session image counts affect error rates. MAX_IMAGE_TURNS = 20 at index.ts:793 is the backstop; in practice, Fazm rarely touches the cap because the AX path already resolves the task.

Fazm's live AX probe: AppState.swift lines 433 to 463

Before any tool call runs, Fazm confirms the accessibility API is actually working for the frontmost app. A screenshot-first agent has no equivalent: it will happily OCR a broken permission prompt and assume it clicked something. Fazm probes with a real AX call, then disambiguates with a Finder probe and a CGEvent tap probe if needed.

Desktop/Sources/AppState.swift:433-463

The native binary behind the Mac-app coverage

mcp-server-macos-use is what turns 'Mail', 'Calendar', 'Slack' into first-class AX targets for the agent. Registered once at bridge startup, it shares the same tool-result filter as Playwright, so a macos-use traversal result flows through lines 2271-2307 the same way a browser_snapshot does.

acp-bridge/src/index.ts:1056-1063

Fazm vs a generic screenshot-first computer-use agent

Eight head-to-head rows. Each is backed by a specific file and line number in the Fazm source tree.

FeatureScreenshot-first agentFazm (AX tree path)
Tool-result handling: what flows into model contextScreenshot base64 forwarded verbatim as the primary signalOnly text items, enforced at index.ts:2271-2307 (no image branch exists)
Input size per 1920x1200 screenPNG base64, roughly 350K tokensAX tree YAML or text, typically 500-2000 tokens
How an element is locatedVision model infers pixel coordinates from the PNGRole + title + value + AX bounds, x/y/w/h returned directly
Stability when the window moves mid-turnPixel coords go stale; requires a fresh screenshot and retryRe-traverse returns new bounds; click auto-centers on x + w/2, y + h/2
Works on non-browser Mac appsBrowser-only or screenshot-only; no structured UI data off the webmacos-use MCP at index.ts:1056-1063 for any AX-compliant app
Pre-flight that AX actually worksNo equivalent; screenshots always 'work' even when brokenAXUIElementCopyAttributeValue on kAXFocusedWindowAttribute (AppState.swift:439-441)
When screenshots get usedEvery turn, regardless of whether a text alternative existsVisual verification only; capped at MAX_IMAGE_TURNS = 20 per session (index.ts:793)
Consumer app or developer framework?Python SDKs, Docker images, dev-only demos (OpenAdapt, OS-Atlas)Shipping Mac app (Fazm), one install, speaks to the user

See the AX-tree filter run live on your own Mac

Book a 20-minute demo. We will open Mail, Finder, Slack and Figma in one session, watch the macos-use traversals stream through lines 2271-2307, and show a side-by-side token-cost counter against the screenshot-first alternative.

Book a call

FAQ

Frequently asked questions

What does 'Fazm prefers the accessibility tree over screenshots' actually mean in code?

It means acp-bridge/src/index.ts at lines 2271-2307 has zero branches that handle image content from MCP tool results. There are exactly two branches: item.type === 'text' at line 2282 and inner.type === 'text' at line 2287. Everything else, including {type:'image', data:<base64>}, is dropped by omission on its way into the model's context. The rawOutput fallback at lines 2293-2307 is also strictly text-only. This is reinforced by --image-responses omit at line 1033 and MAX_IMAGE_TURNS = 20 at line 793. A screenshot never even reaches the context window unless a specific capture_screenshot tool explicitly returned base64 and bypassed the filter.

Why choose the accessibility tree over screenshots for a Mac agent?

Three reasons tied to the shipping code. One, size: a 1920x1200 PNG base64 is roughly 350K tokens, a Playwright browser_snapshot YAML is ~691 chars or roughly 170 tokens, an AX tree from a single Mail window is typically 500-2000 tokens. At a 200K-token context that is 2500x the density. Two, structure: kAXRole, kAXTitle, kAXValue, kAXPosition, kAXSize arrive as named fields from AXUIElementCopyAttributeValue, so 'click Send in Mail' is a role + title match, not a pixel regression. Three, coverage: AXUIElementCreateApplication works on any AX-compliant macOS app (Mail, Calendar, Finder, Slack Catalyst, Discord, Figma, VS Code), while a screenshot-first agent has to re-run OCR every turn for every app.

How is this different from Claude Computer Use, OpenAI Operator, OpenAdapt, and OS-Atlas?

Claude Computer Use ships a tool that returns a screenshot plus mouse/keyboard coordinates. OpenAI Operator drives a cloud browser by streaming screenshots to a vision model. OpenAdapt and OS-Atlas both record and replay screen pixels. Every one of those approaches pays the 350K-ish-token bill on every turn for a 1920x1200 screen. Fazm's bridge does not. It registers mcp-server-macos-use as a native binary at index.ts lines 1056-1063 so AX tree probes are an MCP tool call, and the tool-result handler strips image payloads the moment they arrive. A Mac user is getting text-first AX data by default, not as an opt-in.

Does Fazm ever use screenshots?

Yes, for a narrow set of cases. The Fazm chat prompt at ChatPrompts.swift line 56 routes capture_screenshot for visual verification, and line 61 tells the model to only call browser_take_screenshot 'when you need visual confirmation, it costs extra tokens.' MAX_IMAGE_TURNS = 20 at index.ts line 793 is a per-session cap that keeps the Anthropic 2000px-per-image limit from firing. The posture is: accessibility tree first, screenshot only when the text is insufficient. In practice on a typical session the cap never gets close to 20 because the AX tree is already enough.

What exactly flows through the filter instead of a screenshot?

For Playwright browser_snapshot, a YAML document with the page's accessibility tree and [ref=e1], [ref=e2] handles per element; the bridge stores it under /tmp/playwright-mcp/ via --output-mode file at line 1033 and returns a small JSON reference that the LLM reads with Read. For macos-use traversal tools, a text accessibility tree with [Role] 'title' x:N y:N w:W h:H visible lines per element. For whatsapp, google-workspace, fazm_tools, plain text results. Every one of these is what the filter at lines 2271-2307 lets through.

Does the filter work in every Claude Agent SDK wire format?

Yes, it handles both shapes. Direct MCP format {type:'text', text:'...'} is caught at line 2282. ACP-wrapped format {type:'content', content:{type:'text', text:'...'}} is caught at the inner branch on line 2287. If the MCP server or ACP SDK changes envelope shape across versions, the bridge still only emits text. The comment above line 2273 states this intent: 'We extract only text items and skip images to keep context small.'

How does Fazm probe macOS accessibility to confirm the path even works?

AppState.swift at lines 439-441 calls AXUIElementCreateApplication(frontApp.processIdentifier) and AXUIElementCopyAttributeValue(appElement, kAXFocusedWindowAttribute as CFString, &focusedWindow). If the result is .success, .noValue, .notImplemented, or .attributeUnsupported the AX API is working. If it is .apiDisabled or .cannotComplete, Fazm re-probes against Finder (lines 468-485) and falls back to a CGEvent.tapCreate probe (lines 487-505) to distinguish a truly broken permission from a per-app AX incompatibility. This is the pre-flight every Mac user sees on first run; no screenshot path substitutes for it.

Which macOS apps are reachable through the accessibility tree path?

Any AX-compliant app. ChatPrompts.swift line 59 explicitly lists the macos-use tools for 'Finder, Settings, Mail, etc.' In practice: Apple Mail, Calendar, Notes, Reminders, Finder, System Settings, Safari (as an AX target, not just DOM), Slack Catalyst, Discord Catalyst, WhatsApp Catalyst (which additionally has the dedicated whatsapp MCP at index.ts:1066-1073), Figma desktop, VS Code, Cursor, Obsidian, iA Writer. Apps that render their own UI outside AX (some Qt or pure OpenGL apps) fall through, which is exactly why AppState.swift:454-463 treats AXError.cannotComplete as ambiguous and re-probes.

What is the token-count math on a typical agent turn?

A 1920x1200 PNG compresses to roughly 500 KB, which base64 expands to roughly 666 KB. At the modern Anthropic tokenizer that is around 350K input tokens for a single screen. A Playwright browser_snapshot YAML for an average e-commerce page is under 1 KB, under 250 tokens. A macOS AX tree for a single focused Mail window is typically 1-2 KB, well under 500 tokens. Running ten turns on screenshots alone would be ~3.5M tokens of image content before the model even starts thinking. Running ten turns on AX trees is ~5K tokens, fits in a context budget many times over. That is the difference the filter at lines 2271-2307 realises per turn.

What is different about a Mac-native accessibility tree versus a Chrome DevTools accessibility tree?

Chrome's AX tree comes from the web page DOM and AOM (Accessibility Object Model). It ends at the browser window. Mail's AX tree is the entire NSWindow, every toolbar button, every sidebar row, every message cell. Fazm uses both: playwright-mcp's browser_snapshot for web pages inside Chrome, macos-use's traversal for any other app. Lines 1027-1054 register the Playwright MCP server; lines 1056-1063 register the macos-use native binary. Same filter, two tree sources.

How do I verify this in the Fazm source tree?

Six anchors. 1) The image-drop filter: /Users/matthewdi/fazm/acp-bridge/src/index.ts lines 2271-2307. 2) The Playwright omit flag: index.ts line 1033. 3) The image-turn cap: index.ts line 793. 4) The macos-use binary registration: index.ts lines 1056-1063. 5) The AX probe: /Users/matthewdi/fazm/Desktop/Sources/AppState.swift lines 433-463. 6) The routing rule in the chat prompt: /Users/matthewdi/fazm/Desktop/Sources/Chat/ChatPrompts.swift lines 56-61. Every one of these is a direct file:line anchor in the shipping codebase.

fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.