claude computer useMCP toolsmacOS

Claude Computer Use Agent: the tool-schema swap that runs on a real Mac

Anthropic's reference computer use exposes one tool called computer that takes screenshots and coordinate actions. Fazm keeps the same Claude, swaps the tool schema for six MCP tools, and runs observe-act-observe in one round trip instead of two.

M
Matthew Diakonov
11 min read
4.9from 200+
Uses real accessibility APIs, not screenshots
Works on any Mac app, not just the browser
Consumer app, no MCP config to touch

Why the tool schema is the whole fight

Every article about the Claude computer use agent ends up describing the same loop. Claude takes a screenshot, reads the pixels, returns a coordinate action, takes another screenshot, reads the new pixels, returns the next action. That loop is not a property of Claude. It is a property of the tool schema the agent runtime hands to the model.

Anthropic's reference schema is a single builtin tool named computer. Its observation action is screenshot. Its action primitives are coordinate based: left_click, type, key. The agent loop has to follow that shape because that is what the API lets the model call.

Fazm runs Claude Sonnet 4.6 through an ACP bridge that hides the reference tool entirely and registers six MCP tools in its place. The rest of this page walks through exactly which tools, what they return, and where in the Fazm source you can verify it yourself.

The two tool schemas, side by side

This is the exact shape of the tools the Claude API is offered in each case. The tool list is what the model reasons against, so the list is the agent.

Tool schema exposed to Claude

{
  "type": "computer_20250124",
  "name": "computer",
  "display_width_px": 1024,
  "display_height_px": 768,
  "display_number": 1
}

// Actions the model issues against this single tool:
//   screenshot         — grab the full display
//   mouse_move         — move cursor to (x, y)
//   left_click         — click at (x, y)
//   left_click_drag    — drag from (x1, y1) to (x2, y2)
//   right_click        — right click at (x, y)
//   middle_click       — middle click at (x, y)
//   double_click       — double click at (x, y)
//   triple_click       — triple click at (x, y)
//   type               — type a string
//   key                — press a named key (Return, Tab, cmd+a)
//   scroll             — scroll at (x, y) by dx, dy
//   wait               — pause for n seconds
//   cursor_position    — return current (x, y)
//
// Every step that needs observation calls screenshot as a
// SEPARATE tool_use. Observe and act are two round trips.
28% richer tool set

Anchor fact: what the bridge actually registers

This is the exact code inside Fazm that swaps the tool schema. The path resolution is on line 63, the registration block is on lines 1056 through 1064, the default model is on line 1245, and the authoritative built-in MCP list is on line 1266. Open the file, grep for any of these, and it is there.

acp-bridge/src/index.ts

Uncopyable part: the macOS binary is shipped inside the app bundle at Fazm.app/Contents/MacOS/mcp-server-macos-use, a 21 MB Mach-O 64-bit arm64 executable, version 1.6.0. Right click the app, Show Package Contents, verify with file. No other Mac Claude-driven computer use agent ships its own native AX walker binary inside the app bundle and hands the six tool names directly to Claude.

Ask the binary what tools it offers

The binary speaks MCP over stdio. Pipe a tools/list request at it and read back the six tool names the bridge registers verbatim:

tools/list

What each of the six tools actually does

Claude never sees a bare click or type action. It sees six composite verbs, each fused with a tree re-walk.

open_application_and_traverse

Launches a Mac app by name or bundle id, waits for it to come forward, then returns the freshly walked AX tree. Replaces "open a terminal and run `open -a <app>`" plus a screenshot.

click_and_traverse

Synthesizes a CGEvent mouse click by element (title substring match) or by explicit x/y/w/h. Returns the post-click tree in the same MCP response, no separate screenshot call needed.

type_and_traverse

Types a UTF-8 string as keystrokes into the focused element, optionally preceded by a click to set focus and followed by a pressKey (Return, Tab). All three actions collapse into one MCP round trip.

press_key_and_traverse

Sends a named key or chord (Return, Tab, Escape, cmd+a, cmd+shift+t). Returns the tree after the key, so the agent sees immediately whether the shortcut opened a new window or dismissed a dialog.

scroll_and_traverse

Scrolls the target element by a delta in points and re-walks the tree. Reveals elements that were attached but had visible set to false because they sat below the scroll clip.

refresh_traversal

The observation-only variant. No action, just re-walk the tree of the frontmost app. The agent uses it at the start of a conversation and after anything asynchronous (page load, file save, remote response).

Every tool ends in _and_traverse

The suffix is not cosmetic. It is the API contract that every tool returns the post-action accessibility tree as part of its reply.

macos-use_open_application_and_traverse
macos-use_click_and_traverse
macos-use_type_and_traverse
macos-use_press_key_and_traverse
macos-use_scroll_and_traverse
macos-use_refresh_traversal

How a Claude tool_use reaches the accessibility tree

The ACP bridge is a thin Node process between the Claude API and the bundled Swift binary. Nothing in the path requires a screenshot or a vision pass.

tool_use pipeline

Claude Sonnet 4.6
ACP over stdio
User intent
acp-bridge
mcp-server-macos-use
AXUIElement C API
Tree + action result

The round-trip collapse

Same user intent, two different tool_use traces. Switch the tab to compare the exact shape of what gets sent across the Claude API in each case.

Observe-act-observe, one intent

// Anthropic reference computer use // Round trip 1: action tool_use: name: "computer" input: { action: "left_click", coordinate: [640, 320] } tool_result: content: [{ type: "text", text: "clicked at (640, 320)" }] // Round trip 2: observation (separate tool_use turn) tool_use: name: "computer" input: { action: "screenshot" } tool_result: content: [{ type: "image", source: { type: "base64", media_type: "image/png", data: "iVBORw0KGgo..." // 500 KB - 5 MB }}] // Model must now re-interpret pixels to find the next target.

  • Screenshot is a separate tool_use turn
  • Observation payload is 500 KB to 5 MB of base64
  • Model re-interprets pixels every step
  • Vision-token cost per observation

"Reply to Marwan and say I will call him today" end to end

1

User says what they want in plain English

"Reply to the email from Marwan and say I will call him today."

No coordinates, no element IDs, no Applescript. The model owns the intent-to-action translation.
2

Claude Sonnet 4.6 issues a tool_use block

It picks click_and_traverse and fills in arguments from the last tree dump in context.

The bridge routes the tool_use over stdio to the macos-use MCP server. Model ID: claude-sonnet-4-6 (acp-bridge/src/index.ts:1245).
3

The Swift binary acts and re-walks the tree

mcp-server-macos-use v1.6.0 synthesizes the CGEvent click, waits for the frontmost app to settle, then walks the AX tree depth-first.

441 elements in 0.72 seconds on a real Fazm Dev window. Output written to /tmp/macos-use/<ts>_click_and_traverse.txt as ground-truth you can grep.
4

MCP response carries action result plus the new tree

Both parts come back in the same tool_result block, not as two separate tool_use turns.

This is the single-round-trip collapse. Anthropic's reference computer tool requires a second tool_use for screenshot; _and_traverse bakes observation into action.
5

Claude reads the new tree and picks the next tool

Substring-search on a few kilobytes of structured text. No vision pass, no OCR, no pixel math.

For a typical workflow (open app, click button, type text, submit) that is four round trips, four fresh trees, no separate screenshot calls.

The numbers behind the swap

These are not benchmarks. They come from the file system, the bridge source, and a real traversal header on a Fazm Dev window.

0MCP tools, all _and_traverse
0 MBBundled arm64 binary size
0Elements in a real traversal
0sWalk + serialize time

Compare to a typical screenshot observation in Anthropic's reference loop: 0s on a good day and 0s on a bad one, before the model has even decided what to do next.

One line of what Claude actually reads

The element format the binary emits is deterministic. Claude sees text like this and picks a target by substring search, then pulls x/y/w/h from the same line:

[AXButton (button)] "Send" x:6272 y:-1754 w:56 h:28 visible

After any macos-use tool call, Fazm writes the full tree to /tmp/macos-use/<ts>_<tool>.txt. That file is the exact string the Claude API received in the tool_result block.

One tool vs six tools, step by step

A line-by-line accounting of what each tool schema gives Claude for the same Mac task.

FeatureAnthropic reference (one computer tool)Fazm (six _and_traverse tools)
Number of tools exposed to the model1 (computer)6 (all suffixed _and_traverse)
How the model observes the screenscreenshot action returns base64 PNGevery tool call returns the re-walked AX tree text
Round trips per observe-act-observe2 (action + screenshot)1 (action + tree, same MCP response)
Click target derivationmodel infers coordinates from pixelsexact x/y/w/h pulled from the matched tree line
Typical per-observation payload500 KB to 5 MB base64a few kilobytes of UTF-8
Icon-only buttonsOCR fails, model guesses from shapekAXTitleAttribute gives the label directly
Off-screen but attached elementsinvisible in the JPEGpresent in tree with visible flag = false
Retina scaling and multi-monitorpixel coordinates shift with DPRCGFloat points, negative y legal on monitors above the main
Setup for a non-developerDocker container, API keys, scriptsinstall a Mac app, grant Accessibility once

Where the swap falls back to pixels

The six-tool swap wins on any Mac app that implements the accessibility API cleanly, which is most native apps: Mail, Calendar, Messages, Safari, Finder, Settings, Slack, Notes. It also wins on Catalyst apps and Chrome (the DOM is reflected through AX).

It loses on apps that render to a raw Metal or OpenGL canvas and skip AX entirely. Some Qt apps expose only a window shell. A few Electron builds ship a half-broken bridge. In those cases the tree comes back nearly empty and the agent would have nothing to select on.

Fazm handles that the way every production computer-use agent ends up doing: the system prompt keeps a capture_screenshot escape hatch (Desktop/Sources/Chat/ChatPrompts.swift line 56), and when the tree is empty the model falls back to pixels and coordinate clicks. That is hybrid mode. Tree-first when it works, pixels only when they actually buy information.

Want to watch Claude drive your own Mac through six MCP tools?

Thirty minutes on a call. We open /tmp/macos-use/ together, run a workflow, and you see exactly which tree lines Claude read between each action.

Frequently asked questions

How is Fazm's Claude computer use agent different from Anthropic's reference implementation?

Anthropic's reference computer use exposes one tool called 'computer' to the Claude API. That tool accepts actions like screenshot, mouse_move, left_click, type, key, and scroll. The model sees pixels and issues coordinates. Fazm keeps the same Claude (claude-sonnet-4-6 is the DEFAULT_MODEL at acp-bridge/src/index.ts line 1245), but replaces that single 'computer' tool with six MCP tools from a bundled binary: open_application_and_traverse, click_and_traverse, type_and_traverse, press_key_and_traverse, scroll_and_traverse, and refresh_traversal. Every one of those tools returns the action result AND the post-action accessibility tree in a single MCP response. Same model, different tool schema, completely different loop shape.

What does the '_and_traverse' suffix actually buy the agent?

It collapses the classic observe-act-observe cycle into one MCP round trip. In Anthropic's reference loop, the agent calls computer(action='left_click', coordinate=...), gets back an ack, then has to call computer(action='screenshot') to see what happened. That is two round trips, two tool-use turns, two sets of token accounting. Fazm's click_and_traverse synthesizes the click via CGEvent, then immediately re-walks the frontmost app's accessibility tree and returns both the action result and the new tree text in the same MCP reply. The next tool call always starts with fresh ground truth in context without a separate observation step.

Where does the screen context actually come from if there are no screenshots?

From the native macOS AXUIElement C API. Fazm testAccessibilityPermission at Desktop/Sources/AppState.swift line 439 calls AXUIElementCreateApplication(pid) and AXUIElementCopyAttributeValue, the same primitives VoiceOver uses. The heavy lifting happens inside mcp-server-macos-use, a 21 MB Mach-O 64-bit arm64 binary bundled at Fazm.app/Contents/MacOS/mcp-server-macos-use and registered by the ACP bridge at acp-bridge/src/index.ts lines 1056 through 1064. It walks the frontmost app depth-first and emits one text line per element, with role, subrole, accessible title, and a CGFloat point frame. That text is what the Claude API sees.

Which Claude model runs the agent and why?

claude-sonnet-4-6. It is hardcoded as DEFAULT_MODEL at acp-bridge/src/index.ts line 1245, with SONNET_MODEL pointing at the same string on line 1246. Fazm does not need Opus for the typical workflow because the accessibility tree is already structured text; the model is substring-searching a few kilobytes of UTF-8, not interpreting a 500 KB JPEG. Users can switch to haiku or opus through the model picker (the bridge calls session/set_model on the existing ACP session to preserve conversation history), but Sonnet is the default.

Can I see what the model read on my own disk?

Yes. After any macos-use tool call, Fazm writes a pair of files to /tmp/macos-use/: <timestamp>_<tool>.txt (the accessibility tree as text) and <timestamp>_<tool>.png (a companion screenshot). The .txt file is the exact string the Claude API received in the MCP tool response. If you want to verify the agent clicked what it claimed to click, grep the .txt file for the element. If the element is not there but is visible in the .png, the agent was in fallback mode and reasoned from pixels instead.

How does Fazm decide between the accessibility tree and a screenshot fallback?

The system prompt at Desktop/Sources/Chat/ChatPrompts.swift (line 59) routes macos-use tools for Finder, Settings, Mail and native desktop apps, and Playwright for pages inside Chrome. The capture_screenshot tool is the explicit escape hatch for visual context. In practice, Claude picks the tree-first path on any app that exposes AX cleanly (Mail, Calendar, Slack, Safari, Finder, Settings, Messages). Apps that return an empty tree (some Qt builds, SDL games, Metal canvases) trigger screenshot fallback. This is the hybrid mode production computer-use agents converge on: cheap and rich when the tree works, pixels only when it does not.

Is this a developer framework I have to wire up myself?

No. Fazm is a consumer Mac app. Install it, grant Accessibility permission once through the standard TCC dialog (the same one VoiceOver uses), and the ACP bridge auto-registers the macos-use MCP server alongside Playwright, WhatsApp, and Google Workspace. BUILTIN_MCP_NAMES at acp-bridge/src/index.ts line 1266 is the authoritative list: fazm_tools, playwright, macos-use, whatsapp, google-workspace. You never write a tool definition, never manage an MCP config file, never touch a Python runtime. You open the app and type what you want.

What does a single round trip of this agent actually look like in practice?

User says 'click send.' The Claude API call includes the system prompt and the conversation history. The model emits a tool_use block for click_and_traverse with arguments like { text: 'Send' }. The ACP bridge forwards that to the macos-use binary over stdio. The binary walks the current tree, finds the visible AXButton whose title contains 'Send', synthesizes a CGEvent click at the midpoint of its frame, then re-walks the tree and returns the whole new tree text plus the click result. Claude receives both in one tool_result block and decides the next action. Four of these round trips is a typical workflow. Each one is a single MCP call, not two.

Does the agent work across any Mac app or only specific ones?

Any Mac app that implements the macOS accessibility API well. The system-wide AXUIElement API is app-agnostic; Finder, Mail, Calendar, Messages, Safari, Slack, the System Settings panel, and most native Mac apps expose a full tree. Catalyst apps (including WhatsApp on Mac) also work, which is why Fazm bundles a separate whatsapp MCP that goes deeper into that specific app. The edge cases are apps that render to a raw Metal/OpenGL canvas and skip AX entirely; for those the agent drops to the screenshot fallback path.

Why does the response size matter for a Claude computer use agent?

A base64-encoded 4K screenshot is typically 500 KB to 5 MB of text in the Claude API request, which eats input tokens fast and slows the round trip. A real Fazm Dev traversal is 441 elements in 0.72 seconds, emitted as a few kilobytes of plain text. That is two orders of magnitude less data to ship over the wire and to prefill through the model. Over a 20-step workflow, screenshot-first agents spend tens of seconds on image encoding and vision-token prefill alone. Tree-first agents spend that time on actual actions.

fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.