Computer Use AI Agent: the six-tool local protocol no SERP result shows
Every page on the first SERP for this keyword describes a model. OpenAI's CUA, Anthropic Computer Use, Agent S2, Microsoft Copilot Studio. Same pattern: cloud-hosted, screenshot in, coordinates out. None of them shows the local wire protocol a consumer Mac agent actually uses. Fazm ships one, in a 1917-line open-source Swift binary, version 1.6.0, with six tool declarations at file-anchored line numbers. This page walks the whole path.
What every SERP result covers, and what none of them do
Search the keyword and the first page is almost entirely model announcements and framework lists. OpenAI's Computer-Using Agent, Anthropic's Computer Use, Microsoft Copilot Studio computer use preview, Simular's Agent S2 paper beating OpenAI CUA on OSWorld, Cua's cloud desktop, AWS prescriptive guidance, labellerr's CUA explainer, an IEEE Spectrum feature.
They all answer the same questions. What is a computer-using agent, what model is behind it, what benchmark score did it get, what applications does it enable. Useful, but they stop at the API boundary. The reader never sees the actual tool list, the actual response shape, or the actual local file the agent reads from.
That is the gap. Below is the local protocol, byte by byte, with file-anchored line numbers.
The six tools, with line numbers
Clone github.com/mediar-ai/mcp-server-macos-use and open Sources/MCPServer/main.swift. It is 1917 lines in one file. Every tool Fazm uses to drive your Mac is declared in this block, at the line numbers shown below.
Notice the naming convention: every mutating tool ends in _and_traverse. The server performs the action and immediately rewalks the AX tree, returning both in the same MCP response. For the agent that means one round trip per step instead of two.
What each tool does, one line each
This is the full surface area of the computer use agent. Six tools. Every one of them lands at a file-anchored line number in the open source binary.
open_application_and_traverse
Activate a Mac app by name and return its AX tree in one call. Declared at main.swift line 1301.
click_and_traverse
Synthesize a CGEvent click at a point or a named element, then rewalk the tree. Declared at line 1329.
type_and_traverse
Type a string into the focused field with optional pressKey chain, then rewalk. Declared at line 1349.
refresh_traversal
Pure observe: rewalk without acting. The only tool without a side effect. Declared at line 1363.
press_key_and_traverse
Press a named key or key combo (Return, Command+T), then rewalk. Declared at line 1384.
scroll_and_traverse
Scroll in a direction by a line count, then rewalk so occluded elements enter the tree. Line 1402.
“Every mutating tool returns the new AX tree in the same call. Observe-act-observe collapses to one round trip.”
main.swift lines 1301-1402, naming convention "_and_traverse"
Eight lines that put the agent on every Mac
The Node bridge inside Fazm (acp-bridge) is what sits between the Claude Agent SDK session and every MCP server the model can call. When a new session starts, the bridge walks a registration block and pushes every bundled server onto the list the model sees.
Here is the entry for the macOS accessibility binary. Eight lines, no conditional apart from the existsSync check, no configuration file:
That is the whole setup. The user installs the DMG, grants Accessibility permission once (same TCC dialog VoiceOver uses), and every conversation from that point on has the six tools in the toolbox.
Where every tool call lands when it runs
Two filter functions the model never sees
Before the tree is serialized and written to disk, two functions drop elements the agent has no business reasoning about. Scrollbars and their arrow buttons. Empty layout rows and cells and columns and menus. The filters run on lowercased role strings with zero app-specific code, so the behaviour is identical across every Mac app.
This is free token savings the screenshot path literally cannot replicate. A vision model has to decide whether a scrollbar is meaningful on every frame; a tree agent never sees it.
One full task, step by step
Here is what happens when a user types "reply to Marwan on Slack with on it" into Fazm. Six steps, four tool calls, zero screenshots unless the tree goes empty.
User asks in plain English
They type: "Reply to Marwan on Slack with `on it`."
Agent calls open_application_and_traverse
Arg: { app: "Slack" }
Agent substring-searches the tree
Looks for an AXCell or AXStaticText containing "Marwan"
Agent calls click_and_traverse
Arg: { element: "Marwan <lastname>" } or x,y from the matched line
Agent calls type_and_traverse
Arg: { text: "on it", pressKey: "Return" }
Agent confirms and returns
Reads the new tree, sees the message in the scrollback
full message trace, model to Mac
What the model actually reads
Two ways to answer the same question: where does the Send button live on screen. The left is what a screenshot-based computer use agent does. The right is what this one does.
Same task. Different primitive.
Vision model receives a base64 PNG of the entire screen (500 KB to 5 MB). It runs OCR, detects rectangles, guesses which one is a button, guesses the label, guesses the coordinates. Returns click(1834, 972).
- Pixel input, 500 KB to 5 MB per turn
- OCR + shape detection + label inference
- Coordinates are guessed, not read
- Off-screen or attached-but-hidden elements cannot be represented
Local protocol vs SERP consensus
Nine rows, every one tied to a file:line anchor in the open source repo.
| Feature | SERP consensus (cloud, pixel-first) | Fazm (local, tree-first) |
|---|---|---|
| Primitive input | base64 PNG of the screen | UTF-8 text from AXUIElement walk |
| Primitive output | coordinate clicks and typed strings | six MCP tool calls with named elements |
| Where the model runs | cloud (OpenAI, Anthropic, Microsoft, AWS) | cloud model, local bridge, local binary |
| Scope | a virtualized browser or remote desktop | any accessible Mac app on your own machine |
| Wire protocol you can inspect | closed API | MIT-licensed main.swift at line 1301-1413 |
| Observe-act-observe | screenshot + action = 2 calls | _and_traverse suffix = 1 call |
| Bundled install | none (cloud login) | signed DMG, one TCC prompt |
| Noise filtering | model must ignore scrollbars in pixels | isScrollBarNoise at main.swift line 592 drops them before serialization |
| Version introspection | none | server returns "SwiftMacOSServerDirect" version "1.6.0" in handshake |
Run the three greps yourself
Every line number on this page is reproducible with two commands. Clone the repo, cd into it, and run:
GREP-VERIFIABLE CLAIMS ON THIS PAGE
- main.swift is exactly 1917 lines (wc -l)
- open_application_and_traverse declared at line 1301
- click_and_traverse at line 1329
- type_and_traverse at line 1349
- refresh_traversal at line 1363
- press_key_and_traverse at line 1384
- scroll_and_traverse at line 1402
- Server name SwiftMacOSServerDirect at line 1412
- Version 1.6.0 at line 1413
- Bridge registration at acp-bridge/src/index.ts lines 1057-1064
Want a computer use AI agent on your Mac, today?
Book 20 minutes. I will walk you through the six tools running against a real app, from the same binary this page cites.
Book a call →Adjacent pages on the same binary
Keep reading
Accessibility Tree Computer Use: Six Signals Pixels Cannot Carry
The data side. A real 441-element dump from the same binary, field by field.
Accessibility API AI Agents vs Screenshots
Latency, cost, and fidelity head-to-head between tree-first and pixel-first agents.
Claude Computer Use Agent
How the specific Claude Agent SDK session inside Fazm reaches the Mac via these six tools.
Frequently asked questions
What is a computer use AI agent?
A computer use AI agent is a program that takes a plain-English task, observes the state of a computer, and drives it with clicks, keystrokes, and scrolls until the task is done. The dominant public examples (OpenAI's CUA behind Operator, Anthropic Computer Use, Microsoft Copilot Studio computer use preview, Agent S2 on the OSWorld benchmark) all share one primitive: the model receives a screenshot of the screen and returns pixel coordinates. Fazm is the same category of product, but the primitive is different. Instead of pixels it sends the model a structured text dump of the macOS accessibility tree and a six-tool MCP protocol for acting on it.
How is the Fazm agent different from OpenAI's Computer-Using Agent?
OpenAI's CUA is a model. It lives on OpenAI's servers. The product (Operator) is a browser-embedded remote desktop that streams screenshots back and forth. Fazm is not a model, it is a signed consumer Mac app that plugs any frontier model (Claude, GPT, Gemini) into your actual desktop via a local bridge. The wire format between the model and your Mac is not JPEG + (x,y) clicks. It is a short JSON-RPC call from the six tools declared at /Users/matthewdi/mcp-server-macos-use/Sources/MCPServer/main.swift lines 1301, 1329, 1349, 1363, 1384, and 1402. You can read the file. It is MIT-licensed.
What exactly are the six tools?
They are, in order of declaration: macos-use_open_application_and_traverse (line 1301), macos-use_click_and_traverse (line 1329), macos-use_type_and_traverse (line 1349), macos-use_refresh_traversal (line 1363), macos-use_press_key_and_traverse (line 1384), and macos-use_scroll_and_traverse (line 1402). Every tool that changes UI state ends in _and_traverse, meaning the server performs the action and immediately rewalks the accessibility tree, returning both in the same response. The agent gets fresh ground truth in one round trip instead of two.
Where is the server version set, and why does that matter?
Inside the same main.swift, at lines 1412 and 1413: name "SwiftMacOSServerDirect", version "1.6.0". This is the string the MCP handshake returns to the Claude Agent SDK when Fazm boots a session. It is how the bridge, the model, and any external diagnostic tool can identify the exact binary that is running. A screenshot-based cloud agent has nothing analogous: there is no file you can grep to learn what version of the click tool the model is calling.
How is the binary wired into a session?
At /Users/matthewdi/fazm/acp-bridge/src/index.ts lines 1057 through 1064, an eight-line existsSync check pushes the bundled binary into the MCP server list for every session the Fazm app spawns: if the file exists at Fazm.app/Contents/MacOS/mcp-server-macos-use, register it under the name "macos-use". No config file, no environment variable, no user setup. Once you install Fazm, the six accessibility tools are in the model's toolbox on the next prompt.
How big is the binary and how can I inspect it?
The Swift source is 1917 lines in a single file, main.swift, which compiles to a Mach-O 64-bit arm64 binary around 21 MB inside Fazm.app at Contents/MacOS/mcp-server-macos-use. You can invoke it directly from a terminal: pipe a JSON-RPC tools/list request into its stdin and it prints the six tool names plus the SwiftMacOSServerDirect 1.6.0 identity. That is exactly the same code path Fazm uses when the agent takes an action.
Why accessibility APIs instead of screenshots?
A screenshot is a raster. The model has to infer role, title, clickability, and bounding box from pixel values, often via OCR. The accessibility tree already has all of that as structured data, because the OS uses it for VoiceOver, Switch Control, and every built-in assistive feature. Every line in a Fazm traversal file has the form [Role] "text" x:N y:N w:W h:H visible. The model substring-searches that text, reads the frame directly from the matched line, and passes it to click_and_traverse. No vision tokens, no OCR, no inference.
What noise does the server filter before the agent sees the tree?
Two filter functions run before the tree hits disk. isScrollBarNoise at main.swift line 592 drops scrollbars, value indicators, page buttons, and arrow buttons. isStructuralNoise at line 600 drops empty AXRow, AXCell, AXColumn, and AXMenu containers that exist only for layout. Both work off case-insensitive role-string matches with zero app-specific code, so an agent reading the dump never wastes a token on padding.
Does it work with any app or just browsers?
Any Mac app that implements the macOS accessibility API, which is most of them because the OS requires it for assistive tech compliance. That includes native Cocoa apps (Mail, Notes, Xcode, Finder), Catalyst apps (Messages, WhatsApp), most Electron apps (Slack, Notion, VSCode, Linear), and Mac Office. For apps with broken or empty trees (certain SDL games, some OpenGL canvases), Fazm falls back to the screenshot sibling file the same traversal writes next to the .txt. Start with the tree, fall back to pixels.
What file formats does a tool call produce on disk?
Every traversal writes a paired timestamped (.txt, .png) to /tmp/macos-use/. The .txt is the full tree the model is reasoning against. The .png is a screenshot of the same window. The agent's context only receives a short stdio summary pointing at those files plus a sample of on-screen elements. If you ever want to audit what the model saw when it clicked Send, open the matching .txt and grep for Send: if the element is in the file, the model read it from the tree.
Do I need to write code to use this?
No. Fazm is a consumer app, not a developer framework. The six tools, the bridge, the bundled Claude Agent SDK, and the MCP registry are all packaged into one signed, notarized DMG. You install it, grant Accessibility permission once (same TCC dialog VoiceOver uses), and talk to it in English. The model picks the tools, the bridge routes them, the binary walks the tree. You never edit a config file.
Is it open source?
The macOS-use MCP server is MIT-licensed at github.com/mediar-ai/mcp-server-macos-use. The Fazm app itself is open source at github.com/mediar-ai/fazm. Both ship with all the line numbers, function names, and filter lists referenced here. You can verify every claim on this page with grep.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.