computer uselocal protocolmacOS

Computer Use AI Agent: the six-tool local protocol no SERP result shows

Every page on the first SERP for this keyword describes a model. OpenAI's CUA, Anthropic Computer Use, Agent S2, Microsoft Copilot Studio. Same pattern: cloud-hosted, screenshot in, coordinates out. None of them shows the local wire protocol a consumer Mac agent actually uses. Fazm ships one, in a 1917-line open-source Swift binary, version 1.6.0, with six tool declarations at file-anchored line numbers. This page walks the whole path.

Fazm

Published April 20, 202612 min read

A computer use agent that isn't a model

Six tools. One Swift binary. No screenshots by default.

Every SERP hit for 'computer use AI agent' is a model or cloud API.

Fazm is a signed consumer Mac app, not a model.

Wire format: six _and_traverse MCP tools, not pixels + (x,y).

Binary is 1917 lines of MIT Swift, version 1.6.0.

Mounted into every session by an eight-line bridge push.

0:00 / 0:05

4.9from 200+

Signed, notarized consumer Mac app. Not a cloud API.

Accessibility API first, screenshots as fallback.

Bundled Swift MCP binary is MIT-licensed and grep-verifiable.

What every SERP result covers, and what none of them do

Search the keyword and the first page is almost entirely model announcements and framework lists. OpenAI's Computer-Using Agent, Anthropic's Computer Use, Microsoft Copilot Studio computer use preview, Simular's Agent S2 paper beating OpenAI CUA on OSWorld, Cua's cloud desktop, AWS prescriptive guidance, labellerr's CUA explainer, an IEEE Spectrum feature.

They all answer the same questions. What is a computer-using agent, what model is behind it, what benchmark score did it get, what applications does it enable. Useful, but they stop at the API boundary. The reader never sees the actual tool list, the actual response shape, or the actual local file the agent reads from.

That is the gap. Below is the local protocol, byte by byte, with file-anchored line numbers.

OpenAI Operator CUAAnthropic Computer UseMicrosoft Copilot StudioAgent S2 on OSWorldCua cloud desktopAWS Prescriptive GuidanceLabellerr CUA guideIEEE Spectrum featureAzure AI FoundryAnalytics Vidhya top-7Browser-only scopeVLM screenshot inputPixel coordinate outputCloud-hosted modelAPI integration required

The six tools, with line numbers

Clone github.com/mediar-ai/mcp-server-macos-use and open Sources/MCPServer/main.swift. It is 1917 lines in one file. Every tool Fazm uses to drive your Mac is declared in this block, at the line numbers shown below.

Sources/MCPServer/main.swift

Notice the naming convention: every mutating tool ends in _and_traverse. The server performs the action and immediately rewalks the AX tree, returning both in the same MCP response. For the agent that means one round trip per step instead of two.

What each tool does, one line each

This is the full surface area of the computer use agent. Six tools. Every one of them lands at a file-anchored line number in the open source binary.

open_application_and_traverse

Activate a Mac app by name and return its AX tree in one call. Declared at main.swift line 1301.

click_and_traverse

Synthesize a CGEvent click at a point or a named element, then rewalk the tree. Declared at line 1329.

type_and_traverse

Type a string into the focused field with optional pressKey chain, then rewalk. Declared at line 1349.

refresh_traversal

Pure observe: rewalk without acting. The only tool without a side effect. Declared at line 1363.

press_key_and_traverse

Press a named key or key combo (Return, Command+T), then rewalk. Declared at line 1384.

scroll_and_traverse

Scroll in a direction by a line count, then rewalk so occluded elements enter the tree. Line 1402.

6 tools

“Every mutating tool returns the new AX tree in the same call. Observe-act-observe collapses to one round trip.”

main.swift lines 1301-1402, naming convention "_and_traverse"

0Lines of Swift in main.swift

0_and_traverse tools declared

0Version in line 1413

0Lines of bridge registration

Eight lines that put the agent on every Mac

The Node bridge inside Fazm (acp-bridge) is what sits between the Claude Agent SDK session and every MCP server the model can call. When a new session starts, the bridge walks a registration block and pushes every bundled server onto the list the model sees.

Here is the entry for the macOS accessibility binary. Eight lines, no conditional apart from the existsSync check, no configuration file:

acp-bridge/src/index.ts (lines 1057-1064)

That is the whole setup. The user installs the DMG, grants Accessibility permission once (same TCC dialog VoiceOver uses), and every conversation from that point on has the six tools in the toolbox.

Where every tool call lands when it runs

Two filter functions the model never sees

Before the tree is serialized and written to disk, two functions drop elements the agent has no business reasoning about. Scrollbars and their arrow buttons. Empty layout rows and cells and columns and menus. The filters run on lowercased role strings with zero app-specific code, so the behaviour is identical across every Mac app.

main.swift (lines 592-606)

This is free token savings the screenshot path literally cannot replicate. A vision model has to decide whether a scrollbar is meaningful on every frame; a tree agent never sees it.

One full task, step by step

Here is what happens when a user types "reply to Marwan on Slack with on it" into Fazm. Six steps, four tool calls, zero screenshots unless the tree goes empty.

User asks in plain English

They type: "Reply to Marwan on Slack with `on it`."

No coordinates, no selectors. The model owns the translation from intent to tool call.

Agent calls open_application_and_traverse

Arg: { app: "Slack" }

Server activates Slack and walks the frontmost window's AX tree. Writes the full text to /tmp/macos-use/<ts>_open_application_and_traverse.txt.

Agent substring-searches the tree

Looks for an AXCell or AXStaticText containing "Marwan"

The tree is already filtered: scrollbars, value indicators, empty AXRow and AXCell containers were dropped by the noise filters at main.swift lines 592 and 600.

Agent calls click_and_traverse

Arg: { element: "Marwan <lastname>" } or x,y from the matched line

Server sends a synthetic CGEvent click, waits for UI to settle, rewalks the tree, and returns the new state in the same MCP response. One round trip, not two.

Agent calls type_and_traverse

Arg: { text: "on it", pressKey: "Return" }

Single call does type + press + rewalk. The _and_traverse pattern collapses observe-act-observe into one tool call.

Agent confirms and returns

Reads the new tree, sees the message in the scrollback

If the tree shows the sent message under AXStaticText, the agent replies to the user. If not, it can call refresh_traversal once more or fall back to the paired screenshot.

full message trace, model to Mac

What the model actually reads

Two ways to answer the same question: where does the Send button live on screen. The left is what a screenshot-based computer use agent does. The right is what this one does.

Same task. Different primitive.

Vision model receives a base64 PNG of the entire screen (500 KB to 5 MB). It runs OCR, detects rectangles, guesses which one is a button, guesses the label, guesses the coordinates. Returns click(1834, 972).

Pixel input, 500 KB to 5 MB per turn
OCR + shape detection + label inference
Coordinates are guessed, not read
Off-screen or attached-but-hidden elements cannot be represented

Local protocol vs SERP consensus

Nine rows, every one tied to a file:line anchor in the open source repo.

Feature	SERP consensus (cloud, pixel-first)	Fazm (local, tree-first)
Primitive input	base64 PNG of the screen	UTF-8 text from AXUIElement walk
Primitive output	coordinate clicks and typed strings	six MCP tool calls with named elements
Where the model runs	cloud (OpenAI, Anthropic, Microsoft, AWS)	cloud model, local bridge, local binary
Scope	a virtualized browser or remote desktop	any accessible Mac app on your own machine
Wire protocol you can inspect	closed API	MIT-licensed main.swift at line 1301-1413
Observe-act-observe	screenshot + action = 2 calls	_and_traverse suffix = 1 call
Bundled install	none (cloud login)	signed DMG, one TCC prompt
Noise filtering	model must ignore scrollbars in pixels	isScrollBarNoise at main.swift line 592 drops them before serialization
Version introspection	none	server returns "SwiftMacOSServerDirect" version "1.6.0" in handshake

Run the three greps yourself

Every line number on this page is reproducible with two commands. Clone the repo, cd into it, and run:

verify

GREP-VERIFIABLE CLAIMS ON THIS PAGE

main.swift is exactly 1917 lines (wc -l)
open_application_and_traverse declared at line 1301
click_and_traverse at line 1329
type_and_traverse at line 1349
refresh_traversal at line 1363
press_key_and_traverse at line 1384
scroll_and_traverse at line 1402
Server name SwiftMacOSServerDirect at line 1412
Version 1.6.0 at line 1413
Bridge registration at acp-bridge/src/index.ts lines 1057-1064

Want a computer use AI agent on your Mac, today?

Book 20 minutes. I will walk you through the six tools running against a real app, from the same binary this page cites.

Book a call →

Adjacent pages on the same binary

Keep reading

Data format

Accessibility Tree Computer Use: Six Signals Pixels Cannot Carry

The data side. A real 441-element dump from the same binary, field by field.

Read

Comparison

Accessibility API AI Agents vs Screenshots

Latency, cost, and fidelity head-to-head between tree-first and pixel-first agents.

Read

Integration

Claude Computer Use Agent

How the specific Claude Agent SDK session inside Fazm reaches the Mac via these six tools.

Read

Frequently asked questions

What is a computer use AI agent?

A computer use AI agent is a program that takes a plain-English task, observes the state of a computer, and drives it with clicks, keystrokes, and scrolls until the task is done. The dominant public examples (OpenAI's CUA behind Operator, Anthropic Computer Use, Microsoft Copilot Studio computer use preview, Agent S2 on the OSWorld benchmark) all share one primitive: the model receives a screenshot of the screen and returns pixel coordinates. Fazm is the same category of product, but the primitive is different. Instead of pixels it sends the model a structured text dump of the macOS accessibility tree and a six-tool MCP protocol for acting on it.

How is the Fazm agent different from OpenAI's Computer-Using Agent?

OpenAI's CUA is a model. It lives on OpenAI's servers. The product (Operator) is a browser-embedded remote desktop that streams screenshots back and forth. Fazm is not a model, it is a signed consumer Mac app that plugs any frontier model (Claude, GPT, Gemini) into your actual desktop via a local bridge. The wire format between the model and your Mac is not JPEG + (x,y) clicks. It is a short JSON-RPC call from the six tools declared at /Users/matthewdi/mcp-server-macos-use/Sources/MCPServer/main.swift lines 1301, 1329, 1349, 1363, 1384, and 1402. You can read the file. It is MIT-licensed.

What exactly are the six tools?

They are, in order of declaration: macos-use_open_application_and_traverse (line 1301), macos-use_click_and_traverse (line 1329), macos-use_type_and_traverse (line 1349), macos-use_refresh_traversal (line 1363), macos-use_press_key_and_traverse (line 1384), and macos-use_scroll_and_traverse (line 1402). Every tool that changes UI state ends in _and_traverse, meaning the server performs the action and immediately rewalks the accessibility tree, returning both in the same response. The agent gets fresh ground truth in one round trip instead of two.

Where is the server version set, and why does that matter?

Inside the same main.swift, at lines 1412 and 1413: name "SwiftMacOSServerDirect", version "1.6.0". This is the string the MCP handshake returns to the Claude Agent SDK when Fazm boots a session. It is how the bridge, the model, and any external diagnostic tool can identify the exact binary that is running. A screenshot-based cloud agent has nothing analogous: there is no file you can grep to learn what version of the click tool the model is calling.

How is the binary wired into a session?

At /Users/matthewdi/fazm/acp-bridge/src/index.ts lines 1057 through 1064, an eight-line existsSync check pushes the bundled binary into the MCP server list for every session the Fazm app spawns: if the file exists at Fazm.app/Contents/MacOS/mcp-server-macos-use, register it under the name "macos-use". No config file, no environment variable, no user setup. Once you install Fazm, the six accessibility tools are in the model's toolbox on the next prompt.

How big is the binary and how can I inspect it?

The Swift source is 1917 lines in a single file, main.swift, which compiles to a Mach-O 64-bit arm64 binary around 21 MB inside Fazm.app at Contents/MacOS/mcp-server-macos-use. You can invoke it directly from a terminal: pipe a JSON-RPC tools/list request into its stdin and it prints the six tool names plus the SwiftMacOSServerDirect 1.6.0 identity. That is exactly the same code path Fazm uses when the agent takes an action.

Why accessibility APIs instead of screenshots?

A screenshot is a raster. The model has to infer role, title, clickability, and bounding box from pixel values, often via OCR. The accessibility tree already has all of that as structured data, because the OS uses it for VoiceOver, Switch Control, and every built-in assistive feature. Every line in a Fazm traversal file has the form [Role] "text" x:N y:N w:W h:H visible. The model substring-searches that text, reads the frame directly from the matched line, and passes it to click_and_traverse. No vision tokens, no OCR, no inference.

What noise does the server filter before the agent sees the tree?

Two filter functions run before the tree hits disk. isScrollBarNoise at main.swift line 592 drops scrollbars, value indicators, page buttons, and arrow buttons. isStructuralNoise at line 600 drops empty AXRow, AXCell, AXColumn, and AXMenu containers that exist only for layout. Both work off case-insensitive role-string matches with zero app-specific code, so an agent reading the dump never wastes a token on padding.

Does it work with any app or just browsers?

Any Mac app that implements the macOS accessibility API, which is most of them because the OS requires it for assistive tech compliance. That includes native Cocoa apps (Mail, Notes, Xcode, Finder), Catalyst apps (Messages, WhatsApp), most Electron apps (Slack, Notion, VSCode, Linear), and Mac Office. For apps with broken or empty trees (certain SDL games, some OpenGL canvases), Fazm falls back to the screenshot sibling file the same traversal writes next to the .txt. Start with the tree, fall back to pixels.

What file formats does a tool call produce on disk?

Every traversal writes a paired timestamped (.txt, .png) to /tmp/macos-use/. The .txt is the full tree the model is reasoning against. The .png is a screenshot of the same window. The agent's context only receives a short stdio summary pointing at those files plus a sample of on-screen elements. If you ever want to audit what the model saw when it clicked Send, open the matching .txt and grep for Send: if the element is in the file, the model read it from the tree.

Do I need to write code to use this?

No. Fazm is a consumer app, not a developer framework. The six tools, the bridge, the bundled Claude Agent SDK, and the MCP registry are all packaged into one signed, notarized DMG. You install it, grant Accessibility permission once (same TCC dialog VoiceOver uses), and talk to it in English. The model picks the tools, the bridge routes them, the binary walks the tree. You never edit a config file.

Is it open source?

The macOS-use MCP server is MIT-licensed at github.com/mediar-ai/mcp-server-macos-use. The Fazm app itself is open source at github.com/mediar-ai/fazm. Both ship with all the line numbers, function names, and filter lists referenced here. You can verify every claim on this page with grep.

Computer Use AI Agent: the six-tool local protocol no SERP result shows

What every SERP result covers, and what none of them do

The six tools, with line numbers

What each tool does, one line each

open_application_and_traverse

click_and_traverse

type_and_traverse

refresh_traversal

press_key_and_traverse

scroll_and_traverse

Eight lines that put the agent on every Mac

Where every tool call lands when it runs

Two filter functions the model never sees

One full task, step by step

User asks in plain English

Agent calls open_application_and_traverse

Agent substring-searches the tree

Agent calls click_and_traverse

Agent calls type_and_traverse

Agent confirms and returns

What the model actually reads

Same task. Different primitive.

Local protocol vs SERP consensus

Run the three greps yourself

Want a computer use AI agent on your Mac, today?

Keep reading

Accessibility Tree Computer Use: Six Signals Pixels Cannot Carry

Accessibility API AI Agents vs Screenshots

Claude Computer Use Agent

Frequently asked questions

Comments (••)

Comments ()