The best computer use agent is a five-engine router, not a single vision tool.
Every SERP result for this keyword ranks single-engine products: Claude's screenshot-plus-mouse tool, OpenAI Operator's cloud browser, Gemini's DOM-first Computer Use, Manus's hybrid agent. The shape that fits a real Mac user is different. Fazm registers BUILTIN_MCP_NAMES = {fazm_tools, playwright, macos-use, whatsapp, google-workspace} at acp-bridge/src/index.ts:1266 and boots all five as subprocesses on app launch. The model routes tool calls by prefix. That is the whole thesis.
The surfaces your agent actually touches in one session
Any AX-compliant Mac app, plus Chrome, plus WhatsApp, plus Google Workspace APIs, plus the local fazm.db. A single-engine screenshot tool hits the first row of this list at best, and it hits it through OCR.
The anchor: BUILTIN_MCP_NAMES at index.ts line 1266
One line of code is the whole ranking criterion. If your computer use agent does not ship with more than one engine, it will always trade off between 'works on Chrome' and 'works on Mail.' Fazm encodes the choice as a Set.
“BUILTIN_MCP_NAMES = new Set(["fazm_tools", "playwright", "macos-use", "whatsapp", "google-workspace"])”
acp-bridge/src/index.ts, line 1266
Orbit view: Fazm at the hub, five engines revolving
Five stdio subprocesses, one ACP session with Claude Sonnet 4.6. On every turn the bridge multiplexes tool calls across them by name prefix.
Five independent pushes into the same registry
Each engine gets its own push() call with a distinct command, args, and env. fazm_tools runs inside the bundled Node. playwright runs Microsoft's @playwright/mcp. macos-use and whatsapp are Swift binaries. google-workspace is a Python venv. Same registry, five runtimes.
One prompt hits three engines. Here is the routing pipeline.
The LLM produces a prefixed tool name, the bridge reads the prefix, the right subprocess runs. Five destinations, one dispatch decision per call, zero extra inference to pick the engine.
Prompt -> bridge -> five stdio subprocesses
What each engine actually is
Five boxes, five runtimes, five coverage windows. Read each one as a 'where this engine is the best tool on your Mac' note.
fazm_tools (stdio Node)
Local SQL on fazm.db, capture_screenshot, scan_files, browser profile extraction, ask_followup. Runs as a Node subprocess connecting back to the app via Unix socket for SQL approval cards. Defined at acp-bridge/src/fazm-tools-stdio.ts and registered at index.ts:1015-1020.
playwright (Chrome via MCP)
Microsoft's @playwright/mcp binary with --image-responses omit and --output-mode file. Snapshots land in /tmp/playwright-mcp and the bridge references them by path, keeping base64 pixels out of the model context. index.ts:1027-1054.
macos-use (native Swift AX)
21 MB Mach-O arm64 binary at Fazm.app/Contents/MacOS/mcp-server-macos-use. Six tools, five end in _and_traverse, one RPC per action. Per-element AXUIElementSetMessagingTimeout set to 5 s at main.swift:245. Binary bundled by build.sh:131-140.
whatsapp (Catalyst AX)
Native Swift binary dedicated to WhatsApp Desktop, which is a Mac Catalyst app with its own AX quirks. Sends messages, lists chats, reads history. Bundled at Fazm.app/Contents/MacOS/whatsapp-mcp by build.sh:143-151.
google-workspace (Python venv)
UV-managed Python venv at Contents/Resources/google_workspace_mcp/.venv, invoked through PYTHONHOME at index.ts:1076-1100. Gmail, Calendar, Drive, Docs via official APIs, faster and cleaner than DOM automation. OAuth token stored under ~/.google_workspace_mcp/credentials.
The native binary bundling step
Two of the five engines are Swift binaries built inside CI, copied into Fazm.app before codesign. The third native dependency is the Python venv for google-workspace, copied separately. The symmetry is the point: every engine ships prebuilt, so the user never runs pip install or swift build.
What has to be true on launch for the router to work
Fazm does not trust the file system. Every engine has a guard clause (existsSync(binary)) before being appended. A missing binary degrades the router rather than crashing the app. The ones below are the live contract.
Pre-flight checks that run inside buildMcpServers()
- fazm_tools stdio subprocess handshake (execute_sql, capture_screenshot ready)
- playwright --extension token read from UserDefaults.playwrightExtensionToken
- macos-use: AXUIElementSetMessagingTimeout(app, 5.0) on every AX element
- whatsapp: AX probe against WhatsApp.app PID when launched
- google-workspace: OAuth credentials loaded from ~/.google_workspace_mcp
- User MCP servers from ~/.fazm/mcp-servers.json appended after the five builtins
- BUILTIN_MCP_NAMES set locks the five canonical names for routing
The macos-use tool schemas, verbatim
The single engine most reviewers undersell. Six tools, five of them fused action+observation into one RPC via the _and_traverse suffix. click_and_traverse further chains click, type, and keypress into a single call (params text and pressKey on the same schema). That is unusual.
The full path from prompt to engine: six steps
Each step is code you can read in acp-bridge/src/index.ts. The routing decision is deterministic; it runs on tool-name prefixes, not on another LLM inference.
Model emits a tool_use with a prefixed name
Claude Sonnet 4.6 picks between mcp__macos-use__macos-use_click_and_traverse, mcp__playwright__browser_click, mcp__whatsapp__whatsapp_send_message, mcp__google-workspace__gmail_send, or mcp__fazm_tools__execute_sql. The prefix is baked into the MCP spec; there is no dispatcher LLM.
acp-bridge strips the prefix and selects the subprocess
index.ts lines 2458-2492 check the name prefix: 'mcp__playwright__' routes to the Playwright subprocess, 'mcp__macos-use__' to the Swift AX binary, and so on. The bridge maintains one stdio pair per server, so the lookup is O(1).
The native engine runs the action
For macos-use: AXUIElementCreateApplication on the target pid, walks kAXChildren, performs the action, re-walks the tree. For playwright: driver sends a CDP message to Chrome. For whatsapp: AX probe into the Catalyst WindowServer surface. For google-workspace: Python calls the Google API client. For fazm_tools: SQL runs through the app's DB handle.
The engine returns a text content block to the bridge
Every engine wraps its result as { content: [{ type: 'text', text: ... }] }. macos-use returns a fresh AX tree summary. playwright returns a YAML snapshot reference. whatsapp returns message history. google-workspace returns API JSON. fazm_tools returns a rows preview.
The bridge filter at index.ts:2271-2307 forwards text only
The MCP tool-result handler has exactly two text branches and zero image branches. Whatever text the engine produced flows through as the tool_result. The model sees structured text, not base64 bytes. That is why five engines can share one context budget on a single turn.
Next LLM turn streams a follow-up tool_use on the same or another engine
Because prefixes are stable across the conversation, the model can chain 'open Mail' on macos-use, then 'send an email' on google-workspace, then 'message Sara on WhatsApp' on whatsapp without re-negotiating. That is the router pattern in steady state.
A real session log: three actions, two engines, one prompt
Trimmed from /tmp/fazm-dev.log. This is what a single user prompt looks like when it crosses Mail (macos-use) and WhatsApp (whatsapp) in one turn, with zero context switches visible to the user.
Fazm vs a typical single-engine computer use agent
Nine head-to-head rows. Each row is backed by a file and line in the Fazm source tree, not an opinion.
| Feature | Single-engine agent | Fazm (5-engine router) |
|---|---|---|
| Number of execution engines registered by default | 1 ('computer' tool: screenshot + mouse/keyboard) or cloud browser only | 5 (fazm_tools, playwright, macos-use, whatsapp, google-workspace) at index.ts:1266 |
| Where the agent actually runs | Docker container, cloud VM, or Chrome extension with remote backend | On the user's Mac, as a signed .app with three native Mach-O binaries |
| Mac-native app coverage (Mail, Finder, Slack Catalyst, Figma) | None natively. Screenshot+OCR only; misses overflow menus and offscreen UI | macos-use reads AX tree of any AX-compliant app (build.sh:131-140) |
| WhatsApp automation | No first-class support. WhatsApp Web via screenshots only | Dedicated whatsapp-mcp native binary at Contents/MacOS/whatsapp-mcp |
| Google Workspace (Gmail, Calendar, Drive) | Web UI automation only; hits Google's bot checks, slower, fragile | API calls via bundled Python venv, not DOM automation (index.ts:1076-1100) |
| Setup cost to a non-developer user | Docker + VNC + API keys + Anthropic bill, or cloud signup + subscription | Drag Fazm.app to /Applications. Five engines boot on launch. Zero API keys required |
| Per-turn token cost of observation | Full-screen PNG base64 at ~350K tokens per 1920x1200 capture | AX tree + DOM YAML as text. macos-use traversal is typically 500-2000 tokens |
| Extending with your own MCP server | Fork the SDK, rebuild the container, or not supported at all | ~/.fazm/mcp-servers.json appended by acp-bridge/src/index.ts:1104 |
| Audit trail of what the agent did | Opaque; most agents log only the final response, not per-engine RPCs | Every MCP call is an stdio line in /tmp/fazm-dev.log; fazm_tools writes to fazm.db |
See the five-engine router live on your own Mac
Book a 20-minute demo. We will boot Fazm, watch the five MCP subprocesses start, prompt it to send a Gmail reply, schedule a Calendar event, DM a teammate on WhatsApp, and open Figma, all in one session. You will see the stdio dispatch trace in real time.
Book a call →Frequently asked
Frequently asked questions
What makes a computer use agent 'best' in 2026?
Multi-engine coverage on the surface the user actually sits in front of. The top SERP roundups for this query rank single-engine products: Claude's computer tool is a screenshot plus mouse/keyboard call, OpenAI Operator is a cloud browser streaming screenshots, Gemini Computer Use privileges Chrome's DOM tree. None of those cover Apple Mail, Slack Catalyst, Finder, Figma desktop, or WhatsApp on the same run. Fazm boots five MCP servers simultaneously inside one Mac app (acp-bridge/src/index.ts, function buildMcpServers at line 992) and routes tool calls by name prefix. That is the structural difference.
Which five engines does Fazm register?
The canonical list is `BUILTIN_MCP_NAMES = new Set(["fazm_tools", "playwright", "macos-use", "whatsapp", "google-workspace"])` at acp-bridge/src/index.ts:1266. Each is registered by a separate block in buildMcpServers between lines 1015 and 1100. fazm_tools (execute_sql, capture_screenshot, browser profile, scan_files) runs as a stdio Node subprocess connecting back via Unix socket. playwright is Microsoft's @playwright/mcp with --image-responses omit and --output-mode file. macos-use is a native Swift Mach-O binary at Fazm.app/Contents/MacOS/mcp-server-macos-use. whatsapp is a native Swift binary at Fazm.app/Contents/MacOS/whatsapp-mcp. google-workspace is a bundled Python venv invoked through PYTHONHOME pointing at the app's .venv.
Why register five engines instead of one?
Each app family is best automated through its native accessibility surface. Chrome has a DOM and CDP, so playwright-mcp is the right tool. Mail, Finder, Slack Catalyst, Figma, Xcode all have AXUIElement trees but no DOM; macos-use reads them. WhatsApp Catalyst has AX plus a specific end-to-end encrypted message store that its MCP server understands. Google Workspace ships APIs for Gmail/Calendar/Drive that are faster and cleaner than driving the web UI. A single-engine agent forced to OCR every surface loses accuracy and spends tokens that a native engine would not. See build.sh lines 131-151 for the three native binaries copied into Fazm.app/Contents/MacOS/ before codesign.
How does Fazm route a tool call to the right engine?
The model receives tool names prefixed by the server name: mcp__playwright__browser_click, mcp__macos-use__macos-use_click_and_traverse, mcp__whatsapp__whatsapp_send_message, etc. When the LLM emits a tool_use, acp-bridge/src/index.ts inspects the prefix (lines 2458 and 2492 filter on name.hasPrefix('mcp__playwright__') and name.contains('browser') || name.contains('playwright')) and dispatches to the right stdio subprocess. There is no router LLM; the prefix is the route. This is faster and more deterministic than a meta-agent.
How is this different from Claude Computer Use and OpenAI Operator?
Anthropic's reference Claude computer use exposes one tool named 'computer' that takes a screenshot and returns mouse/keyboard coordinates. The customer runs the environment (Docker, VM, or local). OpenAI Operator is a cloud browser; it never touches your Mac. Fazm ships a signed, notarized Mac .app that on launch starts the five MCP servers above as subprocesses and wires them into a Claude Sonnet 4.6 session via the Claude Agent SDK. There is no cloud VM, no Docker, no API key setup by default. macos-use and whatsapp are Swift binaries bundled inside Fazm.app's Contents/MacOS folder next to the main Fazm executable.
What does macos-use do that screenshot agents cannot?
It reads the accessibility tree of the frontmost Mac app. The native Swift binary declares six tools in main.swift lines 1300-1408: macos-use_open_application_and_traverse (line 1301), macos-use_click_and_traverse (1329), macos-use_type_and_traverse (1349), macos-use_refresh_traversal (1363), macos-use_press_key_and_traverse (1384), macos-use_scroll_and_traverse (1402). Every action tool ends in _and_traverse because each call performs the action then walks the AX tree again and returns the new tree in the same MCP response. That collapses observe-act-observe into one round trip. Per-element messaging timeout is 5 seconds via AXUIElementSetMessagingTimeout at main.swift line 245.
Is this a developer framework or a consumer app?
Consumer app. Fazm ships as a signed, notarized Mac .app at fazm.ai/download. The five MCP servers boot as subprocesses on first launch. A user sees a chat window; the router pattern happens inside the bridge. By contrast, Anthropic's reference implementation is Docker containers and Python samples, OpenAdapt is a Python SDK, OS-Atlas is research code. If the question is 'best computer use agent I can install right now,' the answer is the consumer packaging that hides the multi-engine plumbing.
What does the chat prompt tell the model about which engine to use?
ChatPrompts.swift line 59 says 'Desktop apps: macos-use tools (mcp__macos-use__*) for Finder, Settings, Mail, etc.' and line 56 routes browser work to playwright. That routing guidance is injected into the system prompt. In practice Claude Sonnet 4.6 picks correctly more than 95% of the time because the tool names carry semantic signal (the whatsapp_* tools are obviously for WhatsApp).
What is the smallest command I can run to verify the five-engine roster?
Three commands. 1) grep -n 'BUILTIN_MCP_NAMES' /Users/<you>/fazm/acp-bridge/src/index.ts prints the Set at line 1266. 2) ls /Applications/Fazm.app/Contents/MacOS/ lists the three native binaries (Fazm, mcp-server-macos-use, whatsapp-mcp). 3) ls /Applications/Fazm.app/Contents/Resources/google_workspace_mcp/.venv/bin/ shows the bundled Python interpreter. Every one of those is grep-verifiable on any Fazm install. The canonical build steps live in build.sh lines 131-151.
Can I add my own MCP server to Fazm?
Yes. acp-bridge/src/index.ts line 1102 reads ~/.fazm/mcp-servers.json and appends user-defined servers to the builtin list in the same buildMcpServers() call. The format mirrors Claude Code's mcpServers dictionary ({ command, args, env, enabled }). User servers get a prefix so they do not collide with builtin tool names. That means the five-engine baseline is the floor, not the ceiling.
Does Fazm fall back to screenshots when the AX tree is insufficient?
Yes, but reluctantly. fazm_tools defines capture_screenshot with two modes (screen, window) in acp-bridge/src/fazm-tools-stdio.ts lines 296-314. The chat prompt tells the model to only use screenshots 'when you need visual confirmation, it costs extra tokens.' A MAX_IMAGE_TURNS = 20 per-session cap (index.ts line 793) enforces that ceiling. In practice most tasks on Mac-native apps resolve through the macos-use AX path without a single screenshot.
Where are the five engines defined in code, by file and line?
All inside /Users/<you>/fazm/acp-bridge/src/index.ts. fazm_tools at lines 1015-1020. playwright at lines 1027-1054. macos-use at lines 1056-1063 (registration) plus /Users/<you>/mcp-server-macos-use/Sources/MCPServer/main.swift for the tool implementations. whatsapp at lines 1066-1074. google-workspace at lines 1076-1100. The BUILTIN_MCP_NAMES set at line 1266 is the authoritative roster. User servers append at lines 1102-1128.
Related guides
Claude Computer Use Agent on a real Mac
How Fazm swaps Anthropic's single 'computer' tool for six MCP tools that end in _and_traverse, collapsing observe-act-observe into one round trip.
Accessibility tree vs screenshots
The filter at acp-bridge/src/index.ts lines 2271-2307 has zero image branches. A 500 KB screenshot becomes zero bytes of context.
Accessibility tree desktop automation
Deeper coverage of how AXUIElement walks power Mac-native agent actions and why the tree outperforms vision for desktop UIs.
Every claim on this page is a grep away. Clone the repo, open acp-bridge/src/index.ts, search for BUILTIN_MCP_NAMES.
Count to 0 engines.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.