A Claude computer using agent on your Mac is five processes, not one API
Most of the writing on this topic describes Anthropic's research beta: Claude looks at a screenshot and returns click coordinates. That works inside a cloud sandbox and impresses in a demo. It is not what lives on your Mac when you say “open Finder and find last month’s invoices.” That sentence rides on a Node subprocess, five bundled MCP servers, the macOS accessibility tree, and an OAuth flow that quietly shares a refresh token with the Claude Code CLI you already installed. Here is the whole stack, named and wired up, as it ships in Fazm.
Two very different things share this name
The phrase covers two products that look alike on a slide and behave nothing alike in practice. The first is Anthropic's computer use beta, an API endpoint where you post a screenshot and a task to Claude and get back mouse and keyboard events. It runs in a cloud sandbox, it is a developer tool, and it was built to prototype what the model is capable of. It is not a consumer product, it does not touch your Mac, and it cannot see any app you have open unless you manually screenshot and upload.
The second is a local app where Claude actually drives your machine. That is what people want when they type this search. The Mac has to give the agent permission to read the accessibility tree, control Chrome, and capture the screen. The app has to spawn a local model runtime, wire up tools, handle auth, and recover gracefully when something hangs. None of that is in Anthropic's beta. It is all in the shell that wraps it.
Fazm is the second thing. The next few sections are a guided tour of what is actually inside that shell, read from the source, so you can tell whether it is the right approach if you build your own, or tell what to look at when you pick one off the shelf.
One chat turn, five local processes
The official Anthropic package most articles skip
The engine that runs Claude inside Fazm is not the raw Anthropic API. It is @agentclientprotocol/claude-agent-acp, Anthropic's Agent Client Protocol host for Claude Code. Fazm pins it to version 0.29.2 in acp-bridge/package.json. That matters because ACP is the same protocol Zed uses for its agent panel and the same protocol Neovim plugins use to host Claude. When you pick ACP you inherit a maintained tool loop, a maintained MCP client, a maintained streaming format, and a maintained handshake. You stop reinventing the agent loop and start spending your time on what actually makes a product different: what tools the agent has, how they behave, and how the whole thing feels in the hand.
ACP is how Claude Code gets launched as a subprocess; MCP is how tools plug in. Two protocols, one stack. ACP on the outside, speaking to the host app. MCP on the inside, speaking to tools.
The five servers, named
The honest answer to “what can a Claude computer using agent do” is “whatever tools you give it.” Fazm gives it five local stdio subprocesses, listed below. Anything you add to ~/.fazm/mcp-servers.json gets merged in automatically at bridge startup. There is no prompt rewrite, no tool registry, no schema file; MCP does the discovery.
fazm_tools
The Swift side. Exposes execute_sql against the local SQLite database (memory, chat history, observer cards), capture_screenshot, request_permission, and other Swift-backed tools. Lives inside the Fazm.app Contents/MacOS/ bundle.
playwright
The @playwright/mcp package. Attaches to Chrome via the Fazm Chrome extension so the agent sees your logged-in tabs instead of a fresh browser.
macos-use
Native Swift MCP server. Drives any app on the Mac by walking its accessibility tree. Tools end in _and_traverse so every action comes back with a structured diff of what changed.
Dedicated binary for the WhatsApp Catalyst app. Send, search, open a chat, verify the current chat, read messages, all via accessibility APIs on the native app (not the web client).
google-workspace
Python MCP server bundled inside Contents/Resources/google-workspace-mcp/. Gmail search and send, Calendar events, Drive, Docs, Sheets. Credentials live under ~/.google_workspace_mcp/.
User-defined
Anything you add to ~/.fazm/mcp-servers.json gets merged in at bridge startup. Same shape as Claude Desktop's mcpServers block.
Anchor fact: your Claude Code login already works
This is the detail no other page on this topic mentions, and it tells you more about the architecture than any diagram. Fazm's OAuth flow against Anthropic uses exactly the same CLIENT_ID and the same macOS keychain service name that Anthropic's Claude Code CLI uses. When you click Connect Claude Account in Settings, the PKCE flow against claude.ai/oauth/authorize writes a refresh token to the Keychain service Claude Code-credentials. If you already ranclaude login on this Mac, the token is already there, so Fazm picks it up without asking you to sign in again. This is intentional. Fazm treats Claude Code as an upstream agent engine, not a competitor, and shares the credential store so your Claude Pro or Max subscription works in both.
What that credential share buys you
The practical consequence: you do not pay twice. If you already have a Claude Max plan for development use with Claude Code, the same plan powers Fazm's floating bar, Fazm's observer, Fazm's skills, and every tool call the agent makes against your Mac. The only billing relationship is still directly between you and Anthropic. Fazm does not sit in the token path, does not proxy, and cannot see your Claude requests.
Users who do not have a paid Claude subscription get a bundled API key that Fazm covers up to the plan's monthly usage ceiling. Two modes, one switch in Settings.
Why accessibility beats screenshots on a Mac
Anthropic's computer use beta is the screenshot path: every decision the model makes is grounded in pixels. That is the right call for a cloud-hosted VM where the agent has no other way to know what is on the screen. On a real Mac, there is a better substrate: the accessibility tree the OS already builds and keeps updated for screen readers. Fazm's macos-use server reads that tree directly via AXUIElement APIs, gets back a labeled set of roles, values, and frames, and hands the agent the single set of facts it needs for the next step. No OCR, no vision-model round-trip, no wondering whether a button is enabled.
| Feature | Anthropic computer use (screenshots) | Fazm (accessibility tree) |
|---|---|---|
| Primary observation surface | Full-screen bitmap | Accessibility tree via AXUIElement |
| What the agent sees after an action | New screenshot; agent re-OCRs | Structured added / removed / modified diff |
| Works on native Mac apps (Finder, Calendar, Mail) | Cloud sandbox only | Yes, directly |
| Latency per action | Seconds (image round-trip) | Hundreds of milliseconds |
| Survives dark mode, DPI, theme swaps | OCR accuracy can shift | Yes, semantic not visual |
| Fallback when surface is opaque | Already the fallback | capture_screenshot (Canvas, games, Electron) |
| Needs your existing Chrome session | Fresh browser in a sandbox | Yes (Playwright attaches to live Chrome) |
| Ships as a consumer app | API endpoint, developer only | Yes, signed .app |
A real chat turn, from prompt to action
Below is a condensed tool transcript for a single user prompt. Every line after the command comes from the Claude agent's tool stream. You can see the agent picking which MCP server to use for each sub-step: google-workspace for Gmail, macos-use for the Calendar app, with memory reads in between.
What happens behind one click
The agent calling click_and_traverse triggers five crossings: the floating bar ships the prompt into the ACP bridge over a Unix socket; the bridge spawns a Claude Code child and speaks ACP JSON-RPC over stdio; Claude picks the macos-use MCP server and speaks MCP JSON-RPC over its stdio; macos-use calls into the macOS accessibility API; the target app receives a synthesized event and redraws. Every arrow is local. Nothing in this path touches the network except the outbound model call.
one tool call, five local stops
Floating bar
Swift UI captures the user prompt and pushes it to the ACP bridge.
ACP bridge (Node)
Spawns Claude Code via @agentclientprotocol/claude-agent-acp, routes JSON-RPC between Swift and the model subprocess.
Claude Code
Picks a tool from the MCP registry. Here, macos-use click_and_traverse.
macos-use MCP server
Walks the accessibility tree before the action, clicks, walks again, diffs.
macOS AX API
AXUIElement calls into the target app. The app reacts; the diff reports what changed.
What it takes to ship one of these
If you are planning to build your own, the milestones in order, with the ones that always take longer than you expect called out.
Pick an agent host protocol
ACP is Anthropic's. MCP is the tool protocol. You want ACP for the outer loop and MCP for tools. Writing either yourself is months of re-work when the spec moves.
Spawn Claude Code as a subprocess
Not the raw API. Claude Code handles tool use, conversation history, streaming, and retries. In Fazm this is @agentclientprotocol/claude-agent-acp 0.29.2, wired via acp-bridge/src/index.ts.
Pick your first MCP server
Do the one that makes the product feel possible. For a Mac agent that is macos-use: accessibility-tree reads plus click/type/scroll/press against the frontmost app.
Add the others, one target surface at a time
Chrome via Playwright, WhatsApp via its dedicated MCP, Gmail and Calendar via google-workspace. Each server is a separate subprocess with its own timeout budget.
Solve the accessibility permission UX
macOS caches the grant and the cache can go stale after app renames or crashes. Probe the permission with a real AXUIElement call and surface a reset prompt when it fails. Desktop/Sources/AppState.swift handles this in Fazm.
OAuth, and reuse what is already there
Offer a path for users who have a Claude Pro/Max subscription. If you use the same CLIENT_ID and keychain service as Claude Code CLI, existing Claude Code users sign in with zero friction.
Bundle skills for vertical expertise
Agents without skills are generalists. Fazm bundles 17 skills in Desktop/Sources/BundledSkills (pdf, xlsx, docx, pptx, video-edit, social-autoposter, deep-research, travel-planner). Claude loads them on demand.
Ship it as a signed .app
Everything above, inside one notarized bundle. No npm install, no pip install, no terminal. The user drags Fazm to Applications and grants three permissions during onboarding.
Four places a consumer stack is different from Anthropic's demo
what the product has to solve that the demo does not
- Permission UX: Three TCC prompts, in the right order, with recovery when macOS caches a stale grant. The demo assumes root in a sandbox.
- Credential sharing: Reusing the user's Claude Code login or Claude Pro subscription without re-auth, so they do not pay twice.
- Target-specific MCP servers: A generic click-anywhere agent is unreliable. Named tools per target (macos-use, playwright, whatsapp, google-workspace) are the difference between a demo and a product.
- Observability of tool failure: A two-minute timeout, a synthetic error, and the agent recovers. Without that, one hung Playwright session freezes the UI.
- Local memory that survives across sessions: MEMORY.md files, a SQLite observer that watches conversations, a browser-profile importer. The agent knows who you are on day two.
- A voice and a UI that fit on a Mac: a floating bar, push-to-talk, remote control from your phone. The demo is a chat window; the product lives in the corner of your screen.
Everything Claude can reach inside Fazm
Not every tool. The ones you will notice within the first hour.
No extra install, no config file
MCP servers bundled in the signed .app
fazm_tools
execute_sql against local SQLite, capture_screenshot, permissions probe, Swift-side callbacks.
playwright
@playwright/mcp attached to your running Chrome with the Fazm extension. Sees your logged-in tabs.
macos-use
Native Swift binary. AXUIElement traversal and _and_traverse tools against any Mac app.
Drives the WhatsApp Catalyst app. Search, open chat, send, read, get active chat.
google-workspace
Python stdio server. Gmail, Calendar, Drive, Docs, Sheets. Credentials in ~/.google_workspace_mcp.
+ user-defined
~/.fazm/mcp-servers.json. Same shape as Claude Desktop. Merged on bridge startup.
Getting started on your own Mac
Four steps, in order. Onboarding walks you through each.
- 1
Install
Drag Fazm to Applications. No terminal. No package manager.
- 2
Grant permissions
Accessibility, screen recording, microphone. The app probes each and retries if the grant is stuck.
- 3
Sign in
Use the bundled API key or connect your Claude Pro/Max. If Claude Code CLI is already logged in, the token is reused.
- 4
Talk
Hold Left Control to speak, or type. Ask it to do something on your Mac and watch the tool stream.
“Not one API. Claude becomes a computer using agent only when the host wires it to named, target-specific tools: Finder via macos-use, Chrome via Playwright, WhatsApp via its own binary, Gmail via google-workspace, SQL via fazm_tools.”
acp-bridge/src/index.ts, the BUILTIN_MCP_NAMES set
The short version
A Claude computer using agent on a Mac is Claude Code running locally via Anthropic's Agent Client Protocol, plugged into five MCP servers that each own a specific target surface, using a credential the user probably already has.
Fazm ships this assembled. Total lines of TypeScript in the bridge: about 0. The hardest part was never the model; it was the permission UX, the OAuth share, and picking which MCP server owns which corner of the Mac.
Want to see the five-process stack running on your own Mac?
Fifteen minutes. We open Fazm on your machine, ask Claude to do something across two apps, and you watch the MCP tool stream in the floating bar.
Frequently asked questions
What exactly is a Claude computer using agent?
Two very different things wear this label. Anthropic's research beta is an API endpoint where you send Claude a screenshot and it returns a sequence of pixel-level actions (move the mouse to (412, 318), click, type 'hello'). That runs on Anthropic's servers against a sandbox. The other thing, the one regular users actually want, is a local app on their own Mac where they can ask Claude to open Finder, fill in a Calendar event, or find a message in WhatsApp, and it happens. Fazm is the second kind. It runs a Claude agent as a local Node subprocess using Anthropic's official @agentclientprotocol/claude-agent-acp package, and wires that agent to five MCP servers that let it actually touch the machine: fazm_tools (SQL and the app database), playwright (Chrome), macos-use (native Mac apps via the accessibility tree), whatsapp (the WhatsApp Catalyst app), and google-workspace (Gmail, Calendar, Drive).
Why does it matter that Fazm uses accessibility APIs instead of screenshots?
Anthropic's computer use beta is screenshot-driven: the model sees pixels, reasons about where the Submit button is, and outputs a click coordinate. That works in a browser sandbox, but on a real Mac it is slow, brittle to dark mode and DPI changes, and blind to state the OS already knows (is this button enabled, is this tab focused, what is the value of this text field). Fazm's macos-use MCP server reads the accessibility tree instead. An accessibility tree walk returns the role, title, value, enabled state, and frame of every element in the frontmost window in a few hundred milliseconds, directly from the OS, with zero image I/O. The agent gets 'AXButton titled Send, enabled, at (1186, 56)' instead of a PNG it has to interpret. It also works with any app on your Mac, not only things inside a browser.
Does it use the same Claude subscription I already pay for?
Yes, if you want it to. Fazm runs two auth modes. The default mode uses a bundled Anthropic API key (the app picks up the usage cost). The opt-in mode is OAuth with your personal Claude Pro or Claude Max subscription. Settings > Claude Account kicks off a PKCE OAuth flow against claude.ai/oauth/authorize with CLIENT_ID 9d1c250a-e61b-44d9-88ed-5944d1962f5e and scope user:inference, and stores the refresh token in the macOS Keychain under the service name Claude Code-credentials. That CLIENT_ID and that keychain service name are the exact same ones Anthropic's Claude Code CLI uses. If you have already run claude login on this machine, Fazm reuses that credential with zero re-auth. All the plumbing lives in acp-bridge/src/oauth-flow.ts.
What can the agent actually do that Anthropic's demo cannot?
Five concrete things, and they all come from the bundled MCP servers rather than the model. One, operate native Mac apps: Finder, Calendar, Mail, Messages, Notes, System Settings, because macos-use exposes click_and_traverse, type_and_traverse, press_key_and_traverse, and scroll_and_traverse against any app's accessibility tree. Two, drive Chrome on real pages you are already logged into, because the bundled Playwright MCP attaches to your running Chrome. Three, send WhatsApp messages, because the separate whatsapp-mcp binary automates the WhatsApp Catalyst app directly. Four, read your Gmail, Calendar, Drive, Docs, Sheets, because the google-workspace MCP server ships inside the app bundle. Five, read and write the local SQLite database that stores your conversation history and memory, because fazm_tools exposes execute_sql. Anthropic's browser-sandboxed demo does exactly one of these.
Is this just Claude Code with a GUI wrapper?
The engine is Claude Code via ACP, yes. Fazm's acp-bridge spawns the @agentclientprotocol/claude-agent-acp subprocess (version 0.29.2 as of this writing), which is the same Agent Client Protocol that Zed, Neovim, and other ACP hosts speak. What Fazm adds on top is the five bundled MCP servers, seventeen bundled skills at Desktop/Sources/BundledSkills (browser-scraping, pdf, xlsx, docx, pptx, video-edit, social-autoposter, deep-research, travel-planner, and more), a floating bar UI tuned for everyday use, push-to-talk voice input, remote control from your phone via chat.fazm.ai, and an observer that watches conversations and saves memory. Claude Code gives you a chat that can read and edit files. Fazm gives you a chat that can read and edit files, drive your desktop, and know who you are.
Why five MCP servers and not one?
Because every target surface is different and mixing them breaks reliability. acp-bridge/src/index.ts lines 1035 to 1100 is the canonical list: fazm_tools for the local database and Swift-side callbacks, playwright for Chrome pages, macos-use for native apps, whatsapp for the WhatsApp Catalyst app, and google-workspace for Gmail and Calendar and Drive. Each MCP server is a separate stdio subprocess with its own timeout budget (120 seconds per MCP tool, 10 seconds for internal, 5 minutes default, set near line 77 of the same file). If a Playwright session hangs, macos-use still works. If the WhatsApp app crashes, Google Calendar still works. A single monolithic tool surface would fail at the weakest link.
Where is the tool schema the agent actually sees?
Claude never sees a pre-written schema. Each MCP server advertises its own tool catalog at handshake time, and Claude Code pulls those into its tool menu automatically. Fazm's Desktop/Sources/Chat/ChatPrompts.swift adds routing guidance (use macos-use for desktop apps, use playwright for Chrome, never type your reasoning into a document), but it never hard-codes tool names in the prompt. This is why adding a new MCP server to ~/.fazm/mcp-servers.json just works: the bridge reads it on startup (acp-bridge/src/index.ts line 1104), merges it into the server list, and the agent discovers the new tools automatically without prompt edits.
What does one action actually look like end-to-end?
Say you ask Fazm 'draft a reply to the last email from Marwan.' The floating bar sends your prompt via the Swift ACP bridge to the Node subprocess. The Node subprocess passes it to Claude Code. Claude reads its memory files and decides the next step is to read Gmail. It calls mcp__google-workspace__search_messages. The Python Google Workspace MCP server runs the query, returns a thread ID. Claude calls get_thread. The reply comes back. Claude drafts a response and calls create_draft. The MCP server returns a draft ID. Claude hands control back to Fazm, which streams the text 'Drafted a reply, ready in Gmail' back to the floating bar in your peripheral vision. Total: one model round, three tool calls, four process boundaries, all local except the Gmail API itself.
What are the limits and what breaks?
Three real ones. One, accessibility apps only: games, Canvas-rendered Electron apps, and native OpenGL apps expose a single opaque window to the accessibility tree. For those, the agent falls back to capture_screenshot (Swift-side tool defined in ChatToolExecutor.swift line 896) and reasons from pixels. Two, permission prompts: Fazm needs accessibility, screen recording, microphone, and automation permissions from macOS. AppState.swift walks the user through the TCC prompts during onboarding. Three, long agent loops: MCP tools default to a two-minute wall clock per call, after which a synthetic error is emitted and the agent recovers. That is configurable via Settings > Advanced > Tool Timeout but exists because a hung subprocess should never hold the UI hostage.
Is this open source?
Yes. The Fazm desktop app source is at github.com/mediar-ai/fazm. The macos-use MCP server that does the accessibility work is at github.com/mediar-ai/mcp-server-macos-use, a single Swift binary. The ACP bridge that connects Swift to Claude Code is also in the repo at acp-bridge. Anthropic's Agent Client Protocol package (@agentclientprotocol/claude-agent-acp) is on npm. You can run any of the MCP servers against Claude Desktop, Cline, Zed's ACP client, or anything else that speaks MCP. Fazm is the consumer packaging: signed .app, five MCP servers bundled, onboarding that handles the permissions and OAuth for you.
Related reading
Computer use agent reliability is the verification loop, not the benchmark
The other half of the stack: what macos-use actually returns after every click. Struct names, noise filters, viewport flags, from the open-source Swift binary.
Claude AI for macOS
How a Mac-native Claude assistant differs from claude.ai in a browser tab. Permissions, local memory, voice.
AI agent framework, open source
What ACP, MCP, and the surrounding open-source stack look like when you piece an agent together yourself.