An open source computer use agent on Mac is three repos, not one
The pages that currently rank for this topic frame the answer as a list of products. The actual answer, once you open the source, is three composable MIT-licensed repositories that compile into one signed app. This is a tour of each seam with the file paths that make it real.
The three-repo supply chain
When the Fazm app launches it spawns five child processes that speak MCP over stdin and stdout. The interesting one for this topic is the one named "macos-use". Following its provenance walks you across three GitHub repos, each owned by mediar-ai, each MIT, each installable on its own. The diagram below is the actual call shape.
From AX call to a Mac that does the thing
The numbers that govern every read
A breadth-first walk of an accessibility tree could in principle run forever; macOS apps that embed Chromium views routinely have tens of thousands of nodes. AccessibilityTraversal.swift hard-caps the walk on three axes so the language model never sees an unbounded payload.
Hard cap on collected elements per traversal. Hits at line 104.
Recursion depth limit. Stops the walk before pathological Electron trees blow the stack.
Wall-clock budget. After 5.0 seconds the walk truncates and returns whatever it has.
Non-interactable roles dropped by default. Layout scaffolding never reaches the model.
Seam one: the SDK
MacosUseSDK is a small Swift package with no dependencies beyond AppKit and ApplicationServices. The struct below is what one element becomes after traversal, and the fileprivate class shows the caps that bound every walk. Anyone building an alternate Mac agent can drop this file into a Package.swift and skip writing accessibility code from scratch.
Six fields per element. No color, no opacity, no pixel data. The text field concatenates five accessibility attributes in the order AXValue, AXTitle, AXDescription, AXLabel, AXHelp, so the language model sees the same string a screen reader would. Spatial order is y first then x, which gives you reading-order without a layout pass.
Seam two: the MCP server
mcp-server-macos-use is the wrapper that turns the SDK into a tool any MCP-aware client can call. It exposes five tool calls, all suffixed "_and_traverse" so every action returns the new accessibility tree along with the result of the action. The diff between before-and-after lets the agent verify its own click did what it expected without taking a screenshot.
macos-use_open_application_and_traverseActivate or launch an app, then return its tree.macos-use_click_and_traverseClick at (x, y) within a target PID, return the new tree.macos-use_type_and_traverseType a string, return the new tree.macos-use_press_key_and_traversePress a named key with optional modifier flags.macos-use_refresh_traversalRead the tree without acting.
Seam three: the bundled binary
The Fazm app itself is the third repo. The most interesting fifteen lines of source for this topic are in acp-bridge, which is the TypeScript shim that turns the ACP protocol into MCP server spawns. Line 63 is where the open source piece lands inside the signed consumer app.
contentsDir resolves to the Contents directory of the running .app bundle, so the path becomes Fazm.app/Contents/MacOS/mcp-server-macos-use. The binary is signed and notarized as part of the parent app, which is why the consumer install path skips the Gatekeeper friction that loose helper binaries usually trigger. BUILTIN_MCP_NAMES on line 1266 is the allow-list that tells the rest of the bridge which servers are first- party.
Two architectures, side by side
What the language model actually receives when an open source Mac agent perceives a window.
| Feature | Screenshot agent | Fazm (AX tree) |
|---|---|---|
| Perception primitive | PNG bitmap of the window or screen | JSON ElementData list (role, text, x, y, w, h) |
| Typical payload size | 1 to 3 MB before base64 encoding | 10 to 40 KB for a normal app window |
| Click target format | Pixel coordinates from a vision model | Element role and text, by index |
| Drift on Retina vs non-Retina | Vision model has to reason about scale | None, AX coordinates are screen-stable |
| Cost per perception step | Image tokens on every read | Text tokens only |
| Works without screen recording permission | No, screen recording is mandatory | Yes, AX is the only required TCC service |
| Best for PDFs and figures | This is the native fit | Falls back to screenshots on demand |
What you can fork at each seam
The point of three repos is that you can opt in at the layer that matches your project. A Swift app that needs accessibility input controls but no LLM only takes the SDK. A custom MCP-aware client takes the server and skips the consumer chrome. A non-developer takes the signed app and never thinks about Swift.
Fork MacosUseSDK if you write Swift
github.com/mediar-ai/MacosUseSDK. Add it to a Package.swift, request Accessibility, call traverseAccessibilityTree(pid:onlyVisibleElements:). No LLM, no MCP, no analytics. macOS 13.0 minimum.
- Public API: traverseAccessibilityTree, click, type, key
- Pure Swift, AppKit and ApplicationServices only
- Returns Codable ResponseData
Fork mcp-server-macos-use if you have an MCP-aware client
github.com/mediar-ai/mcp-server-macos-use. Build with swift build, point Claude Desktop, Cursor, or any other MCP client at the binary, and you have Mac control inside that client. Five tools, stdio transport.
- Stdio MCP server, no network surface
- Wraps SDK; depends on it via Package.resolved
- Diff output between before and after traversal
Take the Fazm app if you do not want to compile
github.com/mediar-ai/Fazm. Download the signed DMG, drag to Applications, grant Accessibility, and the same binary is already wired into the chat surface. The consumer-friendly path is the entire reason the .app exists.
- Signed and notarized DMG
- Bundles five MCP servers, no manual config
- SwiftUI chat surface, push-to-talk, Sparkle updates
Watching the binary register on a fresh launch
The acp-bridge logs every MCP server it spawns at startup, including which are bundled and which were resolved from disk. When Fazm.app comes up the line below is what shows up in the dev log at /private/tmp/fazm-dev.log. The macos-use entry confirming the binary is "bundled" is the one to watch.
What the consumer app contributes that the SDK alone does not
A lot of open source agent stacks stop at "here is the binary, good luck." That is fine for developers, but it is the reason most non-developer Mac users never run an agent at all. The Fazm app shell is the part that earns the "consumer-friendly" label on this topic, and most of it is invisible until you skip it.
Bundled in the .app, not asked of the user
- Signed and notarized binary, no Gatekeeper override
- Five MCP servers wired up at first launch
- Multi-stage Accessibility permission probe (Tahoe-safe)
- Bundled Node, Python, ffmpeg, cloudflared
- Sparkle auto-update, no terminal commands
- Push-to-talk voice on hold-Left-Control
- Phone control surface at chat.fazm.ai
The other open source Mac agents in the same neighborhood
A few other projects exist in this space and are worth knowing about. They take different architectural bets, which is the part of the comparison that usually gets flattened.
Fazm is the only one in this list that ships the SDK, the MCP server, and the consumer .app as three separately forkable repos. UI-TARS and OpenCUA are foundation models and need a host. Open Interpreter is terminal-first and reaches the GUI through a Python harness. The browser-use macOS-use project is a Python wrapper over the macOS AX APIs without an end-user app on top.
Want to see the AX path beat a screenshot agent live?
20 minutes, your Mac, the same three repos you just read about. We will run a few tasks side by side and answer the implementation questions a doc cannot.
Frequently asked questions
What does 'open source computer use agent on a Mac' actually mean once you read the source?
On most Mac stacks it means a process that perceives the screen, reasons about it with a language model, and emits clicks or keystrokes. Fazm splits that into three open repos. github.com/mediar-ai/MacosUseSDK is the Swift package that walks the macOS accessibility tree and produces structured ElementData. github.com/mediar-ai/mcp-server-macos-use wraps that SDK as a stdio MCP server with five tools. github.com/mediar-ai/Fazm is the consumer macOS app that embeds the compiled binary inside Contents/MacOS/ and routes Claude through it. Each of the three is MIT-licensed and forkable on its own.
Where exactly is the macos-use binary inside the signed app?
acp-bridge/src/index.ts at line 63 resolves it with `const macosUseBinary = join(contentsDir, "MacOS", "mcp-server-macos-use");`. contentsDir is the Contents directory of the running .app bundle, so the file lives at Fazm.app/Contents/MacOS/mcp-server-macos-use. At startup the same file registers it as an MCP server named 'macos-use' (index.ts lines 1057 to 1064). It is signed and notarized as part of the parent app, which is why the consumer install path skips the Gatekeeper friction that loose helper binaries usually trigger.
What caps does the accessibility traversal actually run under?
AccessibilityTraversal.swift in MacosUseSDK/Sources/MacosUseSDK at lines 103 to 105 sets `maxDepth = 100`, `maxElements = 2000`, and `maxTraversalSeconds: Double = 5.0`. The breadth-first walk stops when any of the three is hit. A 14-role prune list at lines 109 to 114 (AXGroup, AXStaticText, AXUnknown, AXSeparator, AXHeading, AXLayoutArea, AXHelpTag, AXGrowArea, AXOutline, AXScrollArea, AXSplitGroup, AXSplitter, AXToolbar, AXDisclosureTriangle) drops non-interactable containers so the language model sees buttons, fields, and menu items, not layout scaffolding. The output is a JSON ResponseData struct with element list, statistics, and processing_time_seconds.
Why three repos instead of one monolithic agent?
Three different fork audiences. MacosUseSDK is for anyone who wants the raw accessibility primitives (traverse, click, type, key) in a Swift project, with no MCP, no agent, no LLM dependency. mcp-server-macos-use is for anyone who already has an MCP-aware client (Claude Desktop, Cursor, Cline, custom) and wants to drop in Mac control without writing Swift. The Fazm app is for end users who do not want to install a developer toolchain at all. Splitting the seams means the SDK can ship without the consumer chrome, and the consumer app can ship without forcing every user to compile.
How does this differ from a screenshot-based open source agent?
A screenshot agent renders the full window to a PNG, hands it to a vision model, and waits for pixel coordinates back. Fazm calls AXUIElementCreateApplication, walks the tree, and hands a list of typed elements with role, text, and frame back to the language model. A typical Finder window with 40 files traverses to roughly 60 to 120 ElementData entries, which serializes to about 10 to 20 kilobytes of JSON. The screenshot for the same window is 1 to 3 megabytes of PNG, which has to be base64-encoded into the prompt and decoded by the vision model before any reasoning happens. Both approaches work; the AX path is faster and cheaper for native UI, the screenshot path wins on PDFs and visual-only content, and Fazm keeps both available.
What is bundled inside the Fazm app at install time?
Five MCP servers. acp-bridge/src/index.ts line 1266 declares the BUILTIN_MCP_NAMES set: fazm_tools, playwright, macos-use, whatsapp, google-workspace. Each runs as a child process over stdio. The macos-use binary is the open-source Swift Mach-O described above. The others ship from their own bundled paths under Contents. None require manual MCP configuration on first launch, which is the practical cost most open source agent stacks force the user to pay before the first prompt works.
What does the language model actually receive when Fazm reads a window?
A JSON object whose `elements` array contains six fields per UI item: role (string like AXButton or AXTextField), text (concatenated kAXValue, kAXTitle, kAXDescription, AXLabel, AXHelp), x, y, width, height (doubles in screen coordinates). The struct is ElementData in MacosUseSDK/Sources/MacosUseSDK/AccessibilityTraversal.swift lines 32 to 57. The wrapping ResponseData adds app_name, a Statistics struct, and processing_time_seconds. No color, no font, no pixel data. Spatial sort is y first then x, so elements arrive in reading order.
Which macOS permission is required, and how does Fazm handle the Tahoe permission cache bug?
Accessibility (TCC service kTCCServiceAccessibility) is required for the AX read path; Screen Recording is requested separately for the screenshot fallback. AppState.swift wraps the permission probe in a multi-stage check because AXIsProcessTrusted() caches per process on macOS 26 Tahoe and can keep returning true after a permission reset. The probe calls AXUIElementCopyAttributeValue against the frontmost app, retries against Finder when the result is AXError.cannotComplete, and falls back to a CGEvent.tapCreate live TCC probe when Finder is not running. That is in AppState.swift around lines 432 to 504.
Can I take just MacosUseSDK and build my own agent?
Yes. The package exposes traverseAccessibilityTree(pid:onlyVisibleElements:) plus separate input controllers for click, type, and key press. There is no LLM dependency in the SDK, no MCP, no analytics. Add it to a Package.swift, request Accessibility, and you have the same primitives Fazm uses in three lines of Swift. The SDK lives at github.com/mediar-ai/MacosUseSDK and pins macOS 13.0 minimum.
How does Fazm compare to the Anthropic Computer Use tool, OpenAI Codex on Mac, and OpenCUA?
Anthropic Computer Use ships through Claude Sonnet 3.5 and later models as a screenshot plus mouse and keyboard surface. OS-agnostic, vision-driven, image token cost on every read. OpenAI Codex on Mac (released April 16, 2026) is cloud-first screen reading and acting. OpenCUA is an open foundation model trained on screen-action trajectories for vision-based agents. All three lean on screenshots as the primary perception channel. Fazm goes the other way for native Mac apps, defaulting to AX tree reads for cost and latency, and keeps screenshots as a tool the agent can request when it actually needs them.