Open source / Mac

An open source computer use agent on Mac is three repos, not one

The pages that currently rank for this topic frame the answer as a list of products. The actual answer, once you open the source, is three composable MIT-licensed repositories that compile into one signed app. This is a tour of each seam with the file paths that make it real.

Matthew Diakonov, Founder, Fazm

Published April 26, 20269 min read

View Fazm on GitHub MacosUseSDK source →

4.9from open source on GitHub

MIT licensed across all three repos

Native Swift, no Electron, no Python runtime

Five MCP servers bundled in the signed .app

Three repos. One agent.

What ships inside Fazm.app on Mac

MacosUseSDK: walks the live accessibility tree

mcp-server-macos-use: exposes it over stdio MCP

Fazm: signed Mac app that bundles the binary

Caps: 2000 elements / 100 depth / 5 seconds

0:00 / 0:07

The three-repo supply chain

When the Fazm app launches it spawns five child processes that speak MCP over stdin and stdout. The interesting one for this topic is the one named "macos-use". Following its provenance walks you across three GitHub repos, each owned by mediar-ai, each MIT, each installable on its own. The diagram below is the actual call shape.

From AX call to a Mac that does the thing

The numbers that govern every read

A breadth-first walk of an accessibility tree could in principle run forever; macOS apps that embed Chromium views routinely have tens of thousands of nodes. AccessibilityTraversal.swift hard-caps the walk on three axes so the language model never sees an unbounded payload.

maxElements

Hard cap on collected elements per traversal. Hits at line 104.

maxDepth

Recursion depth limit. Stops the walk before pathological Electron trees blow the stack.

maxTraversalSeconds

Wall-clock budget. After 5.0 seconds the walk truncates and returns whatever it has.

pruned roles

Non-interactable roles dropped by default. Layout scaffolding never reaches the model.

Seam one: the SDK

MacosUseSDK is a small Swift package with no dependencies beyond AppKit and ApplicationServices. The struct below is what one element becomes after traversal, and the fileprivate class shows the caps that bound every walk. Anyone building an alternate Mac agent can drop this file into a Package.swift and skip writing accessibility code from scratch.

MacosUseSDK/Sources/MacosUseSDK/AccessibilityTraversal.swift

Six fields per element. No color, no opacity, no pixel data. The text field concatenates five accessibility attributes in the order AXValue, AXTitle, AXDescription, AXLabel, AXHelp, so the language model sees the same string a screen reader would. Spatial order is y first then x, which gives you reading-order without a layout pass.

Seam two: the MCP server

mcp-server-macos-use is the wrapper that turns the SDK into a tool any MCP-aware client can call. It exposes five tool calls, all suffixed "_and_traverse" so every action returns the new accessibility tree along with the result of the action. The diff between before-and-after lets the agent verify its own click did what it expected without taking a screenshot.

Tools exposed by the server

macos-use_open_application_and_traverseActivate or launch an app, then return its tree.
macos-use_click_and_traverseClick at (x, y) within a target PID, return the new tree.
macos-use_type_and_traverseType a string, return the new tree.
macos-use_press_key_and_traversePress a named key with optional modifier flags.
macos-use_refresh_traversalRead the tree without acting.

Seam three: the bundled binary

The Fazm app itself is the third repo. The most interesting fifteen lines of source for this topic are in acp-bridge, which is the TypeScript shim that turns the ACP protocol into MCP server spawns. Line 63 is where the open source piece lands inside the signed consumer app.

fazm/acp-bridge/src/index.ts

contentsDir resolves to the Contents directory of the running .app bundle, so the path becomes Fazm.app/Contents/MacOS/mcp-server-macos-use. The binary is signed and notarized as part of the parent app, which is why the consumer install path skips the Gatekeeper friction that loose helper binaries usually trigger. BUILTIN_MCP_NAMES on line 1266 is the allow-list that tells the rest of the bridge which servers are first- party.

Two architectures, side by side

What the language model actually receives when an open source Mac agent perceives a window.

Feature	Screenshot agent	Fazm (AX tree)
Perception primitive	PNG bitmap of the window or screen	JSON ElementData list (role, text, x, y, w, h)
Typical payload size	1 to 3 MB before base64 encoding	10 to 40 KB for a normal app window
Click target format	Pixel coordinates from a vision model	Element role and text, by index
Drift on Retina vs non-Retina	Vision model has to reason about scale	None, AX coordinates are screen-stable
Cost per perception step	Image tokens on every read	Text tokens only
Works without screen recording permission	No, screen recording is mandatory	Yes, AX is the only required TCC service
Best for PDFs and figures	This is the native fit	Falls back to screenshots on demand

What you can fork at each seam

The point of three repos is that you can opt in at the layer that matches your project. A Swift app that needs accessibility input controls but no LLM only takes the SDK. A custom MCP-aware client takes the server and skips the consumer chrome. A non-developer takes the signed app and never thinks about Swift.

Fork MacosUseSDK if you write Swift

github.com/mediar-ai/MacosUseSDK. Add it to a Package.swift, request Accessibility, call traverseAccessibilityTree(pid:onlyVisibleElements:). No LLM, no MCP, no analytics. macOS 13.0 minimum.

Public API: traverseAccessibilityTree, click, type, key
Pure Swift, AppKit and ApplicationServices only
Returns Codable ResponseData

Fork mcp-server-macos-use if you have an MCP-aware client

github.com/mediar-ai/mcp-server-macos-use. Build with swift build, point Claude Desktop, Cursor, or any other MCP client at the binary, and you have Mac control inside that client. Five tools, stdio transport.

Stdio MCP server, no network surface
Wraps SDK; depends on it via Package.resolved
Diff output between before and after traversal

Take the Fazm app if you do not want to compile

github.com/mediar-ai/Fazm. Download the signed DMG, drag to Applications, grant Accessibility, and the same binary is already wired into the chat surface. The consumer-friendly path is the entire reason the .app exists.

Signed and notarized DMG
Bundles five MCP servers, no manual config
SwiftUI chat surface, push-to-talk, Sparkle updates

Watching the binary register on a fresh launch

The acp-bridge logs every MCP server it spawns at startup, including which are bundled and which were resolved from disk. When Fazm.app comes up the line below is what shows up in the dev log at /private/tmp/fazm-dev.log. The macos-use entry confirming the binary is "bundled" is the one to watch.

/private/tmp/fazm-dev.log

What the consumer app contributes that the SDK alone does not

A lot of open source agent stacks stop at "here is the binary, good luck." That is fine for developers, but it is the reason most non-developer Mac users never run an agent at all. The Fazm app shell is the part that earns the "consumer-friendly" label on this topic, and most of it is invisible until you skip it.

Bundled in the .app, not asked of the user

Signed and notarized binary, no Gatekeeper override
Five MCP servers wired up at first launch
Multi-stage Accessibility permission probe (Tahoe-safe)
Bundled Node, Python, ffmpeg, cloudflared
Sparkle auto-update, no terminal commands
Push-to-talk voice on hold-Left-Control
Phone control surface at chat.fazm.ai

The other open source Mac agents in the same neighborhood

A few other projects exist in this space and are worth knowing about. They take different architectural bets, which is the part of the comparison that usually gets flattened.

Fazm (AX-first, Swift)macOS-use by browser-use (Python wrapper)Open Interpreter (code + GUI)UI-TARS (vision model)OpenCUA (foundation model)Anthropic Computer Use (screenshot)Codex on Mac (cloud)

Fazm is the only one in this list that ships the SDK, the MCP server, and the consumer .app as three separately forkable repos. UI-TARS and OpenCUA are foundation models and need a host. Open Interpreter is terminal-first and reaches the GUI through a Python harness. The browser-use macOS-use project is a Python wrapper over the macOS AX APIs without an end-user app on top.

Want to see the AX path beat a screenshot agent live?

20 minutes, your Mac, the same three repos you just read about. We will run a few tasks side by side and answer the implementation questions a doc cannot.

Frequently asked questions

What does 'open source computer use agent on a Mac' actually mean once you read the source?

On most Mac stacks it means a process that perceives the screen, reasons about it with a language model, and emits clicks or keystrokes. Fazm splits that into three open repos. github.com/mediar-ai/MacosUseSDK is the Swift package that walks the macOS accessibility tree and produces structured ElementData. github.com/mediar-ai/mcp-server-macos-use wraps that SDK as a stdio MCP server with five tools. github.com/mediar-ai/Fazm is the consumer macOS app that embeds the compiled binary inside Contents/MacOS/ and routes Claude through it. Each of the three is MIT-licensed and forkable on its own.

Where exactly is the macos-use binary inside the signed app?

acp-bridge/src/index.ts at line 63 resolves it with `const macosUseBinary = join(contentsDir, "MacOS", "mcp-server-macos-use");`. contentsDir is the Contents directory of the running .app bundle, so the file lives at Fazm.app/Contents/MacOS/mcp-server-macos-use. At startup the same file registers it as an MCP server named 'macos-use' (index.ts lines 1057 to 1064). It is signed and notarized as part of the parent app, which is why the consumer install path skips the Gatekeeper friction that loose helper binaries usually trigger.

What caps does the accessibility traversal actually run under?

AccessibilityTraversal.swift in MacosUseSDK/Sources/MacosUseSDK at lines 103 to 105 sets `maxDepth = 100`, `maxElements = 2000`, and `maxTraversalSeconds: Double = 5.0`. The breadth-first walk stops when any of the three is hit. A 14-role prune list at lines 109 to 114 (AXGroup, AXStaticText, AXUnknown, AXSeparator, AXHeading, AXLayoutArea, AXHelpTag, AXGrowArea, AXOutline, AXScrollArea, AXSplitGroup, AXSplitter, AXToolbar, AXDisclosureTriangle) drops non-interactable containers so the language model sees buttons, fields, and menu items, not layout scaffolding. The output is a JSON ResponseData struct with element list, statistics, and processing_time_seconds.

Why three repos instead of one monolithic agent?

Three different fork audiences. MacosUseSDK is for anyone who wants the raw accessibility primitives (traverse, click, type, key) in a Swift project, with no MCP, no agent, no LLM dependency. mcp-server-macos-use is for anyone who already has an MCP-aware client (Claude Desktop, Cursor, Cline, custom) and wants to drop in Mac control without writing Swift. The Fazm app is for end users who do not want to install a developer toolchain at all. Splitting the seams means the SDK can ship without the consumer chrome, and the consumer app can ship without forcing every user to compile.

How does this differ from a screenshot-based open source agent?

A screenshot agent renders the full window to a PNG, hands it to a vision model, and waits for pixel coordinates back. Fazm calls AXUIElementCreateApplication, walks the tree, and hands a list of typed elements with role, text, and frame back to the language model. A typical Finder window with 40 files traverses to roughly 60 to 120 ElementData entries, which serializes to about 10 to 20 kilobytes of JSON. The screenshot for the same window is 1 to 3 megabytes of PNG, which has to be base64-encoded into the prompt and decoded by the vision model before any reasoning happens. Both approaches work; the AX path is faster and cheaper for native UI, the screenshot path wins on PDFs and visual-only content, and Fazm keeps both available.

What is bundled inside the Fazm app at install time?

Five MCP servers. acp-bridge/src/index.ts line 1266 declares the BUILTIN_MCP_NAMES set: fazm_tools, playwright, macos-use, whatsapp, google-workspace. Each runs as a child process over stdio. The macos-use binary is the open-source Swift Mach-O described above. The others ship from their own bundled paths under Contents. None require manual MCP configuration on first launch, which is the practical cost most open source agent stacks force the user to pay before the first prompt works.

What does the language model actually receive when Fazm reads a window?

A JSON object whose `elements` array contains six fields per UI item: role (string like AXButton or AXTextField), text (concatenated kAXValue, kAXTitle, kAXDescription, AXLabel, AXHelp), x, y, width, height (doubles in screen coordinates). The struct is ElementData in MacosUseSDK/Sources/MacosUseSDK/AccessibilityTraversal.swift lines 32 to 57. The wrapping ResponseData adds app_name, a Statistics struct, and processing_time_seconds. No color, no font, no pixel data. Spatial sort is y first then x, so elements arrive in reading order.

Which macOS permission is required, and how does Fazm handle the Tahoe permission cache bug?

Accessibility (TCC service kTCCServiceAccessibility) is required for the AX read path; Screen Recording is requested separately for the screenshot fallback. AppState.swift wraps the permission probe in a multi-stage check because AXIsProcessTrusted() caches per process on macOS 26 Tahoe and can keep returning true after a permission reset. The probe calls AXUIElementCopyAttributeValue against the frontmost app, retries against Finder when the result is AXError.cannotComplete, and falls back to a CGEvent.tapCreate live TCC probe when Finder is not running. That is in AppState.swift around lines 432 to 504.

Can I take just MacosUseSDK and build my own agent?

Yes. The package exposes traverseAccessibilityTree(pid:onlyVisibleElements:) plus separate input controllers for click, type, and key press. There is no LLM dependency in the SDK, no MCP, no analytics. Add it to a Package.swift, request Accessibility, and you have the same primitives Fazm uses in three lines of Swift. The SDK lives at github.com/mediar-ai/MacosUseSDK and pins macOS 13.0 minimum.

How does Fazm compare to the Anthropic Computer Use tool, OpenAI Codex on Mac, and OpenCUA?

Anthropic Computer Use ships through Claude Sonnet 3.5 and later models as a screenshot plus mouse and keyboard surface. OS-agnostic, vision-driven, image token cost on every read. OpenAI Codex on Mac (released April 16, 2026) is cloud-first screen reading and acting. OpenCUA is an open foundation model trained on screen-action trajectories for vision-based agents. All three lean on screenshots as the primary perception channel. Fazm goes the other way for native Mac apps, defaulting to AX tree reads for cost and latency, and keeps screenshots as a tool the agent can request when it actually needs them.