A walk through one specific file

The open source local desktop agent question is mostly a permission-probe question

Every guide to picking an open source local desktop agent ends at the same shallow conclusion: use the accessibility tree, not screenshots. That is the easy half. The hard half, the one that decides whether the agent survives a macOS upgrade and a re-sign, is what the code does when the operating system lies to it about whether the permission still works. This is a walk through one file in the Fazm source tree that handles exactly that, in the order it runs.

M
Matthew Diakonov
9 min read

The lie the operating system tells

The macOS accessibility API has one entry point everyone reaches for first: AXIsProcessTrusted(). It returns a Bool. The Bool is supposed to mean "this process can call accessibility APIs against other apps." In practice, on macOS 26 (Tahoe), and on Sequoia after an OS update or an app re-sign, that Bool is cached per process and the cache goes stale. The function returns true. The next AX call fails. The user grants permission again. The function still returns true. The next AX call still fails. The agent looks broken. The agent is not broken. The cache is.

A second-order failure is layered on top of the first. The error code that comes back from a failing AX call is AXError.cannotComplete, and it is ambiguous on purpose. Sometimes it means the permission is broken. Sometimes it means the frontmost app does not implement accessibility at all, which is true of most Qt apps, most OpenGL apps, some Electron builds, and PyMOL. An agent that treats every cannotComplete as a permission failure will nag the user for permission every time they tab to PyMOL. An agent that treats every cannotComplete as a foreign-app problem will silently stop working the day TCC drifts.

What the API says vs what the user sees

AXIsProcessTrusted() returns true. The agent assumes everything is fine and starts driving the browser. Every click silently fails. The user sees no feedback. The agent looks dead.

  • Single API call, single Bool
  • No re-check after the first turn
  • TCC cache treated as ground truth
  • Failure indistinguishable from a slow agent

The four probes, in order

The probe lives in Desktop/Sources/AppState.swift between line 310 and line 540. The names of the methods are unglamorous:checkAccessibilityPermission, testAccessibilityPermission, confirmAccessibilityBrokenViaFinder, and probeAccessibilityViaEventTap. They run as a sequence, with short-circuit exits at each step.

Permission probe sequence

FazmTCCFrontAppFinderEventTapAXIsProcessTrusted()Bool (possibly stale)AXUIElementCopyAttributeValue(focusedWindow)success | cannotCompleteSame call, against Finder pidsuccess | cannotCompleteCGEvent.tapCreate(.listenOnly, mouseMoved)tap | nil (live TCC truth)

Probe 1, the cheap one

Call AXIsProcessTrusted(). If it returns false on a process that previously had permission, do not believe it. The cache is the most likely culprit, and there is a cheaper way to confirm than re-prompting the user. Skip ahead to probe 4. If it returns true, do not believe that either. Run probe 2.

Probe 2, the real AX call

Get the frontmost app from NSWorkspace.shared.frontmostApplication, build an AXUIElement for it, and try to read kAXFocusedWindowAttribute. The four meaningful return codes are: success (everything is fine), noValue (the app has no windows, also fine), apiDisabled (system-wide AX is off, unambiguous failure), and cannotComplete (ambiguous, run probe 3). notImplemented and attributeUnsupported are also treated as fine because they come from healthy AX calls against apps that just do not expose a focused window attribute.

Probe 3, the Finder sanity check

When probe 2 returned cannotComplete, the question is whether the failure was caused by the permission or by the foreign app. The cheapest way to find out is to repeat the same call against an app that is known to implement accessibility correctly. Apple ships one: Finder. If Finder succeeds, the original failure was app-specific (probably PyMOL or a Qt build), the permission is fine, and the agent should keep going. If Finder also fails, the permission is the problem. If Finder is not even running, fall through to probe 4.

Probe 4, the CGEvent tap escape hatch

This is the part the existing playbooks do not mention. CGEvent tap creation is governed by the same accessibility entitlement as AX calls, but unlike AXIsProcessTrusted it does not consult the per-process cache. It hits the live TCC database. Calling CGEvent.tapCreate with .listenOnly on mouseMoved is a probe, not a real subscription: the tap is invalidated immediately. If it returned a non-nil port the entitlement is live and AX will recover after a process restart. If it returned nil the permission is genuinely missing and the user has to grant it. Fazm only runs this probe when the previous step had said the process previously had permission, to avoid triggering the "app was prevented from modifying other apps" system notification on every polling cycle for a process that has just been installed.

Desktop/Sources/AppState.swift

What the agent does after the probe agrees

A clean four-step probe is necessary, not sufficient. The probe gives the agent a trustworthy yes-or-no on whether AX is alive. Once that is settled, the agent still needs to actually drive applications. Fazm hands that work to a small set of MCP-style tool families: native macOS apps go through mcp__macos-use__* (Finder, Settings, Mail), the browser goes through Playwright attached to the user's real running Chrome (not a headless instance), WhatsApp uses mcp__whatsapp__* on the native macOS Catalyst app, and Telegram is driven by Python telethon scripts because that is faster and steadier than browser automation.

The routing rules live in Desktop/Sources/Chat/ChatPrompts.swift and they are part of the system prompt the agent receives every turn. The interesting consequence: when the agent picks up a new app, it does not have to be retrained, it has to be added to the routing block. New tool families plug in via ~/.fazm/mcp-servers.json, the user-side MCP server registry that landed in v2.4.0 on 2026-04-20 and is read by Desktop/Sources/MCPServerManager.swift.

The shape of an open source local desktop agent on macOS, then, is two layers: a probe that decides whether the OS will let the agent see the screen at all, and a router that decides which tool family talks to which surface. Most of the comparison energy in this category is spent on the second layer. Most of the failure is in the first.

What to grep for when picking an agent

You can audit any open source local desktop agent on macOS in fifteen minutes if you know what to look for. Clone the repo. Grep for these strings:

  • AXIsProcessTrusted: the entry point. One hit is the bare minimum. Zero hits means the project does not even ask whether it has permission.
  • kAXFocusedWindowAttribute or AXUIElementCopyAttributeValue: the real call. If grep returns hits inside a permission-check function, the project is doing a functional test, not just trusting the cache.
  • CGEvent.tapCreate or CGEventTapCreate: the escape hatch. A repo with this in a permission probe knows about the macOS 26 cache problem and has handled it.
  • cannotComplete: the ambiguous error. A repo that branches on this and has a Finder fallback or equivalent cross-app sanity check is built by someone who has been bitten in production.

The hit count for those four strings, plus a quick read of the function that contains them, is a faster signal of whether the agent will survive on a real Mac than any feature comparison.

Want to talk through your own desktop automation case?

Fifteen minutes on what you'd point an agent at, and whether Fazm's accessibility-first approach is the right fit for it.

Frequently asked questions

What is an open source local desktop agent in one sentence?

A program you install on your own machine that, given a natural-language instruction, can read what is on your screen and drive your real applications (browser, Mail, Finder, Excel, your IDE) without sending the screen to a remote service. Open source means the source is on GitHub and you can audit, fork, or self-build it. Local means the agent process runs on your laptop, not in someone else's data center, even if the underlying language model lives elsewhere.

Why does the choice between accessibility API and screenshots matter so much?

Screenshots compress every UI element into a 2D bitmap and ask a vision model to find the OK button. The accessibility tree gives you that button as a structured object with a role, a title, a screen-space frame, and a set of supported actions. The first approach is two seconds per click and frequently miscounts pixels. The second is milliseconds per call and addresses elements by identity, not coordinates. Every credible local agent on macOS in 2026 uses the accessibility tree as the primary signal. The interesting question is what they do when the tree is empty or the permission is broken.

What specifically goes wrong with the macOS accessibility permission?

Two things. First, AXIsProcessTrusted() returns a cached answer that does not always match the live TCC database, especially after a macOS upgrade or an app re-sign. The agent thinks it has permission and fails on every AX call. Second, AXUIElementCopyAttributeValue can return AXError.cannotComplete for two completely different reasons: your permission is genuinely broken, or the frontmost app simply does not implement accessibility. Qt apps, OpenGL apps, PyMOL, some Electron builds, all return cannotComplete from a healthy AX call against the wrong target. An agent that treats both cases the same will either be paranoid (false alarms) or blind (silent failures).

How does Fazm tell those two cases apart?

It runs a four-step probe in Desktop/Sources/AppState.swift. Step one calls AXIsProcessTrusted(). Step two calls AXUIElementCreateApplication on the frontmost app and tries to read its focused window. Step three, only when step two returns cannotComplete, repeats the call against Finder. Finder is a known accessibility-compliant Apple app, so if it also fails the permission is truly broken; if Finder succeeds the original failure was app-specific. Step four, only when previous steps disagree or Finder is not running, calls CGEvent.tapCreate with .listenOnly on mouseMoved. Tap creation hits the live TCC database and bypasses the per-process cache that goes stale on macOS 26. The whole probe is about 70 lines of Swift and lives between line 433 and line 504 of AppState.swift.

Is this only a Fazm problem or does every local desktop agent face it?

Every local AX-driven agent on macOS faces it. What differs is whether the agent's source code includes the four-step probe, a one-step AXIsProcessTrusted check, or no check at all. You can grep any candidate's GitHub repo for AXIsProcessTrusted, kAXFocusedWindowAttribute, and CGEvent.tapCreate. The number of hits and the way they are wired together tells you whether the maintainer has shipped on macOS Sequoia or Tahoe in production with paying users.

What does this code mean in practice for someone picking an agent?

On a Mac you want one bullet item in the agent's onboarding to be a recoverable accessibility check, not a one-line 'grant permission to continue.' If the agent never re-checks after granting, it will eventually drift on a system update and the user will assume the agent is broken. If it re-checks but only via AXIsProcessTrusted, it will accept stale yes-answers from the cache. The robustness signal is whether the agent shows you something like 'macOS granted accessibility permission but it isn't working yet, please quit and reopen' instead of pretending nothing is wrong. Fazm has that exact dialog wired to the probe.

Where does the agent's actual click happen, then?

Once the probe says AX is alive, Fazm hands the actual UI work to a set of MCP-style tools the agent calls. Desktop apps go through the macos-use tools (mcp__macos-use__*). The browser goes through the playwright tools attached to the user's real Chrome session, not a headless instance. WhatsApp uses its native macOS app via mcp__whatsapp__*. Telegram is driven by Python telethon scripts because that is faster and steadier than browser automation. The routing rules live in Desktop/Sources/Chat/ChatPrompts.swift and they are part of the system prompt the agent receives every turn.

Why is 'fully local' qualified by the language model?

Honest answer: the agent process, your screen contents, your file scan, your knowledge graph, your browser profile, your chat history, all of it stays on your Mac. The language model is the part that can either be local (Ollama, an LM Studio bridge, or any Anthropic-compatible gateway you point Fazm at via the Custom API Endpoint setting) or cloud (Anthropic API). What never happens is a forced trip through a vendor's ingestion pipeline. The screen capture tool, when used, takes a still image and sends it to the configured model, so it is the same trust boundary as the model itself, not a parallel one.

Where is the source you can verify all of this against?

github.com/m13v/fazm. The Swift app is under Desktop/. The accessibility probe is in Desktop/Sources/AppState.swift between lines 310 and 540. The tool routing prompt is in Desktop/Sources/Chat/ChatPrompts.swift starting around line 65. The MCP server registry that lets you add your own tools is at Desktop/Sources/MCPServerManager.swift and the user-side config it reads lives at ~/.fazm/mcp-servers.json. CHANGELOG.json at the repo root has every dated release; v2.4.0 on 2026-04-20 is the one that opened the MCP server registry to user-supplied servers.

What is the shortest version?

An open source local desktop agent on macOS is mostly an exercise in failing gracefully when the operating system says one thing and means another. The screenshot-vs-accessibility comparison is the part everyone writes about. The TCC cache, the cannotComplete ambiguity, the Finder fallback, and the CGEvent tap probe are the parts that decide whether the agent is a daily driver or a demo. Read AppState.swift before you read the marketing copy.