Field notes from one shipping launcher

Mac launcher AI agent accessibility, the part nobody else wires up

Spotlight launches apps. Raycast and Alfred launch apps and run small scripts. The newer launchers added a chat field that calls an LLM. None of those does the thing this post is about: hit a global hotkey, type one sentence, and have an agent loop reach into the frontmost app through the macOS accessibility API and actually do the thing. This is how that wiring works in one open-source implementation, with the file paths and the FourCharCode.

Matthew Diakonov, Written with AI

Published May 14, 20268 min read

Direct answer (verified 2026-05-14)

A Mac launcher AI agent that uses accessibility is a global-hotkey input bar (Spotlight-style) whose enter key hands the prompt to an agent loop, not to a chat completion. The agent loop drives the frontmost app through the macOS accessibility API (AXUIElement, kAXFocusedWindowAttribute, AXUIElementCopyAttributeValue) instead of screenshot vision. The launcher is the input; the AX tree is the substrate; the agent is the executor.

Reference implementation: github.com/m13v/fazm. Hotkey at Desktop/Sources/FloatingControlBar/GlobalShortcutManager.swift, AX probe at Desktop/Sources/AppState.swift:482-534, bundled MCP at Contents/MacOS/mcp-server-macos-use.

What the launcher is, and what it is not

The Mac launcher genre has had one job for twenty years: type a few characters, hit enter, an app opens. Spotlight added file results and definitions, Alfred added workflows, Raycast added extensions. The consistent shape is a global hotkey that drops a fuzzy-matched picker in front of you and returns control to whatever you were doing.

The chat-bar versions of the same UI (Raycast AI, MacGPT, the various menubar wrappers) added a second job: type a sentence, hit enter, an LLM replies with text. The hotkey is the same; the result is text in a box. That is a useful surface for definitions, drafting, and quick translations. It is not what this post is about.

The thing this post is about is the third generation: the launcher where enter does not call a chat completion, it calls an agent loop. The agent reaches into whatever app was in front of you a moment ago, reads its accessibility tree, and clicks, types, or scrolls there. Voice input is the same surface with a microphone. The launcher is still the entry point. The accessibility API is what the launcher is standing on top of.

One input, many surfaces

The launcher does not care which app is in front of you. The agent picks the right reach based on the frontmost window: a native AX tree for Mail or Slack, a browser-extension path for Chrome page DOM, a Google Workspace tool for Docs/Sheets/Calendar, a screenshot fallback for the Qt or OpenGL apps that do not expose a tree at all.

Launcher to surfaces, via one agent loop

The single hub in the middle is the part that makes this design hold together. The launcher does not branch into N different code paths based on what app you have in front of you. It sends one prompt to one agent, and the agent calls the right tool based on what it sees. The AX tree is the default, because it is two orders of magnitude faster than vision, has labels for free, and never needs OCR.

The hotkey wiring, in the actual code

This is the bit that surprised me when I started writing it. The launcher uses Carbon's RegisterEventHotKey, not the newer NSEvent.addGlobalMonitor path. The Carbon hot-key API has been deprecated-but-still-supported for fifteen years and is what Alfred, Raycast, and most launchers actually use under the hood. The reason: it works without the Accessibility TCC permission, so the launcher can open before the first permission prompt ever fires.

Three hotkeys get registered. Cmd + \ (keycode 42) is the toggle for the floating bar. The Ask shortcut is user-configurable: Cmd Enter, Cmd Shift Enter, Cmd J, or Cmd O depending on what the user has set in Settings. The third pops a new chat window into a standalone window. The FourCharCode signature passed to EventHotKeyID is 0x46415A4D, which is the literal bytes for ASCII "FAZM".

GlobalShortcutManager.swift

The handler on the other end of those registrations posts a DistributedNotification. The bar shows up, focused. If the user picked the Ask shortcut, the cursor lands in the agent input field; if they picked the toggle, the bar shows whatever conversation they last had. A separate observer on ShortcutSettings.askFazmShortcutChanged re-registers the Ask hotkey live, so changing it in Settings does not need a restart.

What the agent calls, once the bar opens

The launcher hands the prompt to the agent loop over ACP (Agent Client Protocol). The bundled agent is Claude Code by default, with Codex available as a swappable backend per chat. Both run with one extra MCP server: a native macOS accessibility automation server bundled inside the app at Contents/MacOS/mcp-server-macos-use. The bridge code that discovers and starts it is at acp-bridge/src/index.ts, line 100, where it resolves the path; the server gets registered in buildMcpServers around line 2070 under the name macos-use.

The tools the agent sees are named things like macos-use_open_application_and_traverse, macos-use_click_and_traverse, macos-use_type_and_traverse, macos-use_press_key_and_traverse. Each one performs the action and returns a flat text dump of the resulting AX tree, line per element, format [Role] "text" x:N y:N w:W h:H visible, plus a screenshot path on disk that the agent can read separately if it needs to disambiguate.

The chain reaches into the AX C API: AXUIElementCreateApplication(pid) to get a root handle on a target process, AXUIElementCopyAttributeValue with kAXChildrenAttribute, kAXRoleAttribute, kAXTitleAttribute, kAXValueAttribute to walk the tree, AXUIElementPerformAction(kAXPressAction) to click, CGEventCreateKeyboardEvent to type. No pixel sampling unless the AX tree is empty.

The permission probe (because cannotComplete lies)

A launcher that drives the AX tree has to handle one quirk that the documentation does not warn you about: the error kAXErrorCannotComplete is ambiguous. It can mean the system has revoked accessibility permission, or it can mean the frontmost app simply does not expose an AX tree at all. Qt apps, Python apps that draw to a single Cocoa view, OpenGL canvases, and a few Electron builds with accessibility disabled all return the same error code. If the launcher reads it as "permission broken" every time, it shows the Settings nag every time the user happens to have PyMOL in front.

AppState.swift

The probe lives at lines 482 to 534 of Desktop/Sources/AppState.swift. If the AX call against the frontmost app returns cannotComplete, the probe retries against Finder, which is canonical and always exposes a tree. If Finder also fails, the permission is actually broken and the launcher prompts the user to open System Settings. If Finder succeeds, the launcher knows the frontmost app is at fault and quietly falls back to coordinates plus a screenshot. The launcher does not lie about permission state to make its own error path easier.

Why the accessibility path, not screenshots

A screenshot agent and an AX agent solve the same problem with two very different cost profiles. A full BFS walk of a focused window in Slack, Mail, or Chrome runs in 30 to 80 milliseconds end to end, measured on the same machine that is running the launcher. A screenshot pipeline takes 200 to 400 ms to capture, sends a 1 to 5 MB PNG to a vision model, waits 1 to 4 seconds for inference, and parses pixel coordinates back. The AX path is roughly two orders of magnitude faster on a typical action.

The cost split also changes what travels off the machine. A screenshot agent sends pixels for every action: the contents of your inbox, the message you are about to reply to, the document on screen. An AX agent sends labels and roles for the elements it actually touches: the title of the button, the text of the field, the value of the focused control. A launcher you invoke fifteen times a day with a screenshot pipeline produces a lot more leakage than the same launcher with an AX pipeline.

The flip side: the AX tree is empty for the apps that need a vision pass the most (creative tools, scientific software, custom canvases). A serious launcher does both. AX first because it is fast and cheap; screenshot fallback because some surfaces have no tree to read. Defaulting to vision is the lazy choice. Defaulting to AX is the design choice.

Voice input is the same launcher

The voice path is not a separate product surface. It is the Ask shortcut, with a microphone capture in front of the text path. Hold the key, talk, release; transcription runs on-device through WhisperKit, the text version reaches the same agent loop, the same macos-use tools resolve against the same frontmost-app pivot.

The reason this works at all is that the agent is the executor. "Reply to that one with a Tuesday slot" is not a fixed intent matched against a skill list; it is a sentence the model reads and resolves against the AX tree of whatever Mail thread is in front of you. The launcher does not need to know what "that one" is. The model and the AX tree do.

The three things the launcher itself never does

A clean separation between the launcher and the agent makes the surface smaller and easier to reason about. The launcher specifically does not:

Call any model. Hitting enter posts a notification with the prompt; the agent loop in a separate process is what handles inference.
Read the screen. The launcher does not own a screenshot path. The macos-use MCP does, and only when the agent decides to call it.
Track context. Persistent session state lives in the agent layer, with an explicit chain of upstream session IDs. The launcher is stateless between keystrokes.

Keeping those out of the launcher is what lets the launcher feel instant. A 4 millisecond Carbon hot-key handler is the only thing between Cmd+\ and a visible input field. Everything else happens after the user types.

Want a launcher that drives your Mac, not just your inbox?

Walk through the AX-first agent loop on your own machine, on a call, and we'll see if it fits your workflow.

Frequently asked questions

What is a Mac launcher AI agent that uses accessibility?

A macOS launcher (Spotlight-style global hotkey, e.g. Cmd backslash or Cmd Enter) whose enter key does not call a chat model directly. Instead, it hands the prompt to an agent loop that drives the frontmost app through the macOS accessibility API (AXUIElement, AXUIElementCopyAttributeValue, kAXFocusedWindowAttribute) rather than through screenshot vision. The launcher provides the entry point and the focus pivot. The agent does the work. Fazm is one open-source implementation; the relevant call site is Desktop/Sources/FloatingControlBar/GlobalShortcutManager.swift on GitHub.

Why does the accessibility API matter for a launcher specifically?

Because the moment a launcher tries to do something instead of just open something, it is reading and writing UI state in other apps. A launcher that calls a chat model only shows you text. A launcher that runs an agent has to know what is on screen, which field is focused, what menu just opened, and where the mouse cursor needs to go. Screenshots give you an image. The accessibility API gives you a labelled tree with role, title, value, and position for every element. A 30 to 80 millisecond AX walk replaces a 2 to 4 second vision inference. The launcher feels native because the agent never leaves the keyboard.

Is this different from Raycast AI, Alfred, or MacGPT?

Yes. Raycast AI, MacGPT, and Alfred's OpenAI workflows wire the launcher to a chat model. You type, the model replies, you copy the answer back. None of them currently expose an agent loop that drives other apps through the AX tree. Fazm wraps Claude Code (and Codex) as the agent loop, registers Cmd backslash and a user-configurable Ask shortcut as global hotkeys, and bundles a macOS accessibility MCP server the agent calls. The launcher is the input. The agent is the output. The AX tree is how the output reaches the rest of your Mac.

How does Fazm register the launcher hotkey?

Through Carbon's RegisterEventHotKey, not the newer NSEvent.addGlobalMonitor path. The full code is in Desktop/Sources/FloatingControlBar/GlobalShortcutManager.swift. The signature passed to EventHotKeyID is FourCharCode 0x46415A4D, which is ASCII for FAZM. Three hotkeys are registered: keycode 42 (Cmd backslash) toggles the floating bar, a configurable Ask shortcut opens the agent input, and a third shortcut pops out a new chat window. RegisterEventHotKey works without the Accessibility TCC permission, so the launcher itself opens before the agent loop demands any permission grants.

What does the agent actually call to read the screen?

An MCP server named macos-use, bundled inside the app at Contents/MacOS/mcp-server-macos-use. The agent loop (Claude Code, by default, via the ACP bridge) calls tool names like macos-use_open_application_and_traverse, _click_and_traverse, _type_and_traverse. Each one returns a flat text representation of the AX tree for the target window, plus a screenshot path as a tie-breaker for elements the tree does not expose. Under the hood the server uses AXUIElementCreateApplication(pid) and AXUIElementCopyAttributeValue with kAXChildrenAttribute, kAXRoleAttribute, kAXTitleAttribute, kAXValueAttribute. The agent loop never sees raw pixels for the elements it cares about.

Does the launcher need Accessibility permission to open at all?

No. The launcher itself is a global Carbon hotkey, which lives outside TCC. You can launch the bar before granting any permissions, type a prompt, and see a sensible error from the agent layer saying it cannot reach other apps yet. Accessibility permission is gated at the moment the agent calls into the macos-use MCP. That separation matters for first-run UX: you do not get a Settings prompt before you have seen what the tool does.

What happens when AX permission is granted but a target app does not expose a tree?

Fazm distinguishes the two cases. AppState.swift around lines 482 to 534 runs an AX probe on the frontmost app; if it returns kAXErrorCannotComplete (the ambiguous one), it retries against Finder, which is the canonical AX-compliant reference app. If Finder also fails, the permission is actually broken. If Finder succeeds, the original app is at fault (often a Qt, OpenGL, or Python-based app that draws to a single Cocoa view). For app-specific gaps the agent falls back to coordinates plus a screenshot. The launcher does not pretend the AX tree exists when it does not.

Is the launcher voice-enabled too?

Yes. The Ask shortcut accepts text or voice in the same input field. Audio runs through an on-device transcription path (WhisperKit), the text version is fed to the same agent loop, the AX tools execute against the same frontmost-app pivot. The voice path adds no extra cloud hop. The reason it works at all is that the agent is the executor: a transcribed sentence like 'reply to that one with a Tuesday slot' resolves against the AX tree of whatever Mail or Messages window is in front, not against a generic skill list.

Can the same launcher drive Chrome or Safari, not just native apps?

Yes, but through a different path. The AX tree of a browser window only exposes the chrome (URL bar, tabs, buttons) and not page DOM. For in-page work the agent uses a Playwright MCP that talks to the user's actual browser via an extension. The launcher is identical; the agent picks the right tool depending on whether the frontmost window is a native AX target or a browser tab. The runtime decision lives in the agent loop, not in the launcher.

Does sending the user prompt to a remote model defeat the local AX advantage?

It splits the data path. The screen state, the AX tree, and the audio stay local; what goes over the wire to Anthropic (or OpenAI, for Codex) is the textual prompt, the textual tool responses (so role and title strings of focused UI elements), and the model's response. No screenshot is uploaded unless the agent decides a screenshot is the only way to disambiguate a click, which is rare. The launcher does not have a hidden screen-broadcast loop; the AX-first design is what keeps the data footprint small.

Where in the source can I read the launcher code?

Three files cover most of it, all in github.com/m13v/fazm. Desktop/Sources/FloatingControlBar/GlobalShortcutManager.swift is the Carbon RegisterEventHotKey wiring, including the FAZM FourCharCode signature. Desktop/Sources/FloatingControlBar/ShortcutSettings.swift defines the configurable Ask shortcut (Cmd Enter, Cmd Shift Enter, Cmd J, Cmd O). Desktop/Sources/AppState.swift around lines 482 to 534 is the AX permission probe with the Finder fallback. The agent-side macos-use MCP binary is bundled at Contents/MacOS/mcp-server-macos-use and discovered in acp-bridge/src/index.ts at line 100.

Related field notes

AX in production

macOS accessibility automation, the four production failure modes

TCC cache going stale, the cannotComplete trichotomy, apiDisabled, and the retry-then-restart loop. The code paths a real launcher has to handle once it goes past a happy path.

Read

Approach tradeoffs

Accessibility API vs screenshot agents

Why a 30 to 80 millisecond AX walk beats a 2 to 4 second vision pass on every action the launcher cares about, and where screenshots still earn their keep.

Read

Field notes

Multi-agent macOS accessibility focus contention

When two agents drive AX at the same time, they fight over the frontmost app. The one-tenant problem the docs do not warn about, and the file-mutex plus save-restore pattern that fixes it.

Read

What the launcher is, and what it is not

One input, many surfaces

Launcher to surfaces, via one agent loop

The hotkey wiring, in the actual code

What the agent calls, once the bar opens

The permission probe (because cannotComplete lies)

Why the accessibility path, not screenshots

Voice input is the same launcher

The three things the launcher itself never does

Want a launcher that drives your Mac, not just your inbox?

Frequently asked questions

Related field notes

macOS accessibility automation, the four production failure modes

Accessibility API vs screenshot agents

Multi-agent macOS accessibility focus contention

Comments (••)

Comments ()