Voice controlled macOS agentAX tree, not screenshotsOpen source, signed, notarized

A voice controlled macOS agent that actually clicks buttons in Slack, Linear, and Notion.

Most "voice controlled macOS agent" results stop at dictation or ship AppleScript demos that break the moment you open Slack. Fazm pipes push-to-talk audio from CoreAudio into Deepgram nova-3, hands the transcript to a language model, and drives any AX-compliant Mac app through a bundled Rust binary that walks the native accessibility tree and clicks by exact coordinate. Every element the agent sees is a literal line of text: [Role] "text" x:N y:N w:W h:H visible.

Matthew Diakonov, Written with AI

Published April 18, 202611 min read

Download Fazm for Mac

4.9from 200+

Push-to-talk via Option or Left Control, 400 ms double-tap lock

Deepgram nova-3 at 16 kHz, 100 ms frames, interim results ~200 ms

Actions routed through mcp-server-macos-use, the bundled Rust AX binary

Voice controlled macOS agent, reframed

Accessibility tree, not screenshots, not AppleScript

Hold Option, speak, release. Deepgram nova-3 returns in ~200 ms

Bundled Rust binary walks the AXUIElement tree of any Mac app

Every element is one line: [Role] "text" x:N y:N w:W h:H visible

Click lands at (x + w/2, y + h/2). Sub-100 ms per lookup.

Works in Slack, Linear, Notion, Cursor, Finder, Mail, Settings

0:00 / 0:05

AX tree

“The architecture is not marketing. It is a single binary path in the bridge source: `const macosUseBinary = join(contentsDir, "MacOS", "mcp-server-macos-use");` at /Users/matthewdi/fazm/acp-bridge/src/index.ts:63. The binary is registered as the `macos-use` MCP server at line 1059 alongside the browser runtime. It ships inside the signed, notarized .app bundle at Contents/MacOS/mcp-server-macos-use. Every traversal writes a text file where each line is formatted exactly `[Role] "text" x:N y:N w:W h:H visible`, and the tool clicks at (x + w/2, y + h/2). Run `rg -n mcp-server-macos-use acp-bridge/src/index.ts` on a fresh clone to see it for yourself.”

acp-bridge/src/index.ts:63

The numbers the SERP never quotes

Every voice agent demo video shows a five-second clip. None of them put numbers on the latency budget that determines whether you actually use it all day. Here is the Fazm stack on an M1 MacBook Air, measured from Option-release to first AX-click.

0Hz, audio capture via CoreAudio IOProc

0ms per Deepgram frame (no_delay=true)

0ms to first Deepgram interim transcript

0ms ceiling for an AXUIElement traversal

0tokens

Deepgram nova-3 custom vocabulary slots, tunable per user for company names, tools, and jargon.

0ms

Double-tap threshold in PushToTalkManager.swift line 40. Hold-to-talk below that, lock mode above it.

maxPTTDuration safety cap at line 65, auto-finalizes a stuck Option modifier after 5 minutes.

The path a voice command actually takes

Left side is what enters the agent: a modifier-keyed audio stream, a Deepgram transcript, a chat thread. The ACP bridge in the middle multiplexes five MCP runtimes. Right side is where actions go. For a command like "mark the top three threads in Slack as read," the model fans out to macos-use. For "open the Stripe dashboard and export yesterday's payouts," it picks playwright. Same chat session, same bridge.

Voice → transcript → bridge → five parallel MCP runtimes

One voice command, frame by frame

A real walkthrough of the "mark the top three threads in Slack as read" turn, from the moment you press Option to the moment the tool call log closes.

Option held. Three threads marked read. Under two seconds.

01 / 06

You hold Option and speak

"mark the top three threads in Slack as read." CoreAudio IOProc on your built-in mic pushes 16 kHz frames into a WebSocket. First interim transcript lands from Deepgram nova-3 in ~200 ms.

What the model actually reads when it "looks" at Slack

Not a PNG. Not an OCR pass. Not a base64 blob that eats your context window. A text file, one line per element, with the exact coordinates the OS uses for hit-testing. Below is an abbreviated real traversal from a running Slack workspace.

/tmp/macos-use/trv-slack-01.txt

The model picks the three rows tagged "unread" and emits three click_and_traverse calls targeting each row's "Mark as read" affordance. Each call resolves to an OS click at (x + w/2, y + h/2). That's it. There is no screenshot in this loop, no vision inference, no OCR, no template match.

The tool call log for one voice turn

Every tool call Fazm makes writes to the chat thread with its ID, name, and input. Here is the abbreviated log for the Slack command. Notice the absence of screenshot tool calls; the model does not need them because the AX tree is strictly more structured than a PNG.

tool-call-log.txt, one user turn

Fazm vs the rest of the "voice controlled macOS agent" SERP

Not a feature grid of who supports what. The structural choices that determine whether you can actually run this on Slack, Linear, Notion, and Cursor, or whether you are back to demo apps.

Feature	Apple Voice Control, Siri, Spokenly, GPT-Automator, OpenInterpreter	Fazm
Interprets natural language intent	Voice Control: no, rule grammar. Spokenly: no, dictation only. Siri: limited to SiriKit intents.	Yes, LLM in the loop
Works in Slack, Linear, Notion, Cursor	GPT-Automator / AppleScript: no, Electron apps are opaque to AppleScript	Yes, via AX tree on Chromium's accessibility bridge
Per-step latency	OpenInterpreter / Computer Use: 500 ms - 2 s per vision inference	Sub-100 ms per AX lookup, no vision round-trip
Reliable under dark mode, DPI, Retina scaling	Screenshot-vision: fragile under theme and scaling shifts	Coordinates come from the OS itself
Off-screen / scrolled-out elements	Screenshot-vision: invisible until you scroll	Yes, AX tree includes them
Canvas apps (Figma document, games)	Screenshot-vision: works here, this is their strong case	Falls back to vision; not the strong case
Push-to-talk ergonomics	Spokenly / Superwhisper: hotkey dictation. Voice Control: always-on grammar.	Option / Left Control / Fn, 400 ms double-tap lock
Consumer-friendly install	Talon Voice: developer tool, Python grammars. OpenInterpreter: CLI.	Signed, notarized .app, auto-update, permissions wizard
Open source	Voice Control / Siri: no. Spokenly / Superwhisper: no.	Yes, github.com/mediar-ai/fazm

The push-to-talk state machine, in source

PTT ergonomics is where most voice agents fail the workday-usage test. A 5-second demo survives a single hotkey. Eight hours of use needs a proper state machine. Fazm's is four states, lifted directly from the source.

Desktop/Sources/FloatingControlBar/PushToTalkManager.swift

Hold Option to talk live. Double-tap within 400 ms to lock the mic for a longer dictation. The 5-minute safety cap prevents a stuck Option key from recording indefinitely. The 500 ms debounce means you cannot accidentally machine-gun Option and crash CoreAudio's aggregate-device graph.

Five steps: spoken sentence to OS click

Every stage in the pipeline is either a native macOS API, a bundled binary signed with Fazm's developer ID, or an MCP server you can swap out. There is no hidden cloud middleware.

1. Hold Option (or Left Control, or Fn)

PushToTalkManager.swift transitions from idle to listening on the keyDown event. Left Control uses a short delayed activation so Ctrl+C and Ctrl+V do not accidentally start a voice session. Microphone permission is checked once at setup via AudioCaptureService.checkPermission().

2. Audio streams to Deepgram in 100 ms frames

CoreAudio IOProc on the selected physical input device captures 16 kHz mono frames of 3200 bytes each. The WebSocket to Deepgram is held open with keepalives every 8 seconds. Interim transcripts start arriving within about 200 ms and update as you speak.

3. Release Option

State goes listening → finalizing. The WebSocket flushes a CloseStream, Deepgram returns the final transcript, PushToTalkManager posts it to the floating control bar, and the chat session picks it up. Or double-tap Option within 400 ms to stay in lockedListening; a third tap finalizes.

4. The model picks a runtime and a tool call

The ACP bridge multiplexes five MCP runtimes (fazm_tools, playwright, macos-use, whatsapp, google-workspace). For a command like 'open Messages and send Sarah: on my way,' the model emits mcp__macos-use__macos-use_open_application_and_traverse, then mcp__macos-use__macos-use_type_and_traverse.

5. macos-use walks the AX tree and clicks by coordinate

The Rust binary at Contents/MacOS/mcp-server-macos-use queries AXUIElement for the target app, returns a text file with one element per line in the format `[Role] "text" x:N y:N w:W h:H visible`, picks the element by role plus text, and issues a click at (x+w/2, y+h/2). The tool call log in the chat is the audit trail.

What "works in any AX-compliant Mac app" means in practice

The reason this category of agent matters is that the SERP's top results are either dictation (no action) or AppleScript (dead on Electron). Here are the apps that actually matter in a knowledge-worker workday, with the kind of voice command Fazm's AX path handles by coordinate click.

Slack

Hold Option: 'mark the unread thread from design as read and star it.' The AX tree surfaces each thread row with its title and state; the tool clicks the star and the 'Mark as read' affordance by coordinate, not by screenshot. Works on the Electron build because Chromium bridges the DOM to NSAccessibility.

Linear

'Create a P1 issue titled 'Payout export timing out' and assign to Matt.' The macos-use MCP traverses Linear's command palette, types the title, selects the priority, picks the assignee from the suggestion list. Each action is a coordinate click resolved from the AX tree, not a pixel match.

Notion

'Open the April retro and mark the last three action items done.' AX finds the to-do list items by role='checkbox', reads the checked state, toggles only the unchecked ones. Off-screen items are reachable because the AX tree includes scrolled-out elements; no need to scroll-then-screenshot.

Finder

'Move every invoice PDF from Downloads into ~/Documents/Finance/2026.' Native Finder exposes a deep AX tree with row coordinates and filenames; the MCP selects, drags, and drops using OS-native accessibility actions, which is why this works with file icons, list view, column view, and gallery view.

Mail

'Reply to the latest message from Sarah saying I'll send the deck by Friday.' The AX tree exposes the reply button, the body text view, and the send control. No AppleScript needed, no 'tell application Mail' grammar, no brittle menu-bar traversal.

System Settings

'Turn on Do Not Disturb until tomorrow at 9 AM.' System Settings in macOS 14+ is notoriously hard for AppleScript because it rewrote its UI as SwiftUI, but the AX tree works fine. Fazm walks the Focus pane, picks the schedule, sets the time, hits Apply.

When accessibility-first is the wrong answer

This page argues that the AX tree plus push-to-talk plus an LLM beats dictation, screenshot-vision, and AppleScript for the apps knowledge workers spend their day in. It does not argue it wins everywhere. Canvas-based apps (Figma documents, Photoshop canvas, Blender, native games) expose essentially empty AX trees for the document area. Remote desktops, Citrix sessions, VMs, and embedded browser games are pixels from the OS perspective. In those cases, you want a screenshot-vision agent instead, and the honest answer is that Anthropic Computer Use or OpenInterpreter are strictly better fits for those workflows.

Fazm falls back to vision where the AX tree is empty, but the primary design center is the 95 percent of daily apps that do expose a real AX tree: Slack, Linear, Notion, Cursor, Mail, Messages, Finder, Settings, Numbers, Reminders, Calendar, 1Password, Arc, Chrome, Safari, Tower, GitHub Desktop, Zoom, Discord. If your day is mostly those, the AX-first model is the right tradeoff. If your day is mostly Figma canvases, it is not.

Frequently asked questions

What is a voice controlled macOS agent in April 2026?

A voice controlled macOS agent is software that takes a spoken instruction, understands it with a language model, and then performs the resulting actions against your Mac's real applications, not just a text field. The category splits into four types. Pure dictation apps (Spokenly, Superwhisper, Wispr Flow) stop at transcribed text into whatever window is focused. Apple's built-in Voice Control is a literal command grammar, not an agent; it cannot interpret intent like 'mark the top three threads in Slack as read.' AppleScript-based agents (GPT-Automator) die on Electron apps like Slack, Linear, Notion, Cursor, Figma. Screenshot-vision agents (OpenInterpreter, Anthropic Computer Use in screenshot mode) spend 500ms to 2s per step on vision inference and occasionally click the wrong pixel. Fazm is in a fifth bucket: voice plus a language model plus native macOS accessibility APIs.

How does Fazm actually perform an action from a spoken instruction?

Five stages. (1) Audio is captured by CoreAudio HAL IOProc at 16 kHz mono in AudioCaptureService.swift, avoiding AVAudioEngine's aggregate-device trick that degrades Bluetooth output quality. (2) Push-to-talk is gated on the Option or Left Control modifier key in PushToTalkManager.swift, with a 400 ms double-tap threshold to switch between hold-to-talk and locked-listening. (3) Audio chunks stream to Deepgram nova-3 over a WebSocket at 100 ms cadence; interim results appear within ~200 ms. (4) The transcript lands in the same ACP bridge that Fazm's chat uses, and the model picks a tool call. (5) When the task is a macOS app action, the model emits an mcp__macos-use__* call, which is handled by the bundled Rust binary mcp-server-macos-use at Contents/MacOS/mcp-server-macos-use. That binary walks the AXUIElement tree of the target app and returns a text file where each element is `[Role] "text" x:N y:N w:W h:H visible`, then clicks at (x+w/2, y+h/2).

Why accessibility APIs instead of screenshots?

Three reasons. Latency: a full AX traversal of a running app is sub-100 ms; a vision round-trip through a VLM is 500 ms to 2 s per step. Ten steps in a workflow is the difference between one second and twenty. Precision: AXUIElement returns exact coordinates the OS itself uses for hit-testing, which means the click lands on the real element regardless of DPI, Retina scaling, dark mode, or theme. Screenshot-vision can click the wrong pixel when icons look similar or when a modal half-covers a target. Off-screen access: the AX tree includes elements that are scrolled out of view. A screenshot doesn't. On the other hand, the AX tree is empty on canvas-based apps like Figma or games, which is where screenshot-vision is still the right choice. Fazm uses AX first and screenshots as a targeted fallback.

Where in the Fazm source can I verify the accessibility-first architecture?

Two files tell the whole story. `/Users/matthewdi/fazm/acp-bridge/src/index.ts` line 63 sets `const macosUseBinary = join(contentsDir, "MacOS", "mcp-server-macos-use");` and line 1059 registers it as the `macos-use` MCP server alongside the browser runtime. The binary itself is shipped inside the signed and notarized .app bundle, so `codesign -dv /Applications/Fazm.app/Contents/MacOS/mcp-server-macos-use` returns the same developer ID as Fazm itself. The MCP server returns its traversal as a text file saved under /tmp/macos-use/, one line per element, format `[Role] "text" x:N y:N w:W h:H visible`. Clone the public repo and run `rg -n mcp-server-macos-use acp-bridge/src/index.ts` to see the registration line for yourself.

What does the voice input side of the pipeline look like in code?

`Desktop/Sources/TranscriptionService.swift:94` sets `private let model = "nova-3"`. Line 97 sets `private let sampleRate = 16000`. The WebSocket connects to Deepgram with `interim_results=true`, `diarize=true`, smart formatting on, and a 500-token custom vocabulary you can tune for the apps and names you use often. Audio frames are 3200 bytes each (~100 ms) and are flushed immediately via `no_delay=true`. On the input side, `AudioCaptureService.swift` installs a CoreAudio IOProc directly on the selected input device, preferring physical mics in Built-in > USB > Bluetooth order and avoiding virtual/aggregate devices. On the trigger side, `PushToTalkManager.swift` runs a four-state machine (idle → listening → lockedListening → finalizing) keyed off Option, Left Control, or Fn, with a 0.4 s double-tap threshold and a 5-minute safety cap.

Does this work in Electron apps like Slack, Linear, Cursor, and Notion?

Yes, and this is the entire reason the approach matters. Electron apps are Chromium in a native shell, and Chromium exposes a rich accessibility tree by default. A voice command like 'mark the top three threads in Slack as read' walks the Slack AX tree, finds the thread rows by role and label, and clicks each 'Mark as read' affordance by coordinate. Linear, Notion, Cursor, VS Code, Discord, and 1Password all behave the same way. AppleScript is dead in these apps; a standard AppleScript 'tell application Slack' call cannot reach a message thread. The AX approach works because Chromium bridges the DOM to NSAccessibility. This is the same reason the macos-use MCP server works on native apps like Finder, Mail, Settings, Numbers, Messages, and Reminders.

Where does voice-controlled Fazm not work, honestly?

Canvas-based applications like Figma, raw Photoshop canvases, and most games. These apps draw their UI without exposing elements to NSAccessibility, so the AX tree for the document area is effectively empty. For those, the agent has to fall back to screenshot-vision. Remote desktops and VMs also lack a local AX tree for the remote UI; a Citrix or Parsec session is pixels from Fazm's perspective. Fullscreen video and browser-embedded games are the same. Practically, this means Fazm is strongest on the exact apps knowledge workers spend most of their day in (chat, docs, mail, CRM, Linear, IDEs, Settings, Finder) and weakest on creative canvas work. For the latter, a screenshot-vision tool is the better fit.

What's the push-to-talk ergonomics like for a full workday, not just a demo?

The PTT manager supports two usage patterns so voice is usable beyond a 5-second demo. Hold-to-talk means you hold Option while you speak and it stops the moment you release it; the 5-minute safety cap in PushToTalkManager.swift (line 65, `maxPTTDuration: TimeInterval = 300`) prevents a stuck modifier from recording forever. Double-tap-to-lock means you tap Option twice within 0.4 s and it enters lockedListening; a third tap finalizes. This is the mode for 'dictate this email' or 'keep transcribing while I think out loud.' The Left Control key has a delayed activation (controlDelayWorkItem) so Ctrl+C and Ctrl+V don't trigger a voice session accidentally. Audio input avoids AVAudioEngine's implicit aggregate device creation, which means switching between Bluetooth and Built-in mic mid-day does not audibly degrade system audio output.

Does Fazm run the voice model locally or in the cloud?

Deepgram's nova-3 runs in the cloud. The audio stream leaves your Mac over a WebSocket to Deepgram's API. This tradeoff buys two things: quality (Nova-3 matches Whisper-large-v3 on general speech and is noticeably better on code, company names, and tools vocab via the 500-token custom dictionary) and low latency (partial results in roughly 200 ms against the 600 to 1200 ms typical for local Whisper.cpp on an M1 Air). If you want fully-local transcription, the source is open (github.com/mediar-ai/fazm) and the transcription layer is swappable; a community branch has already wired Whisper.cpp as a drop-in. For most users, the Deepgram path is strictly faster and more accurate, which matters more than local purity when you are talking to your Mac all day.

How does Fazm compare to Apple Voice Control, Siri, and Talon Voice specifically?

Apple Voice Control is a rule-based grammar on top of the AX tree; it knows how to do 'click Submit' but cannot interpret 'approve the second pending PR' because it has no language model. Siri is conversational but cannot actually drive arbitrary third-party Mac UIs; its action surface is the SiriKit intents Apple approves. Talon Voice is powerful and fully local, but it is a developer tool; you write grammars in Python, and it does not reason about intent the way an LLM does. Fazm is the consumer-friendly point in the space: a signed, notarized, auto-updating .app with an LLM deciding what to do and native AX APIs doing it. You install it, grant Accessibility and Microphone permission, hold Option, and talk.

Hold Option, speak, watch your Mac do it

Fazm is an open source voice-controlled macOS agent. It ships with five MCP runtimes in one chat: accessibility-tree clicking on native and Electron apps, your real Chrome via Playwright, WhatsApp, Google Workspace, and a user-extensible slot. Verify the accessibility-first architecture yourself: rg -n mcp-server-macos-use acp-bridge/src/index.ts.

Download Fazm for Mac →