Voice and control, the way a sentence becomes a real Mac action.
Apple's Voice Control accepts a fixed grammar of phrases like “click Send” and “open Mail”. Fazm does the inverse: hold the Option key, speak freely, and DeepGram Nova-3 streams your voice in 3200-byte chunks into an LLM that picks a real accessibility-API tool to act on any Mac app. This page is the on-disk tour of how that works, line by line, in TranscriptionService.swift, PushToTalkManager.swift, and acp-bridge/src/index.ts.
The four numbers that define how Fazm listens
“The anchor facts no competitor page mentions. Fazm's voice pipeline is not a Whisper wrapper and not a Voice Control clone. PushToTalkManager.swift line 71 hard-codes pttDebounceInterval = 0.5 seconds so rapid Option-key cycling cannot crash CoreAudio. Line 40 sets doubleTapThreshold = 0.4 s as the lock window. Line 65 caps a single session at 300 s. TranscriptionService.swift line 94 pins model = nova-3, line 119 sets audioBufferSize = 3200 bytes (~100 ms of 16 kHz linear16), line 42 labels channel 0 as your mic and channel 1 as system audio, and lines 9-31 ship 20 'spoken form to written form' replacements to DeepGram so 'dot com' arrives as '.com' before Claude ever sees the string. Everything downstream routes into the five MCP servers at acp-bridge/src/index.ts line 1266.”
/Users/matthewdi/fazm/Desktop/Sources/TranscriptionService.swift:94 + PushToTalkManager.swift:71
The two products that share the phrase
Search “voice and control” and the first page is Apple's built-in Voice Control: an Accessibility feature that lets you say “click Send” or “open Mail” and fires the matching UI event. The dictation is excellent and the grammar is fixed. You cannot say “summarise the last email from Nathan and draft a reply,” because the command table has no entry for that.
Fazm is built the other way round. Voice is not a command grammar. It is a way to enter a sentence. The sentence becomes a query to an agent, and the agent picks a tool, and the tool runs through the same accessibility APIs Voice Control uses. The difference is where the intelligence lives. In Voice Control, the intelligence is in the command table. In Fazm, the intelligence is in the LLM between the transcript and the action.
That is not a branding choice. It is a concrete architectural choice visible in four files on disk.
Fixed-grammar voice control vs. freeform voice-to-action
The axes SERP results for this keyword never compare: grammar, routing, app reach, symbol handling, audit trail, visibility.
| Feature | macOS Voice Control | Fazm |
|---|---|---|
| What can you say? | Only commands from a fixed vocabulary. Apple's Voice Control expects phrases like 'click Send', 'open Mail', 'move up three lines'. Anything outside the grammar is dropped or inserted as dictation. You cannot say 'summarise the last email from Nathan and draft a reply'. | Anything. The Option key opens a live stream to DeepGram Nova-3 (TranscriptionService.swift line 94) and whatever you say becomes a sentence the LLM reads. You can ramble, correct yourself, change your mind mid-sentence. The grammar is whatever English (or Ukrainian, or Russian, or 'multi' auto-detect) grammar you use. |
| How does it turn voice into an action? | The phrase is matched against a command table, and the matched command fires a synthetic UI event (click at point, press key, move cursor). There is no reasoning step between the sentence and the action. | Transcript travels to Claude via the ACP bridge. Claude reads the sentence together with a fresh screen snapshot, picks one of five bundled MCP servers (fazm_tools, playwright, macos-use, whatsapp, google-workspace, enumerated at acp-bridge/src/index.ts line 1266), and calls the right tool. The action runs through macOS Accessibility APIs, not simulated key presses. |
| Does it work on any Mac app? | Works everywhere, but only with the fixed command grammar. You cannot chain 'read the top of this window and send it to my boss' in one sentence because there is no tool selector, only command matching. | Yes. macos-use is registered at acp-bridge/src/index.ts line 1056 and exposes the AXUIElement tree for every running process. If an app has an accessibility label, Fazm can read it and act on it: Messages, Notes, Calendar, Slack, Finder, third-party apps alike. |
| How does it handle 'dot com' and 'at sign'? | Built-in dictation tricks plus a commands dictionary, but the rewrites happen at the Apple speech layer, and there is no way to feed custom vocabulary to the upstream model for a command-plus-dictation blend. | With 20 hardcoded find-and-replace rules sent to DeepGram as the 'replace' parameter (TranscriptionService.swift lines 9-31): 'dot com' -> '.com', 'dot org' -> '.org', 'dot io' -> '.io', 'at sign' -> '@', 'dot json' -> '.json', 'dot swift' -> '.swift'. If you say 'email Nathan at fazm dot ai' you get 'email Nathan at fazm.ai' before Claude ever sees the string. |
| Can you audit what the voice did? | The command executed is the command matched. There is no record of alternatives considered, no audit of why a specific action fired, no 'show me the tool call' view. | Yes. The chat panel logs every tool call Claude made from the transcript: macos-use clicks, playwright snapshots, file writes, shell commands. You can expand each card, read the text-form arguments, and diff what was said against what ran. |
| Do you know when voice is live? | A small mic indicator in the menu bar, matching the OS convention. No per-tab chrome that announces that the agent is in control of a specific surface. | The floating bar pulses while capture is live, and once Fazm starts driving the browser every page gets a halo + centered pill reading 'Browser controlled by Fazm' at z-index 2147483647 (acp-bridge/browser-overlay-init.js line 58). Voice on does not mean voice silently running; you can always see it. |
Four design choices inside the capture pipeline
The pieces of the voice stack were picked one at a time, for reasons that would not fit in a marketing line. Each card is the shortest truthful answer to the question “why this, and not the obvious alternative?”
CoreAudio IOProc, not AVAudioEngine
AudioCaptureService.swift uses a raw CoreAudio IOProc on the input device instead of AVAudioEngine. The reason is a Bluetooth gotcha: AVAudioEngine implicitly creates an aggregate device, which degrades A2DP → SCO handoff on AirPods. Raw IOProc captures 16-bit PCM at 16 kHz straight from the device, no aggregation, no handoff thrash.
Two audio channels, not one
TranscriptionService.swift line 42: channel 0 is the mic (your voice), channel 1 is system audio (what the Mac is playing). DeepGram receives stereo and labels each final segment with channelIndex, so Fazm can tell 'you said X' apart from 'the video you are watching said Y'. Same socket, no speaker ID guesswork.
DeepGram Nova-3 over WebSocket
model = 'nova-3' (line 94). The socket stays open, buffer size 3200 bytes (~100 ms at 16 kHz linear16) flushes per chunk. Keepalive ping every 8 s. A watchdog at 30 s intervals compares lastDataReceivedAt and lastKeepaliveSuccessAt so a silent room does not get falsely reconnected while a dead socket does.
The Option key as the entire UX
PushToTalkManager.swift installs two NSEvent flagsChanged monitors - one global, one local - to watch for Option-down / Option-up across every app on the Mac. Hold Option to talk. Double-tap Option within 400 ms to lock the mic open. Tap again to stop. No click target required, no menu to open.
From the Option key to an AX-tree click
The whole pipeline is text from the moment DeepGram returns a final segment. No screenshot-to-pixels round trip, no vision model guessing coordinates. The hub is Claude; the destinations on the right are the five MCP servers Claude is allowed to call.
Voice in, accessibility action out
The Option key is the whole voice UX
There is no voice activation wake word, no UI button to start a capture, no push-to-talk pedal to wire up. The decision to keep it to a single key falls out of a simple rule: the voice modality should not compete for real estate with the app you are voice-controlling.
The STT config is not a Whisper wrapper
Fazm opens a live WebSocket to DeepGram's Nova-3 model with two channels and a small buffer so interim transcripts appear as you speak. Twenty “spoken form to written form” replacements travel in the request, which is why you can say “dot swift” and see “.swift” land in the transcript before any LLM post-processing.
What actually happens when you press and hold Option
You press and hold the Option key, anywhere on the Mac
A global NSEvent monitor (PushToTalkManager.swift line 95) fires regardless of which app has focus. Before starting a session, Fazm checks the 0.5 s debounce (line 71). If you rapidly tap-tap-tap Option, the audio subsystem does not thrash: the extra taps bounce off the debounce until 500 ms have passed since the last start.
Audio capture opens. Two channels, 16 kHz, 16-bit, linear PCM
AudioCaptureService installs a raw CoreAudio IOProc on the input device. Channel 0 is the mic (you). Channel 1 is system audio (anything the Mac is playing). Both are interleaved into a single stereo stream so DeepGram can diarise them without a separate socket. Every 3200 bytes (~100 ms) the buffer flushes.
The buffer flushes to DeepGram Nova-3 over a live WebSocket
TranscriptionService.swift opens the socket with model=nova-3, channels=2, sample_rate=16000, encoding=linear16. Twenty 'spoken form to written form' replacement rules travel in the request ('dot com' -> '.com', 'at sign' -> '@', 'dot swift' -> '.swift'), so domain names and symbols arrive already normalised. Keepalive pings every 8 s keep the socket alive; a 60 s data-gone watchdog reconnects only when keepalives themselves go stale.
Interim transcripts stream back and land in the floating bar
Each TranscriptSegment has text, isFinal, speechFinal, confidence, words, and channelIndex (0 for your mic, 1 for system audio). Interim segments update the floating bar in real time so you can see what the model heard and correct yourself mid-sentence. Final segments are accumulated into transcriptSegments.
You release Option. The transcript becomes a query to Claude
If the PTT was in 'locked listening' mode (double-tap), a single Option tap ends it; otherwise Option-up ends it. The concatenated final transcript is pushed into the chat input and sent to Claude over the ACP bridge. At this point voice stops being a voice concern and becomes an agent concern.
Claude picks a tool from five MCP servers and executes it
The system prompt exposes fazm_tools, playwright, macos-use, whatsapp, and google-workspace (acp-bridge/src/index.ts line 1266). Claude reads the transcript, usually together with a screen snapshot, and picks the right tool. A 'click Send on this email' query routes to macos-use's click_and_traverse, which reads the AXUIElement tree, not a screenshot, and presses the real Send button through the accessibility API.
The action runs. The chat logs every tool call as text
Every macos-use call, every playwright snapshot, every bash invocation appears as an inline tool card in the chat. You can expand the card and read the text-form arguments. A voice command is not a black box that fires an opaque event; it is a transcript -> a tool call -> a traceable action chain.
Where the transcript becomes an action
The transcript is just English in Unicode. The component that turns it into a real Mac click is the macos-use MCP server, bundled inside the app and registered at a single line in the ACP bridge entry point. Its “click_and_traverse” tool is the one a voice command most often resolves to.
Voice as grammar vs. voice as query
Tap between the two states. This is the axis the top SERP results for “voice and control” ignore, because they review dictation apps or describe Voice Control's command table.
A fixed command grammar vs. a freeform sentence
You must speak inside a predeclared command table. 'Click Send' works. 'Send this email and then open the calendar for next Tuesday' does not, because it is two intents in one utterance and no table row matches.
- Fixed vocabulary
- One command per utterance
- Dictation and commands in different modes
- No LLM between the phrase and the action
- No audit log of alternatives
- Domain names need character-by-character dictation
The voice-and-control vocabulary, in chips
Every fact on this page is grep-able
You do not need to trust the prose. The pipeline is plain text on disk, and four grep calls are enough to verify the voice stack end to end.
The contract, in one paragraph
Hold the Option key. Speak freely. The transcript reaches Claude via DeepGram Nova-3 (model = "nova-3", TranscriptionService.swift line 94), and Claude picks one of five bundled MCP tools to execute the intent. macos-use reads the AXUIElement tree; playwright reads the labelled DOM. Every action is a tool call you can expand, read, and audit. You can stop the voice session with a single Option tap, or by quitting the menu bar app. There is no wake word, no command grammar, no screenshot-to-pixels guess loop.
That is what “voice and control” looks like when voice is a sentence and control is an accessibility-API call.
What to look for in any voice + control product on the Mac
- A push-to-talk key you can hold anywhere on the Mac
- A streaming STT model so interim text appears as you speak
- Hardcoded spoken-to-written rewrites for domains and symbols
- Two-channel capture so mic and system audio can be separated
- A freeform transcript that an LLM reads, not a fixed command table
- Action execution through the accessibility APIs, not screenshots
- A tool-call audit log you can open per transcript
- A visible indicator when the voice turned into an action
Want to hold the Option key on a real Mac and watch the tool call fire?
Twenty minutes, screen-shared. We open PushToTalkManager.swift, hold Option, speak, and watch macos-use click_and_traverse resolve on a live accessibility tree.
Book a call →Frequently asked questions
What does 'voice and control' mean inside Fazm, versus inside macOS?
Two very different products share the phrase. macOS Voice Control is an Accessibility feature you enable in System Settings; it listens for a fixed grammar of spoken commands ('click Send', 'open Mail', 'move up three lines') and fires the matching UI event. Fazm is the opposite architecture: it opens a live DeepGram Nova-3 WebSocket while you hold the Option key (PushToTalkManager.swift line 71 keeps two activations at least 0.5 s apart), streams your raw voice as freeform language, and hands the final transcript to Claude. Claude then picks a tool from five bundled MCP servers (acp-bridge/src/index.ts line 1266) and runs the action through the macOS Accessibility APIs. Voice is a query modality, not a command table. You can say 'summarise the last email from Nathan and draft a reply that matches my usual tone' in one sentence, which is outside any fixed-grammar system.
Why DeepGram Nova-3 and not on-device Whisper?
Latency and accuracy at streaming lengths under a minute. DeepGram Nova-3 over WebSocket (TranscriptionService.swift line 94) returns interim transcripts per 100 ms chunk, so the floating bar shows words as you speak and you can correct yourself mid-sentence. Whisper on-device is excellent for batch transcription but slower to first interim token on a typical laptop. The choice also lets Fazm send the 20 baked-in replacements ('dot com' -> '.com', 'at sign' -> '@', 'dot swift' -> '.swift', lines 9-31) to the upstream model rather than post-processing client-side. If you do want fully local capture, the audio path (16 kHz linear16 PCM, 3200-byte flushes, stereo channels) is compatible with any STT backend you wire in.
What exactly happens during the 0.5 s PTT debounce?
PushToTalkManager.swift line 71 sets pttDebounceInterval = 0.5. Before a new session can start, Fazm checks lastPTTStartTime against Date.now() and rejects the request if less than 500 ms has passed. The reason is not politeness; CoreAudio's IOProc can crash on rapid start/stop cycles if another input stream has not fully torn down. 500 ms is the measured safe floor for CoreAudio device handoff on modern Apple Silicon. If you rapidly tap-tap-tap the Option key, the second and third taps bounce off the debounce. The first always wins.
Can Fazm drive a Mac app I didn't preconfigure?
Yes. That is the whole point of routing the voice transcript through the macos-use MCP server (acp-bridge/src/index.ts line 1056). macos-use exposes any running app's AXUIElement tree: roles, labels, frames, editable values. If the app is accessible to VoiceOver, it is accessible to Fazm. You do not declare integrations, write plugins, or connect a specific Slack workspace to voice; you say what you want, and Claude inspects the AX tree, finds the button by label, and fires click_and_traverse. That includes third-party apps, native Catalyst apps, and Electron apps that implement the accessibility bridge.
What are the two audio channels doing?
TranscriptionService.swift line 42 says channel 0 is the mic (you) and channel 1 is system audio (anything the Mac is playing). The stream travels to DeepGram as 16 kHz 16-bit stereo linear PCM, and each TranscriptSegment that comes back carries a channelIndex. This means Fazm can distinguish 'you said X' from 'the Zoom call said Y' on the same socket without a second STT session or a speaker-ID model. Practically, a meeting-mode feature can diarise with one channel of cost instead of two.
Does the Option key work in every app on the Mac?
Yes. PushToTalkManager.swift installs two NSEvent flagsChanged monitors: a global monitor (line 95) that fires when other apps are focused, and a local monitor (line 103) that fires when Fazm is focused. Between the two, the Option key is a Mac-wide PTT button. You can be typing in Messages, browsing in Chrome, or editing in Xcode, hold Option, say something, release Option, and the transcript lands in Fazm's chat. The only caveat is the Accessibility permission, which macOS requires for any app that observes global events.
How are domain names and symbols handled when I speak?
TranscriptionService.swift lines 9-31 ship 20 default replacements to DeepGram as the 'replace' parameter. Among them: 'dot com' -> '.com', 'dot org' -> '.org', 'dot io' -> '.io', 'dot ai' -> '.ai', 'dot dev' -> '.dev', 'dot swift' -> '.swift', 'dot json' -> '.json', 'dot py' -> '.py', 'at sign' -> '@'. So if you say 'email Nathan at fazm dot ai about the dot json schema', the transcript that reaches Claude is already 'email Nathan at fazm.ai about the .json schema'. No LLM post-processing pass needed.
If I lock the mic open with a double-tap, how do I stop it?
Tap Option once. PushToTalkManager.swift line 40 sets doubleTapThreshold to 0.4 s: two Option-presses inside 400 ms enter the lockedListening state, and a single tap outside that window exits it. The state machine is documented at the top of the file: idle -> listening -> finalizing -> idle for press-and-hold, or idle -> lockedListening -> finalizing -> idle for double-tap + single-tap. There is also a hard 300 s (5 minute) ceiling on line 65 that auto-ends the session no matter what, in case the Option key physically sticks.
Can I see what the voice actually caused the agent to do?
Yes. After the transcript reaches Claude, every tool call Claude makes appears as an expandable card in the chat. For voice-driven control, the most common card is an invocation of mcp__macos-use__macos-use_click_and_traverse with the target element as a text label and, optionally, a keystroke to follow. You can read the arguments, which are text, not pixels, because the macos-use server reads the AXUIElement tree. If Fazm drives a browser tab, the tab also wears a halo and a pill that reads 'Browser controlled by Fazm' (acp-bridge/browser-overlay-init.js line 58) at z-index 2147483647, so the result of your voice is visible on the surface being acted on.
Is Fazm an accessibility product or a productivity product?
It happens to be both, because of the architecture, not by marketing. The same AXUIElement tree that lets a screen reader speak the screen also lets Fazm read and act on the screen. The same Option-key PTT that powers voice-to-action is a hands-free input modality for anyone who prefers not to type. Fazm does not ship 'Voice Control mode' the way macOS does; the feature is always on, because voice is just a way to enter a sentence, and a sentence is just the input to the agent.
More on how voice becomes a real action
Why accessibility APIs beat screenshots for AI desktop agents
The technical comparison behind the 'voice to real action' pipeline: AXUIElement tree reads at 50 ms, screenshot models at 2500 ms.
Your FBI agent is fired. Meet the AI agent you actually hired.
The consent architecture behind every voice-driven action: halo on every controlled tab, readable AX tree, one-click stop.
macOS accessibility API vs screenshot agents: performance deep dive
Why a voice transcript that resolves to a text-tree lookup runs 50x faster than a voice transcript that resolves to a vision model.
Comments (••)
Leave a comment to see what others are saying.Public and anonymous. No signup.