MIT LICENSED / MAC NATIVE / OPTION-KEY PTT

An open source voice agent that acts on apps you already have open

Most open source voice agents live on a phone line or inside a web widget. Fazm lives on the hotkey a developer is already pressing twenty times a day. Hold Option, speak, and the agent moves a window, answers an email from your real Gmail, replies to a WhatsApp thread, or edits the file in the editor you were looking at a second ago. The whole pipeline is MIT-licensed and sitting in public at github.com/mediar-ai/fazm.

Matthew Diakonov, Fazm

Published April 23, 202611 min read

4.9from Written from the Fazm source tree

MIT, public repo

DeepGram Nova-3 streaming STT

DeepGram Aura TTS, 7 languages

Option-key push-to-talk

Drives real Mac apps, not a widget

Hold Option. Talk. Your Mac does it.

Fazm: MIT, open source, native accessibility voice agent

16 kHz PCM captured through Core Audio HAL

Streamed live to DeepGram Nova-3 over WebSocket

Transcript + accessibility tree to Claude

Claude drives Chrome, Gmail, WhatsApp, Finder

DeepGram Aura speaks the reply back

0:00 / 0:05

What the phrase usually means, and what it means here

Nine out of ten open source projects that describe themselves this way are toolkits for building voice agents that pick up phone calls or embed a chat bubble on a support site. Pipecat, LiveKit Agents, TEN, Bolna, Intervo, and Vapi all occupy that shape. You install a Python package or a Node SDK, you wire up a telephony provider, you host a WebRTC room, and you ship an agent that talks to your customers.

Fazm is trying to fill a different slot. The agent is not on the other end of a phone line. The agent is sitting on your Mac, listening for the Option key, reading the frontmost app through accessibility, and pressing buttons you would have pressed. The voice track is how you talk to your own computer, not how your computer talks to somebody else.

Two different shapes of 'open source ai voice agent'

You write backend code that instantiates a voice pipeline, hooks it to a telephony or WebRTC provider, and exposes it to callers or web visitors.

Agent lives on a server you operate
Audio path is SIP trunk or WebRTC room
Tools are whatever you code against your own APIs
The end user is somebody calling in, not you
No native OS integration, no desktop presence

The voice loop, with exact file paths

Seven stages, all in public Swift / TypeScript / Rust in the repo. Each boundary is a file, not a diagram, so each one is a place you can fork or rewire.

What happens between Option-down and the spoken reply

The anchor: Option-key push-to-talk, in the actual file

This is the block that no other open source voice agent tutorial will show you, because none of them are hotkey-driven desktop agents. The docblock is the first ten lines of PushToTalkManager.swift, which describes the state machine the whole voice interaction runs on.

Desktop/Sources/FloatingControlBar/PushToTalkManager.swift

The choice of Option as the default trigger is not cosmetic. Option is one of the few keys you can hold solo on a Mac without firing a shortcut in the frontmost app, so the manager can install a global event monitor and read its state without stealing input. The ShortcutSettings.swift file lets you move it to Command, Control, or Fn if Option clashes with your own muscle memory.

The audio path: Core Audio HAL, not AVAudioEngine

This is one of the small decisions that makes the voice loop feel native instead of laggy. AudioCaptureService.swift opens a Core Audio HAL aggregate device directly, pulls 16-bit PCM at 16 kHz mono, and streams those frames straight onto the DeepGram WebSocket. It prefers physical microphones over virtual devices like Wispr Flow, BlackHole, or Loopback, and it watches for Core Audio property changes so a mic swap mid-session auto-restarts capture without losing the transcript.

Desktop/Sources/AudioCaptureService.swift

DeepGram is the seam, not the lock-in

TranscriptionService.swift talks to a single URL. Line 276 opens the WebSocket, line 94 holds the model name. That is the entire dependency on a hosted speech provider. Because the audio format going in is already a well-known PCM shape, any API that accepts 16 kHz 16-bit mono on a stream is a drop-in replacement. The README even names the swap as a plausible fork.

Desktop/Sources/TranscriptionService.swift

The DeepGram API key is resolved at runtime via TranscriptionService.resolveDeepgramKey(), which checks the environment first, then calls the bundled Rust KeyService (Backend/src/routes/keys.rs). The backend only returns the key, it does not proxy the audio, so your PCM never leaves the WebSocket between your Mac and api.deepgram.com.

0kHz sample rate into DeepGram

0ms double-tap window for lock mode

0custom vocabulary tokens supported

0languages with voice reply (Aura)

Numbers read directly from AudioCaptureService.swift:55, PushToTalkManager.swift:40, TranscriptionService.swift:128, and Providers/ChatProvider.swift:1573.

Why accessibility beats screenshots for a voice agent

When you ask a voice agent “reply yes to the last email from Marwan,” the model needs two things: the exact text of the thread, and a way to click the reply button. A screenshot-based agent has to OCR the pixels, guess where the button is in screen coordinates, and re-plan if the window moves. An accessibility-based agent reads the focused window's AXChildren, finds the row labelled with Marwan's name, reads the raw body text, and calls AXPress on the button element that reports kAXRoleButton with the title “Reply”.

Every second the screenshot agent spends on OCR is a second the voice reply is not playing. That is the entire reason AppState.swift preflight-tests AXUIElementCopyAttributeValue before the agent even sees the transcript: if accessibility is not granted, the voice loop falls back to asking you to open System Settings instead of guessing.

MIT

“Control browser, write code, handle documents, operate Google Apps, and learn your workflow - all from your voice. Free to start. Fully open source. Fully local.”

Fazm README.md, github.com/mediar-ai/fazm

Four things you can say that no phone-call agent can do

These are not hypothetical. They are the voice queries the prompt in ChatPrompts.swift is explicitly steered to handle, because the tools the ACP bridge exposes all exist on your local machine.

Reply to the last email from X with Y

The agent reads Gmail through the bundled google-workspace MCP, composes a reply inside your real session, and sends. No second login, no OAuth handshake at call time.

Open the doc I was editing yesterday and change the title to Q2

The agent drives your real Chrome session through the Playwright MCP Bridge extension, so Google Docs opens already logged in as you.

Send a thumbs up to the last WhatsApp thread with Nathan

A native whatsapp-mcp binary drives the macOS Catalyst WhatsApp app through accessibility. No QR codes, no web.whatsapp.com scraping.

Summarise the PDF that is open on screen, then email it to Scott

Accessibility reads the visible text of the PDF viewer, Claude summarises, google-workspace MCP drafts and sends. Two apps, one voice turn.

Getting the voice agent running locally in four steps

Clone
git clone github.com/mediar-ai/fazm
Build
./run.sh builds the SwiftUI app, compiles the acp-bridge, and launches it
Permit
Grant mic + accessibility + automation permissions once, prompted inline
4
Hold Option
Speak; release; watch the transcript stream in, then the agent acts

The voice chain files you actually fork

If you want to change one layer without changing the others, these are the files that own the seam. Every item below is a real path in the MIT repo today.

Files that own each seam in the voice loop

Desktop/Sources/AudioCaptureService.swift - Core Audio HAL capture, 16 kHz PCM, device change listener, virtual mic blocklist.
Desktop/Sources/TranscriptionService.swift - DeepGram WebSocket, Nova-3 streaming, batch fallback, keyterm/vocabulary injection.
Desktop/Sources/FloatingControlBar/PushToTalkManager.swift - Option-key state machine, double-tap lock, 5-minute safety cap, 0.5 s debounce.
Desktop/Sources/FloatingControlBar/ShortcutSettings.swift - Lets users move PTT to Command, Control, or Fn; toggles double-tap lock and PTT sounds.
Desktop/Sources/AppState.swift - Accessibility preflight (AXUIElementCreateApplication, kAXFocusedWindowAttribute) so the voice turn knows whether to bother reading screen state.
Desktop/Sources/Chat/ACPBridge.swift + acp-bridge/src/index.ts - JSON-RPC between the Swift app and the Claude Code agent, launches per session.
Desktop/Sources/Providers/ChatToolExecutor.swift - speak_response tool that synthesises the reply via DeepGram Aura (aura-luna-en, 24 kHz linear16).
Desktop/Sources/Chat/ChatPrompts.swift - Contains the <voice_response> block that tells the agent when to speak vs. stay silent, and which 7 languages TTS supports.

Fazm vs. voice-agent frameworks you will find when shopping this topic

Different shapes solve different problems. This is not 'Fazm is better'; it is 'Fazm is a different thing on the same phrase.'

Feature	Pipecat / LiveKit / TEN / Bolna / Intervo / Vapi	Fazm
Primary user	Developers building call-centre or widget agents	End users sitting at their own Mac
What you install	pip / npm / container + telephony + hosting	Signed, notarized macOS .app
How you talk to the agent	Phone call or embedded web widget	Hold the Option key on your keyboard
Where the agent lives	A server you operate	Your own laptop
Tools available to the agent	Whatever APIs you wrap yourself	Chrome (real session), Gmail/Drive/Docs/Sheets/Calendar, WhatsApp, Finder, any AX-exposing macOS app
Reads the user's actual screen	No	Yes, through AXUIElement calls
STT provider	Configurable (Whisper, DeepGram, AssemblyAI, etc.)	DeepGram Nova-3 by default, swappable in TranscriptionService.swift
TTS provider	Configurable (ElevenLabs, Cartesia, DeepGram, etc.)	DeepGram Aura (aura-luna-en), 7-language allowlist in prompt
License	MIT / Apache 2.0 (varies)	MIT, single repo, no carve-outs
Forkability	High, it is a library	High, Swift + Rust + TS all in the same MIT repo

What you see in the terminal while it is running

The dev build writes progress to /tmp/fazm-dev.log. A single voice turn looks like this.

tail -f /tmp/fazm-dev.log during one Option-key voice turn

PushToTalkManager.swiftAudioCaptureService.swiftTranscriptionService.swiftwss://api.deepgram.com/v1/listennova-3aura-luna-enAXUIElementCreateApplicationkAXFocusedWindowAttributeACP v0.29.2MIT

Where this voice agent is the wrong fit

If you need an agent that answers phone calls, qualifies leads on a SIP trunk, or lives in a web widget embedded on a marketing page, do not use Fazm. Pipecat, LiveKit Agents, Vapi, TEN, Bolna, and Intervo are purpose-built for that job and will save you months.

Fazm is the right fit when the caller and the operator are the same person: you, at your own Mac, with apps open and a mic available. It is also the right fit if you want an open source code path where voice, screen context, LLM turn, and native OS action are all in one MIT repo you can read top-to-bottom in an afternoon, instead of stitched together from six providers.

The short version

There are 0 well-known open source projects that call themselves voice agents today, and every one of them is a library for building a voice product for somebody else. Fazm is a voice product for yourself, open source in the same way. The spine of the loop is four files: AudioCaptureService.swift captures the PCM, TranscriptionService.swift streams it to Nova-3, PushToTalkManager.swift owns the Option-key state machine, and ChatToolExecutor.swift speaks the reply back through Aura. Every other piece hangs off those four.

If that shape matches what you meant when you typed this topic in, clone the repo and run ./run.sh. If it does not, the library you probably want is one tab over.

Want to hear the voice loop driving your real Mac?

Fifteen minutes, shared screen. You hold Option, we watch Claude drive your Chrome, Gmail, and WhatsApp from one spoken sentence.

Frequently asked questions

Is Fazm actually open source, and is the voice path open too?

Yes. The whole repo is MIT-licensed at github.com/mediar-ai/fazm, and every file in the voice chain is in the public tree. Audio capture lives at Desktop/Sources/AudioCaptureService.swift. Push-to-talk lives at Desktop/Sources/FloatingControlBar/PushToTalkManager.swift. Streaming speech-to-text lives at Desktop/Sources/TranscriptionService.swift. Voice output (the speak_response tool) lives at Desktop/Sources/Providers/ChatToolExecutor.swift. There is no closed-source voice service behind the curtain, only DeepGram's hosted API, which you can swap by editing a single file.

What speech-to-text engine does Fazm use, and can I change it?

DeepGram Nova-3 over a streaming WebSocket. The endpoint is wss://api.deepgram.com/v1/listen and the model name is set as a private constant near the top of TranscriptionService.swift (search for `private let model = "nova-3"`). The audio going in is 16-bit PCM, 16 kHz, mono, captured through Core Audio HAL in AudioCaptureService.swift, so the interface with DeepGram is a small seam. Replacing it with Whisper, AssemblyAI, Fal Wizper, or a local whisper.cpp server is a fork-level change, not a multi-file rewrite.

What makes this different from Pipecat, LiveKit, TEN, Bolna, Intervo, or Vapi?

All of those are developer frameworks for building voice agents that answer phone calls, run inside a web widget, or plug into a call center stack. You write Python or TypeScript against them, you host a server, you wire up a telephony provider, and you ship something to your end users. Fazm is not trying to be that. Fazm is the end-user thing. You download a signed macOS .app, grant accessibility and microphone permission, hold the Option key, speak, and the agent acts on whatever Mac app is frontmost. If you want to fork it, the same MIT repo contains the Swift app, the Rust backend, the TypeScript ACP bridge, and the prompts.

Why do I want a voice agent that uses accessibility APIs instead of screenshots?

Because a screenshot is a bag of pixels and the accessibility tree is structured data. When you say "reply to the last email from Marwan with a yes and a thumbs up," an accessibility-aware agent reads the focused window's AXChildren, finds the message row labeled with Marwan's name, reads the raw text of the thread, and calls AXPress on the reply button. A screenshot-based agent has to OCR the rendered pixels, guess where the reply button is in screen coordinates, and pray the theme did not change. Accessibility survives dark mode, Retina scaling, window resizing, and localisation. It also costs far fewer tokens to send to the LLM.

Is audio sent anywhere besides DeepGram?

No. The Swift app opens a direct WebSocket from your Mac to api.deepgram.com and streams PCM chunks. The only place that audio buffer is written is your RAM and DeepGram's transport. There is no Fazm-operated audio proxy. The DeepGram API key is resolved at runtime either from a local environment variable or from the bundled Rust backend's key service (Backend/src/routes/keys.rs), and the backend hands a key back, it does not route audio.

Does Fazm also speak the reply back to me?

Yes, when the voice response toggle is on in Settings. The agent is given an instruction (embedded in ChatPrompts.swift) to call a speak_response tool after every final answer. That tool hits DeepGram Aura at https://api.deepgram.com/v1/speak using the aura-luna-en voice at 24 kHz linear16 PCM, and the audio plays through an AVAudioPlayer the Swift app keeps alive. Voice output supports English, Spanish, French, German, Italian, Dutch, and Japanese. Other chat languages fall back to text only.

How does double-tap lock mode work, and why does it exist?

Hold-to-talk is the default: press Option, speak, release Option, the transcript goes to the agent. For longer queries that would be uncomfortable to hold through, Fazm watches for a double-tap on Option within 400 ms (the doubleTapThreshold constant in PushToTalkManager.swift line 40). A double-tap puts the manager into lockedListening state, so you can let go and keep talking. Another single tap ends the session. A 5-minute ceiling (maxPTTDuration) auto-finalises if you wander off.

What does the voice agent do during a call on a phone number?

Nothing. Fazm is not a phone-call agent. If you need an AI voice agent that picks up calls, books appointments over the PSTN, or runs a call centre, use LiveKit, Pipecat, or Vapi. Fazm's voice path is about you speaking to your own Mac and the Mac acting on your behalf inside the apps you already use. The two slots do not overlap, and you can absolutely run both for different jobs.

Can I run the voice agent without a paid subscription?

The repo and every file in the voice chain are free and MIT-licensed, so a self-built local copy only costs you whatever DeepGram charges for STT and TTS, plus whichever Claude model the ACP bridge is pointing at. The hosted build at fazm.ai bundles keys and a managed Claude agent for convenience, but it is a wrapper around exactly the same source code.

Which apps does the voice agent know how to drive today?

Any app on macOS that exposes an accessibility tree (which covers almost every native Cocoa, Catalyst, or Electron app), plus specialised integrations bundled as MCP servers: your real Chrome session via a Playwright MCP extension, WhatsApp's macOS Catalyst app via a native whatsapp-mcp binary, Gmail, Drive, Docs, Sheets, and Calendar via a bundled google-workspace MCP. A fresh ~/.fazm/mcp-servers.json added in release 2.4.0 lets you plug in any other MCP server you want.