An open source voice agent that acts on apps you already have open
Most open source voice agents live on a phone line or inside a web widget. Fazm lives on the hotkey a developer is already pressing twenty times a day. Hold Option, speak, and the agent moves a window, answers an email from your real Gmail, replies to a WhatsApp thread, or edits the file in the editor you were looking at a second ago. The whole pipeline is MIT-licensed and sitting in public at github.com/mediar-ai/fazm.
What the phrase usually means, and what it means here
Nine out of ten open source projects that describe themselves this way are toolkits for building voice agents that pick up phone calls or embed a chat bubble on a support site. Pipecat, LiveKit Agents, TEN, Bolna, Intervo, and Vapi all occupy that shape. You install a Python package or a Node SDK, you wire up a telephony provider, you host a WebRTC room, and you ship an agent that talks to your customers.
Fazm is trying to fill a different slot. The agent is not on the other end of a phone line. The agent is sitting on your Mac, listening for the Option key, reading the frontmost app through accessibility, and pressing buttons you would have pressed. The voice track is how you talk to your own computer, not how your computer talks to somebody else.
Two different shapes of 'open source ai voice agent'
You write backend code that instantiates a voice pipeline, hooks it to a telephony or WebRTC provider, and exposes it to callers or web visitors.
- Agent lives on a server you operate
- Audio path is SIP trunk or WebRTC room
- Tools are whatever you code against your own APIs
- The end user is somebody calling in, not you
- No native OS integration, no desktop presence
The voice loop, with exact file paths
Seven stages, all in public Swift / TypeScript / Rust in the repo. Each boundary is a file, not a diagram, so each one is a place you can fork or rewire.
What happens between Option-down and the spoken reply
The anchor: Option-key push-to-talk, in the actual file
This is the block that no other open source voice agent tutorial will show you, because none of them are hotkey-driven desktop agents. The docblock is the first ten lines of PushToTalkManager.swift, which describes the state machine the whole voice interaction runs on.
The choice of Option as the default trigger is not cosmetic. Option is one of the few keys you can hold solo on a Mac without firing a shortcut in the frontmost app, so the manager can install a global event monitor and read its state without stealing input. The ShortcutSettings.swift file lets you move it to Command, Control, or Fn if Option clashes with your own muscle memory.
The audio path: Core Audio HAL, not AVAudioEngine
This is one of the small decisions that makes the voice loop feel native instead of laggy. AudioCaptureService.swift opens a Core Audio HAL aggregate device directly, pulls 16-bit PCM at 16 kHz mono, and streams those frames straight onto the DeepGram WebSocket. It prefers physical microphones over virtual devices like Wispr Flow, BlackHole, or Loopback, and it watches for Core Audio property changes so a mic swap mid-session auto-restarts capture without losing the transcript.
DeepGram is the seam, not the lock-in
TranscriptionService.swift talks to a single URL. Line 276 opens the WebSocket, line 94 holds the model name. That is the entire dependency on a hosted speech provider. Because the audio format going in is already a well-known PCM shape, any API that accepts 16 kHz 16-bit mono on a stream is a drop-in replacement. The README even names the swap as a plausible fork.
The DeepGram API key is resolved at runtime via TranscriptionService.resolveDeepgramKey(), which checks the environment first, then calls the bundled Rust KeyService (Backend/src/routes/keys.rs). The backend only returns the key, it does not proxy the audio, so your PCM never leaves the WebSocket between your Mac and api.deepgram.com.
Numbers read directly from AudioCaptureService.swift:55, PushToTalkManager.swift:40, TranscriptionService.swift:128, and Providers/ChatProvider.swift:1573.
Why accessibility beats screenshots for a voice agent
When you ask a voice agent “reply yes to the last email from Marwan,” the model needs two things: the exact text of the thread, and a way to click the reply button. A screenshot-based agent has to OCR the pixels, guess where the button is in screen coordinates, and re-plan if the window moves. An accessibility-based agent reads the focused window's AXChildren, finds the row labelled with Marwan's name, reads the raw body text, and calls AXPress on the button element that reports kAXRoleButton with the title “Reply”.
Every second the screenshot agent spends on OCR is a second the voice reply is not playing. That is the entire reason AppState.swift preflight-tests AXUIElementCopyAttributeValue before the agent even sees the transcript: if accessibility is not granted, the voice loop falls back to asking you to open System Settings instead of guessing.
“Control browser, write code, handle documents, operate Google Apps, and learn your workflow - all from your voice. Free to start. Fully open source. Fully local.”
Fazm README.md, github.com/mediar-ai/fazm
Four things you can say that no phone-call agent can do
These are not hypothetical. They are the voice queries the prompt in ChatPrompts.swift is explicitly steered to handle, because the tools the ACP bridge exposes all exist on your local machine.
Reply to the last email from X with Y
The agent reads Gmail through the bundled google-workspace MCP, composes a reply inside your real session, and sends. No second login, no OAuth handshake at call time.
Open the doc I was editing yesterday and change the title to Q2
The agent drives your real Chrome session through the Playwright MCP Bridge extension, so Google Docs opens already logged in as you.
Send a thumbs up to the last WhatsApp thread with Nathan
A native whatsapp-mcp binary drives the macOS Catalyst WhatsApp app through accessibility. No QR codes, no web.whatsapp.com scraping.
Summarise the PDF that is open on screen, then email it to Scott
Accessibility reads the visible text of the PDF viewer, Claude summarises, google-workspace MCP drafts and sends. Two apps, one voice turn.
Getting the voice agent running locally in four steps
Clone
git clone github.com/mediar-ai/fazm
Build
./run.sh builds the SwiftUI app, compiles the acp-bridge, and launches it
Permit
Grant mic + accessibility + automation permissions once, prompted inline
- 4
Hold Option
Speak; release; watch the transcript stream in, then the agent acts
The voice chain files you actually fork
If you want to change one layer without changing the others, these are the files that own the seam. Every item below is a real path in the MIT repo today.
Files that own each seam in the voice loop
- Desktop/Sources/AudioCaptureService.swift - Core Audio HAL capture, 16 kHz PCM, device change listener, virtual mic blocklist.
- Desktop/Sources/TranscriptionService.swift - DeepGram WebSocket, Nova-3 streaming, batch fallback, keyterm/vocabulary injection.
- Desktop/Sources/FloatingControlBar/PushToTalkManager.swift - Option-key state machine, double-tap lock, 5-minute safety cap, 0.5 s debounce.
- Desktop/Sources/FloatingControlBar/ShortcutSettings.swift - Lets users move PTT to Command, Control, or Fn; toggles double-tap lock and PTT sounds.
- Desktop/Sources/AppState.swift - Accessibility preflight (AXUIElementCreateApplication, kAXFocusedWindowAttribute) so the voice turn knows whether to bother reading screen state.
- Desktop/Sources/Chat/ACPBridge.swift + acp-bridge/src/index.ts - JSON-RPC between the Swift app and the Claude Code agent, launches per session.
- Desktop/Sources/Providers/ChatToolExecutor.swift - speak_response tool that synthesises the reply via DeepGram Aura (aura-luna-en, 24 kHz linear16).
- Desktop/Sources/Chat/ChatPrompts.swift - Contains the <voice_response> block that tells the agent when to speak vs. stay silent, and which 7 languages TTS supports.
Fazm vs. voice-agent frameworks you will find when shopping this topic
Different shapes solve different problems. This is not 'Fazm is better'; it is 'Fazm is a different thing on the same phrase.'
| Feature | Pipecat / LiveKit / TEN / Bolna / Intervo / Vapi | Fazm |
|---|---|---|
| Primary user | Developers building call-centre or widget agents | End users sitting at their own Mac |
| What you install | pip / npm / container + telephony + hosting | Signed, notarized macOS .app |
| How you talk to the agent | Phone call or embedded web widget | Hold the Option key on your keyboard |
| Where the agent lives | A server you operate | Your own laptop |
| Tools available to the agent | Whatever APIs you wrap yourself | Chrome (real session), Gmail/Drive/Docs/Sheets/Calendar, WhatsApp, Finder, any AX-exposing macOS app |
| Reads the user's actual screen | No | Yes, through AXUIElement calls |
| STT provider | Configurable (Whisper, DeepGram, AssemblyAI, etc.) | DeepGram Nova-3 by default, swappable in TranscriptionService.swift |
| TTS provider | Configurable (ElevenLabs, Cartesia, DeepGram, etc.) | DeepGram Aura (aura-luna-en), 7-language allowlist in prompt |
| License | MIT / Apache 2.0 (varies) | MIT, single repo, no carve-outs |
| Forkability | High, it is a library | High, Swift + Rust + TS all in the same MIT repo |
What you see in the terminal while it is running
The dev build writes progress to /tmp/fazm-dev.log. A single voice turn looks like this.
Where this voice agent is the wrong fit
If you need an agent that answers phone calls, qualifies leads on a SIP trunk, or lives in a web widget embedded on a marketing page, do not use Fazm. Pipecat, LiveKit Agents, Vapi, TEN, Bolna, and Intervo are purpose-built for that job and will save you months.
Fazm is the right fit when the caller and the operator are the same person: you, at your own Mac, with apps open and a mic available. It is also the right fit if you want an open source code path where voice, screen context, LLM turn, and native OS action are all in one MIT repo you can read top-to-bottom in an afternoon, instead of stitched together from six providers.
The short version
There are 0 well-known open source projects that call themselves voice agents today, and every one of them is a library for building a voice product for somebody else. Fazm is a voice product for yourself, open source in the same way. The spine of the loop is four files: AudioCaptureService.swift captures the PCM, TranscriptionService.swift streams it to Nova-3, PushToTalkManager.swift owns the Option-key state machine, and ChatToolExecutor.swift speaks the reply back through Aura. Every other piece hangs off those four.
If that shape matches what you meant when you typed this topic in, clone the repo and run ./run.sh. If it does not, the library you probably want is one tab over.
Want to hear the voice loop driving your real Mac?
Fifteen minutes, shared screen. You hold Option, we watch Claude drive your Chrome, Gmail, and WhatsApp from one spoken sentence.
Frequently asked questions
Is Fazm actually open source, and is the voice path open too?
Yes. The whole repo is MIT-licensed at github.com/mediar-ai/fazm, and every file in the voice chain is in the public tree. Audio capture lives at Desktop/Sources/AudioCaptureService.swift. Push-to-talk lives at Desktop/Sources/FloatingControlBar/PushToTalkManager.swift. Streaming speech-to-text lives at Desktop/Sources/TranscriptionService.swift. Voice output (the speak_response tool) lives at Desktop/Sources/Providers/ChatToolExecutor.swift. There is no closed-source voice service behind the curtain, only DeepGram's hosted API, which you can swap by editing a single file.
What speech-to-text engine does Fazm use, and can I change it?
DeepGram Nova-3 over a streaming WebSocket. The endpoint is wss://api.deepgram.com/v1/listen and the model name is set as a private constant near the top of TranscriptionService.swift (search for `private let model = "nova-3"`). The audio going in is 16-bit PCM, 16 kHz, mono, captured through Core Audio HAL in AudioCaptureService.swift, so the interface with DeepGram is a small seam. Replacing it with Whisper, AssemblyAI, Fal Wizper, or a local whisper.cpp server is a fork-level change, not a multi-file rewrite.
What makes this different from Pipecat, LiveKit, TEN, Bolna, Intervo, or Vapi?
All of those are developer frameworks for building voice agents that answer phone calls, run inside a web widget, or plug into a call center stack. You write Python or TypeScript against them, you host a server, you wire up a telephony provider, and you ship something to your end users. Fazm is not trying to be that. Fazm is the end-user thing. You download a signed macOS .app, grant accessibility and microphone permission, hold the Option key, speak, and the agent acts on whatever Mac app is frontmost. If you want to fork it, the same MIT repo contains the Swift app, the Rust backend, the TypeScript ACP bridge, and the prompts.
Why do I want a voice agent that uses accessibility APIs instead of screenshots?
Because a screenshot is a bag of pixels and the accessibility tree is structured data. When you say "reply to the last email from Marwan with a yes and a thumbs up," an accessibility-aware agent reads the focused window's AXChildren, finds the message row labeled with Marwan's name, reads the raw text of the thread, and calls AXPress on the reply button. A screenshot-based agent has to OCR the rendered pixels, guess where the reply button is in screen coordinates, and pray the theme did not change. Accessibility survives dark mode, Retina scaling, window resizing, and localisation. It also costs far fewer tokens to send to the LLM.
Is audio sent anywhere besides DeepGram?
No. The Swift app opens a direct WebSocket from your Mac to api.deepgram.com and streams PCM chunks. The only place that audio buffer is written is your RAM and DeepGram's transport. There is no Fazm-operated audio proxy. The DeepGram API key is resolved at runtime either from a local environment variable or from the bundled Rust backend's key service (Backend/src/routes/keys.rs), and the backend hands a key back, it does not route audio.
Does Fazm also speak the reply back to me?
Yes, when the voice response toggle is on in Settings. The agent is given an instruction (embedded in ChatPrompts.swift) to call a speak_response tool after every final answer. That tool hits DeepGram Aura at https://api.deepgram.com/v1/speak using the aura-luna-en voice at 24 kHz linear16 PCM, and the audio plays through an AVAudioPlayer the Swift app keeps alive. Voice output supports English, Spanish, French, German, Italian, Dutch, and Japanese. Other chat languages fall back to text only.
How does double-tap lock mode work, and why does it exist?
Hold-to-talk is the default: press Option, speak, release Option, the transcript goes to the agent. For longer queries that would be uncomfortable to hold through, Fazm watches for a double-tap on Option within 400 ms (the doubleTapThreshold constant in PushToTalkManager.swift line 40). A double-tap puts the manager into lockedListening state, so you can let go and keep talking. Another single tap ends the session. A 5-minute ceiling (maxPTTDuration) auto-finalises if you wander off.
What does the voice agent do during a call on a phone number?
Nothing. Fazm is not a phone-call agent. If you need an AI voice agent that picks up calls, books appointments over the PSTN, or runs a call centre, use LiveKit, Pipecat, or Vapi. Fazm's voice path is about you speaking to your own Mac and the Mac acting on your behalf inside the apps you already use. The two slots do not overlap, and you can absolutely run both for different jobs.
Can I run the voice agent without a paid subscription?
The repo and every file in the voice chain are free and MIT-licensed, so a self-built local copy only costs you whatever DeepGram charges for STT and TTS, plus whichever Claude model the ACP bridge is pointing at. The hosted build at fazm.ai bundles keys and a managed Claude agent for convenience, but it is a wrapper around exactly the same source code.
Which apps does the voice agent know how to drive today?
Any app on macOS that exposes an accessibility tree (which covers almost every native Cocoa, Catalyst, or Electron app), plus specialised integrations bundled as MCP servers: your real Chrome session via a Playwright MCP extension, WhatsApp's macOS Catalyst app via a native whatsapp-mcp binary, Gmail, Drive, Docs, Sheets, and Calendar via a bundled google-workspace MCP. A fresh ~/.fazm/mcp-servers.json added in release 2.4.0 lets you plug in any other MCP server you want.