Open source voice AI agentConsumer app, not a frameworkSource-verified, April 19 2026

The open source voice AI agent you install, not the one you assemble

Search results for this keyword all point to the same shape: a framework. Pipecat, LiveKit Agents, TEN, Bolna, Intervo. Repos that hand a developer the parts and say good luck. Fazm is the other artifact in the category: a signed Mac app where you hold Option, speak, and an agent drives any running app through the macOS Accessibility tree. Same license (MIT), totally different product.

Fazm

Published April 19, 202612 min read

Download Fazm for Mac

4.9from 200+

Every file path and line number on this page points to the public MIT-licensed Fazm repo

Four-state PTT machine verified in Desktop/Sources/FloatingControlBar/PushToTalkManager.swift

Deepgram nova-3 WebSocket and 16kHz CoreAudio pipeline verified 2026-04-19

Framework vs. finished app

Both are open source. Only one of them you can give to your mom.

SERP top 5: Pipecat, LiveKit, TEN, Bolna, Intervo

Every one is an SDK you wire yourself

Fazm is a signed .dmg, Option to talk, ship it

Same MIT license, different artifact

0:00 / 0:05

4 states

“PushToTalkManager.swift uses a four-state machine (idle, listening, lockedListening, finalizing) driven by two NSEvent flagsChanged monitors on the Option key with a 400ms double-tap threshold. AudioCaptureService.swift runs a CoreAudio IOProc at 16kHz linear16 on the default input device. TranscriptionService.swift streams each frame to wss://api.deepgram.com/v1/listen with model 'nova-3'. The final transcript is handed to ChatToolExecutor, which dispatches 15+ tools including capture_screenshot, execute_sql, speak_response, and mcp-server-macos-use for Accessibility-tree click and type.”

Fazm open source, github.com/mediar-ai/fazm, MIT

What the top SERP results actually ship

Pipecat, LiveKit Agents, TEN, Bolna, Intervo, Chatterbox, Whisper. Every one of them is a brick. The category is missing the thing a non-engineer can use directly.

Pipecat (Python framework)LiveKit Agents (WebRTC SDK)TEN Framework (C/Go/TS)Bolna (Python backend)Intervo (server you deploy)Chatterbox (TTS model)Whisper (STT model)Fazm (a signed Mac app you run)

The two shapes of an open source voice AI agent

Both are legitimate. Both have their place. But they are not interchangeable, and a user searching for "open source voice AI agent" probably wants the one on the right.

The category split

01 / 04

Search: open source voice AI agent

Top 5 results are frameworks. Pipecat, LiveKit Agents, TEN, Bolna, Intervo. Each one ships a repo you assemble into a product.

0States in the PTT machine

0 msms double-tap window for locked listening

0Hz audio capture rate

0+Agent tools exposed to the transcript

The Swift enum that defines the voice session

This is the entire concept. Four states, one modifier key, a debounce for hands-free lock. No turn-taking policy, no WebRTC negotiation, no session manager. The operating system already owns the hotkey; the manager just watches it.

Desktop/Sources/FloatingControlBar/PushToTalkManager.swift, lines 16-21

What the voice pipeline actually connects

A voice agent is only useful if the transcript lands somewhere it can act. Most frameworks stop at "here is your string." Fazm's transcript is fuel for fifteen tools that already know how to read your widget tree.

Option key to Accessibility call

End to end, hotkey to widget

This is the path every voice utterance takes. Five concrete hops, all in the open repo.

Option key to Accessibility action

Hold Option

PTTState flips from idle to listening in PushToTalkManager.swift. Global NSEvent monitor observes the flagsChanged event.

Capture audio

AudioCaptureService.swift starts a CoreAudio IOProc at 16kHz linear16 on the default input device.

Stream to Deepgram

TranscriptionService.swift opens wss://api.deepgram.com/v1/listen with model nova-3 and sends each buffer as a WebSocket binary message.

Release Option

State moves to finalizing. The socket drains and the final transcript string is handed to ChatProvider.

Agent acts

ChatToolExecutor dispatches tools. macos-use MCP walks the AX tree. capture_screenshot only fires when AX is insufficient. speak_response returns a voice answer.

The four-state PTT machine, unpacked

Every voice agent framework has a turn-taking model. Fazm does not need one because the human owns the turn, with a key. Holding Option is the entire VAD. The state machine around it is small enough to read in ten minutes.

idle. No audio. Mic is not open.

Two NSEvent monitors (global + local) watch flagsChanged looking for the Option (⌥) bit. The mic is not allocated. No WebSocket is live. PushToTalkManager.swift lines 95-108.

listening. User is holding Option.

On the first flagsChanged tick with Option down, the state flips to listening. AudioCaptureService spins up a CoreAudio IOProc at 16kHz linear16 on the default input device. The WebSocket to wss://api.deepgram.com/v1/listen?model=nova-3 opens and starts receiving frames. PushToTalkManager.swift, AudioCaptureService.swift lines 46-55.

lockedListening. User double-tapped Option.

If two Option down events land inside 400ms, the manager promotes listening into lockedListening. The user can release the key, walk to the coffee machine, keep talking. The state only clears on an explicit escape or on a second double-tap. Line 40.

finalizing. User released Option.

On flagsChanged with Option up, the state moves to finalizing. The IOProc stops, the WebSocket drains the remaining audio and awaits the final is_final=true transcript from Deepgram. The pipe closes, state returns to idle.

dispatch. Transcript enters the agent.

The final string is not a notification. It is a prompt fed to ChatProvider, which calls the ACP bridge, which hands it to a Claude agent loop. The agent picks tools from ChatToolExecutor: capture_screenshot, execute_sql, speak_response, and the macos-use MCP binary that drives the Accessibility tree. ChatToolExecutor.swift lines 61-120.

Why CoreAudio IOProc instead of AVAudioEngine

AVAudioEngine is convenient. It is also the reason macOS voice apps sometimes silently create a kernel-level aggregate device and steal the default input from Zoom, Discord, and everything else on the machine. Fazm uses the lower-level CoreAudio IOProc directly so the input device stays untouched.

Desktop/Sources/AudioCaptureService.swift, lines 46-60

The tools your voice can reach

These are the tools ChatToolExecutor.swift exposes to the agent loop at lines 61-120. Every one is a thing a voice command can trigger. Compare with any framework repo where this surface is empty and your job is to define it.

capture_screenshotexecute_sqlrequest_permissioncheck_permission_statusextract_browser_profilequery_browser_profileedit_browser_profilescan_filesstart_file_scanset_user_preferencesask_followupcomplete_onboardingspeak_responsesave_knowledge_graphmcp-server-macos-use (AX click/type/read)

What makes this a product, not a framework

Each of these is a choice a framework cannot make on your behalf. They sum up to the difference between "here are the pieces" and "here is the app."

A hotkey that works everywhere

Option is a modifier key the OS never uses alone. The PTT monitor is a global NSEvent.addGlobalMonitorForEvents on .flagsChanged, so it fires from Keynote, Figma, Xcode, the Finder. No focus. No menu bar. No browser tab.

A double-tap lock for long commands

A 400ms debounce in PushToTalkManager.swift turns a second Option tap into a hands-free lock. Useful when you need to dictate five sentences or read a query aloud and your hands are full.

Real widget control, not pixel control

Transcripts feed an agent that targets AXButton and AXTextField nodes by role and description. That is why voice commands like 'reply to the last email with got it, thanks' land in the right NSTextField, not a guess at a pixel.

Audible responses, same pipe as input

speak_response is an agent tool, not a framework feature. The same app that opens the mic also owns the speech synthesis, so the loop is voice-in, act-on-apps, voice-out, all inside one signed binary.

A transcript log you can grep

Every session writes to /tmp/fazm.log with PTT_STATE lines. You can tail the file and watch the state machine flip idle → listening → finalizing in real time. Audit trail by default.

Readable Swift, not a toy repo

PushToTalkManager.swift is 200 lines. TranscriptionService.swift is a single URLSessionWebSocketTask. AudioCaptureService.swift calls CoreAudio directly instead of hiding AVAudioEngine. Fork it, replace nova-3 with a local Whisper, ship your own version.

What a single spoken command looks like in the log

Fazm writes one line per PTT state transition to /tmp/fazm.log. Tail it while you hold Option and speak, and you can watch the entire loop close.

/tmp/fazm.log

Fazm versus the frameworks that dominate the SERP

This is not a better-or-worse comparison. Frameworks win if you are building a custom voice product. Fazm wins if you want a voice AI agent to use today.

Feature	Pipecat / LiveKit / TEN / Bolna	Fazm
Shape of the deliverable	pip install, npm install, docker compose up, README	Signed, notarized .dmg you download and launch
Where the voice trigger lives	A button in a web demo, or whatever UI you build yourself	Global Option-key PTT handled by an always-on-top floating bar
Who holds the microphone	Whatever your framework wraps: WebRTC tracks, PyAudio, browser getUserMedia	CoreAudio IOProc on the default input device at 16kHz linear16
Where the transcript goes after ASR	Wherever you wire it. The SDK stops at the transcript	Straight into a Claude agent loop with 15+ real tools
What the agent can actually do	Say words back at you. Control of other apps is your homework	Read and control any running Mac app via the Accessibility tree
Audience	Developers building their own voice product	End users. Your friend downloads it, Option, speak, it works
License	Usually Apache 2 or MIT. Same spirit, totally different artifact	MIT at github.com/mediar-ai/fazm (full desktop Swift source)

Two voice behaviors frameworks leave to you

A shipped consumer voice agent has to make opinionated calls a framework never would. Two of them are worth calling out.

Hands-free lock with a 400ms double-tap

How Fazm ships it

Track the timestamp of every Option down event
If two events land inside 400ms, promote the session
State goes listening -> lockedListening, user can release the key
Escape or a second double-tap exits locked mode

Accessibility-first control, vision fallback

How Fazm ships it

Every voice action first looks for an AX target by role
If AX returns a valid AXUIElement, click or type against it
Only if AX fails does capture_screenshot fire for that step
Screenshots are the exception, not the default pipeline

The whole voice pipeline in numbers

PTT states in the enum

0 ms

Double-tap window

0 kHz

CoreAudio capture rate

Tools the transcript can call

Try the shipped voice agent

Hold Option, speak, and an MIT-licensed Mac app drives your other apps through the macOS Accessibility tree. Free to start, full Swift source on GitHub, same category as the frameworks but a product shape your non-engineer friends can actually use.

Download Fazm →

Open source voice AI agent, answered against the source

What does 'open source voice AI agent' usually mean, and why is Fazm different?

In the top SERP results it means a framework: Pipecat, LiveKit Agents, TEN, Bolna, Intervo. Those are Python or TypeScript SDKs that developers wire together to ship a voice product. Fazm is the shipped product. You download a .dmg, you launch it, you hold Option and speak. The source is still public and MIT licensed at github.com/mediar-ai/fazm, so you can read the Swift that implements the PTT state machine, fork it, and swap nova-3 for a local Whisper if you want. Both shapes are valid. They are not the same artifact.

How is the voice input triggered?

Hold Option. Desktop/Sources/FloatingControlBar/PushToTalkManager.swift registers two NSEvent monitors (global + local) for the flagsChanged event, watching for the Option bit. On key-down the state flips to listening and the mic opens. On key-up it moves to finalizing. A double-tap of Option within 400ms promotes the session to lockedListening so you can let go of the key and keep talking. The four states are idle, listening, lockedListening, finalizing.

Where exactly does the audio go?

AudioCaptureService.swift uses CoreAudio IOProc directly, not AVAudioEngine. Sample rate is 16000 Hz, encoding is linear16 (targetSampleRate on line 55, encoding on line 98). Each buffer is framed into a URLSessionWebSocketTask.Message.data and sent to wss://api.deepgram.com/v1/listen. The WebSocket URL is built at TranscriptionService.swift line 276, the model is set to 'nova-3' at line 94.

Does the voice agent use a local model, or is it cloud only?

Today the shipped app sends audio to Deepgram for ASR and routes the transcript into a Claude agent via the ACP bridge (acp-bridge/src/index.ts). The agent loop calls tools on the local Mac. If you fork the repo you can swap TranscriptionService.swift for a local Whisper or whisper.cpp and point ChatProvider at Ollama, LM Studio, or MLX-LM. That is the benefit of the MIT license on the full desktop source. The consumer build exists so regular users do not have to.

What does the agent actually do after it transcribes my voice?

It dispatches into ChatToolExecutor.swift, which exposes 15+ tools at lines 61-120 including capture_screenshot, execute_sql, speak_response, extract_browser_profile, scan_files, set_user_preferences, ask_followup, and an MCP binary at /Contents/MacOS/mcp-server-macos-use that reads the macOS Accessibility tree for click and type operations. So voice becomes a real action: the agent clicks an AXButton, fills an AXTextField, reads an AXValue, or runs a SQL query against the local fazm.db.

How does Fazm reach apps outside the browser? Every other voice agent stops at Chrome.

Through the Accessibility API. AppState.swift lines 431-504 implement a three-stage permission probe (testAccessibilityPermission, confirmAccessibilityBrokenViaFinder, probeAccessibilityViaEventTap) that validates AX is live. Once the boolean is true, the macos-use MCP can call AXUIElementCreateApplication on any running app's PID and walk the widget tree. That means the voice command 'mark this task done in Reminders' lands on the real checkbox in Reminders, not on a screenshot of it.

Is the repo actually open, or is it open-core with hidden pieces?

Fully open on the desktop side. /Users/matthewdi/fazm contains the complete SwiftUI app under an MIT license (README.md line 52). The ACP bridge at /acp-bridge is TypeScript and also public. The one cloud piece is the Deepgram ASR endpoint, which is a network call you can replace. Every file and line number referenced on this page exists in the github.com/mediar-ai/fazm tree on 2026-04-19.

Can I run it without an internet connection?

Not in the shipped build. Deepgram requires a live WebSocket and Claude via ACP requires an API round trip. But the architecture is already split in a way that makes offline easy: replace TranscriptionService.swift with a local Whisper wrapper and point acp-bridge at a local LLM endpoint. The PTT state machine, the CoreAudio capture, the Accessibility plumbing, and the tool dispatcher all run on-device regardless.

Does the floating UI stay on top while I am in other apps?

Yes. FloatingControlBarWindow.swift sets window level to .floating so the pill-shaped bar stays above normal app windows. It is resizable, expands on hover, and holds the microphone indicator during PTT. Because the Option-key monitor is global (NSEvent.addGlobalMonitorForEvents), you do not need to click the bar before speaking. Focus can be anywhere.

What licenses apply to the audio stack?

The Swift source is MIT. CoreAudio and the Accessibility framework are Apple system libraries. The Deepgram API is a commercial service with its own ToS, and nova-3 is their proprietary model. If you want a fully self-hosted variant, the fork path is described above. No GPL or AGPL pieces in the desktop tree.