macOS AI assistantDeepgram nova-3Stereo ASR

The macOS AI assistant that hears both sides of the call

Every other macOS AI assistant sends a single mic track to its speech API. Fazm opens a stereo Deepgram WebSocket with channels=2 and multichannel=true, so channel 0 carries your microphone and channel 1 carries system audio from the other person. The model gets a two-speaker transcript, labeled at the wire. This page walks through the 710 lines of public Swift that make that work, with file:line anchors you can grep.

Fazm

Published April 20, 202611 min read

Try Fazm free

4.9from 200+ Mac users

Every parameter anchored to a specific line in TranscriptionService.swift

Grounded in Deepgram nova-3 wire params: channels=2, multichannel=true, diarize=true

CoreAudio IOProc path explained with AudioCaptureService.swift line references

Stereo is the difference

A macOS AI assistant that hears you and the other person

channel 0 carries the microphone (user)

channel 1 carries system audio (everyone else)

multichannel=true splits ASR per channel

diarize=true labels speakers inside each channel

The model never has to guess who said what

0:00 / 0:05

The WebSocket URL, one chip at a time

Every chip below is a literal token that Fazm appends to the Deepgram URL or a constant defined near it in TranscriptionService.swift and AudioCaptureService.swift. Nothing invented. If your macOS AI assistant does not send a URL with most of these chips, it is not doing what Fazm does.

wss://api.deepgram.com/v1/listenmodel = nova-3channels = 2multichannel = truediarize = trueencoding = linear16sample_rate = 16000endpointing = 300utterance_end_ms = 1000vad_events = truesmart_format = truepunctuate = trueinterim_results = trueno_delay = trueaudioBufferSize = 3200keepaliveInterval = 8.0staleThreshold = 60.0targetSampleRate = 16000noiseFloor = 0.005decayRate = 0.85channelIndex: 0 = micchannelIndex: 1 = system audio

The numbers that define the anchor

Four structural anchors, all greppable in the public Fazm source. 710 is the length of TranscriptionService.swift. 933 is AudioCaptureService.swift. 3200 is the audio frame size in bytes, exactly 100ms at 16kHz. 16000 is the sample rate forced on the capture side.

0Lines in TranscriptionService.swift

0Lines in AudioCaptureService.swift

0Bytes per audio frame (100ms at 16kHz)

0Sample rate forced by targetSampleRate

Two channels, one transcript

0 channels on the wire, 0 transcript with labeled speakers

Every other macOS AI assistant ships a mono stream and hopes diarization figures out who is talking. Fazm guarantees the split at the wire: channel 0 is the user, channel 1 is the other person, diarize layers in extra speakers inside each channel if there are more than two people in the room.

What most macOS AI assistants do vs what Fazm does

The before tab is the default path almost every macOS AI assistant takes when you ask it to listen. The after tab is the path Fazm takes because the TranscriptionService initializer defaults to channels=2.

What arrives at the agent

Mic only. The other side of the call is invisible. One channel, one blended transcript, no speaker labels unless the model tries to guess. On a meeting, the assistant can summarize what you said, and nothing about what the client replied.

Single audio track, channels is unset or 1
No system-audio capture, the other person is silent to the model
Diarization off, or applied to one mixed channel
Assistant cannot answer 'what did the other person just agree to'

Anchor fact: the 42-line comment that names the two channels

TranscriptionService.swift line 42 is one of the most specific wire-level commitments in the Fazm source. The channelIndex property on TranscriptSegment carries a comment that locks the semantic: channel 0 is the user's microphone, channel 1 is system audio, which means everyone else the Mac can hear.

line 42

“let channelIndex: Int // 0 = mic (user), 1 = system audio (others)”

TranscriptionService.swift:42 (TranscriptSegment struct)

Desktop/Sources/TranscriptionService.swift

Anchor fact 2: the connect() function that opens the stereo WebSocket

TranscriptionService.swift connect() starts at line 274 and the URL it builds is the commitment the rest of the system rests on. Line 276 is wss://api.deepgram.com/v1/listen. Line 283 adds diarize=true. Lines 290 and 291 add channels and multichannel. Line 94 sets the model to nova-3. Every parameter on this URL shows up below verbatim.

lines 290-291

“URLQueryItem(name: "channels", value: String(channels)),\n URLQueryItem(name: "multichannel", value: channels > 1 ? "true" : "false"),”

TranscriptionService.swift:290-291

Desktop/Sources/TranscriptionService.swift

How the audio gets from your Mac to the model

The pipeline is six stages, each implemented in one of the two files this page anchors to. Each step is a concrete constant, function, or IOProc call, not a marketing abstraction.

Channel 0 (microphone) is captured with a CoreAudio IOProc

AudioCaptureService does not use AVAudioEngine because AVAudioEngine creates an aggregate device behind the scenes, which forces Bluetooth headsets from A2DP listening mode into SCO call mode and tanks the output quality. Instead it calls AudioDeviceCreateIOProcID on the default input device directly, which leaves the output format untouched. See AudioCaptureService.swift lines 5 to 8.

The raw input is resampled to 16000 Hz linear16 PCM

The input device might be 44.1kHz or 48kHz. AudioCaptureService uses an AVAudioConverter to emit 16-bit little-endian PCM at exactly 16000 Hz, because that is what Deepgram nova-3 consumes most efficiently. See AudioCaptureService.swift line 55: targetSampleRate = 16000.

Channel 1 (system audio) is mixed in on the second channel

When the user is on a call, the other person's voice is playing back through the system output. Fazm taps system audio separately and hands it to TranscriptionService as channel 1. The two channels are interleaved into a stereo linear16 buffer before being sent to the WebSocket.

Audio is buffered in 3,200-byte frames (~100ms)

TranscriptionService.swift line 119 sets audioBufferSize = 3200, which is exactly 100 milliseconds of 16 kHz 16-bit audio (16000 * 2 * 0.1). The service flushes the buffer to Deepgram every 100ms, which is the sweet spot between streaming latency and WebSocket overhead.

The WebSocket is opened with channels=2, multichannel=true, diarize=true

TranscriptionService.swift line 276 builds wss://api.deepgram.com/v1/listen and lines 290 to 291 append channels=2 and multichannel=true. Line 283 adds diarize=true. The combination tells Deepgram to return per-channel transcript events, plus a per-word speaker label inside each channel.

Every segment comes back tagged with channelIndex

TranscriptionService.swift line 42 defines `let channelIndex: Int // 0 = mic (user), 1 = system audio (others)`. When the agent loop reads the transcript history, it already knows which speaker said which words without having to run voice-embedding clustering over the audio itself.

Anchor fact 3: the CoreAudio IOProc choice

AudioCaptureService has a header comment that is worth reading in full because it explains a decision most AI assistants get wrong. If the capture path creates an aggregate device, Bluetooth headsets drop from A2DP to SCO and the output quality tanks. Fazm uses CoreAudio IOProc directly to avoid that.

lines 5 to 8

“Uses CoreAudio IOProc directly on the default input device to avoid AVAudioEngine's implicit aggregate device creation, which degrades system audio output quality (especially Bluetooth A2DP to SCO switch).”

AudioCaptureService.swift:5-8 (class header comment)

Desktop/Sources/AudioCaptureService.swift

Input channels, pipeline hub, model outputs

Two audio sources enter the pipeline, one WebSocket carries them to Deepgram, two per-channel transcript streams come back, and the agent loop sees labeled rows. The hub is TranscriptionService with audioBufferSize = 3200 and channels = 2 as the default.

Stereo capture through TranscriptionService to the agent

One meeting call, traced end to end

The diagram below follows one exchange during a meeting. The user asks a question, the other person answers, and by the time the agent loop composes a reply, it already has two labeled transcript rows with the right speaker on each.

User and other person, one labeled transcript

The six structural flags, in one grid

Six decisions define whether a macOS AI assistant can hear both sides of a conversation. Each card below maps to a specific constant or URL parameter in TranscriptionService.swift or AudioCaptureService.swift.

channels = 2

Stereo WebSocket so the mic and the system audio travel on separate tracks, not mixed into one blended signal.

multichannel = true

Tells Deepgram to run ASR independently on each channel and return separate transcript streams with channel indices.

diarize = true

Enables per-word speaker labels inside each channel. Layered on top of the channel split, this handles multi-speaker rooms.

channelIndex

Every TranscriptSegment arrives with channelIndex set to 0 (user) or 1 (other). The agent never has to guess who said what.

CoreAudio IOProc

Direct IOProc on the default input device instead of AVAudioEngine. Preserves Bluetooth A2DP output during capture.

audioBufferSize = 3200

Exactly 100ms of 16kHz linear16 audio. Flushed every 100ms for low-latency streaming with sensible WebSocket overhead.

Verify the claims without installing Fazm

Every file and line this page references is in the public Fazm source tree. The grep commands below prove the wire-level commitment from the outside.

grep the public Fazm source

Grep-verifiable anchor checklist

Every item below is independently checkable in the public Fazm source. If any item fails, the guide is wrong and should be corrected. If all pass, the page is a code tour.

Twelve grep-verifiable claims

TranscriptionService.swift exists at Desktop/Sources/ and is 710 lines long (wc -l)
AudioCaptureService.swift is 933 lines at the same path (wc -l)
Line 42: TranscriptSegment.channelIndex comment "0 = mic (user), 1 = system audio (others)"
Line 94: private let model = "nova-3" (Deepgram nova-3 ASR)
Line 99: private let channels: Int comment "2 = stereo (mic + system), 1 = mono (mic only for PTT)"
Line 119: audioBufferSize = 3200 (~100ms of 16kHz 16-bit audio)
Line 141: init default channels: Int = 2 (stereo is the default, not mono)
Line 276: wss://api.deepgram.com/v1/listen WebSocket URL
Line 283: URLQueryItem(name: "diarize", value: "true")
Lines 290-291: channels + multichannel parameters appended to the URL
AudioCaptureService.swift lines 5-8: CoreAudio IOProc comment explaining aggregate-device avoidance
AudioCaptureService.swift line 55: targetSampleRate = 16000

Side by side

Nine rows, each one anchored to a specific Swift symbol or URL parameter. The left column is what most macOS AI assistants ship; the right column is what Fazm commits to at the wire.

Feature	Most macOS AI assistants	Fazm
Captures the other person's voice on a call	No. Mic only, the other side is lost.	Yes. System audio is captured as channel 1.
Deepgram WebSocket parameter channels=2	channels is unset or 1.	TranscriptionService.swift lines 290 and 141 default channels=2.
Deepgram multichannel=true	Not used, assistant receives one blended channel.	Line 291 sets multichannel when channels > 1.
Deepgram diarize=true (per-word speaker label)	Typically off, single speaker assumed.	Line 283 turns diarization on for every session.
Per-segment channelIndex tag	Transcripts arrive unlabeled.	TranscriptSegment.channelIndex at line 42, 0 = user, 1 = other.
Avoids AVAudioEngine aggregate device on capture	Uses AVAudioEngine, which forces Bluetooth to SCO.	CoreAudio IOProc direct on default input (lines 5 to 8).
Frame size tuned for streaming latency	Buffer size is undocumented or whole-utterance.	Line 119: audioBufferSize = 3200, exactly 100ms at 16kHz.
ASR model	Varies, often a legacy speech engine.	Line 94: private let model = "nova-3".
Works with any app on the Mac, not just one vendor's chat window	Scoped to the vendor's app or browser.	Floating NSWindow plus stereo ASR, usable during any call.

Want to see the stereo pipeline on your own Mac?

Book a live walkthrough. We open the Deepgram WebSocket, show the two channels, and run a real meeting with both speakers labeled in the transcript.

Book a call →

Frequently asked questions

What makes Fazm different from Siri, Apple Intelligence, Raycast AI, ChatGPT desktop, and Claude desktop as a macOS AI assistant?

Those assistants all send a single audio track (your microphone) to their speech backend. If the other person is talking on a Zoom call, a FaceTime call, or even a YouTube video the assistant has no idea. Fazm opens a stereo WebSocket to Deepgram with channels=2 and multichannel=true so the microphone goes on channel 0 and system audio goes on channel 1, and every transcript segment comes back with a channelIndex telling the model which speaker said it. The parameters that make this work are at TranscriptionService.swift lines 276, 283, 290, and 291, and the per-segment label is at line 42.

Where exactly does the file say channel 0 is the user and channel 1 is everyone else?

TranscriptionService.swift line 42, inside the TranscriptSegment struct. The declaration is `let channelIndex: Int // 0 = mic (user), 1 = system audio (others)`. The paired constructor argument is declared at line 99: `private let channels: Int // 2 = stereo (mic + system), 1 = mono (mic only for PTT)`. The default value is set at line 141: `init(apiKey: String, language: String = "en", vocabulary: [String] = [], channels: Int = 2)`. The default is 2, meaning a freshly constructed session is stereo unless the caller explicitly switches to mono for push-to-talk.

How does this differ from using Deepgram's diarize flag on its own?

diarize=true alone is a speaker-embedding model that runs over a single audio stream and tries to split it into Speaker 0, Speaker 1, Speaker 2, and so on by clustering voiceprints. That is fragile when two people have similar voices or when one is talking through a compressed codec. Fazm combines diarize=true with channels=2 and multichannel=true, so Deepgram runs ASR independently on each physical channel first and then applies diarization inside each channel. The mic channel is guaranteed to be you because it came from your input device. The system-audio channel is guaranteed to be everyone else because it came from the output device. The wire-level guarantee is stronger than voice-embedding clustering alone.

Why does AudioCaptureService avoid AVAudioEngine?

Because AVAudioEngine creates an aggregate device behind the scenes when you pin it to an input, and an aggregate device changes the routing of the default output. On a Mac with AirPods or a Bluetooth headset, that switches the output from A2DP high-quality music mode to SCO phone-call mode, which cuts output bandwidth to roughly 8 kHz. For an AI assistant that is listening to a meeting, that is the wrong direction. AudioCaptureService.swift lines 5 to 8 document this explicitly. The service uses AudioDeviceCreateIOProcID on the default input device directly, which is a lower-level CoreAudio API that does not create an aggregate device and leaves the output path alone.

What model is used and why?

Deepgram nova-3. TranscriptionService.swift line 94: `private let model = "nova-3"`. nova-3 is Deepgram's current production real-time ASR model, notable for two things Fazm relies on. First, it supports the `keyterm` parameter, which lets Fazm inject custom vocabulary (for example a user's company name or a product SKU) directly into the acoustic model rather than applying replace rules after the fact. The vocabulary is appended to the URL at lines 295 to 297. Second, nova-3 handles the stereo multichannel path with low tail latency, which is what the 3,200-byte 100ms buffer frames assume.

What is the 3,200-byte buffer for?

It is the frame size TranscriptionService ships to Deepgram. Line 119: `private let audioBufferSize = 3200 // ~100ms of 16kHz 16-bit audio (16000 * 2 * 0.1)`. At 16 kHz sample rate and 2 bytes per sample, 100 milliseconds of audio is 3,200 bytes. The service buffers raw PCM until the buffer reaches that threshold, then sends the chunk via webSocketTask.send in a single WebSocket data frame (sendAudio at lines 209 to 224, sendAudioChunk at lines 239 to 249). 100 ms is short enough for live transcription to feel real-time and long enough to amortize the WebSocket frame overhead.

How does the service survive a dropped WebSocket?

Three mechanisms. First, a keepalive task pings every 8 seconds (keepaliveInterval at line 108) so intermediate proxies do not idle-close the connection. Second, a watchdog task checks every 30 seconds that data or keepalive successes have arrived within the last 60 seconds (watchdogInterval and staleThreshold at lines 114 and 115) and forces a reconnect if the socket has gone silent. Third, the service auto-reconnects up to maxReconnectAttempts = 10 with backoff (line 103). This is why Fazm's voice loop can run through a tunnel or a flaky hotel Wi-Fi and recover automatically.

How are domains and emails transcribed correctly?

Deepgram's `replace` parameter applies find-and-replace rules on the server. TranscriptionService.swift lines 9 to 31 define defaultReplacements: 'dot com' to '.com', 'dot ai' to '.ai', 'at sign' to '@', plus file extensions like 'dot json', 'dot ts', 'dot swift'. The rules are only appended when the language is English or 'multi' (lines 301 to 305), because spoken forms like 'dot com' are English-specific. This makes URLs and emails render as typeable text instead of words, so a follow-on agent tool can click a link without a regex post-processing step.

Can I use Fazm as a macOS AI assistant without meetings, just push-to-talk?

Yes. The init signature takes channels as an argument, and the code explicitly notes at line 99 that 1 channel is the mode used for push-to-talk. The floating bar can open a mono session on demand when the user holds a PTT hotkey, and the same Deepgram URL is generated with channels=1 and multichannel=false (line 291 flips based on the channel count). You get the same nova-3 model and the same replace rules, just without the system-audio split. The default stays stereo so that ambient capture during a call works without extra configuration.

How can I verify all of this without installing Fazm?

The files are in the public Fazm source tree at Desktop/Sources/. Run wc -l TranscriptionService.swift AudioCaptureService.swift and you should see 710 and 933. Grep for 'channelIndex' inside TranscriptionService.swift and the comment at line 42 appears. Grep for 'diarize' and the URLQueryItem at line 283 appears. Grep for 'multichannel' and lines 290 to 291 appear. Grep for 'nova-3' and line 94 appears. Grep for 'IOProc' inside AudioCaptureService.swift and the header comment plus the ioProcID property appear. Every claim on this page is a direct grep away.

Does the agent really receive a labeled two-speaker transcript, or does it have to reconstruct it?

It receives it labeled. Deepgram emits separate messages per channel on a multichannel stream, and TranscriptionService parses each message into a TranscriptSegment with channelIndex set from the channel_index field. The segment is handed to the onTranscript callback verbatim, so by the time the agent loop composes a prompt, it already has rows like 'channel=0 text="let me share my screen"' and 'channel=1 text="sounds good, go ahead"'. No voice-embedding clustering is required on the client.

What else is at that path besides TranscriptionService and AudioCaptureService?

AudioDeviceManager.swift (317 lines) handles device enumeration and device change listeners, so when the user swaps from built-in microphone to AirPods the service reconfigures without dropping the WebSocket. The ioProcID is torn down, the new device is inspected for its native format, a fresh AVAudioConverter is built, and capture resumes. This is how the assistant stays on through a device change in the middle of a call. The header comment on the AudioCaptureService class documents the aggregate-device avoidance rationale that makes any of this safe on a Bluetooth headset.

Adjacent guides on how Fazm hears, sees, and clicks on macOS.

Keep reading

Voice

Voice-controlled macOS agent

How the voice loop hands the transcript to the Fazm agent, including hotkey and session wiring.

Read

Latency

Voice AI latency for conversational agents

What the 100ms audio buffer, 8s keepalive, and 60s stale threshold add up to in end-to-end latency.

Read

Architecture

Accessibility API vs screenshot agents

The other half of Fazm's input stack: reading the AX tree instead of sending screenshots to a vision model.

Read

The macOS AI assistant that hears both sides of the call

The WebSocket URL, one chip at a time

The numbers that define the anchor

What most macOS AI assistants do vs what Fazm does

What arrives at the agent

Anchor fact: the 42-line comment that names the two channels

Anchor fact 2: the connect() function that opens the stereo WebSocket

How the audio gets from your Mac to the model

Channel 0 (microphone) is captured with a CoreAudio IOProc

The raw input is resampled to 16000 Hz linear16 PCM

Channel 1 (system audio) is mixed in on the second channel

Audio is buffered in 3,200-byte frames (~100ms)

The WebSocket is opened with channels=2, multichannel=true, diarize=true

Every segment comes back tagged with channelIndex

Anchor fact 3: the CoreAudio IOProc choice

Input channels, pipeline hub, model outputs

Stereo capture through TranscriptionService to the agent

One meeting call, traced end to end

The six structural flags, in one grid

channels = 2

multichannel = true

diarize = true

channelIndex

CoreAudio IOProc

audioBufferSize = 3200

Verify the claims without installing Fazm

Grep-verifiable anchor checklist

Side by side

Want to see the stereo pipeline on your own Mac?

Frequently asked questions

Keep reading

Voice-controlled macOS agent

Voice AI latency for conversational agents

Accessibility API vs screenshot agents

Comments (••)

Comments ()