Notes from the agent that picked neither

Parakeet vs Whisper on a Mac, and the streaming controls a voice agent actually needs.

For dictation on Apple Silicon, Parakeet TDT is roughly a few times faster than Whisper Large and Whisper covers more languages. Both run cleanly on the Neural Engine and never leave the device. For a voice agent that drives real apps, the model is rarely the bottleneck. What the loop actually needs is interim partials, server-side endpointing, runtime vocabulary, and a two-channel mic+system audio stream. Below is what we shipped on Fazm, the exact WebSocket parameters from the source, and the cases where the local choice still wins.

Matthew Diakonov, Written with AI

Published May 8, 20269 min read

Direct answer · verified 2026-05-08

Which one for a voice agent on a Mac?

For pure dictation: Parakeet on Apple Silicon is the speed pick (TDT architecture, Neural Engine, no silence hallucination), Whisper Large v3 Turbo is the multilingual pick. Both run on-device.

For a voice agent that has to drive apps, neither's on-device stack today exposes the four streaming controls the loop relies on: per-utterance keyterm vocabulary, multichannel mic+system audio, partial-and-final transcript pairs, and tunable endpointing. Fazm streams to Deepgram nova-3 over WebSocket for that reason. The exact connection lives at TranscriptionService.swift:274-307 in github.com/m13v/fazm.

5.0from Fazm source tree

Model fixed at nova-3 in TranscriptionService.swift line 94

Two audio channels: mic on 0, system audio on 1

endpointing=300, utterance_end_ms=1000, interim_results=true

Per-connection keyterm vocabulary, no retraining

The two on-device picks, and what we run instead

Nine rows comparing the local stack you would build with Parakeet or Whisper against the streaming WebSocket we shipped. Both columns have honest wins; the page is about which surface area maps to the work an agent loop has to do.

Feature	Parakeet or Whisper, on-device on a Mac	Fazm streaming voice path (Deepgram nova-3 WebSocket)
Architecture	Parakeet TDT: a token-and-duration transducer compiled to CoreML, runs on the Apple Neural Engine. Whisper: encoder-decoder, autoregressive, runs on GPU/ANE via WhisperKit or whisper.cpp.	Streaming WebSocket to a hosted recurrent transducer (Deepgram nova-3). Audio in, JSON transcripts out, frame by frame.
Where it runs	Fully on-device. Audio never leaves the Mac. Both run cleanly on Apple Silicon, with Parakeet typically a few times faster than Whisper Large for the same input.	Cloud over WebSocket. Network round-trip in the loop. Audio leaves the device.
Interim vs final results	Most local Whisper integrations are file-shaped: feed a buffer, get a transcript at the end. Streaming forks exist (whisper.cpp streaming, WhisperKit streaming) but the partial/final split is something each integration glues together. Parakeet TDT is genuinely streaming-friendly thanks to the duration head.	interim_results=true. Partial hypotheses stream as you speak, plus is_final=true segments and a speech_final marker when the user stops.
Endpointing (knowing when the user stopped)	Not part of the model. You bring your own VAD (Silero, WebRTC VAD) and your own silence threshold. Tuning is on you.	endpointing=300 + utterance_end_ms=1000 + vad_events=true. Three parameters that fire UtteranceEnd events the agent loop binds to.
Custom vocabulary at runtime	Whisper accepts an initial_prompt that biases output but does not boost specific terms. Parakeet has no public runtime vocabulary boost surface in the open-source release.	keyterm query parameters per connection. Per-utterance terms (proper nouns, app names, file names) without retraining.
Multichannel mic + system audio	Single audio stream in. If you want mic + speakers separated you mux them yourself or run two model instances and merge transcripts.	channels=2, multichannel=true. Channel 0 is the mic (you), channel 1 is system audio (the call). channel_index in every result tells the agent who said what.
Diarization	Not in either model. Bolt on pyannote, Resemblyzer, or a separate diarization pass.	diarize=true. Speaker labels in every word object.
Hallucination on silence	Whisper's autoregressive decoder can repeat phrases or invent text on silent input, which is a known failure mode in long real-time sessions. Parakeet's transducer head does not have this failure mode.	Server-side VAD plus explicit endpointing keep us from inventing words during silence.
What it costs you	RAM and ANE time. Free per minute once the model is downloaded, but the cost is silicon and battery.	An API key, a usage line item, and audio leaving the box. Real money, but the agent loop is forgiving on tokens.

The four streaming controls a voice agent loop binds to

These are the parameters that matter for an interactive loop, regardless of who wins the WER chart this quarter. If your local pipeline does not surface these, you end up writing them yourself, badly, twice.

interim_results

Partial hypotheses every ~100ms while you are still speaking. The agent UI renders them live; the bridge holds the last interim until the matching is_final lands.

endpointing

300ms silence to mark a turn boundary, 1000ms backup utterance_end. The push-to-talk manager waits for speech_final before submitting the prompt to the model.

vad_events

SpeechStarted and UtteranceEnd events with timestamps. Used to drive the floating-bar pulse and to flush the audio buffer at exactly the right moment.

keyterm

Per-connection vocabulary boost. Fazm passes app names, file names, and the user's contact list at connection time; nova-3 weights them at the decoder without retraining.

multichannel

Two raw PCM channels in one frame. Channel 0 is mic, channel 1 is the WhatsApp/Zoom call audio. Result objects carry channel_index so the agent knows whose turn it was.

diarize

Speaker labels per word. Useful when channel 1 has multiple voices, like a speakerphone call routed through system audio capture.

What the connection actually looks like

This is the URL Fazm opens for every voice session. The query string is built in TranscriptionService.swift, lines 274 to 307. Nothing here is decorative; each parameter ties to a specific behavior the agent loop depends on.

TranscriptionService.swift · WebSocket URL

0 msAudio buffer chunk

0 msEndpointing window

0 msUtterance-end backup

0Audio channels in (mic + system)

2 channels

“channels=2 & multichannel=true. Channel 0 is the user's mic, channel 1 is system audio. Each transcript carries channel_index so the agent loop knows whose turn it was.”

TranscriptionService.swift line 290-291

Why Parakeet's speed is not the deciding factor here

Parakeet TDT is a beautiful piece of engineering. The token-and-duration transducer compresses the audio signal by eight before processing and predicts both the next token and how long the current one lasts, which is why a Parakeet inference step can complete inside a window the human ear treats as instantaneous. On Apple's Neural Engine, that effect is even more pronounced. For a dictation app this is a real win: the transcript appears the instant you stop speaking, and there is no autoregressive decoder waiting around to hallucinate during silence.

For an agent loop, the wall-clock between your last word and the model receiving the prompt breaks down differently. Roughly: endpointing detection + last partial → final conversion + tool-call planning + tool-call execution. The first term is a few hundred milliseconds and is set by your endpointing knob, not by your ASR. The second is small for any streaming model. The third and fourth are seconds, not milliseconds, because the model has to think and a real app has to open. A 50ms ASR win disappears inside a 3-second agent step. The choice that matters is which ASR exposes the shape of the protocol the agent step needs.

In practice, Parakeet's speed advantage shows up brilliantly in a dictation field. It does not show up much in the time-to-first-tool-call of an agent. We picked the protocol that minimises every other piece of glue between the user's mouth and the agent's first action.

The honest local-only case

If a fully on-device voice path is a non-negotiable, both Parakeet and Whisper get you 90% of the way there. Pair Parakeet TDT (via FluidAudio CoreML or parakeet-rs) with a Silero VAD for endpointing, write a simple state machine that emits an UtteranceEnd event after 300ms of silence, and feed the resulting transcript into a local agent harness. You give up runtime vocabulary boosts, multichannel mic+system audio without muxing, and server-side diarization, but you keep your audio on your machine.

We are not religious about cloud here; the day on-device stacks ship runtime keyterm and a multichannel transducer, the WebSocket goes away. The TranscriptionService class is one interface that hands TranscriptSegments to the rest of the app. Anything matching that shape can replace it.

Tools people are reaching for in this category

Mostly dictation-shaped, mostly excellent at what they do. Different shape from a voice agent that has to act, but a useful map of the territory.

WhisperKitwhisper.cppParakeet TDT (FluidAudio)parakeet-rsSuperwhisperMacWhisperWisprFlowApple Speech (SpeechAnalyzer)WisprDeepgram nova-3

One concrete utterance, end to end

User holds the Option key, says: open the latest Stripe email and reply, looks good, ship tonight. Twelve seconds of audio. Here is what happens on each path.

Local Whisper or Parakeet, single-channel

Audio is buffered into RAM until you release Option. The buffer is fed to the model. Parakeet returns a transcript fast on the Neural Engine, Whisper takes a few hundred milliseconds for the chunk. You get a single string out. You write your own VAD to detect that the user stopped, your own initial_prompt to bias toward "Stripe" instead of "strip", and a separate path for system audio if you ever care about it.

Audio leaves device: never. Glue you write yourself: VAD, endpointing, vocab boost, channel routing.

Fazm: streaming nova-3

A WebSocket is already open. Each 100ms PCM chunk is sent in real time. Interim partials stream back; the floating bar shows them at ~100ms latency. When you release Option, the bridge sends CloseStream and waits for the last is_final. keyterm has "Stripe" already, so it never lands as "strip". The transcript drops into the agent loop and Mail.app opens via the macos-use MCP.

Audio leaves device: yes, that is the trade. Glue: zero, the protocol does it.

When each pick is the right pick

Pick Parakeet for a Mac dictation app, a meeting transcriber that runs alongside Zoom, a podcast or video pipeline where the audio lives on disk. The TDT architecture is unbeatable for fast English transcription on Apple Silicon, and there is no silence hallucination to manage.

Pick Whisper when language coverage matters more than raw speed: meetings in twenty languages, file-shaped batch transcription, anything where you want a single model to handle the long tail of accents and code-switching. Large v3 Turbo is the practical default.

Pick a hosted streaming protocol (what Fazm does) when the audio is one half of a feedback loop: agent acts, user speaks, agent acts. The job is no longer transcription; it is keeping a tight loop alive between the user's voice and the next click. Per-utterance vocabulary, multichannel mic+system audio, and tunable endpointing become load-bearing.

See the WebSocket, the floating bar, and a tool call land

Twenty minutes, on a real Mac. We open the source at TranscriptionService.swift, hold the Option key, watch the partials stream in, and let the agent open Mail. Bring questions about the on-device path; we are happy to argue both sides.

FAQ

Frequently asked questions

Short version: Parakeet or Whisper for a Mac voice agent?

Neither, if the agent is something like Fazm that drives real apps and tools. For pure dictation, Parakeet wins on speed on Apple Silicon (a few times faster than Whisper Large for the same input) and Whisper wins on language coverage. For a voice agent loop, the model's WER and latency are not the bottleneck. The bottleneck is the streaming protocol around the model: interim results, endpointing, runtime vocabulary, multichannel mic+system audio, diarization. Fazm streams to Deepgram nova-3 over WebSocket because that surface is exposed and tuned. The exact connection params live in TranscriptionService.swift at lines 274-307.

Why not just run Parakeet locally and bolt the rest on yourself?

You can. The cost is the glue: a Silero or WebRTC VAD, a custom endpointing state machine, a way to inject per-utterance vocabulary, a CoreAudio tap for system audio on a separate channel, and a way to align partials and finals so your UI does not flash. We tried that path early, and the failure mode was that every workflow had a different idea of when the user had stopped speaking. Server-side endpointing with three knobs (endpointing=300, utterance_end_ms=1000, vad_events=true) gave us one place to tune it for everyone. Parakeet TDT is excellent at the model job, the surrounding plumbing is what burns weeks.

But audio leaves the device with Deepgram. Isn't local-only better?

Local-only is better when the threat model includes the network. The voice path is the place we made an explicit non-local choice, and the README and onboarding say so. Everything else in Fazm runs on the Mac: the agent loop, accessibility-tree reads, MCP servers, the chat UI, file work. If a fully local voice path is a hard requirement, Fazm will not be the right tool for that path until on-device transducers expose runtime keyterm and multichannel. Today, that is not yet the case in the open-source releases.

Where is this in the source?

TranscriptionService.swift in the Fazm Desktop sources. Line 94 fixes the model to nova-3. Lines 274 to 307 build the WebSocket URL with all of the streaming knobs. Lines 348 to 374 run an 8-second keepalive ping. Lines 376 to 399 run a 30-second watchdog that reconnects when keepalives stop succeeding for 60 seconds. PushToTalkManager.swift handles the Option-key state machine and feeds 100ms PCM chunks (3200 bytes at 16 kHz, 16-bit) into the WebSocket via TranscriptionService.sendAudio.

What does interim_results actually buy a voice agent over a file-shaped Whisper call?

Three things. First, the floating control bar can show partial text within ~100ms of you starting to talk, which keeps the interaction feeling alive. Second, the agent can start prefetching context for likely tool calls before you finish speaking (open browser tabs, load mailboxes). Third, when you stop, the gap between your last word and the prompt going to the model is bounded by the endpointing window plus one network round-trip, not by 'transcribe the whole buffer at the end'. For a 12-second utterance that matters; for a 1-second utterance the experience is what makes voice feel like a real input modality instead of a clunky dictation form.

Why two channels of audio at the WebSocket layer?

Because a voice agent on a Mac is also useful when there is sound coming out of the speakers. The mic is the user. System audio is the call, the meeting, the video. Keeping them on separate channels at the source means the model never has to decode 'who is talking' from a mixed mono stream, and result objects carry channel_index so the agent loop knows which speaker to attribute. On the local side, both Whisper and Parakeet take a single audio buffer; if you want this you mux it yourself, or run two model instances. Two model instances doubles the ANE time. We picked the single-stream protocol option.

If Parakeet adds streaming vocabulary boost in the future, does this change?

Yes, and we want it to. The architecture is decoupled: TranscriptionService is a class that hands frame-by-frame transcript segments to a TranscriptHandler closure. Swap the WebSocket out for a local Parakeet streaming session that emits the same segments and the rest of Fazm does not notice. The reason this is not the default today is exactly the surface area of the protocol: server-side endpointing, runtime keyterms, multichannel diarization, all in one connection. When the local stack matches that, the choice flips.

Is Whisper just bad for this then?

Whisper is excellent at what it is good at: high-accuracy multilingual transcription of fixed audio, with mature on-device tooling like WhisperKit and whisper.cpp. The places it struggles for an agent loop are the ones the architecture causes: the autoregressive decoder can hallucinate on silence, file-shaped calls do not give you partials, and there is no runtime keyterm boost. Some of those are addressable with engineering, none of them are wrong, they are just the wrong shape for an interactive agent. For a meeting recorder or a podcast pipeline, Whisper Large v3 Turbo is hard to beat.

What does the loop actually do once a final transcript lands?

PushToTalkManager.handleFlagsChanged sees the Option key go up, calls TranscriptionService.finishStream, which sends Deepgram a CloseStream message and waits for the last is_final segment. The accumulated transcript is normalized, find-and-replace rules are applied (dot com to .com, at sign to @, dot js to .js, the full list lives in TranscriptionService.defaultReplacements), and the result is dropped into either the chat input field or directly into a tool-call prompt. From there the agent loop runs the same way it would if you had typed.

Related guides

alternative

Local LLM vs local AI agent on macOS

Where the model ends and the harness begins. Five built-in MCP servers, the AX-permission probe, and the tool-result filter that keeps base64 out of context.

Read

guide

Voice controlled macOS agent

How push-to-talk binds to the Option key, drives the floating control bar, and feeds a streaming WebSocket transcription session into the agent loop.

Read

guide

Voice agent desktop handoff workflow

Concrete recipes for handing off from voice to apps on a Mac, including when to dictate, when to issue commands, and when to drop into a chat thread.

Read