Field notes, fazm desktop

Voice-first control of a laptop without sending audio to a third party: where 2026 actually lands.

The question gets asked as if it has one answer. It has four, one per layer of what voice control of a laptop actually is. Three of the four layers are fully local-viable on a Mac in May 2026. One is not, and not for the reason most people assume. The model is not the bottleneck. The four streaming controls around the model are.

M
Matthew Diakonov
9 min read

Direct answer (verified 2026-05-14)

For dictation, yes. Parakeet TDT v3 (FluidAudio, MIT) and WhisperKit (argmaxinc, MIT) both run on the Apple Neural Engine with zero outbound network. Several shipping apps (MacParakeet, Ghost Pepper, TypeWhisper, Wispr) deliver this today.

For a voice-agent loop, partly. The model layer is local-ready. The surrounding streaming protocol (interim partials, hard endpointing, runtime keyterm vocabulary, multichannel mic + system audio on one stream) is not yet exposed by any open-source on-device stack we have shipped against. Apple's macOS 26 SpeechAnalyzer is the closest, but the public API does not (yet) cover the agent-shape knobs.

Sources we re-checked today: FluidAudio, WhisperKit, MacParakeet, Fazm.

The four layers, ranked by how local they already are

Three are answered. One is the bottleneck. Most arguments about "is local voice control viable" collapse all four into one and end up arguing about the wrong layer.

1. Audio capture

Mic + system audio on separate channels via AVAudioEngine + a CoreAudio system-audio tap. Sample to 16 kHz linear16 PCM. 100 ms frames, 3200 bytes each. Fully local in 2026, has been for a decade. No question here.

2. Recognition (STT)

The model that turns those 100 ms frames into words. Parakeet TDT v3 on the Apple Neural Engine, WhisperKit on the same hardware. Both ship as open Swift SDKs, both run with zero outbound network. The model is local-ready.

3. Loop protocol

The four controls around the model that make voice feel like a real input modality: interim partials every ~100 ms, hard endpointing on a 300 ms silence window, runtime keyterm vocabulary per utterance, multichannel mic + system on one stream with channel_index on every result. No open-source on-device stack ships all four in May 2026. This is the gap.

4. Action

Once the transcript lands, the agent has to click buttons, type into fields, open windows. Fazm does this through macOS accessibility APIs (AXUIElement tree walking) instead of pixel-scraping, and through the user's chosen LLM provider for the planning step. Both knobs are configurable. The action layer is the answered question.

The four-knob gap, named precisely

The protocol layer is the only one where the local answer is "not yet". It is four specific controls. None of them is a model-quality question. All four are surface-area questions about what the streaming API exposes.

  • interim_results. Partial hypotheses every ~100 ms while you are still talking. Local transducers can emit them. Public on-device SDKs (FluidAudio, WhisperKit) do not consistently expose a stable streaming partials API yet that matches what an agent UI needs.
  • endpointing. A hard silence window that produces a speech_final event. Most local stacks ship voice activity detection (Silero, WebRTC VAD) as a separate library. Wiring it to the model so a single "turn-over" event lands at the right moment is the part every voice agent re-invents.
  • runtime keyterm. A way to bias the decoder toward your contact list, app names, and file names this minute, without retraining. Server-side transducers expose this as a per-connection parameter. Parakeet TDT has no public runtime vocabulary boost in the open-source release. WhisperKit accepts an initial_prompt that biases output but does not boost specific terms.
  • multichannel. Two raw PCM channels in one frame: channel 0 is your mic, channel 1 is the sound coming out of your speakers (the Zoom call, the meeting, the video). The result objects carry a channel_index so the agent knows whose words those were. Local SDKs accept a single audio buffer. If you want this split, you mux it yourself or run two model instances.

Voice loop, layer by layer

Mic + system audioLocal STT (model)Loop protocolAgent / action100 ms PCM frame (16 kHz linear16, 3200 B)interim partial (~100 ms after speech start)speech_final = true (after 300 ms silence)per-utterance keyterm bias (app names, file names)finalized transcript + channel_indextool call: AXUIElement click / keystroke / windowmissing locally: runtime keyterm + multichannel on one stream

How Fazm actually splits this (the one cloud hop, named)

The Fazm Mac app runs locally for everything except the audio bytes. Screen reading uses AXUIElementCreateApplication against the focused window, in-process, no network. Clicks and keystrokes are CGEvent calls. The MCP servers run as child processes on your machine. The agent loop and the chat UI render locally. The only thing that leaves the laptop is the 100 ms PCM frame on the way to a streaming transducer, today Deepgram, because no open-source on-device stack reproduces the four-knob surface in May 2026.

The exact connection lives in TranscriptionService.swift in the Fazm Desktop sources:

// TranscriptionService.swift, lines 274-307
let url = "wss://api.deepgram.com/v1/listen"
  ?model=nova-3                  // line 94
  &language=en
  &encoding=linear16
  &sample_rate=16000
  &channels=2                    // mic (0) + system audio (1)
  &multichannel=true
  &interim_results=true
  &endpointing=300               // ms silence -> speech_final
  &utterance_end_ms=1000         // backup endpoint
  &vad_events=true
  &diarize=true
  &keyterm=<per-utterance term>  // app names, file names
  &replace=dot+com:.com  &replace=at+sign:@  ...

// Audio chunk size: 3200 bytes (100 ms at 16 kHz, 16-bit). Line 119.
// Keepalive: every 8 s, lines 348-374.
// Watchdog: every 30 s, reconnect after 60 s of stale data, lines 376-399.

The point of pasting it: this is the surface a local stack has to match before the cloud hop disappears. Not the model. The connection object around the model. The day FluidAudio or WhisperKit (or a peer) exposes those four query parameters as part of a stable streaming session API, the Fazm TranscriptionServicegains a local backend and the audio frame stops leaving the box. The architecture is already decoupled: the class hands frame-by-frame transcript segments to a closure, swap the WebSocket out for a local session emitting the same shape and the rest of the app does not notice.

The local voice stack on a Mac in May 2026 (what already ships)

For dictation specifically, this is solved. The chip row below is every shipping option we used or read the source of while writing this. All run on-device. None are voice agents in the full sense. That is the gap to fill.

Parakeet TDT v3 (FluidAudio, MIT)
WhisperKit (argmaxinc, MIT)
whisper.cpp (ggerganov, MIT)
Ghost Pepper (WhisperKit + Qwen)
MacParakeet (Parakeet default)
TypeWhisper (ten engines, local)
Wispr (Whisper on-device)
Apple SpeechAnalyzer (macOS 26)

What flips the answer to "fully yes"

Two paths to the local-everywhere voice agent loop on a Mac, ordered by how soon I think each lands.

Path A. An open-source on-device transducer (Parakeet TDT class, or a NeMo-family peer) exposes a streaming session API that covers runtime keyterm and multichannel-on-one-stream. FluidAudio or WhisperKit wires it up behind a Swift surface that an agent can target. The day that ships, every voice agent on a Mac that is willing to drop a cloud transcription dependency does so in one afternoon. The model is the part that is already strong enough.

Path B. Apple's macOS 26 SpeechAnalyzer ships in a point release with a public streaming API that covers per-utterance vocabulary boost and multichannel input. The first-token latency is already there. The licensing story is simpler than rolling a third-party model. The remaining unknown is which knobs the public API exposes versus which stay private.

Either path, the day it lands the audio frame stops leaving the machine and the answer to the question on the title becomes a flat yes. Until then, the honest answer is the one above: three layers out of four, with the fourth in motion.

If you can't wait for path A or B

Two reasonable shapes today.

Local dictation, cloud planning. Run Parakeet or WhisperKit on the Mac for the words. Send only the transcribed text plus a structured representation of the screen to your LLM provider. Audio never leaves the laptop. This is what most privacy-first dictation apps converge on. The tradeoff is the loop feels like dictation, not like voice control: no interim partials, no hard endpointing, no agent-driven barge-in.

Self-hosted streaming transducer. Run Deepgram on-prem, or stand up whisper-streaming or a faster-whisper server inside your network. Point the agent's transcription URL at that machine. The audio leaves the laptop but not your hardware. Closest existing path to a fully audited voice agent loop with the four streaming knobs intact.

Talk to me about the audio path on your stack

If the four-knob gap is the only thing keeping you from voice control, I want to know which knob bites first. Half-hour call, no pitch.

Frequently asked questions

One-line answer: viable yet, or not?

For dictation, yes. Parakeet TDT compiled to CoreML runs on the Apple Neural Engine at roughly 80 ms first-token on M-series Macs (FluidAudio benchmark, 2026). WhisperKit runs on the same hardware and covers 99+ languages. Neither needs the network. For a voice-AGENT loop on the same machine (push-to-talk, interim transcripts, hard endpointing, runtime vocabulary, mic plus system audio on separate channels) the model layer is local-ready but the surrounding streaming protocol is not yet exposed by any open-source on-device stack we have shipped against. The model is the answered question. The four streaming controls around it are the unanswered ones.

What is the difference between dictation and a voice-agent loop, concretely?

Dictation is one direction: speech in, text out, you read it. The work is the model. A voice-agent loop is bidirectional and time-sensitive. The UI has to show partial text within 100 ms of you starting to talk, the system has to know when you have stopped, the model needs to be biased toward your app names and file names this minute, and on a Mac the mic and the meeting on your speakers need to stay on separate channels so the agent does not attribute the other side's words to you. The model is the same. The protocol around it is the work.

Why does Fazm use Deepgram for the voice path if everything else is local?

Because in May 2026 no open-source on-device stack exposes interim_results, endpointing, runtime keyterm, multichannel, and diarization in one connection. We tried gluing Parakeet plus a Silero VAD plus a custom endpointing state machine plus a CoreAudio mic-and-system tap, and the failure mode was that every workflow ended up with a different idea of when the user had stopped speaking. Server-side endpointing with three knobs (endpointing=300 ms, utterance_end_ms=1000 ms, vad_events=true) gave us one place to tune that for everyone. The exact connection lives in TranscriptionService.swift, lines 274 to 307. Everything else in the app runs locally: screen reading, click and keystroke control, MCP servers, the chat UI, the workflow store.

What changes the answer to 'fully yes, local everywhere'?

One of two things. A: An open-source on-device transducer (Parakeet TDT class, NVIDIA NeMo lineage, or a peer) ships a public runtime-keyterm and multichannel streaming surface, and FluidAudio or WhisperKit wires it up. B: Apple's SpeechAnalyzer in macOS 26 exposes a stable enough streaming and vocabulary API that a third-party voice agent can target it. Either flips the choice. The Fazm architecture is already decoupled: TranscriptionService.swift is a class that hands frame-by-frame transcript segments to a closure, swap the WebSocket out for a local session emitting the same segments and the rest of the app does not notice.

Which apps in 2026 actually deliver voice control on a Mac with zero audio leaving the box?

For dictation specifically, several. MacParakeet (Parakeet TDT default, WhisperKit optional for 73 more languages, no cloud), Ghost Pepper (WhisperKit plus a local Qwen LLM for cleanup), TypeWhisper (ten engines, all local by default), Wispr (WhisperKit). All of these write text into the focused field. None of them are voice agents in the sense of driving Slack, Linear, the browser, and Google Workspace as a tool-using loop. That is the missing rung.

Could the agent run cloud transcription and stay private if I self-host?

Yes. Deepgram has an on-prem deployment, and several open transducer servers (whisper-streaming, faster-whisper-server) speak similar shapes. If you point TranscriptionService at a server inside your network, the audio leaves the laptop but not your machines. The Fazm app already accepts a custom Anthropic-compatible endpoint for the model. The transcription endpoint is the next override, and it lands the same way: a URL on the box you control.

What does Apple's on-device SpeechAnalyzer in macOS 26 buy us here?

Streaming dictation with first-class on-device support, exposed to third-party apps, with reasonably tight first-partial latency on Apple Silicon. What it does not (yet, publicly) give a voice agent is runtime per-utterance vocabulary boost, multichannel mic-plus-system on one stream, and the same speaker-diarization surface a server-side transducer exposes. SpeechAnalyzer is the cleanest existing on-device dictation path on the Mac. Whether it becomes the agent path depends on which knobs Apple exposes through the public API in the next point release.

Where do I read the actual code?

TranscriptionService.swift in the Fazm Desktop sources at github.com/m13v/fazm. Line 94 fixes the model to nova-3. Lines 274 to 307 build the WebSocket URL with model, language, channels, multichannel, smart_format, no_delay, diarize, interim_results, endpointing, utterance_end_ms, vad_events, encoding linear16 at 16 kHz, plus a keyterm parameter per vocabulary term. Lines 348 to 374 run an 8-second keepalive ping. Lines 376 to 399 run a 30-second watchdog that reconnects when keepalives stop succeeding for 60 seconds. The audio chunk size is 3200 bytes (100 ms at 16 kHz, 16-bit), see line 119.

fazm.AI Computer Agent for macOS
© 2026 fazm. All rights reserved.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.