macOS Dictation with Local Whisper - Sub-Second Latency on Apple Silicon

Fazm Team··2 min read

macOS Dictation with Local Whisper on Apple Silicon

Cloud-based speech-to-text adds 500ms to 2 seconds of latency per utterance. When you are driving an AI agent with voice commands, that delay kills the workflow. Local Whisper on M-series chips changes the equation entirely.

Why Latency Matters for Agent Interaction

When you speak a command to an AI agent, there is a chain: speech capture, transcription, LLM processing, action execution. If transcription alone takes 1.5 seconds over the network, the full loop feels sluggish. You start waiting instead of working.

With Whisper running locally on an M1/M2/M3/M4 chip, transcription drops to under 200 milliseconds for typical utterances. The Neural Engine handles the inference without touching your CPU or GPU budget, so the agent can keep running other tasks simultaneously.

The Technical Setup

Running Whisper locally on macOS involves a few key pieces:

  • whisper.cpp - A C/C++ port optimized for Apple Silicon, using Core ML and the Neural Engine for acceleration.
  • Model selection - The base or small model offers the best latency-to-accuracy tradeoff. The medium model is more accurate but adds 100-300ms.
  • Audio capture - macOS AVAudioEngine provides low-latency microphone access with voice activity detection built in.
  • Streaming chunks - Process audio in 2-3 second chunks rather than waiting for silence. This gives you partial results faster.

Privacy as a Bonus

Every voice command stays on your machine. No audio leaves the device. For agents that handle sensitive workflows - financial data, internal docs, customer information - this is not just a nice-to-have. It is a requirement for many teams.

Real-World Performance

On an M2 MacBook Pro, the small model transcribes a 3-second audio clip in roughly 150ms. That is faster than the time it takes you to finish speaking and mentally context-switch to the next thought. The agent feels like it is reading your mind.

The key insight: voice input for AI agents only works when transcription is invisible. Local Whisper on Apple Silicon makes it invisible.

Fazm is an open source macOS AI agent. Open source on GitHub.

More on This Topic

Related Posts