Running whisper.cpp on Apple Silicon for Local Voice Recognition
Running whisper.cpp on Apple Silicon for Local Voice Recognition
If you want fast, private voice recognition on a Mac, whisper.cpp with the large-v3-turbo model on Apple Silicon is the best option right now. No cloud API calls, no subscription fees, and your audio never leaves your machine.
Why large-v3-turbo
The large-v3-turbo model hits the sweet spot between accuracy and speed on Apple Silicon. It is significantly faster than the full large-v3 while maintaining nearly identical accuracy for English transcription. On an M2 Pro, it processes audio at roughly 10x real-time speed, meaning a 60-second clip transcribes in about 6 seconds.
The smaller models (tiny, base, small) are faster but make noticeably more errors, especially with technical vocabulary, accents, or background noise. For a desktop agent that needs to understand voice commands reliably, the accuracy tradeoff is not worth the speed gain.
The Optimal Pipeline
The recommended architecture is a two-stage pipeline:
- whisper.cpp transcribes - converts raw audio to text locally
- LLM processes - interprets the transcription and executes the intent
This separation matters. Whisper handles the hard problem of speech-to-text. The LLM handles the hard problem of understanding intent. Trying to do both in one step - feeding audio directly to a multimodal model - is slower and less reliable for command execution.
Setup on macOS
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make -j
./models/download-ggml-model.sh large-v3-turbo
whisper.cpp compiles with Metal support automatically on Apple Silicon, so GPU acceleration works out of the box. No CUDA setup, no driver issues.
Real-World Performance Tips
- Use Voice Activity Detection (VAD) to avoid processing silence
- Stream in chunks rather than waiting for complete utterances for lower latency
- Pre-load the model at app startup so the first transcription is not slow
- Quantized models (Q5 or Q8) reduce memory usage with minimal accuracy loss
For desktop agents that need always-on voice input, this pipeline gives you sub-second response times with high accuracy, all running locally on your Mac.
Fazm is an open source macOS AI agent. Open source on GitHub.