Self-Hosted Voice Typing with Whisper for AI Agent Input
Self-Hosted Voice Typing with Whisper for AI Agent Input
Cloud voice typing services work fine until you care about privacy, latency, or monthly costs. Running Whisper on your own hardware gives you a voice input pipeline that is private, fast, and free after the initial setup.
The Basic Architecture
The setup is straightforward:
- Microphone input on your Mac captures audio
- Audio streams to a local Whisper instance (on the same machine or a homelab server)
- Transcribed text gets injected as keyboard input or piped directly to an AI agent
The result is a system-wide voice typing extension that works in any app, any text field, with no internet required.
Whisper Model Selection
- whisper-tiny - Fastest, lowest accuracy. Good enough for short commands
- whisper-base/small - Best balance of speed and accuracy for voice typing
- whisper-medium - Noticeably better accuracy, but 2-3x slower
- whisper-large-v3 - Best accuracy, but requires significant compute and adds latency
For real-time voice typing, whisper-small or whisper-base running on Apple Silicon with whisper.cpp gives sub-second latency with good accuracy.
Homelab vs Local Machine
Running Whisper on a dedicated homelab server has advantages:
- Frees your Mac's compute for other tasks
- A machine with a decent GPU can run whisper-medium with low latency
- Multiple devices can share the same transcription server
But for most users, running whisper.cpp directly on an M-series Mac is simpler and fast enough.
Connecting to AI Agents
The real power comes from connecting voice transcription to AI agent workflows. Instead of typing commands to your AI agent, speak them:
- "Open the last three emails and summarize them"
- "Run the test suite and fix any failures"
- "Schedule a meeting with the design team for Thursday"
Your voice gets transcribed locally by Whisper, then fed to the agent as a text command. The entire pipeline stays on your hardware - no cloud service sees your audio or your commands.
Getting Started
Install whisper.cpp, grab a model, and test with a simple audio file. Once transcription works, add a keyboard shortcut to start and stop recording. The whole setup takes an afternoon.
Fazm is an open source macOS AI agent. Open source on GitHub.