Building a Live Streaming Voice Flow with Push-to-Talk on macOS
Building a Live Streaming Voice Flow with Push-to-Talk on macOS
The best AI interface is one that stays invisible until you need it. A floating control bar on macOS with push-to-talk AI chat achieves exactly that - hold a key, speak your command, release, and the agent executes. No window switching. No typing. No breaking your flow.
Why Push-to-Talk Beats Always-Listening
Always-listening voice assistants have a fundamental problem - they activate when you do not want them to and miss activation when you do. Push-to-talk eliminates both issues. You control exactly when the agent is listening, which means zero false activations and zero missed commands.
For live streaming and creative work, this is critical. You do not want your AI agent responding to something you said to your audience. You want it responding only when you deliberately hold the hotkey and address it directly.
The Floating Control Bar
A minimal floating toolbar sits at the edge of your screen. It shows the current state - idle, listening, processing, executing. When you hold the push-to-talk key, the bar expands slightly to show a waveform visualization. Release the key and it shrinks back to minimal.
The bar persists across all spaces and desktops using NSPanel with the appropriate window level. It never steals focus from your current application. It never appears in the app switcher. It is always there when you need it and invisible when you do not.
Streaming Voice Processing
The voice data streams to a local Whisper model as you speak, providing near-real-time transcription. By the time you release the push-to-talk key, the transcription is already complete. The agent processes the command immediately - no waiting for upload, transcription, and response.
Local processing also means your voice data never leaves your machine. For live streamers who discuss sensitive topics or handle private information, this privacy guarantee matters.
Connecting to Agent Actions
The transcribed command feeds directly into the desktop agent's action pipeline. "Open the project in Xcode" triggers accessibility API interactions. "Send that screenshot to the design channel" chains together screenshot capture, Slack navigation, and file upload. The voice interface is just the input method - the agent handles the rest.
Fazm is an open source macOS AI agent. Open source on GitHub.