Voice AI Latency Matters More Than Accuracy - On-Device WhisperKit Benchmarks
Voice AI Latency Matters More Than Accuracy
The biggest mistake in voice AI is treating it like a model selection problem. Most teams spend weeks A/B testing transcription accuracy between providers when the real killer is latency. Users do not notice 2-3% WER differences. They absolutely notice a 500ms delay.
Cloud STT vs On-Device: Real Numbers
When we switched from cloud-based speech-to-text to on-device WhisperKit for the voice-controlled desktop agent in Fazm, the accuracy improvement was marginal. What changed dramatically was the feel of every interaction.
With cloud STT, every request had a network round trip, queue time, and variable processing delay. On a good connection: 800-1200ms from end of speech to transcription. On a congested network or API under load: 2+ seconds. Users spoke, then waited, then spoke again. The interaction felt like a turn-based game with unpredictable response times.
With on-device processing on Apple Silicon, WhisperKit runs on the Neural Engine and achieves a mean latency of approximately 0.46 seconds - matching the fastest cloud providers at their best while also being the most accurate option (2.2% WER on standard benchmarks). That number comes from the WhisperKit paper presented at ICML 2025, which benchmarked it against OpenAI's gpt-4o-transcribe, Deepgram nova-3, and Fireworks large-v3-turbo. WhisperKit and Fireworks tied on latency. WhisperKit had the best accuracy of the group.
For comparison, Deepgram achieves 0.83 seconds - still fast, but nearly double the latency, and it requires a network call.
Why Sub-500ms Changes User Behavior
The perceptual threshold for "instant" response is around 100-200ms. Voice interactions below 500ms feel natural. Above 500ms they feel like a tool you are waiting on.
When we dropped from ~1000ms average to ~460ms, users stopped pausing after they finished speaking. They started finishing sentences and immediately receiving the response, which changed the rhythm of interaction from "speak, wait, receive" to something closer to a real conversation.
That is not a 2x improvement - it is a categorical shift. The same agent with the same model produced completely different user behavior just from the latency change.
The Technical Tradeoff: What You Give Up
On-device processing is not free. WhisperKit requires downloading model weights (the base model is about 145MB, large-v3 is 1.5GB) and consumes memory on device. On older Apple Silicon (M1, M2), the large model takes 1-2 seconds to load from cold start, though it stays warm in memory between sessions.
The accuracy-model tradeoff matters for voice commands specifically. For desktop agent commands - "open this file", "search for X", "close this window" - the base model is accurate enough. For long-form dictation or domain-specific vocabulary, large-v3 is noticeably better.
Interruption Handling: The Next Hard Problem
Cutting latency revealed the next challenge - interruption handling. When someone starts talking mid-response, the agent needs to detect intentional speech vs background noise, cut off the text-to-speech output, and resume without losing context.
This is harder than it sounds:
// Simplified voice activity detection for interruption
class InterruptionDetector {
private let energyThreshold: Float = 0.02
private let minSpeechFrames = 5 // ~100ms of continuous speech
func shouldInterrupt(audioBuffer: [Float], isTTSActive: Bool) -> Bool {
guard isTTSActive else { return false }
let energy = audioBuffer.map { $0 * $0 }.reduce(0, +) / Float(audioBuffer.count)
return energy > energyThreshold && consecutiveSpeechFrames >= minSpeechFrames
}
}
The threshold matters a lot here. Set it too low and background noise triggers constant interruptions. Set it too high and intentional speech gets ignored. We ended up with a two-stage approach: a fast energy-based check for initial detection, followed by a short WhisperKit pass to confirm actual speech content before cutting the TTS stream.
Why This Matters for Desktop Agents Specifically
Voice-controlled desktop agents have a different latency budget than voice assistants or transcription tools. When you are navigating a file manager or controlling a video editor with voice, you need to see the result before issuing the next command. The feedback loop is visual - you watch the action happen, then speak the next command.
At 1000ms latency, that loop feels sluggish. At 460ms, it feels native. The agent stops feeling like an assistant you are directing and starts feeling like an extension of your intent.
For any voice interface where the user is watching the result in real time, optimize latency first. Accuracy second. Network-free third. The WhisperKit benchmark numbers make that choice straightforward on Apple Silicon.
Fazm is an open source macOS AI agent. Open source on GitHub.