The Biggest Problem Nobody Talks About in Voice AI - Latency
The Biggest Problem Nobody Talks About in Voice AI - Latency
Everyone obsesses over model accuracy in voice AI. Better transcription. Smarter responses. More natural-sounding speech. But the thing that actually kills voice interactions is none of that - it is latency.
Why Latency Breaks Everything
Humans are incredibly sensitive to conversational timing. A 200ms pause feels natural. A 2-second pause feels broken. Most voice AI pipelines chain together speech-to-text, LLM inference, and text-to-speech - each adding hundreds of milliseconds. By the time the system responds, the user has already checked out.
The irony is that a slightly less accurate response delivered instantly feels far better than a perfect response delivered late. Users will forgive small mistakes. They will not forgive awkward silence.
Filler Responses Are Not a Hack
Some teams treat filler responses - "let me think about that" or "great question" - as a band-aid. They are actually a critical UX pattern. Humans use fillers constantly in conversation. An AI that says "hmm, one moment" while processing feels more natural than one that goes silent for three seconds.
The best implementations layer fillers dynamically based on expected processing time. Short queries get instant responses. Complex ones get a brief acknowledgment followed by the real answer.
Streaming TTS Changes the Game
Streaming text-to-speech is the single biggest improvement you can make to voice AI latency. Instead of waiting for the full response to generate before speaking, streaming TTS begins audio output as soon as the first tokens arrive from the LLM.
This can cut perceived latency by 70-80%. The user hears the beginning of the answer while the rest is still being generated. Combined with speculative generation and smart chunking, you can get response times under 500ms consistently.
The Real Bottleneck
Model accuracy improvements have diminishing returns for most voice applications. Latency improvements have compounding returns - faster responses lead to more natural conversations, which lead to higher engagement, which leads to more data, which leads to better models.
If you are building voice AI and spending 90% of your effort on model quality, you are probably optimizing the wrong thing.
Fazm is an open source macOS AI agent. Open source on GitHub.