Voice AI Engineering Guide

Voice AI Latency: How to Get Conversational Agents Under 500ms

The difference between a useful voice AI and an awkward one is latency. Anything over 500ms of dead air between a user finishing their sentence and the AI responding triggers a feedback loop - people start talking over the AI, which creates overlapping audio, which breaks transcription. This guide covers the engineering techniques that get response times under 500ms.

1. Why 500ms Is the Threshold

In natural conversation, the average gap between speakers is 200-300ms. People are remarkably sensitive to delays. At 300ms, a voice AI feels responsive. At 500ms, it feels like a phone call with slight lag. At 1 second, users start checking if the connection dropped.

The critical failure mode is the overlap loop. When the AI takes too long to respond, users interpret silence as the AI not hearing them and repeat their question. The AI then receives two overlapping inputs, produces confused output, and the conversation degrades rapidly.

User perception by latency:

- Under 300ms: feels natural, user unaware of AI processing
- 300-500ms: acceptable, slight awareness of delay
- 500ms-1s: noticeable lag, users start to compensate
- Over 1s: users talk over AI, conversation breaks down

2. The Voice AI Pipeline (And Where Latency Hides)

A voice AI interaction has four sequential steps: audio capture, speech-to-text transcription, LLM inference, and text-to-speech synthesis. In a naive implementation, these run sequentially:

Stage	Naive Latency	Optimized
Voice Activity Detection	200-500ms	50-100ms
Speech-to-Text	500-1500ms	100-300ms
LLM Inference	500-2000ms	200-500ms
Text-to-Speech	300-800ms	50-150ms
Total	1.5-4.8s	400-1050ms

The key to hitting sub-500ms is not making each step faster in isolation - it is overlapping them so they run in parallel.

3. Streaming TTS: Start Speaking Before Thinking Finishes

Streaming TTS is the single biggest latency win. Instead of waiting for the LLM to generate the complete response, you start converting text to speech as soon as the first tokens arrive.

Most modern TTS APIs support streaming input. You feed them tokens as they arrive from the LLM and they produce audio chunks that start playing immediately. The user hears the AI start speaking within 200-300ms of the LLM starting to generate, even if the full response takes 2 seconds to complete.

The tradeoff is that you cannot do post-processing on the full response before speaking. If the LLM starts with something you want to filter, it is already been spoken. Design your prompts to front-load the useful content.

4. Overlapping Transcription with LLM Inference

The second major optimization is starting the LLM call before transcription is fully complete. As the speech-to-text model produces partial transcripts, you can begin the LLM inference with the partial text and update as more words arrive.

In practice, this means using streaming ASR (automatic speech recognition) that emits words as they are recognized rather than waiting for the full utterance. The LLM receives the partial transcript and can start generating a response based on the first few words, refining as more context arrives.

This technique is more complex to implement correctly. You need to handle transcript corrections (when early words get revised as more audio arrives) and avoid committing to a response too early. Most production systems use a confidence threshold before triggering the LLM.

5. Adaptive Follow-Ups for Variable Session Lengths

Session length variance is a major challenge for voice AI. Some users give three-word answers while others talk for 20 minutes straight. A fixed conversation flow breaks for both extremes.

The solution is adaptive follow-ups. For short responses, the AI probes harder with specific questions to extract useful information. For long responses, it gently redirects back to the topic and summarizes what it heard to confirm understanding.

Adaptive strategies by response type:

- Short (under 10 words): ask a more specific follow-up question
- Medium (10-50 words): acknowledge and continue flow
- Long (over 50 words): summarize, confirm, then redirect
- Off-topic: acknowledge the tangent, gently bring back

6. Latency Budget Breakdown

With all optimizations applied, here is a realistic latency budget for a production voice AI system:

Component	Target	Technique
End-of-speech detection	50ms	Local VAD model
Transcription tail	100ms	Streaming ASR, overlap with LLM
LLM first token	200ms	Fast model, warm connection
TTS first audio	100ms	Streaming TTS, pre-warmed
Total to first audio	~450ms	Fully pipelined

7. Tools and Architecture

Building a low-latency voice pipeline requires careful tool selection. The ASR model, LLM, and TTS engine all need to support streaming, and the orchestration layer needs to handle the overlap correctly.

Production stack options:

- ASR: Deepgram (streaming), WhisperKit (on-device), Whisper API (batch)
- LLM: Claude (streaming), GPT-4o (streaming), local models via Ollama
- TTS: ElevenLabs (streaming), OpenAI TTS, on-device synthesis
- VAD: Silero VAD (local, sub-10ms), WebRTC VAD

For desktop voice agents that need to control your computer while conversing, tools like Fazm combine voice input with desktop automation - you speak a command and the agent executes it by interacting with applications via accessibility APIs. This adds the action layer on top of the conversation pipeline.

Want a voice AI agent that actually does things on your computer? Try a desktop agent with voice-first control.

Try Fazm Free