The Latency Budget: How to Build a Voice AI Agent That Replies in Under a Second

Most voice AI demos sound great. Then they hit a real phone line, and the magic evaporates — the caller says something, and there’s a beat of silence that’s just a touch too long. They start talking again. The agent talks over them. The conversation falls apart.

That beat of silence is a latency problem, and it’s the single thing that separates a voice agent that feels human from one that feels like a frustrating phone tree. Research is blunt about the threshold: pauses as short as 300 milliseconds start to feel unnatural, and anything past ~1.5 seconds degrades the experience fast. The target you’re actually shooting for — end of the caller’s speech to the first audio of your reply — is under one second, every time, not on average.

Here’s how that second gets spent, and where you claw it back.

The latency budget

Total response latency is a sum, not a single number:

total = ASR + LLM + TTS + network + processing

A typical, unoptimized breakdown looks roughly like this:

Speech-to-text (ASR): 100–500 ms
LLM (first token): 350 ms for an optimized small model, 1000 ms+ for a frontier model
Text-to-speech (time to first audio): 75–200 ms with modern streaming TTS
Network + processing: 100–150 ms

Add the bad-case numbers and you’re at two seconds — a broken conversation. Add the good-case numbers and stream every stage and you’re comfortably under one. The entire engineering problem is moving each component from its bad case to its good case.

Where the milliseconds actually go

1. Don’t wait for end-of-speech to start thinking

The biggest mistake is treating the pipeline as sequential: wait for the caller to finish, then transcribe, then prompt the LLM, then synthesize. Every stage should be streaming and overlapping. Your ASR should emit partial transcripts as the caller is still speaking, so by the time they stop, you already have a near-final transcript and can fire the LLM almost immediately.

2. Endpointing is the hidden tax

“End of speech” is a guess, not a fact. Voice-activity detection has to decide the caller is done, not just pausing for breath. Tune it too aggressive and you cut people off mid-sentence; too conservative and you add 500 ms of dead air to every single turn. This is where a lot of “the LLM is slow” complaints actually live — it’s not the model, it’s the endpointer waiting too long before the model is even allowed to start.

3. Match the model to the turn

A frontier model on every turn is the easy way to blow your budget. Most conversational turns — confirmations, simple questions, routing — don’t need it. Route the easy turns to a fast, small model (350 ms to first token) and reserve the heavyweight model for the turns that genuinely require reasoning. The caller can’t tell which model answered; they can absolutely tell when the answer took two seconds.

4. Speak the first sentence before the last is written

Streaming TTS means you start synthesizing audio from the first chunk of the LLM’s output instead of waiting for the full response. Better still: generate the next utterance while the current one is still playing. On longer replies this cuts perceived latency by 30–50%, because the caller hears you begin almost immediately — the rest of the sentence is being built while they’re still hearing the start of it.

The part nobody demos: barge-in

Real humans interrupt. A production voice agent has to detect a genuine interruption — the caller starting to talk over you — distinguish it from background noise or a cough, stop its own speech mid-word, and pivot to listening. Get this wrong and the agent steamrolls the caller, which is worse than being slow.

This is also where the multi-vendor “Frankenstack” — one provider for ASR, another for the LLM, another for TTS, a carrier for the telephony — blows up. Each boundary adds network hops and, more importantly, makes coordinated barge-in genuinely hard, because the component that needs to stop (TTS) is three vendors away from the component that detected the interruption (ASR). The fewer boundaries the audio has to cross, the tighter you can close that loop.

Why this matters now

Gartner expects 40% of enterprise applications to integrate task-specific AI agents by the end of 2026, up from under 5% in 2025 — and most of them are voice-first. The ones that fail will fail for exactly one reason: they couldn’t hold a conversation at human cadence. The latency budget isn’t a nice-to-have optimization. It is the product.

This is the kind of problem I work on day to day — building phone-call AI pipelines that stay under a second in production, not just in the demo. If you’re putting a voice agent in front of real callers and want it to actually feel human, book a free 15-minute call or drop me a line.