Voice agents

LiveKit, Vapi, Pipecat: cold-start latency in production

A patient calls a Dutch dental practice at 09:14. Whether she hangs up depends on the gap between SIP connect and the first syllable of agent voice.

Jacob Molkenboer· Founder · A Brand New Company· 24 Sept 2024· 6 min

Black bakelite phone receiver off-hook on leather blotter, green ribbon in closed ledger, brass stopwatch, red wax seal.

It is Tuesday, 09:14 in Amersfoort. The dental practice phone rings for the fourteenth time that morning. A patient wants to move a root canal from next Wednesday because her son's school called about a fever. She has about thirty seconds of patience before she puts the phone down and tries the next clinic on her list.

That phone is not picked up by a receptionist. It is picked up by a voice agent we shipped four months ago. The thing we obsessed over for two weeks was a single number: the gap between the moment the SIP trunk answers and the moment the agent's first syllable of Dutch reaches the caller's ear. We call it cold-start. The demos never show it.

The setup we had to land

Eleven hundred inbound calls a week, three clinics under one owner, calls peaking between 08:30 and 10:00 when parents drop kids at school and reach for their phones. Average call length around two minutes. Roughly 60% of calls are reschedules, 25% are new bookings, 10% are price questions, 5% are everything else. The agent has to handle the first three categories end-to-end and route the rest to a human voicemail with context preserved.

The language is Dutch, with a non-trivial slice of calls in English and the occasional Turkish or Arabic patient who switches mid-sentence. The phone trunk is a Belgian VoIP provider over SIP. The clinic's calendar lives in a UK-built dental PMS with a brittle but workable JSON API.

We benchmarked three orchestration layers: LiveKit Agents, Vapi, and Pipecat. Same speech-to-text (Deepgram Nova-2), same model for tool-use turns, same TTS (ElevenLabs Flash v2.5 in Dutch). Only the orchestration layer changed.

Cold-start is not first-token latency

The number every demo flexes is "time to first token" or "first audio frame after the user stops speaking." That is barge-in latency, and it matters. But cold-start is different. It is the time between the SIP INVITE being answered and the agent saying its first word. The patient is already on the line. They are waiting for someone to say something.

If that gap is more than 1.8 seconds, a meaningful fraction of patients say "hallo?" first, which derails a script that assumes the agent leads. If it is under 900ms, the agent feels like a person who picked up promptly. The band between those two numbers is where the system feels broken.

Warning

Latency dashboards on voice platforms almost always start the timer at the first user utterance. That hides the worst latency your callers actually feel: the dead air right after pickup.

LiveKit Agents

LiveKit is the closest thing to a real production stack we found. It ships as a worker model: you run N agent workers, each one holds a warmed pipeline, and the dispatcher hands incoming calls to a free worker. SIP comes in through their LiveKit Cloud SIP bridge or a self-hosted Kamailio in front.

What we measured on the dental account, averaged across 200 calls in a single week:

Cold-start with a warm worker pool of 4: 740ms median, 1.1s p95.
Cold-start when the pool had just scaled from 0 to 1 on a Sunday morning: 4.2s. Unusable.
Barge-in latency: 380ms median.

The fix for the Sunday case was to keep two workers permanently warm even at zero traffic. That costs roughly the price of a beer per day per worker on a small VM, which is fine. The thing nobody warns you about: the SIP bridge adds around 200ms of its own latency on top of whatever your agent does, and you only see it if you tcpdump.

Vapi

Vapi is the fastest thing to a working demo. We had a Dutch voice agent answering a test number within ninety minutes of signing up. They handle SIP, STT, model orchestration, TTS, and barge-in for you. The price you pay is per-minute, and the abstraction is opinionated in ways you do not always want.

Cold-start measured the same way:

Median: 920ms. Good.
p95: 2.6s. Not good. The long tail comes from cold model containers on their side, which you cannot control or warm.
Barge-in: 320ms median. Excellent.

The other thing that bit us: Vapi's default Dutch TTS voice mispronounces common dental terms ("vulling", "wortelkanaal") in ways that made one elderly patient ask "wat zegt u?" three times in a row. Switching to a custom ElevenLabs voice via their bring-your-own setup fixed it but added 90ms to cold-start.

Pipecat

Pipecat is the most flexible and the most work. It is a Python framework from Daily. You write the pipeline as a graph of frame processors: VAD, STT, LLM, TTS, output. You self-host. You handle SIP yourself, either with Daily's PSTN add-on or with something like Twilio in front of it.

We ran it on a single 2-vCPU VM in Frankfurt because Dutch calls should never cross the Atlantic. Numbers:

Cold-start median: 1.4s. Slower than LiveKit and Vapi.
p95: 1.9s. Tight distribution, which we liked.
Barge-in: 290ms median. Best of the three, because we tuned the VAD frame size down to 10ms.

The reason the median is slower: the Python event loop spends real time on first-session pipeline initialization. You can pre-warm a pipeline per worker, but Pipecat's worker model is less batteries-included than LiveKit's. We wrote about 200 lines of glue to get a warm pool, and we still do not love it.

What we shipped with

LiveKit, with a permanently-warm pool of three workers in Frankfurt and a fallback worker in Amsterdam. The math:

Cold-start beats Vapi at p95, which is the number patients actually feel.
Self-hosted cost per minute is roughly a third of Vapi at our call volume.
The SIP bridge lets us drop call summaries straight into the clinic's PMS via a small Go service.

Pipecat is genuinely good, and we would ship it for any client who wants tight control over the audio pipeline (custom VAD, in-house TTS, real-time DSP). For a clinic that wants the agent to feel like a fast receptionist and nothing more, LiveKit was the right call.

Numbers we still owe you

Two numbers we wanted to publish here but will not, because we do not have enough data yet. First, language-switching latency when a patient flips from Dutch to English mid-call. Second, the cost of running a long-context call summary back into the clinic's PMS at the end of each call. We will write both up once we have a full quarter of production data.

When we built the voice agent for that dental group, the thing we ran into hardest was the gap between "demo-fast" and "production-fast." We ended up solving it by paying for warm workers we technically did not need and benchmarking the p95 of every layer end to end. If you are scoping a voice agent and you want the real numbers, that is the kind of work we do inside our AI agents practice.

The five-minute audit you can run today

Call your own voice agent on its production number. Start a stopwatch the moment you stop hearing the ring. Stop it when you hear the first syllable. Do it ten times across a day, including a Sunday morning. If the p95 of that ten-call sample is over 1.8 seconds, you have the same problem we did. It is fixable.

Key takeaway

Cold-start, not first-token latency, is the number your callers feel. Benchmark pickup-to-first-syllable at p95 and warm your worker pool to fit it.

FAQ

Which stack had the lowest cold-start at p95?

LiveKit Agents, at 1.1s with a permanently-warm pool of four workers. Vapi was 2.6s at p95 because of cold containers on their side that we could not pre-warm.

Is Vapi a bad choice for Dutch voice agents?

No, it is the fastest path to a working demo. But you will want bring-your-own TTS for dental or medical vocabulary, and the per-minute cost adds up at over 1,000 calls a week.

When would you ship Pipecat over LiveKit?

When you need control over the audio pipeline itself: custom VAD, in-house TTS, real-time DSP, or non-standard frame processing. For a standard receptionist agent, LiveKit was easier.

What is cold-start latency for a voice agent?

The time between the SIP trunk answering the call and the agent speaking its first syllable. It is different from barge-in latency, which is measured after the user starts talking.

How do you measure it in production?

Tap the SIP layer for the INVITE-accepted timestamp and the audio output for the first non-silent frame. The difference, measured per call and aggregated at p95, is what your callers actually feel.

voice agentsai agentsarchitecturecase studyoperations

Building something?

Start a project