A voice agent's hard part isn't the model — it's the 200ms turn-taking game and barge-in.
// WHY THIS MATTERS
By 2026 voice AI has crossed from "can transcribe" to "can converse" — but most people treat it as "STT + ChatGPT + TTS bolted together," and the result is an agent that's either sluggish (two seconds of silence before it speaks) or rude (it cuts you off the moment you pause for breath). The engineering difficulty in voice has almost nothing to do with model quality and everything to do with real-time behavior and the turn-taking game: the gap between human conversational turns is only ~200ms, and once end-to-end latency exceeds 800ms people feel "this AI is kind of dumb." This issue covers four engineering decisions that determine how a voice agent feels: choosing cascaded vs end-to-end speech models, why turn detection and barge-in are the hardest part, how to cut the voice-to-voice latency budget, and the watermarking and ethical responsibility that voice cloning brings. We assume you can already drive Whisper and use ElevenLabs — here we only cover how to engineer them into a real-time agent that isn't dumb.
// 01
Cascaded Pipeline vs End-to-End Speech Models: Decide the Route First
Claim: Cascaded (STT→LLM→TTS) keeps full control at the text layer; an end-to-end speech model cuts latency but also cuts your tools, grounding, and auditability.
Background & Principles
A voice agent has two architectural routes, and this is the first — and costliest — decision you make before writing any code:
Cascaded: speech →(STT)→ text →(the LLM you know, with RAG / tool calling / content moderation)→ text →(TTS)→ speech. Every stage is swappable, loggable, evaluable — text is an intermediate representation you can grep. The cost is stacked latency across three serial stages, plus transcription drops tone, emotion, and emphasis.
End-to-End (Speech-to-Speech): the model goes audio-in / audio-out directly. OpenAI's Realtime API and Kyutai's Moshi are representative — Moshi (arXiv 2410.00037) is built on a 7B Temporal Transformer plus the Mimi streaming audio codec, with a theoretical latency of 160ms (~200ms in practice), full-duplex (it can listen and speak at once), and it preserves tone. The cost is that you lose the text intermediate layer: RAG injection, tool calling, compliance review, and conversation auditing all become hard, and when the model says something wrong you can't even reconstruct a text log to review.
The pragmatic conclusion: production systems that need tools, knowledge bases, or auditability default to cascaded; end-to-end pays off only for pure companionship / chitchat / cases that demand high tonal fidelity and need no external knowledge. In 2026 the vast majority of commercial voice agents (support, booking, outbound calling) are still cascaded — precisely because they can't live without tool use and auditing.
Failure mode: getting swept up by "end-to-end has lower latency" and building a support agent that needs to look up orders, query a knowledge base, and log tickets as pure speech-to-speech — only to find you can't insert tool calling, can't do content moderation, and have no conversation transcript when a customer complains. Saving 300ms doesn't buy any of that back.
Turn Detection & Barge-in: The Hardest Part of a Voice Agent
Claim: What decides whether a voice agent feels human is "when do you judge the user is done" and "how fast do you shut up when interrupted" — and VAD is bad at both.
Background & Principles
VAD (voice activity detection, e.g. Silero VAD) only judges "is there sound right now," using silence duration to decide turn end (endpointing). The problem: people pause mid-thought — "let me think... how do I put this..." — and a VAD-only system jumps in during the pause. The right fix is semantic turn detection: LiveKit's open-source turn-detector uses a small model distilled from Qwen2.5-0.5B (runs at low latency on CPU) that judges, from the transcribed text's semantics, "is this utterance grammatically/semantically complete?" — running in parallel with VAD. VAD handles "is there sound" and triggers barge-in; the semantic model handles "should I take the turn."
Barge-in is a separate piece of engineering: the moment the user speaks, you must do three things at once — stop TTS playback, flush the already-synthesized-but-unplayed audio buffer, and cancel the reply the LLM is still streaming. Miss any one of them and the user hears the AI finish its previous sentence anyway, and the "machine feel" lands immediately.
Hands-on
# barge-in: the instant the user speaks, do all three togetherasync def on_user_speech_start(session):
# 1) stop the TTS audio currently playing (most visible, do first)await session.tts.stop()
# 2) discard synthesized-but-unplayed audio — else it "replays" later
session.audio_buffer.clear()
# 3) cancel the LLM round still streaming (saves tokens + stops it talking)
session.llm_task.cancel()
# 4) record "interrupted" in history so the model knows it didn't finish
session.history.append_interrupted(spoken_so_far=session.played_text)
# turn-end decision: VAD silence + semantic completeness, need both signalsdef should_commit_turn(vad_silence_ms, transcript):
if vad_silence_ms < 200: # still talking, definitely not donereturnFalse
eou = turn_detector.predict(transcript) # Qwen-0.5B: prob the utterance is complete# semantically complete -> reply on short silence; not -> wait longer (thinking)return eou > 0.7or vad_silence_ms > 1200
The key is the last line of should_commit_turn: when semantically complete use a short threshold (reply fast); when incomplete use a long threshold (give thinking room). That one dynamic threshold is the core of "neither cutting people off nor being sluggish."
Failure mode: (1) endpointing with a fixed silence threshold only — too short and it interrupts, too long and it's sluggish, no sweet spot. You must layer in a semantic signal. (2) barge-in stops TTS playback but forgets to flush the audio buffer, so after the user finishes the AI "replays" half of an old reply — like a broken record. (3) not recording "interrupted" in history, so the model thinks it finished its previous turn and its logic goes off the rails.
Voice-to-Voice Latency Budget: The Sentence-Level Pipeline
Claim: Voice latency isn't "wait for the LLM to finish, then synthesize" — it's a sentence-level pipeline: TTS starts on the first sentence while the LLM is still writing the rest.
Background & Principles
The gap between natural conversational turns is ~200ms; once end-to-end voice-to-voice exceeds 800ms users feel it's sluggish, and <500ms is the gold standard. Where the budget goes: STT final transcript + LLM time-to-first-token (TTFT) + TTS time-to-first-byte (TTFB) + network round-trips, which strung together easily breaks a second.
The most important trick: don't wait for the LLM to produce a complete answer. While the LLM streams, hang a sentence buffer off it; accumulate tokens until you hit sentence-ending punctuation, then ship that whole sentence into TTS — so TTS is already playing the first sentence while the LLM is still generating the second. For voice the metric that truly matters is TTFT, because once the first sentence completes TTS, audio can start, and the user never perceives that more is still being generated. Then stack a few more layers: cut audio chunks to 100–250ms, use streaming STT partials instead of waiting for the full segment, and pick a TTS provider that supports streaming.
Hands-on
# sentence-level pipeline: LLM generates while TTS plays. Don't wait for the LLM to finishimport re
SENT_END = re.compile(r'[.!?]\s*')
async def stream_speak(llm_stream, tts):
buf = ""async for token in llm_stream: # LLM streaming output
buf += token
# as soon as a full sentence forms, ship to TTS — don't wait for the whole replyif SENT_END.search(buf):
sent, buf = _split_first_sentence(buf)
await tts.synthesize(sent) # sentence 1 already playing, LLM writing sentence 2if buf.strip():
await tts.synthesize(buf) # flush trailing partial# -> perceived latency ~= STT + LLM first-token + first-sentence gen + TTS first-byte# not STT + whole LLM + whole TTS (the latter easily breaks 2 seconds)
In practice this usually cuts voice-to-voice from 1.5s+ down to under 700ms, because you've parallelized the longest wait — "LLM finishes the whole reply" — with TTS and playback.
Failure mode: (1) sending the whole text to TTS only after the LLM stop — the most common and most fatal; long replies break 2 seconds outright. (2) chopping sentences too finely (by comma, by token) — each TTS segment has fixed overhead, so it's slower and makes prosody stutter. Splitting on sentence-ending punctuation is the sweet spot. (3) optimizing only model latency and ignoring the network — RTT from the client to your STT/TTS provider is often the hidden bulk; move to edge or a nearby region.
Voice Cloning, Watermarking & Ethics: Your Liability Boundary
Claim: Being able to clone anyone's voice in seconds means your voice agent can also be weaponized for fraud — consent gates and watermarks aren't compliance decoration, they're your liability boundary.
Background & Principles
Modern TTS (ElevenLabs and others) can clone a person's timbre from a few seconds of sample. That hands voice-agent engineering two classes of responsibility you must handle:
Input side · consent & provenance: did the voice you cloned have the owner's consent? Production systems need a consent gate (verify authorization before cloning) plus voice provenance (each voice ID records its source and authorization proof) — otherwise, once it's used to impersonate, the legal liability is yours.
Output side · identifiable as synthetic: the speech you generate should be detectable as AI-synthesized. Passive detection (training a classifier to judge real vs fake) is brittle and easy to evade; proactive watermarking is more reliable: Meta's AudioSeal (arXiv 2401.17264) uses a generator/detector architecture to embed an inaudible watermark into the waveform — localizable down to the sample level (1/16000s granularity), single-pass and extremely fast to detect, robust to compression and editing, and already deployed in Audiobox / Seamless serving ~100k daily users.
One eval misconception worth correcting: don't evaluate a voice agent by STT WER alone. The real metrics are task success rate + interruption accuracy + turn false-trigger rate (the fraction where noise or a normal pause is misjudged as "done"). An agent with a low WER that keeps cutting people off still tanks the user experience.
Hands-on
# minimal responsibility stack for cloning: consent gate + provenance + output watermarkdef clone_voice(sample_audio, owner_id, consent_token):
# 1) consent gate: no valid authorization, no cloning — physically blockedif not verify_consent(owner_id, consent_token):
raise PermissionError("no valid consent for this voice")
voice_id = tts.create_clone(sample_audio)
# 2) provenance: pin source/authorization into the voice record, auditable
registry.record(voice_id, owner=owner_id, consent=consent_token, source="user_upload")
return voice_id
def synthesize_safe(text, voice_id):
audio = tts.synthesize(text, voice_id)
# 3) embed an inaudible watermark so generated speech is traceablereturn audioseal.embed(audio) # detection side: audioseal.detect(audio) -> generated by us?
Failure mode: (1) only a "do not misuse" clause in the ToS, with no technical consent gate — that's no defense at all. (2) not watermarking generated speech, so after an incident you can't prove "this fraud recording wasn't produced by my system" and the burden of proof falls entirely on you. (3) using WER as the only launch gate while ignoring barge-in/turn metrics — ship and crash.
// Capstone · Build a Non-Dumb Local Voice Agent with Pipecat
String the four points into a weekend project: build a naturally-conversing voice agent with Pipecat. The goal isn't to show off but to put your hands on all four of voice engineering's pain points.
Pick the architecture (§1): go cascaded first — streaming STT (Deepgram / Whisper) + your LLM (with one query tool) + streaming TTS. Keep the text intermediate layer so you can print every step and run evals.
Turns & barge-in (§2): wire up Silero VAD + the LiveKit turn-detector, implement the dynamic-threshold should_commit_turn, and write the full barge-in triad: stop TTS + flush buffer + cancel LLM. This is the single step with the biggest felt difference.
Latency (§3): hang a sentence-level buffer and ship to TTS on sentence-ending punctuation; use a stopwatch to print voice-to-voice latency and push it under 800ms.
Safety (§4): if you use a cloned voice, add a consent gate and pass the output through AudioSeal watermarking.
Eval: record 20 real conversations (with mid-sentence pauses, background noise, user interruptions) and measure three numbers — task success rate, interruption accuracy, turn false-trigger rate. You'll find WER barely explains the experience gap, while the turn metrics explain most of it.
Once you've done this, you'll instinctively dissect any voice product — does it cut people off? how fast does it shut up when interrupted? how many milliseconds voice-to-voice? — instead of just judging whether the timbre sounds human.
// GLOSSARY
STT / ASR
Speech-to-Text / Automatic Speech Recognition. Streaming partial output beats waiting for a full segment for low latency.
TTS
Text-to-Speech. Only a streaming-capable provider lets you do a sentence-level pipeline.
Cascaded vs End-to-End
Cascaded (STT→LLM→TTS, keeps text control) vs end-to-end speech model (audio in/out, low latency but loses the text middle layer).
VAD
Voice Activity Detection — is there speech right now (e.g. Silero VAD). Solves "is there sound," not "are they done."
Endpointing
Deciding a user's turn has ended. A pure silence threshold isn't enough; layer in a semantic signal.
Turn Detection
Semantic turn detection — is the utterance semantically complete (e.g. LiveKit turn-detector, distilled from Qwen2.5-0.5B).
Barge-in
User interrupts the agent. Requires stopping TTS, flushing the audio buffer, and cancelling LLM generation at once.
Full-Duplex
Listening and speaking at the same time (e.g. Moshi) — the natural form of human dialogue, an end-to-end model's advantage.
TTFT / TTFB
Time-to-first-token / time-to-first-byte. In voice TTFT matters most — once the first sentence completes TTS, audio starts.
Voice Cloning
Cloning timbre from a few seconds of sample. Needs a consent gate + provenance to manage misuse risk.
Watermarking
Proactively embedding an inaudible mark in generated speech (e.g. AudioSeal) for traceable detection — more robust than a passive classifier.
// DEEP THINKING
End-to-end speech models (Moshi/Realtime) hit 200ms latency — why do commercial voice agents in 2026 still mostly run cascaded?
Because commercial use cases almost all need capabilities only cascaded provides: tool calling (look up orders / place orders / transfer to a human), RAG grounding (knowledge base), content moderation, and an auditable conversation transcript. End-to-end folds all of that into a single audio→audio black box — you can't insert tools, and you can't produce a transcript to review when things go wrong. Saving 300ms doesn't buy that back. End-to-end truly fits pure companionship / chitchat / language practice — tone-sensitive cases that don't depend on external knowledge or auditing. This is fundamentally the same controllability-vs-end-to-end-performance trade-off as Day 3's "workflow vs agent."
Turn detection runs VAD and a semantic small model at once — doesn't that add latency? Why not just raise the silence threshold?
Raising the silence threshold makes everything "globally sluggish" — every turn waits longer, including short utterances the user clearly finished. The semantic model implements conditional waiting: short threshold when the sentence is semantically complete (reply fast), long wait only when incomplete (give thinking room). That separates "be fast when you should, wait when you should," and the average felt experience improves. The semantic model (0.5B, CPU-capable) infers in tens of milliseconds, runs in parallel with STT's final transcript, and barely enters the critical path. The cost is engineering complexity — but it's the only solution to "neither interrupting nor sluggish."
The sentence pipeline has TTS play while the LLM writes the rest — if the LLM then changes course or self-corrects, what about the half-sentence already spoken?
This is the inherent risk of streaming TTS: audio already spoken can't be recalled. Engineering has three lines of defense: 1) prompt the model to state the conclusion first then elaborate, lowering the odds of changing course; 2) split at the sentence (not paragraph) level, so the most that can be "sunk" at once is one sentence; 3) for high-risk replies (numbers, amounts, confirmations) turn off early streaming and synthesize only after full generation. It's fundamentally a trade — "tiny chance of saying half a wrong sentence" for "cutting a big chunk of latency." Worth it for chitchat, handle with care for order confirmation. This echoes §3's failure mode: don't blindly stream everywhere just for latency.
Watermarking all generated speech with AudioSeal sounds responsible — but can the watermark be removed? Where's the real boundary of this defense?
It can be weakened. Heavy compression, resampling, added noise, and time-stretching all erode it, and a determined adversary can even remove it deliberately — AudioSeal is designed to be robust to common audio processing, not to defeat a professional attacker. So its real value isn't "100% block malice" but: 1) a low-cost batch traceability tool for the platform (distinguish our-system vs other-system output); 2) raising the bar on casual misuse; 3) a technical basis for compliance and evidence. It's one layer of defense in depth and must stack with consent gates, usage monitoring, and anomaly detection — no single-layer watermark should be treated as a complete solution.
Why is evaluating a voice agent by WER misleading? Which metrics actually predict the user experience?
WER only measures STT transcription accuracy, but a voice agent's failure points are almost never in transcription: what users least tolerate is "cutting in" (turn false triggers) and "sluggishness" (latency), then failing to pick up after a barge-in. An agent with WER=2% that interrupts once every three sentences will feel dumb. What truly predicts experience is task success rate, barge-in/turn accuracy, false-trigger rate, and voice-to-voice p95 latency. This is Day 29's "Eval Beyond Benchmark" made concrete for the voice domain: benchmark metrics (WER) measure components, but experience is determined by interaction dynamics, which need their own eval built for real interaction.