Voice Agents 2026: Builders & Buyers Guide

February 2, 2026

Voice agents are moving from demos into production—but the gap between what works in a prototype and what holds up with real users is wide. This guide distills what practitioners, vendors, and researchers are learning into practical guidance for two audiences.

If you build voice agents, Part I is for you—it covers how to design, ship, and run reliable voice agents in production. If you evaluate or buy voice solutions, Part II is for you—it covers how to choose vendors, govern deployments, and get value from voice AI without getting burned.

You can use either part on its own or read the full guide for a complete view of the ecosystem.

How to read this guide: Each major section opens with a quote and a short takeaway, then goes deeper. Builders and Buyers paragraphs are marked so you can skim to your role. Bottom line sentences at the end of sections summarize the main action. Tables and diagrams are used where they make comparison or flow easier than prose.

Top Companies

Companies and closed-source products from the voice AI landscape (source: Desktop projects.yaml).

Company	Description
ElevenLabs	AI voice generator and text-to-speech platform
Cartesia	Streaming TTS for live agents and multimodal experiences; ultra-low latency voices
Resemble AI	Voice cloning and TTS platform with real-time Speech-to-Speech API
Murf	AI-powered voice generator for professional content
PlayHT	AI voice generator and text-to-speech platform
Speechify	Text-to-speech and voice cloning platform
LOVO	AI voice generator and voiceover platform
Synthesys	AI voice generator and video synthesis platform
WellSaid Labs	AI voice generator for enterprise content
Descript	AI-powered audio and video editing with voice cloning
Replica Studios	AI voice actors for games and interactive media (shut down in 2025)
Sonantic	AI voice platform for games and entertainment (acquired by Spotify)
Microsoft	Cloud-based speech recognition and synthesis
Amazon AWS	Cloud-based text-to-speech service
Google Cloud	Neural text-to-speech API
IBM	Enterprise text-to-speech API
AssemblyAI	AI-powered speech recognition API
Deepgram	AI speech recognition and transcription API
Rev	Automatic speech recognition API
Otter.ai	AI-powered meeting transcription and notes
Fireflies	AI meeting assistant with transcription
Zoom	AI assistant for Zoom meetings
Apple	Apple’s voice assistant
Google	Google’s AI-powered voice assistant
Amazon	Amazon’s voice assistant platform
HeyGen	AI video generation with voice cloning
D-ID	AI video generation with talking avatars
Synthesia	AI video generation platform
Podcastle	AI-powered podcast creation and editing
Cleanvoice	AI audio editing and enhancement
Adobe	AI-powered podcast recording and editing
Modulate	Voice intelligence (Velma) for sentiment, emotion, moderation; ToxMod for real-time voice chat safety
Coval	Voice AI testing and evaluation platform; simulate and monitor conversational agents
SLNG	Voice AI infrastructure (Unmuted); gateway for STT, TTS, LLMs; real-time voice apps
Daily	Realtime voice/video infra; embed telephony/WebRTC for agents and apps
Genesys	Contact center platform with voice AI
NICE	Contact center and workforce optimization platform
Five9	Cloud contact center platform
Talkdesk	Enterprise contact center platform
Cisco	Enterprise contact center solutions
RingCentral	Business communications and contact center platform
Nextiva	Business phone and contact center platform
Verint	Customer engagement and workforce optimization
Uniphore	Conversational AI and automation platform
Cognigy	Enterprise conversational AI platform
Kore.ai	Enterprise conversational AI and automation platform
Yellow.ai	Conversational AI platform for customer engagement
LivePerson	Conversational AI and messaging platform
Sprinklr	Customer experience management platform
OneReach.ai	Enterprise conversational AI platform
boost.ai	Conversational AI platform for customer service
Gridspace	Conversational AI and speech analytics platform
Aisera	AI-powered service desk and conversational AI
Avaamo	Enterprise conversational AI platform
Omilia	Conversational AI platform for contact centers
PolyAI	Voice AI platform for customer service
Parloa	Conversational AI platform for customer service
Replicant	AI voice agents for customer service
Vapi	Voice AI platform for building phone agents
Retell AI	Conversational AI platform for voice agents
Bland AI	Voice AI platform for phone calls
Synthflow	No-code voice AI platform
Sierra	Conversational AI platform
Decagon	Vertical CX agents (chat, email, phone); branded voice agents for support
Dialora	Voice AI platform for customer service
Aircall	Business phone system and call center platform
8x8	Business communications and contact center platform
Sonix	Automated transcription and speech-to-text platform
Twilio	Cloud communications platform with voice APIs
Vonage	Cloud communications platform with voice APIs
Telnyx	Voice and messaging API platform
Plivo	Voice and SMS API platform
Speechmatics	Speech recognition and transcription API
Gladia	AI audio processing and transcription platform
OpenAI	STT, TTS, and low-latency speech-to-speech via Realtime API for voice agents
Hume AI	Expressive speech model (Octave) and Empathic Voice Interface for emotion-aware S2S agents
Voiceflow	Platform to design, test, and deploy chat + voice agents with collaboration and observability
NVIDIA	GPU-accelerated speech AI microservices/SDK for real-time ASR and TTS (cloud/on-prem/edge)

Top Projects

Open-source voice AI projects from the landscape (source: projects.yaml).

Project	Description
Coqui	Deep learning toolkits for TTS and STT (Coqui TTS, Coqui STT)
OpenVoice	Versatile instant voice cloning
Bark	Transformer-based text-to-audio model
Chatterbox	State-of-the-art open-source TTS
VibeVoice	Open-source frontier voice AI
LiveKit	Realtime voice and video infrastructure; agent frameworks (Python, JS)
Pipecat	Framework for voice and multimodal conversational AI
Silero	TTS and VAD models (silero-models, silero-vad)
Whisper	Robust speech recognition via large-scale weak supervision
whisper.cpp	C/C++ port of Whisper for local inference
Faster Whisper	CTranslate2-powered fast Whisper inference
WhisperX	Whisper with word-level timestamps and diarization
SpeechBrain	Open-source conversational AI toolkit for speech
ESPnet	End-to-end speech processing toolkit
Vosk API	Offline speech recognition toolkit
Mozilla	STT (DeepSpeech) and TTS (Mozilla TTS) engines
NVIDIA NeMo	Conversational AI toolkit (NeMo); TTS components (Tacotron2, WaveGlow)
TensorFlowTTS	Real-time state-of-the-art text-to-speech in TensorFlow
gTTS	Google Text-to-Speech Python library and CLI
Mycroft Core	Open source voice assistant framework
Rhasspy	Offline voice assistant; Piper TTS for local neural speech
Kaldi	Toolkit for speech recognition research
wav2letter	End-to-end automatic speech recognition toolkit
Voice Assistant	Customizable open source voice assistant
Jasper	Always-on, voice-controlled applications on Raspberry Pi
Awesome Speech Recognition	Curated list of speech recognition resources
Awesome Whisper	Curated list of Whisper resources
Awesome Audio Speech	Awesome list of audio, speech and DSP resources
Awesome End2End ASR	Curated list of end-to-end ASR resources
pyannote-audio	Neural building blocks for speaker diarization
PaddlePaddle DeepSpeech	End-to-end ASR system based on Baidu DeepSpeech2
VITS	End-to-end TTS with variational inference
Tacotron	Neural network for speech synthesis
DeepSpeech PyTorch	PyTorch implementation of DeepSpeech2
Real-Time Voice Cloning	Voice cloning in 5 seconds to generate speech in real time
VALL-E	Zero-shot TTS (VALL-E, VALL-E-X multilingual)
Mozilla.ai Speech-to-Text	Blueprint repository for speech-to-text transcription
OA-Arch	Offline voice assistant for Arch Linux
Cmdr	Offline voice-activated assistant using Picovoice
Lexi Voice Assistant	Offline voice assistant with multi-language support
Rune Voice Assistant	Offline voice assistant project
Piper	Fast, local neural text-to-speech (offline); popular in Home Assistant/Rhasspy
Tortoise TTS	Multi-voice, high-quality TTS with realistic prosody and intonation
StyleTTS2	Research-grade TTS with style diffusion and adversarial training
RVC WebUI	Voice conversion (S2S) web UI; train models with small datasets; real-time voice changing
WeNet	Production-oriented end-to-end ASR toolkit (streaming + non-streaming)
FunASR	Speech recognition toolkit; ASR, VAD, punctuation, diarization
Vocode	Open-source framework for real-time voice agents (streaming STT/LLM/TTS); phone and meetings
DeepFilterNet	Real-time noise suppression for speech (denoising); pretrained models and CLI

Top Models

Open-source models commonly used for voice agents.

Model	Task	Why it’s popular / notes
Whisper (OpenAI)	ASR	De-facto open ASR baseline; multilingual; permissive MIT license; tons of tooling.
Kokoro TTS	TTS	Small (≈82M params) yet high quality; Apache-2.0 open weights; fast + efficient.
Kimi-Audio (Moonshot)	Audio foundation model	Unified audio model (understanding + generation) released open-source. Good jumping-off point for research.
F5-TTS	TTS	Modern diffusion/flow-matching TTS; strong quality, vibrant OSS ecosystem.
Bark (Suno)	TTS / text-to-audio	Early open text-to-audio that can do expressive speech and simple SFX.
StyleTTS 2	TTS	High-naturalness, style-controllable TTS; widely used in academic/OSS demos.
CosyVoice / CosyVoice 2	TTS	Multilingual, zero-shot cloning; active project with streaming focus.
MMS (Massively Multilingual Speech)	ASR + TTS (many langs)	Huge language coverage; open models on HF.
SeamlessM4T (Meta)	S2ST / S2TT / T2ST / ASR	All-in-one speech↔text / speech↔speech translation with open checkpoints.

Closed-source models

Provider / Model	Task	Why teams pick it
OpenAI GPT-4o Realtime & 4o-mini TTS	Speech-in ↔ speech-out, TTS	Full realtime stack (WebRTC/WebSocket), low-latency speech; easy path if you already use OpenAI.
ElevenLabs TTS	TTS	Very natural voices, cloning, realtime streaming; large voice library.
Deepgram Aura (TTS)	TTS	Sub-200 ms latency TTS tuned for agents; enterprise deployment options.
Deepgram Flux / Nova-3 (ASR)	ASR	Real-time (Flux) + high-accuracy (Nova-3) models; strong turn detection.
Cartesia Sonic	TTS	Ultra-low latency streaming (≈90–100 ms TTFA) with rich emotion control.
Google Cloud TTS (WaveNet/Neural2/Studio)	TTS	Massive voice/language coverage; easy cloud integration.
Google Cloud STT (Chirp family)	ASR	Modern multilingual ASR (Chirp/Chirp-2/Chirp-3) with diarization and ALD.
Microsoft Azure AI Speech (Neural TTS / STT)	TTS + ASR	Mature enterprise features (custom voices, diarization, HD voices).
Amazon Polly (Neural)	TTS	Reliable, globally available TTS; fits AWS-centric stacks.
PlayHT (Play) 2.0 / Turbo	TTS	Real-time streaming TTS with SDKs and cloning; popular with indie devs and startups.
AssemblyAI (Conformer-2 / Universal-2)	ASR	Commercial ASR with strong accuracy; common alternative to cloud vendors.

Top Architectures

Placeholder: cascaded, S2S, and hybrid architecture patterns.

The Builders Guide to Voice Agents

Goal: Help you find your ultimate stack—the architecture, components, and practices that let you ship reliable voice agents and keep evolving them.

The sections below are the dimensions that define that stack: what kind of system to build, which capabilities to treat as first-class, and which practices get you from prototype to production and keep the system improving.

Voice and text are fundamentally different

“You can’t treat voice like chat. A few extra seconds that are fine in text completely destroy a voice conversation.”
— Anna Baidina, Revolut

Teams that bolt audio onto a chatbot discover the hard way: a delay that feels fine in text destroys a voice conversation, and prompts that work in chat behave unpredictably in real time. The constraints aren’t incremental—they’re fundamental.

Treating voice like text is the most common mistake

The most common mistake is assuming chatbot or text-LLM strategies will translate directly to voice. They don’t. Voice has entirely different constraints. Latency that feels fine in chat can derail a voice conversation. The real-time, continuous nature of audio demands architectural decisions that text systems never face.

In practice, you can’t treat voice like chat; a few extra seconds destroy the conversation. This isn’t a minor optimization—it’s a first-order design constraint. Reusing text prompts backfires: what works in chat doesn’t behave the same in real-time audio. Voice models have different failure modes, including more hallucination, and need flows redesigned for turn-taking, interruptions, and audio. User research ranks “having to repeat themselves” as the top frustration (55%) and “frequently misheard words” (45%)—a problem that doesn’t exist in text, where users can see what was transcribed (AssemblyAI).

Prosody, streaming, and state make voice technically distinct from text

The technical differences run deeper than surface-level latency. Voice carries prosody and emotion—far more information than text—which makes it both more powerful and much harder to engineer.

“Voice is fundamentally higher bandwidth than text. It carries more data—prosody and intonation and cadence. It simultaneously makes it an amazing medium for interaction and a very natural medium… but it also makes it incredibly difficult to build a synthetic human-like experience in code.”
— Russ Dsa, LiveKit

In production, text agents stream tokens; voice agents must stream audio continuously or the call feels broken. That demands continuous GPU use and stateful session management—nothing like stateless web apps. Once you move from text to audio in/out, turn-taking and audio understanding become different problems. You can’t fix that by just adding a TTS layer to a chatbot.

Voice agents are often called an “industry” rather than a “market”: they need low latency, local compliance, and language support at once. That creates regulatory, infrastructure, and brand challenges that text AI doesn’t face.

The architectural implications are profound. Coval’s experience: 10% failure at every step (STT→LLM→TTS) can compound into much worse. Voice agents need streaming architectures—STT in chunks, LLM generating as soon as it gets the first words, TTS before the LLM finishes (AssemblyAI). That’s a parallel pipeline that text chat doesn’t need. Self-hosted voice stacks need different orchestration (streaming LLM and TTS, no heavyweight components in the hot path) than text-based LLM apps (r/LocalLLaMA).

Compound failure (conceptual): if each step has 10% failure rate, end-to-end success is the product, not the average:

# 10% fail per step (STT, LLM, TTS) -> ~73% success overall (conceptual)
p_success_per_step = 0.9
p_e2e_success = p_success_per_step ** 3  # 0.729

  Streaming voice pipeline (parallel, not sequential)
  ==================================================

  [User] ----speech---->
       \
        +-> [STT chunks] ----> [LLM] ----> [TTS] ----> [User]
             (while talking)     ^           ^
                                 |           |
                    "first words"|           | stream before
                    trigger LLM   |           | LLM done
                                 +-----------+
  (unlike chat: no "wait for full input then full output")

Figure 1: How a voice agent processes speech in real time

The diagram shows how your words move through the system: speech is turned into text in small chunks while you’re still talking, the AI starts thinking as soon as it hears the first words, and the reply is turned back into speech before the AI has finished its full answer. That keeps the conversation feeling fast and natural—unlike a chatbot where you wait for a complete response.

Builders must design for voice as a distinct domain

Builders: Your stack must be designed for voice as a distinct domain—not retrofitted from chat. Voice agents need their own methodology, architecture patterns, and evaluation. Reuse chatbot flows and what worked in text will fail in voice—often when real customers are on the line. Treat voice as its own domain: its own constraints, failure modes, and practices. Those who treat it as “text with audio” learn the hard way that voice needs specialized infra, testing, and deployment. Bottom line: treat voice as its own domain from day one—architecture, methodology, and evaluation built for voice, not chat.

Latency is a first-order requirement for voice

“If your system can’t react on roughly the same 200–250 millisecond timescale as a human, it won’t feel natural in conversation.”
— Anoop D, Deepgram

The human brain responds in 200–250 ms; miss that window and the conversation feels off. In production, 6 seconds is an eternity and 30 seconds is game over. Latency budgets and the interrupt/wait trade-off define whether your agent feels alive or robotic.

The human baseline is 200–250 ms; missing it feels robotic

The human brain processes audio and formulates responses in roughly 200–250 milliseconds; voice systems must meet that timescale or the interaction feels robotic. If your system can’t react on roughly the same 200–250 millisecond timescale as a human, it won’t feel natural in conversation. This isn’t a soft preference or optimization opportunity; it’s a first-order design constraint that shapes every architectural decision. Stark thresholds: 6 seconds feels like an eternity; 30 seconds and they hang up.

Text and voice latency tolerance are not transferable

The difference between text and voice latency tolerance is stark. In text, a few seconds of delay is fine; in voice it ruins the interaction. Organizations consistently rank real-time response speed as “important” or “very important” for human-like agents—low latency is central to perceived quality (Deepgram 2025). Typical first-response latency bands (AssemblyAI, Deepgram):

First-response latency	Perceived quality
< 100 ms	Optimal
200–500 ms	Acceptable
> 1 s	Too slow

Voice agents need streaming STT with sub-500ms latency (AssemblyAI). Getting there is a balancing act: wait too long and you get awkward pauses; respond too fast and you cut users off. Small timing differences can break the whole experience.

Example latency targets (conceptual—tune per component):

# First-response and component targets (conceptual)
latency:
  human_baseline_ms: 250      # aim for this timescale
  first_response_optimal_ms: 100
  first_response_acceptable_ms: 500
  stt_streaming_max_ms: 500
  e2e_target_p99_ms: 600      # cascaded with streaming

The latency landscape is evolving but thresholds still matter

The latency landscape is evolving. The real cascaded vs. S2S delta has narrowed to about 200–300 ms (not the 2s vs. 300ms sometimes claimed), but the gap still matters in latency-critical flows. Real-time streaming STT and semantic VAD help: they let you process while the user is still speaking and detect end-of-thought better than silence alone. Think about latency budgets at the 99th percentile across all components, not just average. Sub-600ms end-to-end is achievable even self-hosted with careful choices—small ASR, ≤8B LLM, streaming TTS. Vendors like Cartesia specialize in ultra-low-latency streaming TTS for live agents. Latency is about architecture, not just better models.

Fast responses don’t automatically create trust. People still trust humans more when something is complicated or high-stakes. Latency is necessary but not sufficient. Bottlenecks are mostly STT/TTS choice and non-streaming design; low latency needs streaming architectures so you hear the first tokens sooner.

Builders must architect for low latency from the ground up

Builders: Your stack must be architected for low latency from the ground up. Stack choices (STT, LLM, TTS, streaming design) determine whether you hit latency targets. That means real-time streaming, continuous processing, optimized model selection, and careful tuning of the latency-interruption trade-off. Treat latency as an afterthought and even perfect accuracy and natural-sounding voices won’t overcome slow response times. Voice agents succeed or fail on the millisecond scale. You can also reduce perceived latency through predictive loading (e.g. preloading based on where the conversation is going). Bottom line: make latency a first-class constraint in every component choice and pipeline design.

Turn detection / end-of-turn / semantic VAD is critical and hard

“Turn detection is still one of the hardest parts of this architecture. We do semantic VAD—it classifies if it thinks you’re done. If you end with emphasis or a question, that’s a strong signal; if you trail off with an ‘um,’ it gives a very low probability.”
— Peter Bakum, OpenAI

Assume 400 ms of silence means the user is done, and you’ll interrupt people who are just pausing to think. Wait too long, and you stall. Getting turn detection right is one of the hardest parts of real-time voice—and one of the biggest differentiators between agents that feel natural and agents that don’t.

  Turn-taking trade-off
  =====================

  Respond too soon  ------------  Sweet spot  ------------  Wait too long
  (interrupt user)   semantic      (natural)    semantic    (awkward pause)
                     VAD cues                    VAD cues
                     ? "um"                      question
                     emphasis                   end-of-thought

  Silence-only VAD fails: 400ms quiet = "done" breaks on pauses, thinking.

Figure 2: The timing tightrope—when to jump in

The diagram illustrates the trade-off voice agents face: respond too soon and you cut the user off (the “robot” feel); wait too long and the pause feels awkward. It shows how the system uses clues from your voice—like ending on a question or trailing off with “um”—to guess when you’re done speaking, so it can respond at the right moment.

Getting turn detection wrong drives the classic voice bot vibe

Perhaps the most deceptively difficult problem in voice agents is knowing when a user has finished speaking. This challenge sits at the intersection of latency, user experience, and technical complexity. Get it wrong, and you either interrupt users mid-sentence—creating the “classic voice bot vibe” that makes customers beg for a human agent—or wait too long, introducing awkward pauses that destroy the illusion of natural conversation.

Figuring out when a person is actually done talking is one of the hardest parts of real-time voice. It’s often positioned as one of three critical production components beyond basic STT/LLM/TTS—the “iPhone moment” when streaming STT, semantic VAD, and end-of-turn detection work together. Simple silence-based VAD fails catastrophically in production: a naïve rule like “400 ms of silence means you’re done” breaks constantly once you listen to real conversations, where people pause to think, emphasize points, or simply breathe. The trade-off is waiting long enough to avoid interrupting while responding quickly enough to feel natural.

In production, slow speakers trigger random end-of-turn events, “wrecking experiments” and demonstrating that simple timing thresholds are insufficient. “Intelligent endpointing” is crucial—voice agents need to detect natural speech boundaries (when users have finished speaking versus when they’re simply pausing to think). Poor endpointing leads to agents interrupting users or waiting awkwardly long.

Silence alone fails; semantic and linguistic cues work

The solution is semantic understanding: when has the user completed a thought? Emphasis, questions, trailing off, or other linguistic signals—not just acoustic silence. Semantic VAD can “classify if it thinks that you’re done speaking” by looking at whether users “end with an emphasis or a question” or “trail off with an um,” giving probability signals that guide how long the system should wait.

Example: use probability, not a fixed timeout (conceptual):

# Semantic VAD returns p_done; threshold + min_wait avoids premature cut-off (conceptual)
def should_respond(p_done: float, silence_ms: float, min_wait_ms: int = 200) -> bool:
    return p_done > 0.8 and silence_ms >= min_wait_ms

When turn detection works correctly, teams have implemented an event-driven state machine with spectrogram-based VAD that detects speaking and pauses, enabling natural conversation flow. Without this, you get premature responses and a classic voice bot vibe.

# Minimal state machine (conceptual): react to VAD/semantic events, not just timers
states = {"listening", "user_speaking", "turn_ended", "agent_speaking"}
# On VAD: user_speaking -> (silence + semantic "done") -> turn_ended -> trigger LLM/TTS
# On agent done: agent_speaking -> listening

Even sophisticated approaches face a fundamental challenge: turn detection is highly personal. People who pause or speak slowly blow up any one-size-fits-all timing—requiring per-user or adaptive tuning that learns individual speaking patterns.

Turn detection must be tested with simulations and statistics

Voice simulations must test “emotions, background noise, accents, speech speed” and interruptions as primary behaviors rather than edge cases. Their approach recognizes that turn detection failures cascade: if an agent interrupts mid-sentence, the user must repeat themselves (the #1 frustration in user surveys), creating a compounding negative experience. Emerging open-source solutions include Kyutai’s “unmute” STT with semantic VAD, showing that the industry recognizes this as a critical capability requiring specialized solutions beyond simple silence detection.

Builders: Turn detection belongs in your stack as a core capability, not an add-on. Semantic VAD and turn detection separate agents that feel natural from those that feel robotic. Treat them as first-class: test interruptions, pauses, and overlapping speech as primary behaviors, and test statistically across multiple runs. Transcription accuracy alone means nothing if your agent interrupts mid-thought—turn detection drives whether users feel the repetition and frustration that make them leave. Bottom line: include semantic VAD and turn detection as a core capability, and test them rigorously.

Cascaded vs. S2S: control vs. naturalism; hybrid is the future

“Enterprises don’t buy excitement. They buy control. The future isn’t S2S replacing cascaded—it’s hybrid routing, choosing the right architecture per conversation.”
— Brooke Hopkins, CEO of Coval

Cascaded gives you control and compliance checkpoints; S2S gives you lower latency and more naturalism. In production, speech-to-speech still hallucinates too much for many enterprise use cases—so the practical answer isn’t one architecture, it’s hybrid routing.

Cascaded prioritizes control; S2S prioritizes latency and naturalism

The voice agent architecture debate centers on a fundamental trade-off: cascaded pipelines (STT→text→LLM→text→TTS) prioritize control, debuggability, and compliance, while speech-to-speech (S2S) models prioritize latency and naturalism.

Aspect	Cascaded	S2S (speech-to-speech)
Prioritizes	Control, compliance, debuggability	Latency, naturalism
Text in pipeline	Yes (inspect, log, gate at each step)	No (audio in, audio out)
Best for	Complex workflows, regulated (healthcare, finance, legal)	Emotional, low-risk (therapy, coaching, premium consumer)
Latency	Streaming can reach sub-500 ms; delta vs S2S ~200–300 ms	Lower; gap narrowing

  Cascaded                              S2S (speech-to-speech)
  ========                              =====================

  [Audio] --> [STT] --> text --> [LLM] --> text --> [TTS] --> [Audio]
              inspect   ^         inspect   ^
              log       |         before    |
              comply    |         speak     |
                        +------------------+

  [Audio] ==========> [ model: audio in, audio out ] ==========> [Audio]
                      (no text in the middle; lower latency, less control)

Figure 3: Two ways to build a voice agent

The diagram compares two designs. In the first (cascaded), your speech is fully turned into text, then the AI replies in text, then that text is turned into speech—so you can inspect and control each step. In the second (speech-to-speech), audio goes in and audio comes out with no visible text in between, which can feel more natural but is harder to control and debug.

For most enterprises, this isn’t a technical preference—it’s a business requirement. For most enterprise use cases, cascaded pipelines win because they give you compliance checkpoints, fallbacks, and debuggability that S2S simply doesn’t. The text intermediaries in cascaded systems enable enterprises to inspect what the AI will say before customers hear it, which is non-negotiable in healthcare, finance, legal, and other regulated industries. AssemblyAI notes cascaded, turn-based pipelines are “unsuitable for real-time conversations” due to cumulative latency—but cascaded systems still provide the control and debuggability that enterprises require.

S2S fits emotional, low-risk use cases; cascaded fits complex workflows

However, S2S isn’t without merit. The market segments clearly: use real-time speech-to-speech when latency and naturalness matter most, and cascaded flows when you absolutely must get complex workflows right. S2S excels in emotional, premium, low-risk experiences like therapy, coaching, and certain consumer flows where the naturalness advantage outweighs control concerns—platforms like Hume AI (EVI, Octave) focus on emotion-aware, expressive S2S for these use cases.

“Something that just doesn’t really work with the chain or cascading approach—it really is native audio in and native audio out.”
— Peter Bakum, OpenAI

Yet production reality tempers enthusiasm. Sierra reports: in tests, speech-to-speech still hallucinates too much to trust for big customers, so text-to-speech remains the safer production path. The hallucination risk in S2S models creates unacceptable reliability gaps for enterprise deployments where accuracy is paramount. End-to-end speech-to-speech models are “still maturing” and “often lack the flexibility and specialized performance of a modular, streaming architecture,” suggesting that S2S isn’t yet ready for broad production deployment despite vendor positioning.

Hybrid routing is the practical answer

The future isn’t choosing one architecture over the other—it’s hybrid routing. Speech-to-speech is great for emotional, low-risk experiences, but making it your default architecture is a mistake. Organizations will increasingly route conversations to the appropriate architecture based on intent, risk profile, and use case requirements.

Example routing logic (conceptual):

def choose_architecture(intent, risk_profile, use_case):
    if risk_profile in ("regulated", "high_stakes") or use_case in ("support", "finance", "healthcare"):
        return "cascaded"  # control, compliance, debuggability
    if intent in ("emotional", "coaching", "therapy") and risk_profile == "low":
        return "s2s"       # latency, naturalism
    return "cascaded"      # default: control

Vendor positioning of S2S as “the future” reflects optimism; builders must navigate conflicting narratives and choose architecture based on their specific use case and stack goals. For most production use cases today, a well-orchestrated streaming stack offers the best balance of performance, control, and quality.

The latency gap between architectures is narrowing but still matters. The real cascaded vs. S2S latency delta has narrowed to roughly 200–300 ms (not the 2s vs. 300ms sometimes claimed), but that gap still matters in latency-critical experiences. Streaming cascaded architectures can achieve sub-500ms latency through parallel processing—STT in chunks, LLM generating as soon as it receives first words, TTS before LLM finishes. That reduces the latency advantage of S2S while maintaining control and debuggability. Even self-hosted cascaded stacks can achieve sub-600ms end-to-end with careful model selection and streaming.

Builders must choose architecture by use case, not vendor

Builders: Architecture selection is a core stack decision—use-case driven, not vendor-driven. Default to S2S everywhere and you’ll hit compliance gaps, debuggability challenges, and reliability issues that cascaded systems avoid. Choose cascaded for everything and you’ll miss opportunities in latency-critical, low-risk use cases where S2S gives a better experience. The winning strategy is a hybrid stack that can route intelligently—choosing the right architecture per conversation while keeping control and reliability. Bottom line: choose architecture by use case (and by conversation), not by vendor—build for hybrid routing from the start.

Multi-model / modular architecture is the winning pattern

“Physics and economics mean no single model gives you maximum speed, depth of reasoning, and low cost all at once. Real systems orchestrate multiple models.”
— Brooke Hopkins, CEO of Coval

Only 19% of organizations use a single model across use cases; 81% mix and match, and 99% expect to change their strategy (Twilio, Conversational AI Report, p.64, p.75). No one model delivers max speed, depth, and low cost at once—so the winning pattern is orchestration and swapability, not lock-in.

Single-model voice agents are giving way to multi-model

The era of single-model, monolithic voice agents is ending. Twilio finds only 19% rely on a single model; 81% mix and match, and 99% expect to evolve their strategy. That’s strategic design, not buyer’s remorse. No single model can optimize for speed, reasoning depth, and cost at once. Real systems orchestrate multiple models.

The modular approach extends beyond the LLM. ASR, speech-to-speech, TTS, and language models are best treated as separate, swappable components. In practice, production systems often use different providers per capability (e.g. one for streaming STT, another for accuracy, another for language coverage, another for TTS). Two architectural families exist—cascaded and S2S—and real systems select or combine them by use case. Modularity lets you optimize each component and swap as technology evolves.

Modularity extends beyond the LLM and reduces vendor lock-in

Modular architecture pays off when facing vendor lock-in and technological change. Three gaps slow teams after the demo: price/scalability at scale, lack of standardized config across vendors, and regional compliance/compute. Monolithic stacks around one vendor get obsolete fast; lock-in blocks adaptation. Modular design lets you replace components without rebuilding. Ninety-nine percent of organizations plan to change their strategy—they’re designing for change, not permanence (Twilio).

Industry data shows 44% of teams on hybrid builds (custom + vendor), 30% on third-party platforms, 22.5% fully custom (AssemblyAI):

Build approach	Share	Typical use
Hybrid (custom + vendor)	44%	Vendor infra for STT/noise/accents; custom logic for differentiation
Third-party platform	30%	Speed to market
Fully custom	22.5%	Maximum control

The dominant pattern is modular: vendor infra for hard problems (speech-to-text, noise, accents) plus custom logic for differentiation. Even self-hosted stacks benefit—e.g. OpenAI-compatible APIs for STT, LLM, TTS so you can swap without rewriting client logic.

# Same client call; swap provider via config (conceptual)
transcript = stt_client.transcribe(audio, **OPENAI_COMPATIBLE_OPTS)  # Deepgram, AssemblyAI, etc.
response = llm_client.chat(messages, **OPENAI_COMPATIBLE_OPTS)      # OpenAI, Anthropic, local, etc.
audio_out = tts_client.synthesize(text, **OPENAI_COMPATIBLE_OPTS)   # Cartesia, ElevenLabs, etc.

Speech-to-text accuracy is “mostly solved” with the right infra; building it in-house needs massive data and distracts from product.

Builders must build modular, multi-model stacks from the start

Builders: Your ultimate stack should be modular and multi-model from the start. Select components that can be swapped independently. Design interfaces that enable component replacement. Avoid architectures that create vendor lock-in. Bet everything on a single vendor or model and you’ll be unable to adapt as the landscape shifts; build modular and you can evolve component-by-component as new capabilities emerge. Graph-based architectures enable dynamic context injection; multi-agent architectures are emerging for complex deployments. Modularity extends beyond component selection to patterns that enable intelligent orchestration. Bottom line: build for swapability and evolution from day one—no single vendor or model for everything.

Dialogue management / conversation control is the "secret ingredient"

“Dialogue management is the secret ingredient that turns great STT, LLM, and TTS into a natural conversation. Without it, users get the classic voice bot vibe and beg for a live agent.”
— Anna Baidina, Revolut

Perfect STT, LLM, and TTS can still produce the classic voice-bot vibe: wrong turn-taking, no handling of interruptions, flows that work in chat but fail in real time. The missing layer is dialogue management—when to listen, when to speak, and how to recover. Without it, users beg for a live agent.

Perfect STT, LLM, and TTS still need dialogue management

Perfect STT, LLM, and TTS aren’t enough for natural voice. The missing piece is dialogue management—the layer that turns great components into a natural conversation. Without it, users get the classic voice bot vibe and beg for a human. Dialogue management decides when to listen, when to think, when to speak, and how to handle interruptions. It defines responsiveness and perceived personality more than any single component. User surveys put “having to repeat themselves” (55%) and “frequently misheard words” (45%) as top frustrations (AssemblyAI)—problems dialogue management can ease through better turn-taking and flow, even when STT isn’t perfect.

The challenge goes beyond turn-taking to flow redesign. Voice needs different flow patterns than text: agents should announce tool calls (“I’m looking that up for you”), avoid long-running ops during real-time conversation, and design around voice limits (e.g. spelling ambiguity). These are core dialogue decisions, not tweaks.

# Before calling a tool, say it in voice (conceptual)
def handle_tool_call(tool_name, args):
    send_to_tts("I'm looking that up for you.")  # user hears this first
    result = call_tool(tool_name, args)
    send_to_tts(format_result(result))

Without dialogue management, even perfect transcription and generation feel robotic and interruptive. Voice agents need orchestration—a “conductor” for flow, turn-taking, state, and APIs—that coordinates STT, LLM, and TTS into one coherent conversation.

  Dialogue management = conductor
  ===============================

           +------------------+
           | Dialogue         |
           | management       |
           | (when listen,    |
           |  speak, handoff, |
           |  recover)        |
           +--------+---------+
                    |
     +--------------+--------------+
     |              |              |
     v              v              v
  +------+      +------+      +------+
  | STT  |      | LLM  |      | TTS  |
  | hear | ---> | think| ---> | speak|
  +------+      +------+      +------+
     ^              |              |
     +--------------+--------------+
              (orchestration coordinates all three)

Figure 4: The voice agent team—who does what

The diagram shows the main parts of a voice agent working together: the part that hears you (STT), the part that thinks and replies (LLM), and the part that speaks (TTS). The “conductor” in the middle—dialogue management—decides when to listen, when to speak, when to hand off to a human, and how to recover when something goes wrong.

Dialogue management includes routing, fallbacks, and redundancy

At a macro level, dialogue management includes routing and evaluation: which architecture (cascaded vs. S2S), which models, what fallbacks, how to handle failures. This conversation-control layer is essential—voice systems need orchestration that adapts to context, user, and system. Teams are building graph-based dialogue management (e.g. injecting context as you move through the graph, preloading availability when the user reaches scheduling). That’s context engineering and dynamic flow, not just plumbing. It’s what makes voice agents feel human.

Example: on entering a graph node, inject only that node’s context (conceptual):

# When user reaches "schedule" node, inject schedule rules + preload availability (conceptual)
def on_enter_node(node_id: str, conversation_state: dict) -> dict:
    context = get_rules_for_node(node_id) + get_recent_history(conversation_state, turns=3)
    if node_id == "schedule":
        context += fetch_availability()  # preload while user still talking
    return context

Dialogue management also needs redundancy and fallbacks. Reliability comes from redundancy. Dialogue systems must handle component failures, route to fallbacks when primaries fail, and keep conversation state when parts break. Orchestration libraries like Pipecat, LiveKit, and Vocode (open-source streaming voice-agent framework for phone and meetings) handle this fallback logic in practice (Coval). That demands stateful session management, unlike stateless web apps. Dialogue management is a real engineering challenge, not simple request-response.

Example fallback (conceptual):

# Route to fallback on timeout or error; keep session state (conceptual)
def transcribe(audio, primary_stt, fallback_stt, timeout_ms=2000):
    try:
        return primary_stt.transcribe(audio, timeout=timeout_ms)
    except (TimeoutError, ServiceError):
        return fallback_stt.transcribe(audio)  # same session, no user re-prompt

Builders must treat dialogue management as a first-class component

Builders: Your stack isn’t complete without a dialogue management layer—the component that orchestrates STT, LLM, and TTS into a coherent conversation. Treat it as first-class, not an afterthought. Focus only on STT or LLM and neglect dialogue management, and perfect components still won’t guarantee a natural experience. The difference between agents that feel natural and those that feel robotic often comes down to dialogue management quality. Bottom line: add a dialogue management layer that owns when to listen, when to speak, and how to recover—treat it as part of the core stack, not plumbing.

Prototyping is easy; production is hard

“Where’s the Datadog for voice AI? Voice agents are stateful, always on—they’re not web apps. We still don’t have that observability.”
— Russ Dsa, LiveKit

Success in demos often drops sharply in week-one production—Coval’s research shows success rates can fall from 95% in controlled demos to 62% when real customers use the system. The gap isn’t the models—it’s compounding failures across the stack, missing observability (standard voice-agent monitoring tooling doesn’t exist yet; LiveKit: “Where’s the Datadog for voice AI?”), and investigation work that can cost more than the rollout. What to budget and design for so the gap doesn’t catch you.

  Demo vs production
  ==================

  Controlled demo          Week-one production (real users)
  ---------------          ---------------------------------
  Happy path                 Accents, pauses, slow cadence
  Same speakers              Compounding: 10% fail × 10% × 10%
  ~95% success       -->     ~62% success
                             Missing: observability, evaluation,
                             post-deployment investigation

Figure 5: Why demos look great but real calls often don’t

The diagram shows how success rates typically drop when a voice agent moves from a controlled demo (e.g. 95% success) to real customers in the first week (e.g. 62%). It can also show how small failures at each step—hearing, thinking, speaking—add up to a much bigger chance that the call fails overall.

The demo–production gap is wider than most expect

The gap between a working prototype and a production-ready system is wider than most anticipate. The gap isn’t inferior models or training data—it’s deployment methodology. Teams that focus only on model performance and neglect orchestration, observability, evaluation, and post-deployment investigation see demos fail in production. Many builders feel confident (82.5%); many still struggle with reliability—accuracy/misunderstandings (52.5%), integration difficulty (45%), high costs (42.5%) (AssemblyAI). Confidence and capability disconnect.

Prototypes lie because of compounding failures and missing observability

Building a prototype is deceptively simple.

“You can wire up a voice agent in a day, but without orchestration and observability you have no idea why real calls go wrong.”
— Anoop D, Deepgram

In practice, prototypes that looked solid fail when real users have accents, pauses, or slow cadence—failure modes that never show up in demos (transcript-voice / Process). Failures compound: 10% failure at every step (STT→LLM→TTS) can compound into much worse (Coval). LLM agents are different from traditional software: expensive, slow, non-deterministic. They need a different plan→build→test→release loop—simulations and statistical evaluation, not one-and-done unit tests.

Why compounding hurts (conceptual):

# 10% fail per step -> ~73% call success; 5% per step -> ~86% (conceptual)
def e2e_success_rate(fail_rate_per_step: float, steps: int = 3) -> float:
    return (1 - fail_rate_per_step) ** steps
# e2e_success_rate(0.10)  # 0.729
# e2e_success_rate(0.05)  # 0.857

The infrastructure gaps are significant. Tooling that makes voice-agent behavior obvious at scale still doesn’t exist. Platforms like Voiceflow (design, test, deploy with collaboration and observability) and Coval (simulation and monitoring) are filling part of the gap; voice agents remain stateful and always-on, unlike stateless web apps. Think about latency at the 99th percentile across all components. If TTS, LM, and ST all have variance, you must be ready for all three to stall—that can add up to 6 seconds, and 6 seconds is an eternity; 30 seconds is game over.

Teams also underestimate post-deployment investigation. Budget for the agent, but investigation (reviewing calls, analyzing failures, building dashboards) often costs more than the rollout. Plan for it upfront.

Accuracy, integration, and cost form a vicious cycle

A vicious cycle links accuracy, integration, and cost. Accuracy failures drive frustration and escalations; integration difficulty extends timelines; high costs block fixes. Fix one and the other two pull you back. Successful teams tackle all three.

  Vicious cycle (fix one, the other two pull back)
  ================================================

       accuracy failures --> frustration, escalations
              ^                        |
              |                        v
         high costs <-- integration difficulty
              |         (extends timelines)
              v
       blocks fixes
  (Successful teams tackle all three at once.)

Even self-hosted stacks face production challenges—concurrency and multi-call scaling are under-discussed but real. Prototypes don’t reveal them; orchestration and infra planning do.

Builders must budget and design for production infrastructure from day one

Builders: Your stack only delivers when production practices are built in from the start. Observability, evaluation, and investigation are prerequisites, not afterthoughts. Budget and design for production infra from day one. Teams that invest meaningfully in evaluation (e.g. 20–30% of budget) reach 90%+ production success in months; those that skip testing start at ~62% and take 6–9 months to reach 85% (Coval, AssemblyAI). The difference is methodology, not models.

Evaluation investment	Week-one production success	Time to ~85% reliability
20–30% of budget	90%+	Months
Skip / minimal	~62%	6–9 months

Bottom line: treat production infra, observability, and post-deployment investigation as part of the stack from day one—otherwise your stack won’t ship reliably.

Evaluation and testing infrastructure drive production success

“The difference isn’t better models—it’s systematic testing and the ability to understand what’s actually happening in production.”
— Zack Reneau-Wedeen, Sierra

The difference between teams that ship reliably and those that don’t usually isn’t better models—it’s systematic testing and understanding what’s actually happening in production.

Evaluation infrastructure delivers measurable ROI

Teams that invest in evaluation reach strong production success in months; those that skip testing start behind and take longer to catch up (Coval, Sierra). Failures compound across the stack—so systematic testing is essential before production.

Traditional testing fails; agents need goal-achievement evaluation

Traditional software testing fails with voice agents. Leading teams run simulations multiple times per scenario and use statistical analysis to judge behavior. Agents are non-deterministic—test for goal achievement, not specific outputs. You need to know whether the agent helped the user achieve their goal, not whether it said exact words. Evaluation should measure distributions and probabilities, not binary pass/fail. Voice testing must account for interruptions, pauses, accents, noise, and the variability of human conversation. Traditional test suites aren’t enough.

The industry still lacks standard observability for voice

The ecosystem gap is significant. Tooling that makes voice-agent behavior obvious at scale doesn’t exist yet. Voice agents are stateful, always-on; they don’t behave like stateless web apps. Monitoring and debugging need different approaches. Teams often build custom or operate blind. Voice agents need instruction-following evaluation, not just performance metrics. Quality is non-deterministic; you need to assess whether agents “did the right thing at the right time,” not just accuracy or latency.

Simulations, CI/CD, and post-deployment evaluation are all required

Voice simulations require testing “emotions, background noise, accents, speech speed” and the entire stack (STT, reasoning, TTS/S2S), not just conversation logic (Sierra). Voice adds unique complexity—you must test the entire audio pipeline. Critical simulations run in CI/CD pipelines, treating evaluation as a continuous process rather than a one-time check. Organizations need a three-layer framework: regression (core scenarios), adversarial (edge cases), and production-derived (scenarios from real user failures). That’s comprehensive coverage that prototypes never achieve.

Example CI job (conceptual—run voice sims on every change):

# Example: run voice evaluation on push
voice-eval:
  runs-on: ubuntu-latest
  steps:
    - run: ./scripts/run_voice_sims --scenarios regression adversarial --runs 10
    - run: ./scripts/judge_calls --model reasoning-judge

Layer	Purpose	When to run
Regression	Core scenarios, happy path	Every change; CI/CD
Adversarial	Edge cases (noise, accents, interruptions)	Every change; CI/CD
Production-derived	Scenarios from real user failures	Continuously; feed back into regression/adversarial

Practical pipeline (conceptual): text scenarios → TTS/record → full stack → judge. Example shape:

scenarios.yaml (intents, flows, edge cases)
    → generate_speech (TTS or recorded)
    → run_agent (STT + LLM + TTS/S2S)
    → judge_call(goal, steps, behavior)  # reasoning model
    → aggregate(5–15 runs per scenario)

Post-deployment evaluation is part of the same loop

Post-deployment evaluation is equally critical. Most of the painful work happens after launch, when you’re trying to understand what actually happened on calls. Teams consistently underestimate the cost of investigation—reviewing calls, analyzing failures, building dashboards. This work can cost more than the initial deployment if not planned upfront. Evaluation isn’t just pre-production testing; it’s continuous monitoring, analysis, and improvement throughout the agent’s lifecycle. Build evaluation infrastructure that can test fallback mechanisms and redundancy paths, not just primary flows.

Builders must allocate budget for evaluation and testing from day one

Builders: Evaluation and testing are part of the stack that lets you ship reliably and keep evolving. Allocate budget and design for evaluation, testing, and observability as first-class requirements, not afterthoughts. Invest in evaluation infrastructure from day one. Build simulation frameworks that test non-deterministic behavior statistically. Plan for post-deployment investigation as a core capability. Treat evaluation as optional and you’ll discover that production success requires understanding what’s happening—and that understanding requires infrastructure you should have built from the start. Bottom line: evaluation and testing are core stack capabilities—budget for them from day one so you can ship and evolve with confidence.

Satisfaction gap: adoption is high, satisfaction is low

“Eighty percent use voice agents. Twenty-one percent are very satisfied. The market is voting with its wallet—and it’s not voting for the status quo.”
— Anoop D, Deepgram

Adoption is high; satisfaction is low—80% use voice agents, only 21% very satisfied (Deepgram 2025). Buyers often think customers are happy (90% of orgs believe it); only 59% of consumers agree (Twilio). The gap is accuracy and experience. Teams that fix reliability instead of chasing features have a clear opening.

  Satisfaction gap
  ================

  Use vs satisfaction          Org belief vs consumer reality
  ------------------          ---------------------------------
  80% use voice agents         90% of orgs believe
  21% very satisfied            customers are satisfied
  -------                      59% of consumers agree
  gap = opportunity            gap = blind spot

Figure 6: Adoption is high; satisfaction is low

The diagram illustrates the gap between how many people use voice agents (e.g. 80%) and how many are very satisfied (e.g. 21%). It can also show the mismatch between what organizations believe (e.g. 90% think customers are satisfied) and what consumers actually say (e.g. 59% agree).

Adoption is high; satisfaction is low

The gap is both problem and opportunity: strong demand and widespread dissatisfaction. The market is voting with its wallet. Yet user satisfaction remains stubbornly low—a disconnect between effort and outcomes.

Buyer perception and user reality diverge

The satisfaction gap extends to customer perception. Organizations track automation and cost; customers feel frustration and unmet needs. User frustration is well documented: 55% cite “having to repeat themselves,” 45% “frequently misheard words” (AssemblyAI). Buyer perception and user reality diverge—a blind spot that blocks fixing real problems.

Contact center data adds another dimension: 98% of contact centers use AI, but leaders often optimize for automation and bot scores while underinvesting in agent experience, emotional intelligence, and human–AI collaboration (Calabrio). Automation handles routine work; human agents get harder, more emotional interactions without enough support. Result: agent burnout, turnover, and degraded experience despite strong automation stats. Nearly a third (31%) of users prefer human over AI—a preference that costs through churn, escalations, and reputation (AssemblyAI, Twilio).

The root cause is accuracy. Repetition and misheard words are one problem with many symptoms. When STT misses words, everything downstream fails—the LLM never got the input, users repeat themselves, frustration compounds. The industry is still moving from “does it respond?” to “can it finish the conversation?” to “can it do complex actions?” Many deployments are stuck at early stages where satisfaction stays low.

Accuracy first, then iteration

Builders: Getting your stack right—architecture, components, and practices—closes the satisfaction gap. There’s massive opportunity for teams that focus on reliability, evaluation, and UX rather than feature completeness. Winners solve reliability, transparency, and value delivery—not the longest feature list.

	Successful teams	Struggling teams
Focus	Accuracy first, then iteration	Deploy first, fix later
Metrics	Multiple (accuracy, containment, UX, cost)	Cost only, or vanity metrics
Cost vs UX	Balance both	Optimize cost only
Improvement	Visible in 60–90 days	No clear success criteria

Bottom line: the satisfaction gap is closed by the stack you choose—architecture, components, and practices that prioritize reliability and UX over demos and feature lists.

The Buyers Guide to Voice Agents

Goal: Help you get the best return on investment—how to evaluate vendors, govern deployments, and avoid the pitfalls that burn budget without delivering value.

The sections below are the levers that determine ROI: what to require from vendors and deployments, and which pitfalls burn budget or destroy value if you miss them.

Hybrid human–AI is required, not optional

“It’s critical to be able to switch from an AI agent to a human when needed. The hybrid model isn’t a transitional phase—it’s the end state.”
— Twilio

Full-AI voice isn’t the goal—78% of consumers say it’s critical to be able to switch from an AI agent to a human (Twilio). The hybrid model is the end state. Only 15% of consumers report seamless AI-to-human handoffs, especially around context transfer; the rest get dropped context and frustration (Twilio, p.294, p.298). To get the best ROI, require and govern for: context transfer, escalation quality, and a clear division of labor (AI for routine, humans for the rest).

  Hybrid: AI + human (78% say "switch to human" is critical)
  ==========================================================

  [User] <-----> [AI agent]  routine questions, simple tasks
                  |
                  | handoff (only 15% get seamless + context today)
                  v
              [Human agent]  complex, emotional, escalation
                  |
                  +-- needs: full context from AI, no "repeat yourself"

Figure 7: AI and humans working together

The diagram shows how a typical call is meant to work: the AI handles routine questions and tasks, but when the user needs a person—or when the AI gets stuck—the call is handed to a human with full context so the user doesn’t have to repeat themselves. It highlights that only a small share of users today experience that handoff as seamless.

Customers expect human access; few get seamless handoff

Buyers: Fully autonomous voice agents are appealing but unrealistic. Handoff is a customer expectation, not a nice-to-have—78% say switching to human is critical (Twilio). Only 15% report seamless AI-to-human handoffs, especially around context transfer. The gap between what customers need and what most organizations deliver is large; the hybrid model isn’t transitional, it’s the end state. Nearly a third (31%) prefer human over AI—a preference that costs through churn, escalations, reputation, and lost revenue (AssemblyAI). Human availability is a business necessity.

AI handles routine; humans handle the rest

The division of labor is clearer: AI excels at routine, predictable interactions; humans remain essential for complex problem-solving, emotional support, and situations needing judgment or empathy. As automation takes routine work, the remaining interactions get harder and more emotional. Human agents need better support and training, not replacement. Empathy is the most lacking agent skill; many organizations underinvest in emotional intelligence and ongoing coaching for AI-driven workflows. Organizations must invest in human agent development even as AI handles routine tasks.

Pure-AI delivery is unrealistic without governance and knowledge management

Customer service leaders are now primary decision-makers for AI initiatives in many organizations. Knowledge management gaps, process inconsistencies, and governance challenges make pure-AI service delivery unrealistic. Human expertise and oversight remain central—not because AI is insufficient, but because complex customer service requires judgment, context, and relationship-building that current AI cannot fully replicate. The hybrid model acknowledges these limitations while maximizing the value of both. Learning alone won’t solve the hybrid challenge; organizations need systems that enable seamless collaboration between AI and human agents.

Handoff requires context transfer and escalation quality

Only 15% of consumers report seamless AI-to-human handoffs, especially around context transfer (Twilio, p.294, p.298). The gap between customer expectations and organizational capability is large.

“If you’re making someone wait, there has to be that feedback. Because otherwise they think that it just went offline.”
— Russ Dsa, LiveKit

Poor accuracy drives human handoffs, which eliminates cost savings. This creates a vicious cycle: deploy to reduce costs, but poor accuracy drives escalations that eliminate savings. Hybrid handoff quality is a critical success factor. Organizations need context preservation (human agents have full visibility into AI interactions), handoff mechanisms that transfer context without requiring users to repeat themselves, and training so human agents can collaborate effectively with AI.

Buyers: To get the best ROI, hybrid must be designed in from day one—not bolted on later. When evaluating vendors, require seamless handoff and context transfer. When governing deployments, treat hybrid as core to ROI—poor handoff burns budget through escalations, churn, and lost resolution rates. Bottom line: require and govern for hybrid from day one—handoff and context transfer are where value is preserved or burned.

Compliance, control, and auditability matter for enterprises

“The biggest red flag is lack of on-prem or private deployment. When regulators ask where voice data is processed—and it contains biometric identifiers—that becomes a board-level problem.”
— Zohaib, Resemble AI

When regulators ask where voice data is processed—and it contains biometric identifiers—the answer had better not be “someone else’s cloud.” The biggest red flag is lack of on-prem or private deployment. Retrofitting compliance after launch is far costlier than designing for it from day one. To protect ROI, require from vendors and deployments: control over what the AI says, objective logging and audit (not LLM summaries), and data residency, biometrics, and regional requirements.

Enterprises need compliance checkpoints and control over what the AI says

Buyers: Enterprise voice deployments face regulatory scrutiny that consumer apps don’t. Healthcare, finance, legal, and other regulated industries need compliance checkpoints, audit trails, and control over what AI says and does. Text intermediaries in cascaded pipelines give you that: you can inspect what the AI will say before customers hear it. That’s non-negotiable in regulated industries. This choice isn’t about performance—it’s about legal and regulatory requirements. Voice data is sensitive and often needs HIPAA and SOC-2. Compliance is a first-class requirement, not an afterthought.

Logging and audit require objective truth, not LLM summaries

The compliance challenge extends beyond what the AI says to how it’s logged and audited. Without hard, objective logs of what your agent did, you’re flying blind on security and compliance. LLM-generated summaries for incident analysis create dangerous gaps: they can miss critical details, introduce errors, and fail audit requirements. You need definitive logging that captures exactly what happened, when, and what decisions were made—not AI-generated interpretations.

Example shape for an objective log entry (per turn or per call):

{
  "session_id": "...",
  "turn_id": "...",
  "timestamp_utc": "...",
  "transcript": "...",
  "actions": ["listened", "tool_called", "spoke"],
  "tool_calls": [{"name": "...", "args": {...}}],
  "fallbacks_used": [],
  "latency_ms": {"stt": 120, "llm": 340, "tts": 80}
}

Session and turn identifiers let you trace a full call; timestamps and latency per component support audit and 99th-percentile dashboards. Compliance requires systems that can verify and audit behavior without depending on AI-generated explanations, including deterministic access controls that don’t rely on model decisions.

Data residency, biometrics, and regional requirements are compliance must-haves

Data residency and deployment architecture become compliance requirements when voice is treated as biometric data. When voice is treated as a biometric, regulators care where and how that audio is processed, not just what the model says. Regional compliance is a “hard stop” for regulated industries:

“I’m from Australia, and if you’re in the medical space in Australia, you’ve got to run the compute in Australia. And the route has to be in Australia, because you’ve got to keep your records local.”
— Luke Miller, Slang Labs

The EU AI Act Article 50 creates transparency obligations requiring synthetic content to have machine-readable marking and detection capabilities, making compliance a technical requirement, not just a policy concern. Organizations must design for on-premise or private deployment options, data residency controls, and ongoing security updates (like deepfake detection) as part of their compliance strategy.

Regional infrastructure availability is severely limited: very few big providers have more than three or four regions. Control plane vs. runtime regional mismatch creates hidden problems. Organizations must architect for regional compliance so that all components (control plane, runtime, data storage) meet regional requirements, not just the primary processing components. In some jurisdictions voice profile is classified as PII, adding compliance complexity.

Buyers must architect for compliance and observability from day one

Buyers: To protect ROI, compliance and observability must be architected in from day one. When evaluating vendors, require on-prem or private deployment options, objective logging (not LLM summaries), and data residency controls. When governing deployments, treat compliance as non-negotiable. Retrofitting compliance into voice agent systems is far more expensive and risky than designing for it from the start—and it’s a major way budget gets burned without delivering lasting value.

Choose architectures that enable inspection and control (cascaded pipelines with text intermediaries), deterministic access controls that don’t rely on model decisions, and logging and audit systems that capture objective truth rather than AI interpretations. Compliance isn’t just regulatory; it’s a technical challenge that requires upfront planning and ongoing investment. Bottom line: require compliance and observability from vendors and deployments from day one—retrofitting burns budget and destroys value.

Professional services / cross-domain expertise remain necessary

“People mix concepts: am I controlling the behavior of the voice interaction, the nature of the voice, or the business logic? It’s often cross-pollination—you get funky prompts that end up with bizarre behaviors.”
— Luke Miller, Slang Labs

Voice agents need more than engineering—they need conversation design, brand/UX, and ML working together, and most organizations have depth in one or two, not all three. Shipping a serious voice agent feels more like deploying a whole contact center than spinning up a chatbot; projects that don’t plan for professional services or upskilling often stall right after the demo.

Voice agents need three domains: conversation design, brand/UX, and ML

Voice agent success needs expertise in three domains at once: conversation design (how humans communicate and how to structure dialogues), brand/UX (voice agent reflects identity and delivers consistent experience), and ML/LLM tuning (models, prompts, behavior for the use case). Most organizations have depth in one or two; almost none have all three. Real deployments need people who understand conversation design, brand, and ML—not just one. Even among confident teams (82.5% feel confident building voice agents), 25% admit they lack skills in NLP engineering, voice UX, and conversational AI architecture (AssemblyAI, Deepgram). Confidence doesn’t equal capability.

Professional services intensity matches contact-center scale

This expertise gap isn’t temporary.

“Shipping a serious voice agent feels more like deploying a whole contact center than spinning up a chatbot—you need that level of services.”
— Brooke Hopkins, CEO of Coval

The professional services intensity required for voice agent deployments is comparable to full contact center rollouts, and this will remain true for the foreseeable future. Voice agents aren’t yet a “drop-in” product that organizations can deploy successfully without significant consulting, training, or internal capability building. Prompt engineering failure is often the most common mistake: people mix concepts—am I controlling the behavior of the voice interaction, the nature of the voice, or the business logic?—and end up with funky prompts that produce bizarre behaviors, demonstrating that even technical teams struggle with voice-specific design patterns.

Ongoing optimization and hidden costs demand expertise

The need extends beyond deployment to ongoing optimization and workforce development. Organizations underinvest in agent training—especially emotional intelligence and coaching—even as AI reshapes workflows. Empathy is the most lacking agent skill; 64% aren’t prioritizing it, and 59% don’t provide ongoing coaching for AI-driven workflows (Calabrio). Professional services fill that gap: redesigning workflows, training human agents to work with AI, building internal capability. The voice agent transformation is organizational change; it needs expertise most companies don’t have.

Hidden costs need expertise to manage. Rough ranges (varies by scope and vendor):

Cost category	Typical range (order of magnitude)
Integration	$1K–50K
Training	$500–2K
Compliance add-ons	Variable
MVP development	$40K–100K+ for a basic agent

Optimizing cost too early creates agents users avoid. Successful teams follow a playbook: accuracy first, multiple metrics, balance cost and UX, improvements in 60–90 days. Struggling teams deploy without success criteria, optimize only for cost, chase vanity metrics. Expertise tells you which approach to follow. The space is new; people don’t know what they’re doing or how to put it together. The skill gap is fundamental, not temporary.

Buyers must plan for professional services or internal upskilling

Plan for professional services or internal upskilling as part of your voice agent strategy. Budget for consulting, training, and capability building rather than expecting to deploy with existing internal resources. If you treat voice agents as a simple technology purchase, you’ll discover that success requires cross-domain expertise you may not possess—and that underinvesting here burns budget through stalled projects, rework, and deployments that never reach value.

Teams that invest in professional services or build internal capabilities achieve faster, more successful deployments and better ROI. Voice agents aren’t yet mature enough to be a self-service product for most organizations; the successful ones recognize that voice agents need expertise across conversation design, brand, and ML and invest in building or buying it. Buyers—bottom line: budget for and require expertise (consulting, training, or internal upskilling); skipping it burns budget and delays value.

Knowledge base and content readiness block success

“The best way to get your knowledge base in a good place is to launch an AI agent and find the edges of what it knows.”
— Zack Reneau-Wedeen, Sierra

Voice agents can only be as good as the knowledge they can access—and most organizations discover their knowledge bases aren’t ready. Many customer service leaders have backlogs of articles to edit; many have no formal revision process. The counterintuitive fix: the best way to get your knowledge base in a good place is to launch an agent and find the edges of what it knows, then fix governance and use deployments to improve content instead of waiting for perfection.

Knowledge backlogs and missing revision processes block voice agents

Voice agents can only be as good as the knowledge they can access. Most organizations discover their knowledge bases aren’t ready—backlogs of articles to edit, often no formal revision process. Outdated, incomplete, or inconsistent knowledge produces unreliable, frustrating agents. GenAI deployment fails if knowledge management is ignored—content readiness is a prerequisite, not optional. Customer service leaders now own AI initiatives (more than IT in some areas) but can’t succeed without fixing knowledge gaps first.

Launch to learn surfaces knowledge gaps

However, waiting for perfect knowledge bases before deploying voice agents creates its own problems. A counterintuitive strategy: the fastest way to discover what your knowledge base is missing is to put an agent in front of customers and watch where it fails. Launch an AI agent with appropriate scoping and use the platform to surface and prioritize missing or incorrect knowledge from real conversations. That creates a continuous improvement loop and identifies knowledge gaps faster than traditional content audits. This “launch to learn” approach treats knowledge base improvement as an ongoing process rather than a one-time preparation step. Platforms can create a prioritized list of the knowledge your agent should have but doesn’t have—turning deployment into a knowledge discovery mechanism that accelerates content improvement.

The tension between these approaches reflects a fundamental challenge: organizations need good knowledge bases to deploy effective voice agents, but they also need voice agent deployments to discover what’s missing from their knowledge bases.

Context engineering is critical because you can’t feed all context into every turn. Voice agents have been early to context engineering because there’s far too much context to feed into every single turn—requiring dynamic injection via graphs or background agents rather than dumping all knowledge into every prompt. Knowledge base readiness isn’t just about having good content; it’s about having content structured for dynamic retrieval and context injection, which requires knowledge engineering expertise that most organizations lack.

Context engineering and dynamic injection are required

The solution isn’t choosing one approach over the other—it’s fixing knowledge governance and revision processes while simultaneously using voice agent deployments as discovery mechanisms. Organizations that fix governance first will deploy more successfully, while those that deploy carefully scoped agents can accelerate knowledge base improvement through real-world feedback. Sixty-four percent of service leaders plan to spend more time learning about technology in 2025 (Gartner), but learning alone won’t solve knowledge management gaps—organizations need formal revision processes, content governance, and systems that keep knowledge current as deployments surface gaps.

Buyers must address knowledge management systematically

Knowledge readiness directly affects ROI—deployments that ignore it burn budget on failed rollouts and frustrated users. When evaluating vendors and deployments, require evidence of knowledge governance and revision processes; use carefully scoped deployments to surface gaps so you get value faster instead of waiting for perfection.

Buyers: Address knowledge management systematically. Voice agents can be part of the solution, not just dependent on it: formal revision processes, knowledge base cleanup, governance that keeps content current, plus voice agents in scoped use cases that surface knowledge gaps through real customer interactions. The teams that succeed treat knowledge management as a continuous discipline, using voice agents both as a consumer of knowledge and as a tool for improving it. Bottom line: require knowledge governance from vendors and deployments, and use scoped rollouts to surface gaps—ignoring readiness burns budget; using deployments to learn accelerates value.

Trust and transparency break first in production

“User trust is easily the first thing that breaks when you roll out voice agents. The easiest way to lose trust is to pretend your AI is something it’s not. Transparency prevents most of the pain.”
— Carter Huffman, Modulate AI

When voice agents go live, the first thing that usually fails isn’t accuracy or latency—it’s user trust. The easiest way to lose it is to pretend your AI is something it’s not. What actually fixes it: transparency about what the agent can and can’t do, human availability when users need it, and objective logging so you can explain what happened. Designing for trust from day one pays off; retrofitting it after users have already been burned does not.

Trust as the first casualty; transparency as the fix

When voice agents go live, the first thing that usually fails isn’t accuracy or latency—it’s user trust. Trust breaks easily even when technical performance is solid. The problem is usually transparency and expectation management, not capability. When users don’t know they’re speaking to AI, or when capabilities are overstated, trust erodes.

The fix is counterintuitive: being explicit about AI limitations builds trust. The easiest way to lose trust is to pretend your AI is something it’s not. Being upfront about what it can and can’t do avoids most of that pain. Modulate’s Carter Huffman: transparency prevents about 80% of trust issues—one of the highest-ROI investments.

Speed doesn’t create trust; human availability does

The trust challenge extends beyond transparency to capability perception. While AI response times are fast, trust lags significantly: many consumers feel AI doesn’t understand them as well as humans, and they value being able to escalate to a person when things go wrong. Speed alone doesn’t create trust—users need to feel understood, respected, and able to access human help when needed. Nearly a third (31%) of users prefer human over AI (AssemblyAI)—a preference that costs through churn, escalations, reputation, and lost revenue. The price of frustration is measurable; that’s a problem you pay for. This reinforces the hybrid human-AI model: trust requires both AI efficiency and human availability, not one or the other.

Ethics and authenticity matter for voice as identity

Ethical concerns compound the trust challenge. Many voice buyers still strongly prefer human voices; ethical concerns about AI voice training data and misuse are high. Adoption of AI voices remains cautious, driven by authenticity and ethical concerns. Authenticity and ethics are central to trust, particularly in brand-critical applications where voice represents organizational identity. Organizations using AI voices must address not just technical quality but ethical sourcing, consent, and appropriate use—concerns that don’t apply to human voice actors. Voice data is sensitive and often needs HIPAA and SOC-2, adding compliance complexity that impacts trust.

Objective logging enables explainability

Trust-building needs objective evidence, not promises. Without hard, objective logs of what your agent did, you’re flying blind on security and compliance. LLM-generated summaries for incident analysis create dangerous gaps. You need definitive, objective logging—when users ask “why did the agent do that?” or “what happened in that call?”, you must give clear, factual answers from logs, not AI interpretations. Trust and satisfaction are measurable business outcomes, not soft metrics.

Trust is an ROI lever—lost trust costs you in churn, support escalations, and reputation.

Buyers: When evaluating vendors, require transparency (what the AI can and can’t do) and objective logging that enables explainability. When governing deployments, design for trust from day one. Be explicit that users are speaking to AI. Clearly communicate what the AI can and cannot do. Invest in objective logging. Build trust through consistent performance rather than hiding limitations.

Treat trust as an afterthought and technical success won’t guarantee user acceptance—trust failures burn value. Prioritize transparency and ethical practices and you’ll build sustainable deployments that users actually want to use. Buyers—bottom line: require transparency and objective logging from vendors; govern for trust from day one. Trust failures burn value fast.

Voice agents as learning systems, not one-off deployments

“In 2026 the edge won’t be who has the flashiest day-one agent—it’ll be who built the fastest learning loop around their agent.”
— Zack Reneau-Wedeen, Sierra

In 2026 the edge won’t be who has the flashiest day-one agent—it’ll be who built the fastest learning loop around their agent. Winners treat every conversation as feedback: outcome-based pricing, deploy–measure–learn–improve cycles, and knowledge-base learning from production. The teams that build feedback and iteration into the operating model from the start will pull ahead; those that deploy once and maintain will fall behind.

Advantage comes from learning velocity, not day-one capability

Competitive advantage in voice agents doesn’t come from the “best” day-one agent—it comes from systems that learn from every conversation and improve faster than competitors. In 2026 the edge will be who built the fastest learning loop, not the flashiest launch. Teams that treat voice agents as one-time deployments find that static systems fall behind as technology, use cases, and expectations change. Winners build continuous improvement into their operating model. The industry is moving from “does it respond?” to “can it finish the conversation?” to “can it do complex actions?” Learning systems must evolve capability over time, not launch with fixed functionality.

Learning systems use outcome-based pricing and upward spirals

Sierra and others use outcome-based pricing (only getting paid when the agent successfully completes the job) and combine continuous simulations with live deployment data to drive what they call an “upward spiral” of improvement. This creates aligned incentives: both the vendor and customer benefit from agent improvement, and every conversation becomes data for making the system better. They also treat knowledge-base learning from production as a core loop—using agent failures to automatically surface and prioritize missing or incorrect knowledge, turning deployment into a knowledge discovery mechanism. Simulation frameworks enable “build once, test anywhere, deploy anywhere,” allowing learning systems to improve across modalities (voice, chat, messaging, email) simultaneously, maximizing the value of each conversation for system improvement.

Deploy, measure, learn, improve—not deploy and maintain

The learning system model requires a fundamental shift: from “deploy and maintain” to “deploy, measure, learn, improve.” Successful organizations treat every interaction as feedback. That means evaluation infrastructure that captures what’s happening, analysis that finds improvement opportunities, and iteration that implements changes rapidly.

  Learning loop (not one-off deploy)
  ==================================

      +--------+
      | Deploy |----+
      +--------+    |
           ^        v
      +----+   +---------+
      |        | Measure |  (real calls, logs, outcomes)
      |        +----+----+
      |             |
      |             v
      |        +---------+
      |        | Learn   |  (simulations, production-derived scenarios)
      |        +----+----+
      |             |
      |             v
      +--------+---------+
               | Improve |
               +---------+
  (outcome-based pricing, knowledge-base learning from production)

Figure 8: Voice agents that get better over time

The diagram shows the cycle that makes voice agents improve: deploy, measure what happens on real calls, learn from that data, and improve the agent. It contrasts this with the old approach of deploying once and barely changing it, and can show how outcome-based pricing and learning from production feed into the loop.

Static deployments assume initial design is sufficient; learning systems recognize that voice agents are never “done”—they’re always improving or falling behind. Voice agents are like self-driving: longer and messier than we expect, but reliable autonomous systems are in reach. Plan for iterative improvement over years, not months.

Successful teams follow a playbook: accuracy first, multiple metrics, balance cost and UX, improvements in 60–90 days. Struggling teams deploy without success criteria, optimize only for cost, chase vanity metrics. Learning systems need clear success criteria, multiple measurement dimensions, and rapid iteration. ROI-positive teams show improvements in 60–90 days when properly instrumented and iterated. You need instruction-following evaluation (“did it do the right thing at the right time?")—evaluation that assesses goal achievement and behavior quality, not just accuracy or latency.

Buyers and builders must build in feedback and iteration from the start

Buyers: ROI depends on learning velocity—one-off deployments destroy value over time. When evaluating vendors, ask how they support learning loops, outcome-based pricing, and deploy–measure–learn–improve cycles. When governing deployments, budget for ongoing improvement, not just initial launch. Build feedback, monitoring, and iteration into your voice agent operating model from the start. Treat voice agents as a one-time project and you’ll discover that competitive advantage and ROI come from improvement velocity, not initial capability. The market rewards those who learn fastest, not those who start strongest. Bottom line: require learning loops and outcome-based pricing from vendors; govern for deploy–measure–learn–improve. One-off deployments destroy ROI over time.

Annex: Playbooks by Use Case

The annex below is organized by use case. Each use case includes concrete advice for builders (design, implement, operate) and buyers (evaluate vendors, govern deployments, protect ROI). The main guide provides the evidence and narrative; these playbooks are the actionable checklists.

1. Turn-taking

The biggest user-experience killer in voice agents is not dumb reasoning—it is the agent talking over people or waiting so long that the call feels broken. Naive systems assume “silence means the person is done,” but humans pause mid-sentence, think aloud, trail off, or speak slowly. The frontier is detecting the meaning and tone of speech to decide if a turn is truly finished, then balancing that against the need to respond quickly.

                    +------------------+
                    |   User speech    |
                    +--------+---------+
                             |
                             v
              +------------------------------+
              |  Semantic VAD / end-of-turn  |
              |  (emphasis? question? "um"?)|
              +--------+--------------+------+
                       |              |
         "still talking"|              |"turn ended"
                       v              v
              +-------------+   +------------------+
              | keep        |   | streaming STT    |
              | listening   |   | (chunks ready)   |
              +-------------+   +--------+--------+
                                 |                 |
                                 v                 v
                          +-----------+     +------------+
                          | LLM       |     | TTS        |
                          | (warm)    |---->| (stream)   |
                          +-----------+     +------------+
                                 |
                       first response: 200-500ms target
                       (balance: don't interrupt / don't stall)

Builders

Do not assume silence = done. Use semantic VAD (meaning and tone—emphasis, questions, trailing off like “um”) to decide when a turn is truly finished, not a fixed silence timeout (e.g. 400 ms). OpenAI’s real-time API and Kyutai’s “unmute” STT offer semantic VAD; AssemblyAI’s Universal-Streaming provides intelligent endpointing.
Naive vs better (conceptual config):
```
# Naive: breaks on pauses and slow speakers
end_of_turn: { silence_ms: 400 }

# Better: semantic VAD or intelligent endpointing
end_of_turn: { use_semantic_vad: true }  # or endpointing: "intelligent"
```
Balance delay vs. interruption. Tune latency budgets and turn detection together. Aim for sub-500 ms first response and human-scale (200–250 ms) where possible. Wait long enough to avoid interrupting; respond quickly enough to feel natural.
Use streaming to warm up. Process and transcribe while the user is still speaking so the LLM and TTS are ready when the turn ends. Semantic VAD + streaming makes the delay–interruption tradeoff tractable.
Test turn-taking as a first-class behavior. Run simulations with interruptions, pauses, accents, and speech speed as primary behaviors; test statistically (e.g. 5–15 runs per scenario), not one-off.
Tune by use case. Customer support (short, task-focused turns; holds), language learning (longer utterances, corrections), and coaching/therapy (reflection, silence, emotion) need different turn-taking rules; use per-use-case or adaptive tuning where needed.

Use case	Turn-taking needs
Customer support	Short, task-focused turns; handle holds well
Language learning	Longer utterances; corrections; patience for rephrasing
Coaching / therapy	Reflection, silence, emotion; avoid cutting off

Buyers

Require from vendors: Semantic VAD or equivalent end-of-turn detection—not silence-only. Evidence of latency and turn-taking metrics (e.g. first-response time, interruption rate). Evidence of testing across accents, speech speeds, and interruptions.
Govern in deployment: Monitor interruption rates and user feedback on “cut off” or “long waits.” Include turn-taking quality in success criteria and dashboards. Treat poor turn-taking as a top-priority fix—it drives the #1 user frustration (having to repeat themselves).

2. Context engineering

Long voice calls break simple prompting: you cannot pack thirty minutes of dialogue into a single instruction block and expect reliable behavior. The fix is a graph of states, inject only the context needed for the current step, known checkpoints, and predictive preloading—retrieving data or preparing tools while the user is still talking so the response feels immediate.

  Conversation graph                    Per-node context
  ==================                    ================

   [Greeting]----->[Understand]----->[Schedule]----->[Confirm]
        |                |                |              |
        v                v                v              v
   rules_greeting    intent_rules     calendar API    confirmation
   (minimal)         + history       + policy        (minimal)

  Background roles (parallel to "respond"):
  +----------------+  +----------------+  +----------------+
  | Respond        |  | Watch          |  | Fetch          |
  | (answer user)  |  | (off-track?    |  | (preload next  |
  |                |  |  drift? loop?) |  |  step data)    |
  +----------------+  +--------+-------+  +----------------+
                              |
                              v
                     recovery / escalation

Figure 9: Feeding the right information at the right time

The diagram shows how long conversations are managed without overloading the AI: the conversation is represented as a graph of steps (e.g. greeting → understanding the problem → scheduling). At each step, only the information relevant to that step is given to the AI, and the system can preload the next likely piece of information while the user is still talking, so the reply feels quick.

Builders

Graph of states. Model the conversation as nodes (phases, intents) and edges (valid transitions). Move the agent through the graph as the conversation progresses.

Inject only what’s needed. At each node, inject only the context (rules, knowledge, history slice) relevant to that step. No single giant prompt for long calls.

Example: context injected at one node (conceptual):

node: schedule
context:
  - rules: [scheduling_policy, business_hours]
  - knowledge: [calendar_api_docs]
  - history: last_3_turns
preload: [availability_slots]  # fetch while user still talking

Known checkpoints. Define clear checkpoints (intent clarified, slot selected, confirmed, etc.) so behavior is traceable and recoverable when the conversation drifts.
Background helpers. Use separate processes or logical roles: one responds to the user; one watches for off-track (intent drift, repeated misunderstanding, frustration) and triggers recovery or escalation; one fetches likely next-step information (availability, policy, tools) so data is ready when the user commits.
Predictive preloading. Start retrieving data and preparing tools while the user is still talking, based on partial transcript and graph position. Preload likely next-step context at each node. Combined with streaming, this is why graph-based architectures outperform a single giant prompt for scheduling and scripted support flows.

Buyers

Require from vendors: Architecture that supports dynamic context injection (graph or equivalent)—not a single giant prompt for long, multi-step flows. Explanation of how context is injected per step and how off-track detection works. For scheduling or scripted flows, evidence that preloading or equivalent reduces latency.
Govern in deployment: Verify that long or complex calls (e.g. scheduling, support scripts) use bounded, step-relevant context rather than unbounded dumps. Include drift and loop detection in monitoring; require that vendors surface when conversations go off track.

3. Evaluation

Production voice agents fail in ways teams cannot easily see or measure because there is no standard monitoring and evaluation stack yet. Failures are often slow and subtle: loops, drift, repeated misunderstanding. Voice evaluation must be conversation-level and probabilistic—how often the agent achieves the goal over many turns, how natural the timing feels, how consistently it follows the right steps.

  Pre-production                          Production
  ===============                         ==========
  Text scenarios --> [TTS/record] --> [Full stack] --> Judge model
  (intents, flows)      speech          STT+LLM+TTS    (goal, steps,
                                                        behavior)
  5-15 runs per scenario (statistical, not one-off)

  +------------------+  +------------------+  +------------------+
  | Regression       |  | Adversarial      |  | Production-      |
  | (core scenarios) |  | (edge cases)    |  | derived (real    |
  +------------------+  +------------------+  |  failures)        |
                                             +------------------+
                        run in CI/CD

  Logs (objective, no LLM summaries):
  [session_id][turn_id] transcript | actions | tool_calls | fallbacks | latency
  Dashboards: 99th %ile latency, loops, drift

Builders

Conversation-level, probabilistic evaluation. Do not evaluate like traditional software (binary pass/fail on single turns). Use many runs per scenario (e.g. 5–15), statistical analysis, and metrics: goal-achievement rate, step consistency, timing (latency percentiles, turn-taking quality). Ask: “Did it do the right thing at the right time?”
Practical pipeline. Generate tests in text (intents, flows, edge cases); convert to speech (TTS or recorded); run end-to-end through the full stack (STT, reasoning, TTS/S2S). Use a strong reasoning model to judge entire calls (goal, steps, behavior). Three layers: regression (core scenarios), adversarial (edge cases), production-derived (scenarios from real failures). Run simulations in CI/CD.
Log in production. Objective, per-turn and per-call records: transcript, actions, tool calls, fallbacks, timestamps, latency per component. Session and turn identifiers so you can trace a full call. Tail metrics (99th-percentile latency and errors). Do not rely on LLM-generated summaries for incident analysis. Budget for post-deployment investigation (calls, failures, dashboards)—often costs more than the rollout.
Example log shape (conceptual): session_id, turn_id, timestamp_utc, transcript, actions, tool_calls, fallbacks_used, latency_ms (stt, llm, tts). One JSON object per turn (or per call) so incidents are explainable from logs.
Harness to adopt. Pre-production: scenario simulations, 5–15 runs, judge model. Production: objective logs, session/turn ids, 99th-percentile dashboards that surface loops and drift. Post-deployment: continuous monitoring; regression and adversarial tests fed by production-derived scenarios.

Buyers

Require from vendors: Conversation-level metrics (goal achievement over many turns, step consistency, timing)—not only single-turn accuracy or latency. Objective logging (transcript, actions, timestamps)—not LLM summaries only for incidents. Visibility into 99th-percentile latency and error rates. Evidence of simulation and production-derived testing (regression, adversarial, real-failure scenarios).
Govern in deployment: Review logs and dashboards regularly; require that incidents can be explained from objective logs. Include goal-achievement rate and conversation-level quality in success criteria and SLAs. Treat evaluation and post-deployment investigation as a non-negotiable budget line—teams that invest 20–30% in evaluation reach 90%+ production success in months; those that skip it start at 62% and take 6–9 months to reach 85%.

4. Scaling to production

A voice agent in production is a real-time system of multiple parts (STT, LLM, TTS, turn detection, tools layer) that must work together under tight timing. Users judge the experience by the slowest moments, not the average. Scaling from demo to millions of calls requires strict delay budgets, redundancy and automatic fallbacks when any provider stalls, and orchestration as the glue.

  Budget per component (99th %ile)        Orchestration
  ===============================        ==============

  [User] --> [STT] --> [Turn det] --> [LLM] --> [TTS] --> [User]
              |            |            |         |
              v            v            v         v
         fallback      (semantic    fallback   fallback
         + timeout     VAD)         + timeout  + timeout

  e.g. 2s + 2s + 2s in tail = 6s (unacceptable)
  Target: sub-600ms E2E with streaming + strict budgets

  +----------------------------------------------------------+
  |  Orchestration (Pipecat, LiveKit, Vapi, Vocode, Daily)   |
  |  Coordinates: STT, LLM, TTS, turn detection, tools       |
  |  Handles: failover, state, timeouts, safety (cascaded)   |
  +----------------------------------------------------------+
                              |
  [Tools layer: booking, CRM, APIs]  <- announce in voice, timeout, degrade

Builders

Delay budgets. Set latency budgets at the 99th percentile across all components (STT, LLM, TTS, turn detection), not just averages. If each component can stall 2s in the tail, three in sequence can add up to 6s—and 6s is an eternity; 30s is game over. Allocate strict per-component budgets (e.g. sub-600 ms end-to-end for cascaded with streaming). Monitor the tail. Use real-time streaming (STT in chunks, LLM as soon as first words, TTS before LLM finishes) but plan for the worst case.
Example budget (conceptual):
```
latency_budget_ms:
  stt_p99: 200
  llm_p99: 300
  tts_p99: 150
  e2e_target_p99: 600
```
Redundancy and fallbacks. Design for “when this provider stalls” as a first-class scenario. Route to fallbacks when primaries fail; keep conversation state when parts break. Timeouts, health checks, automatic failover for STT, LLM, TTS, and upstream APIs so a single blip does not define the call.
Orchestration. Use an orchestration layer (e.g. Pipecat, LiveKit Agents, Vapi, Vocode, or Daily for realtime voice/video and telephony) that coordinates STT, LLM, TTS, turn detection, and the tools layer; manages real-time flow, turn-taking, state, and APIs; applies safety checks (e.g. cascaded: inspect before customers hear); and triggers failover when components stall. Without orchestration and observability, you cannot debug why real calls go wrong.
Tools layer. Beyond STT/LLM/TTS: booking, refunds, lookups, CRM, APIs. Announce tool use in voice (“I’m looking that up for you”); avoid long-running ops during real-time turns; handle timeouts so a stuck tool does not stall the call. Include tools in delay budget and failover design.
Budget for observability and investigation. There is still no “Datadog for voice agents.” Design your own logging and dashboards. Budget for post-deployment investigation—often more than the rollout.

Buyers

Require from vendors: Redundancy and automatic fallbacks (STT, LLM, TTS, APIs); orchestration that handles streaming and failover; latency budgets and tail (99th-percentile) monitoring; tools layer that announces, times out, and degrades gracefully. Evidence of production readiness (e.g. failover tested, tail latency measured).
Govern in deployment: SLAs or targets for latency (e.g. 99th percentile) and availability. Incident response and root-cause analysis using objective logs, not summaries. Treat orchestration and observability as non-negotiable—demos that skip them fail in production (Coval: 95% → 62% in week one is common).

5. Implementing in legacy industries

Voice agents can automate work even when the other side never installs software or integrates an API. Because everyone can talk on a phone, voice becomes a universal interface: a company can automate its side while counterparties remain “legacy.” That is why customer support dominates early adoption and why healthcare administration, trucking, and field services are next. Economics favor high-volume routine calls where the counterparty will not adopt a new app; design choices (e.g. clearly artificial vs. human-like voice) and compliance determine trust and ROI.

  Universal interface (no app/API required)     Verticals
  ======================================       =========

  [Company]  ====== voice ======  [Counterparty]
     |         (phone only)            |
     v                                 v
  Voice agent                    Human (or legacy
  (automate)                     process); no install

  +----------------+  +----------------+  +----------------+
  | Design for     |  | Compliance     |  | Hybrid         |
  | phone-only     |  | from day one   |  | handoff        |
  | transparency   |  | cascaded,     |  | context        |
  | handoff        |  | on-prem,      |  | transfer       |
  | turn-taking    |  | data residency |  |                |
  +----------------+  +----------------+  +----------------+

  Prioritize: support | healthcare admin | trucking | field services
  (phone primary, high-volume routine, counterparty won't adopt app)

Builders

Design for phone-only users. No assumption that users have an app, account, or API. Design for transparency (what the agent can and cannot do), human handoff with context transfer, and natural turn-taking so the experience feels smooth. Match voice and disclosure to the vertical—high-stakes and regulated (healthcare, finance, legal) favor clarity and control over maximum naturalness.
Compliance and logging from day one. Use architectures that allow inspection and control (e.g. cascaded with text intermediaries). Implement objective logging (transcript, actions, timestamps)—not LLM summaries for incidents. Design for on-prem or private deployment and data residency where the vertical requires it. Plan for hybrid human–AI as the end state and for evaluation and learning loops so the system stays within guardrails.
Prioritize verticals where voice is the natural channel. Phone already primary, high-volume routine calls, counterparties unlikely to adopt apps—e.g. customer support, healthcare admin, trucking, field services. Compliance and regional requirements will shape architecture and disclosure.

Buyers

Verticals to prioritize. Industries where (1) the phone is already the primary or necessary channel, (2) routine calls are high volume and repetitive, and (3) counterparties are unlikely to adopt new apps or APIs—customer support, healthcare administration, trucking, field services. Compliance and regional requirements (e.g. healthcare, finance) will shape what you require (cascaded, on-prem, data residency, disclosure).
Economics and expectations. Replacing routine calls reduces cost per contact and frees humans for complex work—but design for handoff and context transfer or savings erode when users repeat themselves or escalate. Require transparency and human handoff; require objective logging and compliance-ready architecture for regulated verticals.
Trust and design choices. Require that vendors support design choices that determine trust: clearly artificial (disclosure built in) vs. fully human-like (natural voice). In high-stakes or regulated environments, require cascaded architectures and clear disclosure. Treat trust as an ROI lever—lost trust costs churn, escalations, reputation.
Govern for safe deployment. Require and govern for hybrid human–AI, compliance (control, auditability, data residency), and learning loops (deploy–measure–learn–improve). Use the main guide sections on hybrid, compliance, trust, and voice agents as learning systems as the checklist for vendor evaluation and deployment governance.

Glossary

Terms used in this guide:

STT (speech-to-text): Transcribes user speech into text for the LLM.
TTS (text-to-speech): Converts the LLM’s text response into spoken audio.
LLM (large language model): The model that generates responses from text (and sometimes audio).
Cascaded: Architecture where audio goes STT→text→LLM→text→TTS. You get text at each step (control, compliance, debuggability).
S2S (speech-to-speech): Architecture where audio goes directly in and out; no explicit text in the middle. Lower latency, less control.
VAD (voice activity detection): Detects when someone is speaking vs. silent.
Semantic VAD: VAD that uses meaning (e.g. end of sentence, question) to guess when the user is done speaking, not just silence.
Turn detection / end-of-turn: Deciding when the user has finished speaking so the agent can respond without interrupting.
Dialogue management: The layer that decides when to listen, when to speak, how to handle interruptions, and how to recover—orchestrating STT, LLM, and TTS into a coherent conversation.
Stack: The full set of architecture, components (STT, LLM, TTS, dialogue management, etc.), and practices you use to build and run voice agents.

Sources & evidence

Evidence in this guide is drawn from distilled analyses in resources/ (reports, transcripts, advice). Below: fact-check (web), reports and surveys (name, company, year, used for), then sources by section.

Fact-check (web-verified as of 2026)

Key claims were checked against public sources. Verified: Deepgram State of Voice AI 2025 (deepgram.com/2025-state-of-voice-ai-report): 80% use voice agent systems, 21% very satisfied. Twilio Inside the Conversational AI Revolution (Nov 2025): 90% orgs vs 59% consumers (satisfaction), 19%/81% single vs multi-model, 99% plan to evolve strategy, 78% want human handoff, 15% experience seamless handoff. AssemblyAI 2026 Voice Agent Insights Report: 55% cite “having to repeat themselves” as top frustration. Cartesia: Sonic streaming TTS, ultra-low latency for live agents (cartesia.ai, docs). Hume AI: EVI (Empathic Voice Interface) S2S, Octave TTS with emotional intelligence (hume.ai, dev.hume.ai). Vocode: open-source streaming voice agents, STT/LLM/TTS, telephony (docs.vocode.dev). Daily: realtime voice/video, WebRTC, telephony for agents (daily.co). Voiceflow: design, test, deploy voice agents with test platform and observability (docs.voiceflow.com). Not found in public search: Coval’s 95%→62% demo-to-production stat is from Coval’s research/transcript in resources/, not from a publicly cited report page.

Internal fact-check (vs. resources/): All stats, page refs (Twilio p.64, p.75, p.294, p.298; Coval report p.9, p.25; etc.), and speaker attributions were verified against the transcripts and reports in resources/. One correction made: the quote “You can wire up a voice agent in a day, but without orchestration and observability you have no idea why real calls go wrong” is paraphrased from Deepgram (Anoop D)—“build it really quickly… very little orchestration or observability… I don’t know what happened” (transcript-deepgram)—not from Modulate (Carter). Carter’s transcript covers trust, transparency, and investigation costs; Deepgram’s covers prototyping vs. production and observability gaps.

Reports and surveys

Report / source	Company	Year	Used for
State of Voice AI 2025	Deepgram	2025	80%/21% adoption vs satisfaction, latency importance, fine-tuning, three-dimensional expertise
Conversational AI Report (State of CAI)	Twilio	2025	90%/59% org vs consumer satisfaction, 19%/81%/99% multi-model, 15% seamless handoff, 78% want human handoff
The Voice AI Stack for Building Agents in 2026	AssemblyAI	2026	Streaming stack, STT/TTS/LLM/orchestration, 55%/45% frustrations, 44% hybrid build, evaluation ROI
Coval report & main transcript	Coval	2025	95%→62% demo→production, hybrid routing, multi-model orchestration, Pipecat/LiveKit, redundancy
Coval Brooke snippet	Brooke Hopkins, Coval	2025	Cascaded vs S2S, enterprise control, hybrid routing
Calabrio contact center / AI research	Calabrio	2025	98% use AI, agent experience gap, 64%/59% empathy/coaching
Gartner (conversational GenAI)	Gartner	2025	Adoption trends, knowledge/learning (64% learning tech)
Sierra transcripts & simulations	Sierra (Zack Reneau-Wedeen)	2025	S2S hallucination in production, outcome-based pricing, voice simulations, learning loops, launch-to-learn
Voices report	Voices	2025	Human preference, ethics, authenticity

Named products and platforms (representative)

Frameworks and infra: Pipecat, LiveKit, Vapi, Vocode (open-source streaming voice agents, phone and meetings), Daily (realtime voice/video, telephony/WebRTC). TTS: Cartesia (ultra-low-latency streaming TTS for live agents), AssemblyAI Universal-Streaming. S2S / emotion-aware: Hume AI (EVI, Octave—expressive, emotion-aware S2S). Design and observability: Voiceflow (design, test, deploy voice agents with collaboration and observability), Coval (simulation and monitoring). See projects.yaml in this repo for a fuller landscape.

Transcripts and advice (speaker / company)

Revolut (Anna Baidina): voice ≠ chat, dialogue management as “secret ingredient,” turn-taking.
OpenAI (Peter Bakum): real-time API, semantic VAD, turn detection, S2S vs cascaded use cases, Perplexity in production.
LiveKit (Russ Dsa): voice bandwidth, “Datadog for voice AI” gap, observability, handoff feedback.
Modulate AI (Carter Huffman): prototyping vs observability, trust/transparency, objective logging.
Resemble AI (Zohaib): on-prem/private deployment, biometrics, data residency red flags.
Slang Labs (Luke Miller): Australia/regional compliance, prompt/behavior confusion, contact-center-scale services.
Process / transcript-voice: voice ≠ text, turn-taking, production failure modes.
r/LocalLLaMA (advice-1): self-hosted sub-600ms stack, orchestration, Kyutai unmute, Parakeet, Chatterbox.
Softcery (advice-2): STT/TTS selection, S2S maturity.

Sources by section

Voice and text are fundamentally different — Revolut, transcript-voice (Process), Deepgram transcript, LiveKit, Coval Brooke snippet, OpenAI transcript, AssemblyAI advice, SLNG.

Latency is a first-order requirement for voice — Deepgram report & transcript, Revolut, Coval Brooke snippet, transcript-voice, Twilio, Coval main transcript, AssemblyAI advice, r/LocalLLaMA.

Turn detection / end-of-turn / semantic VAD is critical and hard — Deepgram transcript, OpenAI (semantic VAD), Revolut, transcript-voice, Coval Brooke snippet, Sierra (voice simulations), AssemblyAI advice, Kyutai unmute.

Cascaded vs. S2S: control vs. naturalism; hybrid is the future — Coval report, Coval Brooke snippet, OpenAI, Sierra, Deepgram report (with different emphasis), AssemblyAI advice, Softcery advice.

Multi-model / modular architecture is the winning pattern — Coval report, Twilio, Deepgram report, transcript-voice (two approaches), Coval main transcript, AssemblyAI advice, SLNG, Softcery advice.

Dialogue management / conversation control is the “secret ingredient” — Revolut, transcript-voice (turn-taking, flow redesign), Coval (Pipecat, LiveKit), Coval main transcript, AssemblyAI advice.

Prototyping is easy; production is hard — Coval report (95%→62%), Deepgram transcript, Sierra, LiveKit, Modulate (Carter), transcript-voice, AssemblyAI advice, Coval main transcript.

Evaluation and testing infrastructure drive production success — Coval report, Sierra, LiveKit, Modulate (Carter), Sierra simulations, Coval main transcript (20–30% evaluation → 90%+ success).

Satisfaction gap: adoption is high, satisfaction is low — Deepgram report (80%/21%), Twilio (90%/59%), Calabrio (contact center / AI), AssemblyAI advice (55%/45% frustrations).

Hybrid human–AI is required, not optional — Twilio (78%, 15% handoff), Calabrio, Gartner, AssemblyAI advice.

Compliance, control, and auditability matter for enterprises — Coval Brooke snippet, Deepgram report, Sierra (deterministic access control), Modulate (Carter), Resemble (Zohaib: on-prem, data residency), Gartner, SLNG, AssemblyAI advice.

Professional services / cross-domain expertise remain necessary — Coval report (Brooke Hopkins: contact-center scale), Calabrio, Deepgram report (82.5%/25% confidence vs skills), SLNG, AssemblyAI advice.

Knowledge base and content readiness block success — Gartner, Sierra (Zack Reneau-Wedeen: launch to learn), Coval main transcript.

Trust and transparency break first in production — Modulate (Carter: 80% trust), Twilio (trust vs. speed), Voices (human preference, ethics), AssemblyAI advice (31% prefer human).

Voice agents as learning systems, not one-off deployments — Coval report, Sierra (outcome-based pricing, knowledge from deployment), Coval main transcript, AssemblyAI advice.

← All reports