Real-Time (Speech-to-Speech) vs Turn-Based (Cascading STT/TTS) Voice Agent Architecture

Three architectural approaches exist for building voice agents:

Chained pipelines using separate speech recognition, language processing, and synthesis components
Speech-to-Speech (Half-Cascade) that processes native audio input, uses text-based language reasoning, and generates speech output
Native Audio models that reason directly in audio space within a single neural network

Each makes different tradeoffs between flexibility, latency, cost, production readiness, and audio quality preservation.

Understanding Speech-to-Speech Voice Agent Architecture

What Is Speech-to-Speech?

Speech-to-speech voice agents process audio as a continuous stream instead of waiting for full utterances. As of April 2026, end-to-end TTFT on Artificial Analysis clusters between 0.78 seconds (xAI Grok Voice Agent) and 2.98 seconds (Gemini 3.1 Flash Live), with OpenAI gpt-realtime-1.5 at 0.82 s and Amazon Nova 2 Sonic at 1.14 s. Human conversational response averages around 200 ms, so the fastest S2S providers now approach but do not quite match human pacing.

Two approaches exist: half-cascade systems use native audio input processing with text-based language model reasoning and speech synthesis output. Native audio models handle everything within a single neural network that reasons directly in audio space.

Both encode incoming sound into vectors capturing linguistic content, tone, and emotion. They begin generating responses while the user is speaking or immediately after. Native audio maintains more audio information throughout processing, while half-cascade systems balance modularity with lower latency than chained architectures.

Chained vs. Speech-to-Speech Voice Agent Architecture

Chained pipelines follow a sequential flow: Voice → STT → LLM → TTS → Voice. Each component waits for the previous one to finish before processing. Speech-to-speech architectures stream input and output concurrently across the stack, reducing perceived delay in scenarios that involve rapid turn-taking or mid-utterance interactions.

Aspect	Chained Voice Agent	Speech-to-Speech Voice Agent
STT Processing	Can stream partial transcripts, but waits for end-of-utterance to finalize	Continuously streams partial transcripts as user speaks
LLM Behavior	Waits for complete STT output before processing	Begins processing from partial input while user is still speaking
TTS Synthesis	Can stream audio chunks, but starts after LLM generates first chunks (TTFT)	Starts speaking immediately as first tokens are generated, fully streaming
Latency	Higher due to sequential handoffs between components	Lower – concurrent streaming across all components
Flexibility	High – easy to swap out STT, TTS, and LLM independently	Less flexible – components must support tight integration and real-time coordination
Risks / Challenges	Requires careful orchestration between components to minimize latency	Cost varies dramatically by provider – OpenAI Realtime runs ~10x chained pipeline, while Amazon Nova 2 Sonic and Gemini 3.1 Flash Live approach chained pricing; requires stream orchestration to avoid mishearing
User Experience	Structured and clear, but less dynamic; noticeable pauses between turns	Agent can begin replying before user finishes speaking; maintains emotional tone through audio processing
Best Use Cases	All use cases, especially when cost control and flexibility are priorities	Best when ultra-low latency is critical and budget allows (AI concierges, premium live support)
Technical Requirements	Moderate – most providers offer PaaS solutions; focus on linking components and fallback strategy	Moderate – cloud APIs handle infrastructure; high only if self-hosting open-source models

Core Architectures for Voice AI Agents

Three fundamental architectural approaches exist for building voice AI agents. Each has distinct trade-offs in latency, flexibility, and naturalness:

1. Chained Pipeline (Cascaded STT→LLM→TTS Architecture)

Schema: Voice → STT → LLM → TTS → Voice

How it works: The system converts speech to text, processes it through a language model, and turns it back into audio.

Pros:

Easy to build and debug;
Works well with existing LLM APIs;
Reliable and predictable;
High flexibility – easy to swap out STT, TTS, and LLM independently.

Cons:

High latency since each component waits for the previous one to complete;
Loses tone and emotion when converting to text;
Less natural feel, limited interruptibility.

Example implementations: Deepgram STT + GPT-5.4 + Cartesia TTS, Gladia STT + Gemini 3 Flash + ElevenLabs TTS

2. Speech-to-Speech (Half-Cascade Architecture)

Schema: Voice → Audio Encoder → Text-based LLM → TTS → Voice

How it works: The model processes audio input directly through an encoder, uses a text-based language model to reason and respond, then generates speech via synthesis. This combines native audio input with text-based reasoning and speech output.

Google and OpenAI use this half-cascade architecture, balancing speed, performance, and reliability. This works well for production use and tool integration.

Voice → LLM → Voice

Pros:

Lower latency with streaming capability;
Retains tone and prosody cues;
Natural conversational flow;
More interruptible than chained pipeline.

Cons:

Still has a separate LLM reasoning layer (text-native);
TTS quality is lower than specialized TTS models (e.g., ElevenLabs, Cartesia) – voice sounds less natural and expressive;
Less flexible than fully modular approach.

Example systems: Google Gemini 3.1 Flash Live, OpenAI Realtime API (gpt-realtime-1.5), xAI Grok Voice Agent API, Ultravox

3. Native Audio Model (End-to-End Speech-to-Speech AI)

Schema: Voice → Unified Model → Voice

How it works: A single model listens, reasons, and speaks – all within one neural network. It encodes audio into latent vectors that capture meaning, emotion, and acoustic context, then directly generates output audio from those same representations.

Pros:

Very low latency (true real-time);
Maintains emotional tone and voice consistency;
Most natural conversational quality;
Supports full-duplex with natural interruptions.

Cons:

Hard to train and control;
Opaque reasoning (no clear text layer);
Needs huge, high-quality audio datasets;
Limited flexibility for voice customization.

Example systems: Step-Audio R1.1 (Realtime), Amazon Nova 2 Sonic, Moshi by Kyutai Labs, VITA-Audio, SALMONN-Omni, Kimi-Audio

Step-Audio R1.1 (Realtime), released January 14, 2026 by StepFun under Apache-2.0, currently tops the Big Bench Audio leaderboard at 97%. Its Dual-Brain Architecture splits reasoning and articulation into separate components, enabling complex reasoning while maintaining fluent real-time speech. Artificial Analysis measures TTFT at around 1.51 seconds; hosted inference runs on community providers.

Amazon Nova 2 Sonic (GA December 2, 2025 on AWS Bedrock) offers native audio processing at $3/1M input speech tokens and $12/1M output speech tokens (plus $0.33/$2.75 per 1M text tokens for transcription/tool-calling). Third-party analyses quote roughly $0.02/min for typical turn structure, placing it an order of magnitude cheaper than OpenAI Realtime for a comparable-class model. It supports 7 languages (English, French, Italian, German, Spanish, Portuguese, Hindi), polyglot voices that can switch mid-conversation, and is deployed in US East (N. Virginia), US West (Oregon), and Asia Pacific (Tokyo). It scored ~88% on Big Bench Audio with TTFT around 1.14 seconds on Artificial Analysis.

Google’s previous native-audio model (gemini-2.5-flash-native-audio-preview) has been consolidated into Gemini 3.1 Flash Live (see the half-cascade section above) – a single unified audio-to-audio model that replaced both the half-cascade and native-audio preview paths in March 2026.

Available Speech-to-Speech Models & Platforms

Commercial APIs and open-source projects provide speech-to-speech voice agents in 2026:

Leading Proprietary Platforms

Four production-ready speech-to-speech platforms lead the space in April 2026:

Feature	OpenAI Realtime API (gpt-realtime-1.5)	Google Gemini 3.1 Flash Live	xAI Grok Voice Agent API	Amazon Nova 2 Sonic
Architecture Type	Half-Cascade (Speech-to-Speech)	Unified Audio-to-Audio (collapsed half-cascade + native audio)	Half-Cascade (Speech-to-Speech), OpenAI-Realtime-compatible spec	Native Audio (End-to-End)
Provider	OpenAI (also via Azure)	Google / DeepMind	xAI	Amazon (AWS Bedrock, Amazon Connect)
Released / Status	v1.5 shipped Feb 23, 2026; Realtime API GA (MCP, SIP, image inputs since Aug 2025 GA)	Preview launched Mar 26, 2026; production-ready via Gemini Enterprise for CX	Voice Agent API GA Dec 17, 2025; standalone STT/TTS APIs launched Apr 18, 2026	GA December 2, 2025
Latency – Time to First Audio	~0.82 s (Artificial Analysis)	~2.98 s measured on Artificial Analysis (slower than marketed “real-time”)	~0.78 s (fastest in class per Artificial Analysis)	~1.14 s
Big Bench Audio	~81%	~96%	~93%	~88%
Audio Input	Streaming audio via WebRTC + WebSocket; SIP dialing supported	Streaming audio via Multimodal Live API	Streaming audio via OpenAI-Realtime-compatible WebSocket API	Streaming audio via Bedrock Runtime API
Voices	Cedar and Marin voices (available since Aug 2025 GA)	Polyglot voices via Gemini voice catalog	Multiple voices across 20+ languages with inline emotion tags	Polyglot voices, 7 languages
Tool Use / MCP	Tool calls and remote MCP servers (GA)	Tool calls; Google Cloud tool integrations	Tool calls compatible with OpenAI Realtime tool schema	Native tool use on Bedrock
Pricing	$32/1M audio-in, $64/1M audio-out (≈$0.30/min baseline)	$3/1M audio-in (≈$0.005/min), $12/1M audio-out (≈$0.018/min)	~~$3.00/hr input (~~$0.05/min)	$3/1M audio-in, $12/1M audio-out on Bedrock (third-party analyses quote ~$0.02/min for typical turn structure, roughly an order of magnitude cheaper than OpenAI Realtime)
Hosting / Access	Cloud only (OpenAI API, Azure OpenAI Service)	Cloud only (Google AI Studio, Vertex AI, Gemini Enterprise)	Cloud only (xAI API)	Cloud only (AWS Bedrock)
Context Window	32k total tokens (~28k effective input, 4,096 max output)	Long (inherits Gemini 3 family)	Inherits Grok 4.1 Fast family	Long (Nova family)

Open-Source Alternatives

Two open-source projects offer alternatives to proprietary models:

Feature	Ultravox (by Fixie.ai)	Moshi (by Kyutai Labs)
Architecture Type	Half-Cascade (Speech-to-Speech)	Native Audio (End-to-End)
Model Type	Multimodal LLM (audio + text encoder, outputs text)	Audio-to-audio LLM (integrated STT and TTS – speech in, speech out)
Architecture	Voice → LLM → Text (planned speech output in future versions)	Voice → LLM → Voice (fully integrated speech-to-speech pipeline)
Streaming Support	Streaming text output with low latency	Full-duplex streaming (supports overlap and interruption)
Time to First Token (TTFT)	Model-level latency under 300 ms; end-to-end varies by deployment	~160 ms model-level
Token Generation Speed	Streams at 200+ tokens/sec on typical deployments	Not token-based; generates speech waveform directly
Base Models	Built on open LLMs – v0.7 uses GLM-4.6 (355B params, 160 experts/layer), achieving 87.05 on VoiceBench without reasoning / 90.75 with reasoning (#1 among speech models) and 91.80 / 97.00 on Big Bench Audio; LibriSpeech WER 2.28	Proprietary foundation model trained by Kyutai
Audio Processing	Projects audio into same token space as text using custom audio encoder	End-to-end audio encoder and decoder (neural codec pipeline)
Output Type	Text tokens paired with downstream TTS (Sonic-3, ElevenLabs, Cartesia); native speech-token output on Ultravox roadmap	Audio (neural codec speech)
Hosting / Deployment	Self-hostable on B200/H100-class GPUs for 355B GLM-4.6 backbone; or via Fixie.ai hosted API	Self-hostable (heavy); commercial APIs available via Gradium
Open-Source Status	Fully open: model weights, architecture, and code available on GitHub	Fully open: code and demos available; weights provided (early stage)
Extensibility	Can plug in any open-weight LLM; attach custom audio projector	Closed model structure for now; focused on turnkey audio-agent use
Use Case Fit	Voice-enabled bots with real-time understanding, using custom TTS for output	Full voice agents with natural interruptions and direct speech response

Integration Frameworks and Tools

Integration frameworks include Pipecat (vendor-agnostic voice agent framework maintained by Daily.co as 100% open source, reached v1.0.0 on April 14, 2026), LiveKit Agents (v1.5.6 as of April 22, 2026, with adaptive interruption handling at 86% precision / 100% recall, dynamic endpointing, and preemptive generation enabled by default), and FastRTC (Python streaming audio). For comprehensive platform comparisons including deployment options and integration approaches, see the voice agent platform guide. Teams can also assemble open-source speech recognition (Vosk, NeMo, Kimi-Audio) and TTS (VITS, FastSpeech, Cartesia Sonic-3) components into speech-to-speech agents without using end-to-end models.

Performance Metrics That Matter

Three metrics determine voice agent performance: speed (time to first token), accuracy (word error rate), and processing efficiency (real-time factor).

Time to First Token (TTFT)

Time to First Token (TTFT) measures latency from end-of-user-speech to start-of-agent-speech. Production speech-to-speech models in April 2026 cluster in a wide range on Artificial Analysis measurements: xAI Grok Voice Agent leads at ~0.78 s, OpenAI gpt-realtime-1.5 at ~0.82 s, Amazon Nova 2 Sonic at ~1.14 s, Step-Audio R1.1 at ~1.51 s, and Gemini 3.1 Flash Live at ~2.98 s (notably slower than Google’s “real-time” marketing). Human response latencies in conversation average around 200 ms.

Network latency affects cloud API measurements, so real-world TTFT runs higher than lab values. Published TTFT may be measured in controlled settings or end-to-end.

Lower TTFT is better, though extremely low values may indicate the model responds before fully processing user intent.

Word Error Rate (WER)

Word Error Rate (WER) measures the percentage of words incorrectly recognized in the transcript. Lower WER means more accurate transcription. Moonshot’s Kimi-Audio set a new SOTA at 1.28% WER on LibriSpeech test-clean in early 2026, with NVIDIA Canary Qwen 2.5B leading the Hugging Face Open ASR Leaderboard at 1.6% test-clean / 3.1% test-other (5.63% averaged across the suite). xAI’s Grok STT also reports 5.0% WER in phone-audio conditions, beating ElevenLabs (12%), Deepgram (13.5%), and AssemblyAI (21.3%) on the same benchmark.

Recognition errors can lead the LLM astray. Cloud providers publish WER on curated benchmarks, but real-world WER over telephony or noisy audio runs higher. Real-time agents may correct some ASR errors via context, though lower baseline WER remains preferable.

Domain adaptation through custom vocabulary or fine-tuning helps with specialized terminology.

Real-Time Factor (RTF)

Real-Time Factor (RTF) measures processing speed relative to input duration. RTF < 1.0 means the system processes faster than real time. Each component has its own RTF: STT engines typically process at 0.2× real time, voice-suitable LLMs generate at 100–300+ tokens/sec (Grok 4.1 Fast 135 TPS, Gemini 3.1 Flash-Lite 314 TPS per Artificial Analysis), modern TTS synthesizes at RTF 0.1 or better (10 seconds of speech generated in 1 second).

Systems must maintain RTF < 1 under load to prevent latency accumulation. Smaller models often achieve better RTF at the cost of language quality, making token generation speed a determining factor for ultra-low latency requirements.

Cost Analysis and Scalability for Speech-to-Speech Voice Agents

Speech-to-speech voice agent costs break down into five categories: cloud API usage, self-hosting compute, scalability limits, bandwidth, and enterprise overhead.

Cost Category	Description	Examples / Benchmarks	Key Considerations
Usage-Based Pricing (Cloud APIs)	Pay-per-token/minute for any architecture (STT, LLM, TTS, or integrated multimodal)	OpenAI Realtime (gpt-realtime-1.5): ~$0.30/min baseline, scales to $1.50+/min with long context Gemini 3.1 Flash Live: ~$0.023/min baseline ($0.005/min audio-in + $0.018/min audio-out), text reasoning tokens added on top Amazon Nova 2 Sonic: $0.017/min (≈10× cheaper than OpenAI Realtime for comparable quality) xAI Grok Voice Agent: $3.00/hr input ($0.05/min) Chained pipeline: ~$0.15/min (no context accumulation)	Speech-to-speech models with text-token reasoning accumulate context cost across turns; native-audio models like Nova 2 Sonic and Step-Audio price per audio second with no text accumulation; chained pipelines maintain consistent per-minute pricing
Compute Costs (Self-Hosting)	Run open-source models like Ultravox/Moshi/Step-Audio on your own infra	Ultravox v0.7 (GLM-4.6 355B backbone) needs B200-class or multiple H100s per concurrent session; April 2026 GPU rentals: H100 $1.49–$2.99/hr specialist, B200 $2.65–$3.79/hr on Lambda/RunPod	Lower marginal cost at scale; Requires infra & DevOps team; Harder to spin up instantly
Scalability / Rate Limits	Limits on concurrent sessions, tokens per minute, request rate	OpenAI gpt-realtime-1.5: ~800K audio tokens/min, ~1K req/min default; Enterprise tiers negotiate higher quotas	Watch for WebSocket caps or long-lived session constraints; Request enterprise quotas if needed
Bandwidth Overhead	Cost of streaming audio data over network	~8–64 kbps per stream; Telephony codecs (e.g. G.711 vs G.729) can affect costs	Minor cost per stream, but adds up at scale; Ensure egress limits aren’t exceeded in cloud setups
Enterprise Overhead	SLAs, premium support, custom deployments, fallback systems	Regional/on-prem hosting; Redundancy systems (e.g. backup STT or fallback bots)	Adds reliability and control; Contractual/licensing complexity increases total cost of ownership

Understanding Speech-to-Speech Pricing

Speech-to-speech models like OpenAI gpt-realtime-1.5, Gemini 3.1 Flash Live, and Amazon Nova 2 Sonic have different cost structures than chained STT/TTS pipelines. For detailed provider comparisons with latency benchmarks and accuracy metrics, see the complete STT and TTS selection guide. Three factors drive higher costs in OpenAI’s lineup specifically:

Proprietary multimodal infrastructure – These models require specialized neural architectures that process audio natively, maintaining acoustic features throughout the pipeline rather than collapsing to text
Cloud-only deployment – No self-hosting option means paying for enterprise-grade streaming infrastructure, low-latency global endpoints, and WebRTC/gRPC orchestration
Advanced real-time capabilities – Support for interruptions, emotional tone preservation, and sub-second latency requires substantial compute resources per session

Real-world cost reports from OpenAI’s developer community:

$3 spent on “a few short test conversations” in the playground (simple questions like bedtime stories)
$10 consumed during weekend integration testing, leading developers to call the API “unusable at the moment” due to cost
Costs increase per minute as conversations get longer – in a 15-minute session, one developer reported $5.28 for audio input vs $0.65 for output. This happens because tokens accumulate in the context window, and the model re-charges for all previous tokens on each turn, making longer conversations disproportionately more expensive

User-reported costs differ from official per-minute estimates because actual costs depend on conversation length (context accumulation), system prompt size (larger prompts = more tokens per turn), and conversation complexity (more back-and-forth = more context to maintain). A 5-minute conversation might cost $0.30/min, while a 30-minute conversation could cost $1.50/min or more due to accumulated context. Two newer pricing patterns reduce this risk: Gemini 3.1 Flash Live charges per audio second with text-token cost layered on top, and Amazon Nova 2 Sonic prices a flat ~$0.017/min – both far below OpenAI’s per-turn accumulation curve.

Native Audio Models matured significantly in 2025–2026. Production-ready or near-production options now include:

Step-Audio R1.1 (Realtime) – open-source under Apache-2.0, 97% on Big Bench Audio, TTFT ~1.51 s on Artificial Analysis
Amazon Nova 2 Sonic – GA on AWS Bedrock (Dec 2, 2025), ~$0.017/min, 7 languages
Moshi via Gradium – Kyutai’s commercial spinoff (December 2025, $70M seed) now ships production STT and TTS APIs in English, French, Spanish, Portuguese, and German. Live deployments include gaming studios (immersive NPCs), language-learning platforms, and healthcare assistants. Core Moshi model: 7B parameters, 2.1T tokens, full-duplex streaming, ~160 ms latency
Kimi-Audio – 1.28% WER on LibriSpeech test-clean (current SOTA)

Self-hosting open-source native-audio models still requires significant GPU resources (A100/H100 class) and engineering effort. Commercial APIs from Gradium, AWS, and StepFun providers reduce that burden but limit voice customization and control compared to modular pipelines.

Prompt Caching Changes the Math for Chained Pipelines

Voice agents replay the same system prompt every turn, and every major chained-pipeline LLM provider now offers prompt caching that cuts that repeated cost to roughly 10% of base price. Anthropic bills cached reads at 0.1× base (5-minute default TTL, 1-hour TTL available at 2× write premium). Gemini applies implicit caching automatically at zero cost and explicit caching at a 75% discount on 2.5+ models. OpenAI discounts cached input to 10% of standard. DeepSeek offers roughly 90% off on cache hits. For a voice agent with a 5K-token system prompt and 30-second turn rate, caching reduces system-prompt cost by roughly 90% across a 10-minute call. Context-accumulating S2S models like OpenAI Realtime do not benefit from classical prompt caching the same way – their cost structure re-bills the full context on each turn. See the LLM selection guide for detailed caching math by provider.

Tool Use and MCP Now Standard in S2S APIs

All four leading S2S providers ship native tool-calling for voice agents: OpenAI gpt-realtime-1.5 supports remote MCP servers and SIP dialing, Gemini 3.1 Flash Live integrates Google Cloud tools, xAI Grok Voice Agent uses an OpenAI-Realtime-compatible tool schema, and Amazon Nova 2 Sonic supports async tool use and cross-modal switching on Bedrock. Chained pipelines pass tool use to the underlying text LLM (GPT-5.4, Claude Sonnet 4.6, Gemini 3 Flash, Grok 4.1 Fast) and inherit the full function-calling capability set. For voice agents that book appointments, query CRMs, or trigger workflows, tool-use reliability matters at least as much as raw LLM accuracy – benchmarks like BFCL v3, τ-bench / τ²-bench / τ³-bench, and VoiceAgentBench are the current references.

Match Cost Strategy to Deployment Scale

Early-stage projects with low volume benefit from cloud APIs: fast setup, predictable pricing, pay-per-use. As usage grows, self-hosting economics may improve, particularly when requiring tight control, data locality, or custom model tuning.

Enterprise scale depends on reliability, rate limits, support agreements, and long-term flexibility – not just price per minute. Total cost of ownership (TCO) includes processing minutes, bandwidth, DevOps effort, redundancy, and support.

Cost calculation for specific scenarios: average conversation length × conversations per day × per-minute pricing = monthly cost. Compare against self-hosting infrastructure investment. Monitor usage limits and enterprise tier requirements.

Technical Implementation Challenges in Speech-to-Speech Voice Agent Deployment

Deploying to production requires integrating streaming, connecting to telephony, handling noise, and orchestrating streams.

Streaming Integration (WebRTC, WebSockets, etc.)

Low latency requires appropriate streaming mechanisms. Three options: WebRTC, WebSockets, and streaming HTTP/gRPC.

WebRTC

Web Real-Time Communication is the standard for low-latency audio/video streaming in browsers and mobile apps. Uses UDP for fast transmission and handles packet loss gracefully. OpenAI, Google, and xAI all expose WebRTC endpoints for client-side audio capture and playback; Amazon Nova 2 Sonic uses Bedrock Runtime API instead.

Browser and mobile app interactions use WebRTC to send microphone audio to the server. Includes Acoustic Echo Cancellation (AEC), noise reduction, and automatic gain control (AGC). Libraries like LiveKit, mediasoup, or Twilio provide WebRTC integration.

WebSockets and gRPC

Server-side connections between application servers and AI services use persistent bidirectional connections. OpenAI’s voice API uses WebSockets – client sends audio chunks and receives tokens continuously. Google’s API uses gRPC streaming over HTTP/2.

Both provide continuous streams rather than discrete HTTP requests. Implementation requires proper binary audio frame handling and maintaining open connections for conversation duration.

Audio Encoding

Audio format choice depends on API requirements. PCM raw audio is simple but bulky. Opus codec (used by WebRTC) provides high quality at low bitrate, though not all APIs accept Opus packets. Some APIs accept WAV or FLAC frames.

Compressed codecs save bandwidth for mobile users. Phone calls use G.711 µ-law 8kHz, requiring transcoding to 16kHz linear PCM for most ASR systems (Whisper, DeepSpeech).

Latency Tuning

Streaming systems use buffers to smooth network variation. WebRTC jitter buffers trade smooth audio for added delay. Default WebRTC parameters suffice for most deployments.

WebSocket implementations send data immediately (20ms audio frame every 20ms) without batching. Most WebSocket libraries disable Nagle’s algorithm by default to avoid delaying small packets.

Handling Network Issues

WebRTC handles packet loss through loss concealment, filling missing audio chunks with plausible noise. WebSocket implementations lack this but ASR systems handle minor gaps reasonably well on decent networks.

Output packet loss can cause audio blips. Some systems use redundant packets or forward error correction on unreliable networks.

Many implementations combine approaches: WebRTC from client to relay server, then WebSocket from server to AI API. OpenAI’s example follows this pattern. WebRTC handles unpredictable client networks while WebSocket simplifies AI model interfacing.

Telephony Integration (8 kHz and PSTN)

Call Quality Challenges

Phone deployments reveal quality issues absent in web-based implementations. Standard PSTN uses 8 kHz audio (G.711 codec), which severely degrades both speech recognition accuracy and TTS naturalness compared to 16 kHz+ web audio.

Most high-quality ASR models (including gpt-realtime-1.5, Gemini 3.1 Flash Live, Whisper, Kimi-Audio) train primarily on 16 kHz audio, so 8 kHz telephony input reduces their accuracy significantly.

Provider Support

Twilio’s standard codecs operate at 8 kHz with limited support for higher-quality audio streaming needed for AI models. Telnyx now ships HD Voice on its LiveKit-integrated platform with G.722 and Opus wideband codecs across all four regions – an explicit AI-voice product, not just an underlying capability – though configuration still requires more expertise than Twilio defaults.

Speech-to-speech models (gpt-realtime-1.5, Gemini 3.1 Flash Live, Grok Voice Agent) optimized for high-quality web audio don’t perform as well over standard PSTN. Their latency and integration benefits disappear over phone while premium pricing remains. Chained STT/LLM/TTS pipelines with telephony-optimized components often deliver more reliable and cost-effective phone-based deployments.

SIP and VoIP Integration

Telephony integration uses services like Twilio, Vonage (formerly Nexmo), Telnyx, Plivo, SignalWire, or on-premises SIP systems. These provide audio via WebSocket (Twilio streams 8k PCM in real time) or media servers. Architecture must ingest these streams and connect to the AI pipeline. See the voice agent platform guide for a detailed provider breakdown.

DTMF and Control

Telephony providers detect DTMF tones (touch-tone input) out-of-band to avoid confusing ASR. Twilio sends webhook events for DTMF. Speech-to-speech voice agents minimize DTMF menus but users may still attempt touch-tone input.

Telephony Latency

Phone networks add 100-200ms fixed latency. Processing pipelines should minimize additional overhead. Hosting AI services near telephony ingress points reduces roundtrip latency.

Human Agent Handoffs

Human agent handoffs benefit from passing conversation context. AI conversations that escalate after collecting information should provide transcribed summaries to avoid user repetition.

Handling Background Noise & Voice Variability

Noise Suppression

Noise suppression algorithms applied before ASR improve recognition accuracy. ML models like RNNoise remove background noise (keyboard sounds, fans) in real time. Picovoice’s Koala demonstrates intelligibility improvements.

Tradeoff: slightly distorts voice and consumes extra CPU.

Microphone Differences

Audio quality varies across headsets, speakerphones, and car bluetooth (frequency response, echo). Echo cancellation prevents agent voice from being picked up by microphone. WebRTC’s AEC handles most cases.

Telephone scenarios rely on network echo cancellers or require adaptive echo cancelers in the pipeline.

VAD and Barge-In

Voice Activity Detection (VAD) distinguishes speech from noise. Noisy conditions cause false positives/negatives. Combining VAD with ASR confidence improves accuracy. Treat silence as end-of-utterance only when ASR confirms finality.

Continue assuming speech while ASR generates transcribed words. End turn after 500ms silence. Barge-in requires monitoring microphone during agent speech to stop TTS when user interrupts.

Accents and Languages

Diverse user bases require testing across accents and dialects. Cloud ASRs support accent/locale specifications for improved accuracy. Open models benefit from fine-tuning on accented data.

Bilingual support requires models supporting multiple languages (Google, OpenAI). Multi-language detection works through auto-detection or routing to language-specific models.

Stream Management and Orchestration

Continuous conversation streams require managing concurrent input/output and conversation state.

Half-Duplex vs Full-Duplex

Most systems use half-duplex with barge-in – users can interrupt agents, but agents don’t interrupt users except for short backchannel utterances (“uh-huh”, “I see”). Backchannel implementation requires detecting pauses and generating quick responses without disrupting ASR.

Prompt Management

Persistent conversation state requires maintaining rolling prompts for the LLM. APIs with persistent sessions handle this up to context limits. Manual implementations append each utterance and reply.

Long conversations require summarizing older content to stay within context windows. Important user-provided facts need re-injection into prompts as needed.

Ensuring Required Steps

Flows requiring specific actions (identity verification, mandatory questions) benefit from checkpoints. Teams can implement checkpoints through LLM prompt instructions or external state machines.

Some systems prevent sending queries to LLM until prerequisite steps complete, or override LLM responses that skip required actions. This combines rule-based flow with AI – trusting AI for understanding and generation while enforcing action sequences.

Speech-to-speech agent orchestration requires managing concurrent input/output streams. Best practices and libraries exist for common patterns. Testing should include scenarios like users interrupting agent speech to verify barge-in logic stops TTS promptly.

Conclusion

Speech-to-speech voice agents now cluster in the 0.8–3 second TTFT range across leading providers on Artificial Analysis measurements, with the fastest options approaching human response times. Proprietary platforms (OpenAI gpt-realtime-1.5, Google Gemini 3.1 Flash Live, xAI Grok Voice Agent, Amazon Nova 2 Sonic) and open-source options (Ultravox v0.7, Moshi via Gradium, Step-Audio R1.1, Kimi-Audio) are production-ready or operating in production with paying customers.

Architecture choice depends on deployment environment and constraints:

Chained Pipeline – Voice → STT → LLM → TTS → Voice – provides maximum flexibility and reliability but higher latency
Speech-to-Speech (Half-Cascade) – Voice → Audio Encoder → Text-based LLM → TTS → Voice – balances performance with production readiness, with cost ranging from premium (OpenAI ~$0.30/min) to budget (Nova 2 Sonic ~$0.017/min)
Native Audio – Voice → Unified Model → Voice – offers the lowest latency and most natural interactions; Step-Audio R1.1 currently leads Big Bench Audio at 97.0%, and Gradium ships Moshi-derived APIs in production

Implementation factors:

Performance requirements: TTFT, WER, and RTF targets for the use case
Cost structure: Cloud APIs vs. self-hosting economics at expected scale
Technical complexity: Streaming integration, telephony connectivity, noise handling
Deployment environment: Phone systems (8kHz PSTN) vs. web-based (16kHz+ audio)

System design includes audio streaming, orchestration, testing, and optimization for specific constraints. Cloud APIs enable rapid prototyping. Production deployment requires testing with real user patterns and audio conditions.

About Softcery: We’re the AI engineering team that founders call when other teams say “it’s impossible” or “it’ll take 6+ months.” We specialize in building advanced AI systems that actually work in production, handle real customer complexity, and scale with your business. We work with B2B SaaS founders in marketing automation, legal tech, and e-commerce – solving the gap between prototypes that work in demos and systems that work at scale. Get in touch.

Frequently Asked Questions

What is speech-to-speech voice agent architecture?

Speech-to-speech (S2S) voice agents process audio with minimal delay – 200-300 milliseconds – enabling natural conversation flow. Unlike chained architectures that wait for full user input before responding, speech-to-speech agents stream audio in and out simultaneously, achieving near-human responsiveness.

How do speech-to-speech voice agents differ from chained pipelines?

Chained pipelines follow a step-by-step flow: Speech-to-Text (STT) → Language Model (LLM) → Text-to-Speech (TTS) – which introduces noticeable latency. Speech-to-speech agents, by contrast, use streaming architectures or multimodal models that process voice continuously, reducing delay and improving conversational flow.

What are the main challenges in building a speech-to-speech AI voice agent?

Speech-to-speech voice agents face several production challenges. Integrated TTS quality is lower than specialized models like ElevenLabs Conversational AI 2.0 or Cartesia Sonic-3. Cost varies dramatically by provider: OpenAI gpt-realtime-1.5 runs roughly 10× a chained pipeline due to context-window accumulation, while Amazon Nova 2 Sonic (~~$0.02/min) and Gemini 3.1 Flash Live (~~$0.023/min baseline) approach or undercut chained-pipeline pricing. Additional technical challenges include WebRTC/telephony integration complexity, 8 kHz PSTN audio quality degradation that reduces the benefits of speech-to-speech models, and handling network latency and packet loss in streaming scenarios.

What models or platforms support speech-to-speech AI today?

Leading proprietary platforms in April 2026 include OpenAI Realtime API (gpt-realtime-1.5, which shipped February 2026 on an API that has been GA with MCP, SIP dialing, and image inputs since August 2025), Google Gemini 3.1 Flash Live (unified audio-to-audio, Preview launched March 2026), xAI Grok Voice Agent API (Voice Agent API GA since December 2025; standalone STT and TTS APIs launched April 2026; OpenAI-Realtime-compatible spec, fastest TTFT in class at ~~0.78 s on Artificial Analysis), and Amazon Nova 2 Sonic on Bedrock (~~$0.017/min, roughly an order of magnitude cheaper than OpenAI Realtime). Open-source alternatives include Step-Audio R1.1 (Apache-2.0, currently #1 on Big Bench Audio at ~97%), Ultravox v0.7 (half-cascade with GLM-4.6 backbone; VoiceBench 87.05 without reasoning, 90.75 with reasoning), Moshi via Gradium (Kyutai’s commercial spinoff, $70M seed Dec 2025, now shipping STT and TTS APIs in 5 languages), and Kimi-Audio (1.28% WER on LibriSpeech). Each offers different trade-offs in latency, cost, flexibility, and production readiness.

How can businesses estimate the cost of deploying speech-to-speech AI voice agents?

Cost depends heavily on which provider and architecture. OpenAI gpt-realtime-1.5 uses context accumulation pricing where costs grow as conversations get longer – the model re-charges for all previous tokens on each turn. Baseline ~~$0.30/min for a 5-minute exchange can climb to $1.50+/min in a 30-minute session. Newer entrants price differently: Gemini 3.1 Flash Live charges per audio second (~~$0.005/min in + $0.018/min out) with text reasoning tokens layered on, and Amazon Nova 2 Sonic prices a flat ~$0.017/min – both far below OpenAI’s per-turn accumulation curve. Chained STT/LLM/TTS pipelines maintain consistent ~$0.15/min pricing regardless of conversation length. Estimate by combining baseline per-minute rate, average conversation length, and system prompt size.

When should chained pipeline be chosen over speech-to-speech architecture?

Chained pipelines (STT→LLM→TTS) are often the better choice for phone-based deployments where PSTN uses 8 kHz audio – this degrades both speech recognition and TTS quality, eliminating speech-to-speech models’ latency advantages while maintaining their premium pricing. Chained pipelines also win when cost control matters (no context accumulation, though Nova 2 Sonic now narrows this gap), when maximum flexibility to swap components independently is needed (different STT/TTS providers), or when using specialized TTS models like ElevenLabs Conversational AI 2.0 or Cartesia Sonic-3 that sound more natural than integrated multimodal TTS. Despite higher latency, chained pipelines remain the production standard for most phone-based voice agent deployments.

Why is OpenAI Realtime more expensive than chained pipelines, and do all S2S models share that cost?

The 10× premium is specific to OpenAI Realtime and stems from three factors: (1) context accumulation – the model re-charges for all previous conversation tokens on each turn, so costs grow with conversation length; (2) proprietary multimodal infrastructure that processes audio natively rather than converting to text, requiring more compute per token; (3) cloud-only deployment with no self-hosting option, meaning paying for enterprise streaming infrastructure and global low-latency endpoints. Developer reports show simple OpenAI Realtime test conversations costing $3–$10, with 15-minute sessions reaching $5–$6 just for audio input. Other S2S providers price differently: Amazon Nova 2 Sonic charges per audio token (~~$0.02/min), Gemini 3.1 Flash Live charges per audio second (~~$0.023/min baseline), and xAI Grok Voice Agent charges $3/hr (~$0.05/min). Open-source self-hosting (Step-Audio R1.1, Ultravox v0.7, Moshi) replaces per-minute cost with GPU rental.

How does audio quality affect model choice for phone systems?

Standard PSTN phone systems use 8 kHz audio (G.711 codec), which severely degrades both speech recognition accuracy and TTS naturalness compared to 16 kHz+ web audio that most AI models are trained on. This quality gap reduces the effectiveness of speech-to-speech models (gpt-realtime-1.5, Gemini 3.1 Flash Live, Grok Voice Agent) that are optimized for high-quality web audio – their latency and integration benefits disappear over phone while premium pricing remains. Provider choice matters: Twilio uses standard 8 kHz codecs with limited high-quality streaming support, while Telnyx ships HD Voice on its LiveKit-integrated platform with G.722 and Opus wideband codecs across four regions (though requiring more configuration expertise). For phone deployments, chained pipelines with telephony-optimized STT/TTS components often deliver better results at lower cost.

Can teams switch between architectures later if needs change?

Switching architectures involves significant re-engineering. Chained pipelines offer high flexibility – teams can swap STT, LLM, or TTS providers independently with minimal code changes. Speech-to-speech models create tighter coupling: application logic, audio streaming, and conversation management become intertwined with the provider’s specific API (OpenAI WebSocket, Google gRPC). Migration requires rewriting streaming integration, conversation state management, and potentially redesigning interruption handling and voice activity detection. Start with chained pipeline if uncertain – modular architectures are easier to optimize than migrating away from tightly-coupled speech-to-speech systems. Consider vendor lock-in carefully, especially with cloud-only speech-to-speech platforms.