Real-Time (S2S) vs Cascading (STT/TTS) Voice Agent Architecture

Calendar

Last updated on October 24, 2025

Three architectural approaches exist for building voice agents:

  • Chained pipelines using separate speech recognition, language processing, and synthesis components
  • Speech-to-Speech (Half-Cascade) that processes native audio input, uses text-based language reasoning, and generates speech output
  • Native Audio models that reason directly in audio space within a single neural network

Each makes different tradeoffs between flexibility, latency, cost, production readiness, and audio quality preservation.


Understanding Speech-to-Speech Voice Agent Architecture

What Is Speech-to-Speech?

Speech-to-speech voice agents process audio with minimal delay – 200-300 milliseconds from user speech to agent response.

Two approaches exist: half-cascade systems use native audio input processing with text-based language model reasoning and speech synthesis output. Native audio models handle everything within a single neural network that reasons directly in audio space.

Both encode incoming sound into vectors capturing linguistic content, tone, and emotion. They begin generating responses while the user is speaking or immediately after. Native audio maintains more audio information throughout processing, while half-cascade systems balance modularity with lower latency than chained architectures.

Chained vs. Speech-to-Speech Voice Agent Architecture

Chained pipelines follow a sequential flow: Voice → STT → LLM → TTS → Voice. Each component waits for the previous one to finish before processing. Speech-to-speech architectures stream input and output concurrently across the stack, reducing perceived delay in scenarios that involve rapid turn-taking or mid-utterance interactions.

AspectChained Voice AgentSpeech-to-Speech Voice Agent
STT ProcessingCan stream partial transcripts, but waits for end-of-utterance to finalizeContinuously streams partial transcripts as user speaks
LLM BehaviorWaits for complete STT output before processingBegins processing from partial input while user is still speaking
TTS SynthesisCan stream audio chunks, but starts after LLM generates first chunks (TTFT)Starts speaking immediately as first tokens are generated, fully streaming
LatencyHigher due to sequential handoffs between componentsLower – concurrent streaming across all components
FlexibilityHigh – easy to swap out STT, TTS, and LLM independentlyLess flexible – components must support tight integration and real-time coordination
Risks / ChallengesRequires careful orchestration between components to minimize latencySignificantly higher cost (~10x chained pipeline); requires stream orchestration to avoid mishearing
User ExperienceStructured and clear, but less dynamic; noticeable pauses between turnsAgent can begin replying before user finishes speaking; maintains emotional tone through audio processing
Best Use CasesAll use cases, especially when cost control and flexibility are prioritiesBest when ultra-low latency is critical and budget allows (AI concierges, premium live support)
Technical RequirementsModerate – most providers offer PaaS solutions; focus on linking components and fallback strategyModerate – cloud APIs handle infrastructure; high only if self-hosting open-source models

Core Architectures for Voice AI Agents

Three fundamental architectural approaches exist for building voice AI agents. Each has distinct trade-offs in latency, flexibility, and naturalness:

1. Chained Pipeline (Cascaded STT→LLM→TTS Architecture)

Schema: Voice → STT → LLM → TTS → Voice

How it works: The system converts speech to text, processes it through a language model, and turns it back into audio.

Pros:

  • Easy to build and debug;
  • Works well with existing LLM APIs;
  • Reliable and predictable;
  • High flexibility – easy to swap out STT, TTS, and LLM independently.

Cons:

  • High latency since each component waits for the previous one to complete;
  • Loses tone and emotion when converting to text;
  • Less natural feel, limited interruptibility.

Example implementations: Deepgram STT + GPT-4.1 + Cartesia TTS, Gladia STT + Gemini 2.5 Flash + ElevenLabs TTS

2. Speech-to-Speech (Half-Cascade Architecture)

Schema: Voice → Audio Encoder → Text-based LLM → TTS → Voice

How it works: The model processes audio input directly through an encoder, uses a text-based language model to reason and respond, then generates speech via synthesis. This combines native audio input with text-based reasoning and speech output.

Google and OpenAI use this half-cascade architecture, balancing speed, performance, and reliability. This works well for production use and tool integration.

Voice → LLM → Voice

Pros:

  • Lower latency with streaming capability;
  • Retains tone and prosody cues;
  • Natural conversational flow;
  • More interruptible than chained pipeline.

Cons:

  • Still has a separate LLM reasoning layer (text-native);
  • TTS quality is lower than specialized TTS models (e.g., ElevenLabs, Cartesia) – voice sounds less natural and expressive;
  • Less flexible than fully modular approach.

Example systems: Google Gemini Live 2.5 Flash, OpenAI Realtime API (gpt-realtime), Ultravox

3. Native Audio Model (End-to-End Speech-to-Speech AI)

Schema: Voice → Unified Model → Voice

How it works: A single model listens, reasons, and speaks – all within one neural network. It encodes audio into latent vectors that capture meaning, emotion, and acoustic context, then directly generates output audio from those same representations.

Pros:

  • Very low latency (true real-time);
  • Maintains emotional tone and voice consistency;
  • Most natural conversational quality;
  • Supports full-duplex with natural interruptions.

Cons:

  • Hard to train and control;
  • Opaque reasoning (no clear text layer);
  • Needs huge, high-quality audio datasets;
  • Limited flexibility for voice customization.

Example systems: Gemini 2.5 Flash Native Audio, VITA-Audio, SALMONN-Omni, Moshi by Kyutai Labs

Gemini 2.5 Flash Native Audio (gemini-2.5-flash-native-audio-preview) provides true native audio processing – reasoning and generating speech natively in audio space. It includes affective (emotion-aware) dialogue, proactive audio capabilities, and “thinking” features. This represents Google’s experimental approach to end-to-end audio reasoning without text intermediation.


Available Speech-to-Speech Models & Platforms

Commercial APIs and open-source projects provide speech-to-speech voice agents in 2025:

Leading Proprietary Platforms

OpenAI and Google offer three production-ready speech-to-speech voice models:

FeatureOpenAI Realtime API (gpt-realtime)Google Gemini Live 2.5 FlashGoogle Gemini 2.5 Flash Native Audio
Architecture TypeHalf-Cascade (Speech-to-Speech)Half-Cascade (Speech-to-Speech)Native Audio (End-to-End)
ProviderOpenAI (also via Azure)Google / DeepMindGoogle / DeepMind
Model TypeMultimodal LLM with realtime audio streaming supportMultimodal flash LLM optimized for speed and interactivityNative audio-to-audio model with affective dialogue and proactive audio capabilities
Latency – Time to First Token~280 ms~280 ms~200-250 ms (experimental)
Audio InputStreaming audio via WebRTC + WebSocket APIStreaming audio via Multimodal Live API (likely gRPC-based)Native audio streaming via Multimodal Live API
Token Generation Speed~70–100 tokens/second~155–160 tokens/secondN/A (generates audio directly, not token-based)
Hosting / AccessCloud only (OpenAI API / Azure OpenAI Service)Cloud only (Google AI Studio / Vertex AI)Cloud only (Google AI Studio / Vertex AI) – Preview only
Developer IntegrationOpenAI Streaming API with WebRTCAccess via Google’s Vertex AI or AI Studio; endpoint: gemini-2.5-flash-live-001Access via Google AI Studio; endpoint: gemini-2.5-flash-native-audio-preview
Multimodal CapabilitiesYes – audio input, speech output; also supports visionYes – audio, video, text input; supports images and rolling context in conversationYes – native audio reasoning with emotion awareness, “thinking” mode, proactive audio
Throughput Capacity~800K tokens/min, ~1,000 req/min (Azure OpenAI, realtime mode)N/A (not publicly specified, but optimized for high concurrency and streaming)N/A (experimental preview, not intended for production scale)
Production ReadinessGenerally AvailableGenerally AvailableExperimental Preview only – not production-ready

Open-Source Alternatives

Two open-source projects offer alternatives to proprietary models:

FeatureUltravox (by Fixie.ai)Moshi (by Kyutai Labs)
Architecture TypeHalf-Cascade (Speech-to-Speech)Native Audio (End-to-End)
Model TypeMultimodal LLM (audio + text encoder, outputs text)Audio-to-audio LLM (integrated STT and TTS – speech in, speech out)
ArchitectureVoice → LLM → Text (planned speech output in future versions)Voice → LLM → Voice (fully integrated speech-to-speech pipeline)
Streaming SupportStreaming text output with low latencyFull-duplex streaming (supports overlap and interruption)
Time to First Token (TTFT)~190 ms (on smaller variant)~160 ms
Token Generation Speed~200+ tokens/secNot token-based; generates speech waveform directly
Base ModelsBuilt on open LLMs (e.g., LLaMA 3 – 8B / 70B)Proprietary foundation model trained by Kyutai
Audio ProcessingProjects audio into same token space as text using custom audio encoderEnd-to-end audio encoder and decoder (neural codec pipeline)
Output TypeText (for now), with plans for speech token outputAudio (neural codec speech)
Hosting / DeploymentSelf-hostable; requires GPU infra, especially for 70B variantSelf-hostable (heavy); public demo available at moshi.chat
Open-Source StatusFully open: model weights, architecture, and code available on GitHubFully open: code and demos available; weights provided (early stage)
ExtensibilityCan plug in any open-weight LLM; attach custom audio projectorClosed model structure for now; focused on turnkey audio-agent use
Use Case FitVoice-enabled bots with real-time understanding, using custom TTS for outputFull voice agents with natural interruptions and direct speech response

Integration Frameworks and Tools

Integration frameworks include Pipecat (vendor-agnostic voice agent framework used by teams like NVIDIA and Cresta, maintained by Daily.co as 100% open source), LiveKit (WebRTC streaming infrastructure), and FastRTC (Python streaming audio). For comprehensive platform comparisons including deployment options and integration approaches, see the voice agent platform guide. Developers can assemble open-source speech recognition (Vosk, NeMo) and TTS (VITS, FastSpeech) components into speech-to-speech agents without using end-to-end models.

Performance Metrics That Matter

Three metrics determine voice agent performance: speed (time to first token), accuracy (word error rate), and processing efficiency (real-time factor).

Time to First Token (TTFT)

Time to First Token (TTFT) measures latency from end-of-user-speech to start-of-agent-speech. Current models achieve TTFT in the 200-300 millisecond range: Google’s Gemini Flash logs ~280 ms, OpenAI’s GPT-4o realtime ~250-300 ms. Human response latencies in conversation average around 200 ms.

Network latency affects cloud API measurements, so real-world TTFT runs higher than lab values. Published TTFT may be measured in controlled settings or end-to-end.

Lower TTFT is better, though extremely low values may indicate the model responds before fully processing user intent.

Word Error Rate (WER)

Word Error Rate (WER) measures the percentage of words incorrectly recognized in the transcript. Lower WER means more accurate transcription. Meta AI’s research on streaming LLM-based ASR achieved ~3.0% WER on Librispeech test-clean (~7.4% on test-other) in real-time mode, approaching offline model accuracy.

Recognition errors can lead the LLM astray. Cloud providers publish WER on benchmarks, but real-world WER runs higher. Real-time agents may correct some ASR errors via context, though lower baseline WER remains preferable.

Domain adaptation through custom vocabulary or fine-tuning helps with specialized terminology.

Real-Time Factor (RTF)

Real-Time Factor (RTF) measures processing speed relative to input duration. RTF < 1.0 means the system processes faster than real time. Each component has its own RTF: STT engines typically process at 0.2× real time, LLMs generate at 50+ tokens/sec, modern TTS synthesizes at RTF 0.1 or better (10 seconds of speech generated in 1 second).

Systems must maintain RTF < 1 under load to prevent latency accumulation. Smaller models often achieve better RTF at the cost of language quality, making token generation speed a determining factor for ultra-low latency requirements.


Cost Analysis and Scalability for Speech-to-Speech Voice Agents

Speech-to-speech voice agent costs break down into five categories: cloud API usage, self-hosting compute, scalability limits, bandwidth, and enterprise overhead.

Cost CategoryDescriptionExamples / BenchmarksKey Considerations
Usage-Based Pricing (Cloud APIs)Pay-per-token/minute for any architecture (STT, LLM, TTS, or integrated multimodal)OpenAI Realtime: ~$0.30/min (baseline), increases significantly with turns
Gemini Live: ~$0.22/min (baseline), increases with turns
Gemini Native Audio: ~$0.50/min with typical conversation turns (experimental)
Chained pipeline: ~$0.15/min (no context accumulation)
Speech-to-speech models have context accumulation that dramatically increases costs with conversation turns; chained pipelines maintain consistent per-minute pricing
Compute Costs (Self-Hosting)Run open-source models like Ultravox/Moshi on your own infraHosting Ultravox 70B may need A100/H100 GPU per concurrent session; GPU costs: ~$2–$3/hr (cloud)Lower marginal cost at scale; Requires infra & DevOps team; Harder to spin up instantly
Scalability / Rate LimitsLimits on concurrent sessions, tokens per minute, request rateOpenAI GPT-4o: 800K tokens/min, 1K requests/min; Enterprise: up to 30M tokens/minWatch for WebSocket caps or long-lived session constraints; Request enterprise quotas if needed
Bandwidth OverheadCost of streaming audio data over network~8–64 kbps per stream; Telephony codecs (e.g. G.711 vs G.729) can affect costsMinor cost per stream, but adds up at scale; Ensure egress limits aren’t exceeded in cloud setups
Enterprise OverheadSLAs, premium support, custom deployments, fallback systemsRegional/on-prem hosting; Redundancy systems (e.g. backup STT or fallback bots)Adds reliability and control; Contractual/licensing complexity increases total cost of ownership

Understanding Speech-to-Speech Pricing

Speech-to-speech models like GPT-4o Realtime and Gemini 2.5 Flash Live have different cost structures than chained STT/TTS pipelines. For detailed provider comparisons with latency benchmarks and accuracy metrics, see the complete STT and TTS selection guide. Three factors drive higher costs:

  1. Proprietary multimodal infrastructure – These models require specialized neural architectures that process audio natively, maintaining acoustic features throughout the pipeline rather than collapsing to text
  2. Cloud-only deployment – No self-hosting option means paying for enterprise-grade streaming infrastructure, low-latency global endpoints, and WebRTC/gRPC orchestration
  3. Advanced real-time capabilities – Support for interruptions, emotional tone preservation, and sub-300ms latency requires substantial compute resources per session

Real-world cost reports from OpenAI’s developer community:

  • $3 spent on “a few short test conversations” in the playground (simple questions like bedtime stories)
  • $10 consumed during weekend integration testing, leading developers to call the API “unusable at the moment” due to cost
  • Costs increase per minute as conversations get longer – in a 15-minute session, one developer reported $5.28 for audio input vs $0.65 for output. This happens because tokens accumulate in the context window, and the model re-charges for all previous tokens on each turn, making longer conversations disproportionately more expensive

User-reported costs differ from official per-minute estimates because actual costs depend on conversation length (context accumulation), system prompt size (larger prompts = more tokens per turn), and conversation complexity (more back-and-forth = more context to maintain). A 5-minute conversation might cost $0.30/min, while a 30-minute conversation could cost $1.50/min or more due to accumulated context.

Native Audio Models (Moshi, VITA-Audio) are early-stage and experimental. While they promise the lowest latency and most natural interactions, they are:

  • Mostly research projects, not production-ready
  • Require significant GPU resources for self-hosting (A100/H100 class)
  • Lack the ecosystem support, tooling, and reliability of commercial offerings
  • Limited voice customization and control compared to modular approaches

Match Cost Strategy to Deployment Scale

Early-stage projects with low volume benefit from cloud APIs: fast setup, predictable pricing, pay-per-use. As usage grows, self-hosting economics may improve, particularly when requiring tight control, data locality, or custom model tuning.

Enterprise scale depends on reliability, rate limits, support agreements, and long-term flexibility – not just price per minute. Total cost of ownership (TCO) includes processing minutes, bandwidth, DevOps effort, redundancy, and support.

Cost calculation for specific scenarios: average conversation length × conversations per day × per-minute pricing = monthly cost. Compare against self-hosting infrastructure investment. Monitor usage limits and enterprise tier requirements.


Technical Implementation Challenges in Speech-to-Speech Voice Agent Deployment

Deploying to production requires integrating streaming, connecting to telephony, handling noise, and orchestrating streams.

Streaming Integration (WebRTC, WebSockets, etc.)

Low latency requires appropriate streaming mechanisms. Three options: WebRTC, WebSockets, and streaming HTTP/gRPC.

WebRTC

Web Real-Time Communication is the standard for low-latency audio/video streaming in browsers and mobile apps. Uses UDP for fast transmission and handles packet loss gracefully. Both OpenAI and Google use WebRTC for client-side audio capture and playback.

Browser and mobile app interactions use WebRTC to send microphone audio to the server. Includes Acoustic Echo Cancellation (AEC), noise reduction, and automatic gain control (AGC). Libraries like LiveKit, mediasoup, or Twilio provide WebRTC integration.

WebSockets and gRPC

Server-side connections between application servers and AI services use persistent bidirectional connections. OpenAI’s voice API uses WebSockets – client sends audio chunks and receives tokens continuously. Google’s API uses gRPC streaming over HTTP/2.

Both provide continuous streams rather than discrete HTTP requests. Implementation requires proper binary audio frame handling and maintaining open connections for conversation duration.

Audio Encoding

Audio format choice depends on API requirements. PCM raw audio is simple but bulky. Opus codec (used by WebRTC) provides high quality at low bitrate, though not all APIs accept Opus packets. Some APIs accept WAV or FLAC frames.

Compressed codecs save bandwidth for mobile users. Phone calls use G.711 µ-law 8kHz, requiring transcoding to 16kHz linear PCM for most ASR systems (Whisper, DeepSpeech).

Latency Tuning

Streaming systems use buffers to smooth network variation. WebRTC jitter buffers trade smooth audio for added delay. Default WebRTC parameters suffice for most deployments.

WebSocket implementations send data immediately (20ms audio frame every 20ms) without batching. Most WebSocket libraries disable Nagle’s algorithm by default to avoid delaying small packets.

Handling Network Issues

WebRTC handles packet loss through loss concealment, filling missing audio chunks with plausible noise. WebSocket implementations lack this but ASR systems handle minor gaps reasonably well on decent networks.

Output packet loss can cause audio blips. Some systems use redundant packets or forward error correction on unreliable networks.

Many implementations combine approaches: WebRTC from client to relay server, then WebSocket from server to AI API. OpenAI’s example follows this pattern. WebRTC handles unpredictable client networks while WebSocket simplifies AI model interfacing.

Telephony Integration (8 kHz and PSTN)

Call Quality Challenges

Phone deployments reveal quality issues absent in web-based implementations. Standard PSTN uses 8 kHz audio (G.711 codec), which severely degrades both speech recognition accuracy and TTS naturalness compared to 16 kHz+ web audio.

Most high-quality ASR models (including GPT-4o Realtime, Gemini Live, Whisper) train primarily on 16 kHz audio, so 8 kHz telephony input reduces their accuracy significantly.

Provider Support

Twilio’s standard codecs operate at 8 kHz with limited support for higher-quality audio streaming needed for AI models. Telnyx offers native 16 kHz support via G.722 wideband codec through their owned infrastructure, but requires more expertise to configure properly.

Speech-to-speech models (GPT-4o Realtime, Gemini Live) optimized for high-quality web audio don’t perform as well over standard PSTN. Their latency and integration benefits disappear over phone while premium pricing remains. This makes chained STT/LLM/TTS pipelines with telephony-optimized components often more reliable and cost-effective for phone-based deployments.

SIP and VoIP Integration

Telephony integration uses services like Twilio, Nexmo, or on-premises SIP systems. These provide audio via WebSocket (Twilio streams 8k PCM in real time) or media servers. Architecture must ingest these streams and connect to the AI pipeline.

DTMF and Control

Telephony providers detect DTMF tones (touch-tone input) out-of-band to avoid confusing ASR. Twilio sends webhook events for DTMF. Speech-to-speech voice agents minimize DTMF menus but users may still attempt touch-tone input.

Telephony Latency

Phone networks add 100-200ms fixed latency. Processing pipelines should minimize additional overhead. Hosting AI services near telephony ingress points reduces roundtrip latency.

Human Agent Handoffs

Human agent handoffs benefit from passing conversation context. AI conversations that escalate after collecting information should provide transcribed summaries to avoid user repetition.

Handling Background Noise & Voice Variability

Noise Suppression

Noise suppression algorithms applied before ASR improve recognition accuracy. ML models like RNNoise remove background noise (keyboard sounds, fans) in real time. Picovoice’s Koala demonstrates intelligibility improvements.

Tradeoff: slightly distorts voice and consumes extra CPU.

Microphone Differences

Audio quality varies across headsets, speakerphones, and car bluetooth (frequency response, echo). Echo cancellation prevents agent voice from being picked up by microphone. WebRTC’s AEC handles most cases.

Telephone scenarios rely on network echo cancellers or require adaptive echo cancelers in the pipeline.

VAD and Barge-In

Voice Activity Detection (VAD) distinguishes speech from noise. Noisy conditions cause false positives/negatives. Combining VAD with ASR confidence improves accuracy. Treat silence as end-of-utterance only when ASR confirms finality.

Continue assuming speech while ASR generates transcribed words. End turn after 500ms silence. Barge-in requires monitoring microphone during agent speech to stop TTS when user interrupts.

Accents and Languages

Diverse user bases require testing across accents and dialects. Cloud ASRs support accent/locale specifications for improved accuracy. Open models benefit from fine-tuning on accented data.

Bilingual support requires models supporting multiple languages (Google, OpenAI). Multi-language detection works through auto-detection or routing to language-specific models.

Stream Management and Orchestration

Continuous conversation streams require managing concurrent input/output and conversation state.

Half-Duplex vs Full-Duplex

Most systems use half-duplex with barge-in – users can interrupt agents, but agents don’t interrupt users except for short backchannel utterances (“uh-huh”, “I see”). Backchannel implementation requires detecting pauses and generating quick responses without disrupting ASR.

Prompt Management

Persistent conversation state requires maintaining rolling prompts for the LLM. APIs with persistent sessions handle this up to context limits. Manual implementations append each utterance and reply.

Long conversations require summarizing older content to stay within context windows. Important user-provided facts need re-injection into prompts as needed.

Ensuring Required Steps

Flows requiring specific actions (identity verification, mandatory questions) benefit from checkpoints. Teams can implement checkpoints through LLM prompt instructions or external state machines.

Some systems prevent sending queries to LLM until prerequisite steps complete, or override LLM responses that skip required actions. This combines rule-based flow with AI – trusting AI for understanding and generation while enforcing action sequences.

Speech-to-speech agent orchestration requires managing concurrent input/output streams. Best practices and libraries exist for common patterns. Testing should include scenarios like users interrupting agent speech to verify barge-in logic stops TTS promptly.


Conclusion

Speech-to-speech voice agents reduce latency to 200-300ms, approaching human response times. Proprietary platforms (OpenAI GPT-4o, Google Gemini 2.5 Flash) and open-source options (Ultravox, Moshi) are production-ready or nearing maturity.

Architecture choice depends on deployment environment and constraints:

  • Chained Pipeline – Voice → STT → LLM → TTS → Voice – provides maximum flexibility and reliability but higher latency
  • Speech-to-Speech (Half-Cascade) – Voice → Audio Encoder → Text-based LLM → TTS → Voice – balances performance with production readiness, but at significantly higher cost
  • Native Audio – Voice → Unified Model → Voice – offers the lowest latency and most natural interactions, but remains experimental and not production-ready

Implementation factors:

  • Performance requirements: TTFT, WER, and RTF targets for the use case
  • Cost structure: Cloud APIs vs. self-hosting economics at expected scale
  • Technical complexity: Streaming integration, telephony connectivity, noise handling
  • Deployment environment: Phone systems (8kHz PSTN) vs. web-based (16kHz+ audio)

System design includes audio streaming, orchestration, testing, and optimization for specific constraints. Cloud APIs enable rapid prototyping. Production deployment requires testing with real user patterns and audio conditions.

About Softcery: We’re the AI engineering team that founders call when other teams say “it’s impossible” or “it’ll take 6+ months.” We specialize in building advanced AI systems that actually work in production, handle real customer complexity, and scale with your business. We work with B2B SaaS founders in marketing automation, legal tech, and e-commerce – solving the gap between prototypes that work in demos and systems that work at scale. Get in touch.


Frequently Asked Questions

What is speech-to-speech voice agent architecture?

Speech-to-speech (S2S) voice agents process audio with minimal delay – 200-300 milliseconds – enabling natural conversation flow. Unlike chained architectures that wait for full user input before responding, speech-to-speech agents stream audio in and out simultaneously, achieving near-human responsiveness.

How do speech-to-speech voice agents differ from chained pipelines?

Chained pipelines follow a step-by-step flow: Speech-to-Text (STT) → Language Model (LLM) → Text-to-Speech (TTS) – which introduces noticeable latency. Speech-to-speech agents, by contrast, use streaming architectures or multimodal models that process voice continuously, reducing delay and improving conversational flow.

What are the main challenges in building a speech-to-speech AI voice agent?

Speech-to-speech voice agents face several production challenges. They’re relatively new technology with limited track record compared to chained pipelines. TTS quality is lower than specialized models like ElevenLabs or Cartesia. Costs run significantly higher (~10x chained pipelines) due to context accumulation where the model re-charges for all previous tokens on each turn. Additional technical challenges include WebRTC/telephony integration complexity, 8kHz PSTN audio quality degradation that reduces the benefits of speech-to-speech models, and handling network latency and packet loss in streaming scenarios.

What models or platforms support speech-to-speech AI today?

Leading platforms include OpenAI Realtime API (gpt-realtime), Google Gemini 2.5 Flash Live (half-cascade architecture), and Google Gemini 2.5 Flash Native Audio (native audio model, experimental). Open-source alternatives include Ultravox (half-cascade built on LLaMA) and Moshi (native audio model by Kyutai Labs). Each offers different trade-offs in latency, cost, flexibility, and production readiness – with proprietary platforms generally available but more expensive, and open-source requiring self-hosting infrastructure.

How can businesses estimate the cost of deploying speech-to-speech AI voice agents?

Speech-to-speech models use context accumulation pricing where costs increase significantly as conversations get longer – the model re-charges for all previous tokens on each turn. While baseline estimates might show $0.22-$0.30 per minute, actual costs for longer conversations can reach $1.50+ per minute due to accumulated context. A 5-minute conversation might cost $0.30/min, while a 30-minute conversation could cost $1.50/min or more. Chained STT/LLM/TTS pipelines maintain consistent ~$0.15/min pricing regardless of conversation length. Factor in conversation length and context accumulation for speech-to-speech models when estimating expenses.

When should chained pipeline be chosen over speech-to-speech architecture?

Chained pipelines (STT→LLM→TTS) are often the better choice for phone-based deployments where PSTN uses 8 kHz audio – this degrades both speech recognition and TTS quality, eliminating speech-to-speech models’ latency advantages while maintaining their premium pricing. Chained pipelines also win when cost control matters (no context accumulation), when maximum flexibility to swap components independently is needed (different STT/TTS providers), or when using specialized TTS models like ElevenLabs or Cartesia that sound more natural than integrated multimodal TTS. Despite higher latency, chained pipelines remain the production standard for most voice agent deployments.

Why are speech-to-speech models significantly more expensive than chained pipelines?

Speech-to-speech models cost ~10x more due to three factors: (1) context accumulation – the model re-charges for all previous conversation tokens on each turn, making costs increase exponentially with conversation length; (2) proprietary multimodal infrastructure that processes audio natively rather than converting to text, requiring more compute per token; (3) cloud-only deployment with no self-hosting option, meaning paying for enterprise streaming infrastructure, WebRTC orchestration, and global low-latency endpoints. Developer reports show simple test conversations costing $3-10, with 15-minute sessions reaching $5-6 just for audio input due to context window accumulation.

How does audio quality affect model choice for phone systems?

Standard PSTN phone systems use 8 kHz audio (G.711 codec), which severely degrades both speech recognition accuracy and TTS naturalness compared to 16 kHz+ web audio that most AI models are trained on. This quality gap reduces the effectiveness of speech-to-speech models (GPT-4o Realtime, Gemini Live) that are optimized for high-quality web audio – their latency and integration benefits disappear over phone while premium pricing remains. Provider choice matters: Twilio uses standard 8 kHz codecs with limited high-quality streaming support, while Telnyx offers native 16 kHz via G.722 wideband codec through owned infrastructure (though requiring more configuration expertise). For phone deployments, chained pipelines with telephony-optimized STT/TTS components often deliver better results at lower cost.

Can teams switch between architectures later if needs change?

Switching architectures involves significant re-engineering. Chained pipelines offer high flexibility – teams can swap STT, LLM, or TTS providers independently with minimal code changes. Speech-to-speech models create tighter coupling: application logic, audio streaming, and conversation management become intertwined with the provider’s specific API (OpenAI WebSocket, Google gRPC). Migration requires rewriting streaming integration, conversation state management, and potentially redesigning interruption handling and voice activity detection. Start with chained pipeline if uncertain – modular architectures are easier to optimize than migrating away from tightly-coupled speech-to-speech systems. Consider vendor lock-in carefully, especially with cloud-only speech-to-speech platforms.

Launch Your AI Without the Disasters

Discover the critical flaws in your AI system before customers do. Your custom launch plan identifies what will break in production, which shortcuts will backfire, and exactly what needs fixing.

Get Your AI Launch Plan
How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

Engineering perspective on legal AI research: RAG systems, citation hallucination prevention, validation architectures, and what makes production systems reliable.

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying

A founder-focused guide to legal AI development, covering market landscape, core technologies, compliance navigation, build vs buy decisions, and scaling strategies.

AI Call Center Automation: Actionable Playbook for 2025

AI Call Center Automation: Actionable Playbook for 2025

The CS landscape is changing. Expectations are rising, and teams are overworked. For the first time, the technology is mature enough to help.

AI Voice Agents for Travel: STT/TTS Architecture, GDS Integration, and HotelPlanner Case Study

Voice Agents for Travel: What Works at HotelPlanner, What Breaks Most Implementations

GDS latency kills conversations. Payment security blocks voice collection. API integration determines whether this works or wastes six months.

Custom AI Voice Agents: The Ultimate Guide

Custom AI Voice Agents: The Ultimate Guide

This guide breaks down everything you need to know about building custom AI voice agents - from architecture and cost to compliance.

How to Build Production-Ready Legal AI: Quality Assurance & Testing Guide

How to Build Production-Ready Legal AI Systems

Legal AI is one of the hardest domains to get right. Learn the quality assurance, testing, and observability patterns that make legal AI actually work in production.

AI for Law Firms: What Actually Works in Production (Beyond the Demos)

AI for Law Firms: What Actually Works in Production (Beyond the Demos)

Proven AI capabilities for law firms: intake automation, document analysis, compliance Q&A. What works in production today versus what is still immature, based on real implementations.

Legal Chatbots: Off-the-Shelf vs Custom Development (When Each Makes Sense)

Legal Chatbots: When to Build Custom vs Buy Off-the-Shelf

Implementation challenges, compliance requirements, and real costs. A framework for deciding between custom legal chatbot development and pre-built solutions.

Howdy stranger! What brings you here today?