How to Choose STT and TTS for Voice Agents: Latency, Accuracy, Cost

Last updated on April 24, 2026

Speech-to-Text (STT) and Text-to-Speech (TTS) tech, combined with Large Language Models (LLMs), power most AI voice agents today.

Direct Speech-to-Speech solutions exist but remain limited in production deployment. For detailed comparison of real-time versus turn-based architectures and their cost implications, see our architecture comparison guide. The STT → LLM → TTS pipeline offers independent model selection, adjustable complexity per use case, and straightforward integration with existing systems.

The selection depends on accuracy requirements, latency constraints, language support, and cost. Models vary significantly across these dimensions, and the most popular providers show distinct tradeoffs.


Understanding STT and TTS Technologies

The voice agent interaction cycle:

  1. STT captures voice input and converts it to text
  2. An LLM generates an appropriate response
  3. TTS converts the response back into natural-sounding speech

Modern STT models rely on deep learning, mostly transformer architectures. They process audio through a few key steps: cleaning it up, pulling out useful features, and modeling the sequence to turn sound into accurate text.

TTS systems reverse this flow. They convert text into a spectrogram (a visual representation of sound frequencies), then generate an audio waveform that produces natural-sounding speech.

Current TTS models achieve near-human naturalness in controlled conditions. Development focuses on cost reduction, cross-device optimization, and stability improvements.

STT faces harder technical constraints. Noisy environments, multi-speaker scenarios, and speaker isolation remain challenging. These limitations drive active development priorities across providers.


Key Criteria for Selecting STT Models

Some Speech-to-Text models shine in quiet call-center setups, others handle noisy real-world audio better. A few core factors determine performance across different environments.

1. Accuracy and Recognition Capabilities

Word Error Rate (WER) measures transcription accuracy. The best streaming STT models score 2–4% on AA-WER v2.0 (Artificial Analysis’s independent benchmark), meaning very high accuracy on production-representative audio. Accuracy varies significantly across different accents, background noise levels, specialized vocabulary domains, and multi-speaker scenarios.

2. Processing Speed and Latency

For voice agents, target total round-trip latency around 800 ms (VAD + STT + LLM + TTS + network). STT under 300 ms is ideal and under 150 ms is achievable with the latest models, leaving headroom elsewhere in the pipeline.

3. Audio Input Requirements

Models must handle different audio qualities, various microphone types, and diverse environmental conditions. The critical capability is filtering and isolating target voices from background noise.


Best Speech-to-Text Models for Voice Agents

Voice agents require real-time streaming STT with sub-500ms latency. The table below compares leading STT models, with streaming-capable models ranked by their suitability for production voice agent deployments.

Provider and ModelAA-WER v2.0LanguagesCost/HourLatency
ElevenLabs Scribe v2 Realtime2.3%90$0.28~150 ms
AssemblyAI Universal-3 Pro Streaming3.2%99+$0.15<600 ms
Deepgram Nova-3 / Flux~Nova-3 level36 (Nova-3) / EN (Flux)$0.46<300 ms (Flux ~260 ms)
OpenAI gpt-4o-transcribe4.1%100+$0.36~320 ms
Mistral Voxtral Small2.9%8$0.24~500 ms
Gladia AI SolariaN/A100$0.61~270 ms
Speechmatics Ursa 2 EnhancedN/A50$1.35<1 s
Google Chirp 3 (Public Preview)Not scored100+VariesStreaming

For voice agents: ElevenLabs Scribe v2 Realtime is the leading choice, balancing top accuracy with the lowest streaming latency. AssemblyAI Universal-3 Pro Streaming offers the best price-performance ratio for production voice agents. Deepgram Flux is purpose-built for voice agents with model-integrated end-of-turn detection.

Note: AA-WER v2.0 (Artificial Analysis Word Error Rate, version 2.0) is measured across diverse real-world datasets including VoxPopuli, Earnings-22, AMI-SDM, and AA-AgentTalk (a dataset specifically focused on voice-agent-directed speech). Provider-reported WER often uses cleaner test data and may show lower scores than independent benchmarks.

#1 ElevenLabs Scribe v2 Realtime

Released January 6, 2026 (v2 Batch followed January 9). Top accuracy on AA-WER v2.0 (2.3%) plus ~150 ms streaming latency — the first model to lead on both dimensions simultaneously. Covers 90 languages including Japanese, Hindi, Polish, Swedish, Mandarin, Vietnamese, and French. Uses predictive transcription to anticipate the most probable next words and punctuation. On the FLEURS multilingual benchmark, Scribe v2 Realtime reports 93.5% accuracy vs. Gemini Flash 2.5 (90%), GPT-4o Mini (85%), and Deepgram Nova-3 (80%).

Audio support: PCM 8–48 kHz and μ-law (telephony). WebSocket streaming. Fully integrated into ElevenLabs Agents (the Conversational AI 2.0 platform).

Pricing:

  • ~$0.28 per hour on Creator/Pro plans
  • Lower on annual Business and Enterprise
  • 30+ concurrency for enterprise

#2 AssemblyAI Universal-3 Pro Streaming

Released March 3, 2026. Replaces Universal-2 for voice agent workloads. Ships with prompting, disfluency control, code-switching, real-time speaker diarization, and 99+ languages. Pre-recorded Universal-3 Pro reports 93.3% Word Accuracy Rate; the streaming variant comes within 0.1–0.3% of the batch version on most evaluation sets. Slam-1 is deprecated.

Pricing:

  • $0.15 per hour ($2.50 per 1000 minutes) for both pre-recorded and streaming
  • Speaker identification $0.02/hr add-on
  • Unlimited concurrency, no rate limits

#3 Deepgram Nova-3 / Flux

Nova-3 (released February 2025): sub-300 ms streaming, 36 languages with real-time switching between 10, Nova-3 Medical reaches 3.45% median WER on medical terminology.

Flux (GA since October 2025): the first Conversational Speech Recognition (CSR) model with model-integrated end-of-turn detection (~260 ms). Eliminates the need for a separate Voice Activity Detection system. Nova-3-level accuracy. English-first. 100+ concurrent streams per GPU.

Pricing:

  • Nova-3 Monolingual / Flux: $0.0077/min ($0.46/hr) PAYG, $0.0065/min on Growth
  • Nova-3 Multilingual: $0.0092/min ($0.55/hr) PAYG
  • Nova-3 Medical: custom enterprise
  • Voice Agent API (bundled STT + LLM + TTS): $0.050–$0.163/min depending on tier (Standard / Custom / Advanced) and BYO options

Other Notable STT Providers

  • OpenAI gpt-4o-transcribe and gpt-realtime — gpt-4o-transcribe: 4.1% AA-WER v2.0, 100+ languages, $0.36/hr. gpt-4o-mini-transcribe-2025-12-15 is the updated mini tier with lower WER. OpenAI’s gpt-realtime speech-to-speech model (GA 2026) covers the full conversational loop with MCP, SIP, and image inputs.

  • Mistral Voxtral — Voxtral Small reaches 2.9% AA-WER v2.0 at $0.24/hr (8 languages), Voxtral Mini Transcribe is even cheaper at $1.00 per 1000 min with 3.7% WER. Strong option if your language coverage needs are narrow.

  • Google Chirp 3 — Public preview as of 2026. Supports StreamingRecognize, Recognize, and BatchRecognize. 100+ languages, speaker diarization, automatic language detection, built-in denoiser. Successor to Chirp 2 which was batch-only.

  • Gladia AI Solaria — 100 languages including 42 underserved, ~270 ms latency. Lacks independent benchmark scores but strong customer validation in enterprise deployments. $0.61/hr Pro.

  • Speechmatics Ursa 2 — 50 languages, strong Spanish/Polish performance. Real-Time Enhanced: $1.35/hr. Free tier: 8 hrs/month.

  • Azure MAI-Transcribe-1 — Microsoft’s in-house STT flagship. 3.0% AA-WER v2.0, 140+ languages. Batch-focused; real-time variants exist in the Azure Speech SDK.


Text-to-Speech (TTS) Selection Criteria

TTS selection matters as much as STT. The TTS engine determines how natural and human-like the voice agent sounds to end users.

Voice Quality and Naturalness

Voice naturalness is the primary consideration for TTS. Models must avoid robotic qualities, maintain consistency across longer passages, handle partial text fragments, and accurately pronounce specific formats like phone numbers and email addresses.

There’s no universal quality metric for TTS, but platforms like Artificial Analysis use an ELO Score. Top TTS models score 1164–1208 ELO on the Artificial Analysis TTS Leaderboard.

Voice Customization Options

Providers vary in available voice options and customization capabilities. Basic adjustments include speaking rate, pitch, and emphasis. Advanced systems offer control over voice characteristics like emotional tone (ElevenLabs v3 supports inline tags like [laughs] and [whispers]), real-time emotion control (Cartesia Sonic-3), and voice cloning. The Speech Synthesis Markup Language (SSML) enables fine-tuned voice generation across most platforms.

Language Support

Language selection, regional accent configuration, and dialect-specific adjustments determine global deployment viability. Some models are limited to one language and may struggle with mid-call language switches.


Best Text-to-Speech Models and Providers

The providers below rank highest on the Artificial Analysis Leaderboard. Voice naturalness affects user perception significantly – voices that fall into the Uncanny Valley typically perform worse in production deployments.

Provider and ModelELO ScoreLanguagesCost (per 1M characters)Latency
Inworld TTS-1.5 Max~120815$0.025/min (~$25/1M)P90 <250 ms
Google Gemini 3.1 Flash TTS~120440+$15~250 ms
ElevenLabs v3~117670+$100~250 ms
OpenAI Speech 2.8 HD~116430+$30~300 ms
ElevenLabs Flash v2.5N/A32$5075 ms
Cartesia Sonic-3~105440+15 credits/sec40–90 ms
Amazon Polly Long-formN/A34$100100 ms
Azure AI Speech Dragon HDN/A140+$22300 ms
Google Cloud TTS StandardN/A50+$4500 ms
PlayHT DialogN/A32$99/mo unlimited300 ms

#1 Inworld TTS-1.5 Max

Launched January 21, 2026, and ranked #1 on the Artificial Analysis TTS Leaderboard (ELO 1208). Time-to-first-audio P90 under 250 ms with median under 200 ms. Supports 15 languages with voice cloning. Priced at $0.025/min (~$25 per 1M characters) for the Max tier and $0.01/min for Mini. Available via the Inworld TTS API, on fal.ai, and integrates with Layercode, LiveKit, and Vapi.

#2 ElevenLabs Flash v2.5 + v3

Offers multiple TTS models. Flash v2.5, recommended specifically for voice agents, delivers ultra-fast performance with ~75ms delay and supports 32 languages at $50 per 1M characters. Eleven v3 (ELO 1176 on the AA leaderboard) is the expressive flagship — supports inline audio tags ([whispers], [laughs], [excited]) across 70+ languages at $100 per 1M characters.

Beyond TTS, ElevenLabs provides an integrated platform for building customizable interactive voice agents, including Scribe v2 STT, the Conversational AI 2.0 framework, dubbing API for translation, and support for additional audio formats (Opus, A-law for telephony). The Voice Library allows community and company voice uploads categorized by use case.

Pricing:

  • Free tier available
  • Starter: $6/month
  • Creator: $22/month ($11 first month)
  • Pro: $99/month
  • Scale: $299/month
  • Business: $990/month
  • Custom Enterprise tiers

#3 Cartesia Sonic-3

The latest Cartesia TTS model, with 40 ms time-to-first-audio (Sonic-3 Turbo) and 90 ms model latency — among the lowest in production TTS. Supports instant voice cloning with minimal audio input (~10 seconds) and allows customization of voice attributes like pitch, speed, and emotion. Supports 40+ languages. Sonic-3 also ships on AWS SageMaker JumpStart for self-hosted deployments. Cartesia additionally offers Line, a full voice agent platform on their owned stack (Sonic-3 + Ink-Whisper STT + Line orchestration) with SOC 2 Type II, HIPAA, and PCI Level 1 compliance.

Pricing:

  • Free: $0/month
  • Pro: $4/month
  • Startup: $39/month
  • Scale: $239/month (annual)
  • Enterprise: custom

#4 Amazon Polly

Supports 34 languages and dialects with multiple voices across languages. Supports SSML for fine-tuning speech output and allows custom voice creation for branding. Response latency ranges from 100ms to 1 second. Includes four models: Generative, Long-Form, Neural, and Standard. Integrates with other AWS services and can be accessed through AWS Console.

Pricing:

  • Long-form: $100 per 1M characters
  • Generative: $30 per 1M characters
  • Neural: $16 per 1M characters
  • Standard: $4 per 1M characters

#5 Microsoft Azure AI Speech

Supports over 140 languages and locales. Offers multiple versions: Standard, Custom, and HD Neural. Allows custom neural voice creation and supports SSML for pronunciation and intonation customization.

The Dragon HD Neural TTS variant (DragonHDLatestNeural) delivers highly expressive, context-aware speech with emotion detection capabilities. Neural HD pricing was reduced from $30 to $22 per 1M characters in March 2026.

Pricing:

  • Standard Neural voices: $15 per 1M characters
  • Neural HD voices: $22 per 1M characters
  • Custom Neural Professional voices: $24/1M characters
  • Additional costs for model training and endpoint hosting

#6 Google Text-to-Speech

Supports 380+ voices across 50+ languages. Can create unique voices by recording samples and supports SSML for controlling pitch, speed, volume, and pronunciation. Chirp 3 HD is the latest tier with Instant Custom Voice (voice cloning) and 28 multilingual voices. Latency around 500ms on Standard tier.

Pricing:

  • Standard voices: $4 per 1M characters
  • WaveNet voices: $16 per 1M characters
  • Neural2 voices: $16 per 1M characters
  • Studio / Chirp 3 HD voices: premium pricing

#7 PlayHT Dialog

Specifically designed for conversational applications. Works with 9 main languages and 23 additional languages with more than 50 voices available. Voice cloning functionality with 300ms latency. Partners with Groq for faster inference and LiveKit for real-time voice AI integration. Offers Play AI Studio for multi-speaker podcast creation and voice agent building.

Pricing:

  • Free tier
  • Creator: $31.20/month (billed yearly)
  • Unlimited: $29/month
  • Professional: $99/month (unlimited voice generation)
  • Enterprise: Custom pricing

The AI voice agent calculator projects runtime performance, cost, and infrastructure load based on selected STT and TTS models.


Choosing the Right STT and TTS Models for Your Project

Large providers like Google and Microsoft prioritize stability and infrastructure reliability. Smaller providers like ElevenLabs, Deepgram, and Cartesia often deliver lower latency and more natural-sounding voices.

Different use cases emphasize different capabilities. Entertainment and gaming applications benefit from emotional range and voice realism in TTS. High-volume commercial deployments require proven stability and uptime guarantees. Appointment booking systems need extremely low WER in STT, since users dictate contact information that must be transcribed accurately.

Production performance differs significantly from benchmark results. Providers showcase ideal conditions, but real deployments face quiet speech, speech impairments, heavy accents, background noise, poor connections, and multi-speaker scenarios. These edge cases reveal model limitations that don’t appear in initial testing.

Scaling introduces additional constraints. Traffic spikes, multi-region deployment, language expansion, and integration complexity all affect provider selection. Infrastructure capabilities matter as much as model performance once the system reaches production scale.

STT and TTS choices determine how voice agents sound and understand users. The complete picture includes LLM selection, observability, error handling, compliance, and scaling infrastructure. For detailed LLM selection guidance covering latency, accuracy, and cost tradeoffs, see our LLM comparison guide.

About Softcery: We’re the AI engineering team that founders call when other teams say “it’s impossible” or “it’ll take 6+ months.” We specialize in building advanced AI systems that actually work in production, handle real customer complexity, and scale with your business. We work with B2B SaaS founders in marketing automation, legal tech, and e-commerce – solving the gap between prototypes that work in demos and systems that work at scale. Get in touch.

Frequently Asked Questions

How do I choose between accuracy and latency for my voice agent?

It depends on your use case. Customer service agents prioritize accuracy to avoid misunderstandings. Gaming or entertainment applications prioritize low latency for natural conversation flow.

For most production voice agents, aim for sub-300ms STT latency with the best accuracy you can afford in that range. ElevenLabs Scribe v2 Realtime (2.3% AA-WER v2.0, ~150 ms) and AssemblyAI Universal-3 Pro Streaming (3.2% AA-WER v2.0) both balance these tradeoffs well.

What's the difference between Word Error Rate (WER) and ELO Score?

They measure different parts of the voice agent stack.

WER applies to Speech-to-Text (STT) and shows transcription accuracy. Lower is better. Independent testing on AA-WER v2.0 shows top streaming models scoring 2–4% on diverse real-world audio.

ELO Score applies to Text-to-Speech (TTS) and reflects how natural the generated voice sounds. Higher is better. Top models score 1164–1208 on the Artificial Analysis Leaderboard.

Why do provider-reported WER scores differ from Artificial Analysis benchmarks?

Providers typically test on clean, curated datasets that show their models in the best light. Artificial Analysis uses diverse real-world audio with accents, background noise, and challenging acoustic conditions — including AA-AgentTalk, a dataset specifically built around voice-agent-directed speech.

For example, OpenAI reports WER below 5% internally but shows 4.1% on AA-WER v2.0. Both numbers are accurate — they just measure different things. Independent benchmarks give a more realistic picture of production performance.

Can I use batch-only models like ElevenLabs Scribe v2 Batch for voice agents?

No. Batch transcription models process pre-recorded audio files and don’t support real-time streaming required for live conversations.

For voice agents, use streaming models like ElevenLabs Scribe v2 Realtime, AssemblyAI Universal-3 Pro Streaming, or Deepgram Nova-3 / Flux. Batch models still have a role for post-call analytics, compliance transcription, and training data labeling.

How do latency and cost scale in production?

Latency determines conversation naturalness. Target total round-trip around 800 ms — anything over 1 second feels robotic. Total latency includes STT + LLM + TTS + VAD + network, so each component matters.

Cost scales linearly with usage. At 10,000 hours per month, the difference between AssemblyAI Universal-3 Pro ($0.15/hr) and Speechmatics Ursa 2 Enhanced ($1.35/hr) is $1,500 vs $13,500 monthly. The AI voice agent calculator helps project costs at your expected volume.

We're building a voice agent but stuck on production readiness. Can you help?

We work with B2B SaaS founders who need voice agents that handle real customer complexity. If your prototype works but production feels risky, or your team hit walls with advanced features, we might be able to help.

If it resonates with your situation, reach out and we can discuss whether we’re a good fit.

AI Voice Agents for Personal Injury Intake: Solving the Missed-Call Problem

AI Voice Agents for Personal Injury Law Firms: How to Automate Intake Calls

AI voice agents handle personal injury intake 24/7 with attorney-level qualification. Technical deep-dive covering architecture, bilingual support, compliance, and real production results.

Building AI That Actually Understands Legal Documents: RAG Architecture for 500-Page Contracts

Building AI That Understands Legal Documents (Not Just Reads Them)

Engineering perspective on legal document AI: difference between text ingestion and contextual reasoning, RAG architecture for massive contracts, and how production systems handle legal complexity.

How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

Engineering perspective on legal AI research: RAG systems, citation hallucination prevention, validation architectures, and what makes production systems reliable.

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying Legal AI Solutions

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying

A founder-focused guide to legal AI development, covering market landscape, core technologies, compliance navigation, build vs buy decisions, and scaling strategies.

AI Call Center Automation: Actionable Playbook for 2026

AI Call Center Automation: Actionable Playbook for 2026

The CS landscape is changing. Expectations are rising, and teams are overworked. For the first time, the technology is mature enough to help.

AI Voice Agents for Travel: STT/TTS Architecture, GDS Integration, and HotelPlanner Case Study

Voice Agents for Travel: What Works at HotelPlanner, What Breaks Most Implementations

GDS latency kills conversations. Payment security blocks voice collection. API integration determines whether this works or wastes six months.

Custom AI Voice Agents: The Ultimate Guide (Updated May 2026)

Custom AI Voice Agents: The Ultimate Guide (Updated May 2026)

Custom voice agents in 2026: cascaded pipelines still win for telephony, S2S is the second valid pattern for web, and the production stack now includes MCP, semantic turn detection, and dedicated voice eval tooling.

How to Build Production-Ready Legal AI: Quality Assurance & Testing Guide

How to Build Production-Ready Legal AI Systems

Legal AI is one of the hardest domains to get right. Learn the quality assurance, testing, and observability patterns that make legal AI actually work in production.