How to Choose STT and TTS for Voice Agents: Latency, Accuracy, Cost
Last updated on April 24, 2026
Speech-to-Text (STT) and Text-to-Speech (TTS) tech, combined with Large Language Models (LLMs), power most AI voice agents today.
Direct Speech-to-Speech solutions exist but remain limited in production deployment. For detailed comparison of real-time versus turn-based architectures and their cost implications, see our architecture comparison guide. The STT → LLM → TTS pipeline offers independent model selection, adjustable complexity per use case, and straightforward integration with existing systems.
The selection depends on accuracy requirements, latency constraints, language support, and cost. Models vary significantly across these dimensions, and the most popular providers show distinct tradeoffs.
Understanding STT and TTS Technologies
The voice agent interaction cycle:
- STT captures voice input and converts it to text
- An LLM generates an appropriate response
- TTS converts the response back into natural-sounding speech
Modern STT models rely on deep learning, mostly transformer architectures. They process audio through a few key steps: cleaning it up, pulling out useful features, and modeling the sequence to turn sound into accurate text.
TTS systems reverse this flow. They convert text into a spectrogram (a visual representation of sound frequencies), then generate an audio waveform that produces natural-sounding speech.
Current TTS models achieve near-human naturalness in controlled conditions. Development focuses on cost reduction, cross-device optimization, and stability improvements.
STT faces harder technical constraints. Noisy environments, multi-speaker scenarios, and speaker isolation remain challenging. These limitations drive active development priorities across providers.
Key Criteria for Selecting STT Models
Some Speech-to-Text models shine in quiet call-center setups, others handle noisy real-world audio better. A few core factors determine performance across different environments.
1. Accuracy and Recognition Capabilities
Word Error Rate (WER) measures transcription accuracy. The best streaming STT models score 2–4% on AA-WER v2.0 (Artificial Analysis’s independent benchmark), meaning very high accuracy on production-representative audio. Accuracy varies significantly across different accents, background noise levels, specialized vocabulary domains, and multi-speaker scenarios.
2. Processing Speed and Latency
For voice agents, target total round-trip latency around 800 ms (VAD + STT + LLM + TTS + network). STT under 300 ms is ideal and under 150 ms is achievable with the latest models, leaving headroom elsewhere in the pipeline.
3. Audio Input Requirements
Models must handle different audio qualities, various microphone types, and diverse environmental conditions. The critical capability is filtering and isolating target voices from background noise.
Best Speech-to-Text Models for Voice Agents
Voice agents require real-time streaming STT with sub-500ms latency. The table below compares leading STT models, with streaming-capable models ranked by their suitability for production voice agent deployments.
| Provider and Model | AA-WER v2.0 | Languages | Cost/Hour | Latency |
|---|---|---|---|---|
| ElevenLabs Scribe v2 Realtime | 2.3% | 90 | $0.28 | ~150 ms |
| AssemblyAI Universal-3 Pro Streaming | 3.2% | 99+ | $0.15 | <600 ms |
| Deepgram Nova-3 / Flux | ~Nova-3 level | 36 (Nova-3) / EN (Flux) | $0.46 | <300 ms (Flux ~260 ms) |
| OpenAI gpt-4o-transcribe | 4.1% | 100+ | $0.36 | ~320 ms |
| Mistral Voxtral Small | 2.9% | 8 | $0.24 | ~500 ms |
| Gladia AI Solaria | N/A | 100 | $0.61 | ~270 ms |
| Speechmatics Ursa 2 Enhanced | N/A | 50 | $1.35 | <1 s |
| Google Chirp 3 (Public Preview) | Not scored | 100+ | Varies | Streaming |
For voice agents: ElevenLabs Scribe v2 Realtime is the leading choice, balancing top accuracy with the lowest streaming latency. AssemblyAI Universal-3 Pro Streaming offers the best price-performance ratio for production voice agents. Deepgram Flux is purpose-built for voice agents with model-integrated end-of-turn detection.
Note: AA-WER v2.0 (Artificial Analysis Word Error Rate, version 2.0) is measured across diverse real-world datasets including VoxPopuli, Earnings-22, AMI-SDM, and AA-AgentTalk (a dataset specifically focused on voice-agent-directed speech). Provider-reported WER often uses cleaner test data and may show lower scores than independent benchmarks.
#1 ElevenLabs Scribe v2 Realtime
Released January 6, 2026 (v2 Batch followed January 9). Top accuracy on AA-WER v2.0 (2.3%) plus ~150 ms streaming latency — the first model to lead on both dimensions simultaneously. Covers 90 languages including Japanese, Hindi, Polish, Swedish, Mandarin, Vietnamese, and French. Uses predictive transcription to anticipate the most probable next words and punctuation. On the FLEURS multilingual benchmark, Scribe v2 Realtime reports 93.5% accuracy vs. Gemini Flash 2.5 (90%), GPT-4o Mini (85%), and Deepgram Nova-3 (80%).
Audio support: PCM 8–48 kHz and μ-law (telephony). WebSocket streaming. Fully integrated into ElevenLabs Agents (the Conversational AI 2.0 platform).
Pricing:
- ~$0.28 per hour on Creator/Pro plans
- Lower on annual Business and Enterprise
- 30+ concurrency for enterprise
#2 AssemblyAI Universal-3 Pro Streaming
Released March 3, 2026. Replaces Universal-2 for voice agent workloads. Ships with prompting, disfluency control, code-switching, real-time speaker diarization, and 99+ languages. Pre-recorded Universal-3 Pro reports 93.3% Word Accuracy Rate; the streaming variant comes within 0.1–0.3% of the batch version on most evaluation sets. Slam-1 is deprecated.
Pricing:
- $0.15 per hour ($2.50 per 1000 minutes) for both pre-recorded and streaming
- Speaker identification $0.02/hr add-on
- Unlimited concurrency, no rate limits
#3 Deepgram Nova-3 / Flux
Nova-3 (released February 2025): sub-300 ms streaming, 36 languages with real-time switching between 10, Nova-3 Medical reaches 3.45% median WER on medical terminology.
Flux (GA since October 2025): the first Conversational Speech Recognition (CSR) model with model-integrated end-of-turn detection (~260 ms). Eliminates the need for a separate Voice Activity Detection system. Nova-3-level accuracy. English-first. 100+ concurrent streams per GPU.
Pricing:
- Nova-3 Monolingual / Flux: $0.0077/min ($0.46/hr) PAYG, $0.0065/min on Growth
- Nova-3 Multilingual: $0.0092/min ($0.55/hr) PAYG
- Nova-3 Medical: custom enterprise
- Voice Agent API (bundled STT + LLM + TTS): $0.050–$0.163/min depending on tier (Standard / Custom / Advanced) and BYO options
Other Notable STT Providers
-
OpenAI gpt-4o-transcribe and gpt-realtime — gpt-4o-transcribe: 4.1% AA-WER v2.0, 100+ languages, $0.36/hr. gpt-4o-mini-transcribe-2025-12-15 is the updated mini tier with lower WER. OpenAI’s
gpt-realtimespeech-to-speech model (GA 2026) covers the full conversational loop with MCP, SIP, and image inputs. -
Mistral Voxtral — Voxtral Small reaches 2.9% AA-WER v2.0 at $0.24/hr (8 languages), Voxtral Mini Transcribe is even cheaper at $1.00 per 1000 min with 3.7% WER. Strong option if your language coverage needs are narrow.
-
Google Chirp 3 — Public preview as of 2026. Supports StreamingRecognize, Recognize, and BatchRecognize. 100+ languages, speaker diarization, automatic language detection, built-in denoiser. Successor to Chirp 2 which was batch-only.
-
Gladia AI Solaria — 100 languages including 42 underserved, ~270 ms latency. Lacks independent benchmark scores but strong customer validation in enterprise deployments. $0.61/hr Pro.
-
Speechmatics Ursa 2 — 50 languages, strong Spanish/Polish performance. Real-Time Enhanced: $1.35/hr. Free tier: 8 hrs/month.
-
Azure MAI-Transcribe-1 — Microsoft’s in-house STT flagship. 3.0% AA-WER v2.0, 140+ languages. Batch-focused; real-time variants exist in the Azure Speech SDK.
Text-to-Speech (TTS) Selection Criteria
TTS selection matters as much as STT. The TTS engine determines how natural and human-like the voice agent sounds to end users.
Voice Quality and Naturalness
Voice naturalness is the primary consideration for TTS. Models must avoid robotic qualities, maintain consistency across longer passages, handle partial text fragments, and accurately pronounce specific formats like phone numbers and email addresses.
There’s no universal quality metric for TTS, but platforms like Artificial Analysis use an ELO Score. Top TTS models score 1164–1208 ELO on the Artificial Analysis TTS Leaderboard.
Voice Customization Options
Providers vary in available voice options and customization capabilities. Basic adjustments include speaking rate, pitch, and emphasis. Advanced systems offer control over voice characteristics like emotional tone (ElevenLabs v3 supports inline tags like [laughs] and [whispers]), real-time emotion control (Cartesia Sonic-3), and voice cloning. The Speech Synthesis Markup Language (SSML) enables fine-tuned voice generation across most platforms.
Language Support
Language selection, regional accent configuration, and dialect-specific adjustments determine global deployment viability. Some models are limited to one language and may struggle with mid-call language switches.
Best Text-to-Speech Models and Providers
The providers below rank highest on the Artificial Analysis Leaderboard. Voice naturalness affects user perception significantly – voices that fall into the Uncanny Valley typically perform worse in production deployments.
| Provider and Model | ELO Score | Languages | Cost (per 1M characters) | Latency |
|---|---|---|---|---|
| Inworld TTS-1.5 Max | ~1208 | 15 | $0.025/min (~$25/1M) | P90 <250 ms |
| Google Gemini 3.1 Flash TTS | ~1204 | 40+ | $15 | ~250 ms |
| ElevenLabs v3 | ~1176 | 70+ | $100 | ~250 ms |
| OpenAI Speech 2.8 HD | ~1164 | 30+ | $30 | ~300 ms |
| ElevenLabs Flash v2.5 | N/A | 32 | $50 | 75 ms |
| Cartesia Sonic-3 | ~1054 | 40+ | 15 credits/sec | 40–90 ms |
| Amazon Polly Long-form | N/A | 34 | $100 | 100 ms |
| Azure AI Speech Dragon HD | N/A | 140+ | $22 | 300 ms |
| Google Cloud TTS Standard | N/A | 50+ | $4 | 500 ms |
| PlayHT Dialog | N/A | 32 | $99/mo unlimited | 300 ms |
#1 Inworld TTS-1.5 Max
Launched January 21, 2026, and ranked #1 on the Artificial Analysis TTS Leaderboard (ELO 1208). Time-to-first-audio P90 under 250 ms with median under 200 ms. Supports 15 languages with voice cloning. Priced at $0.025/min (~$25 per 1M characters) for the Max tier and $0.01/min for Mini. Available via the Inworld TTS API, on fal.ai, and integrates with Layercode, LiveKit, and Vapi.
#2 ElevenLabs Flash v2.5 + v3
Offers multiple TTS models. Flash v2.5, recommended specifically for voice agents, delivers ultra-fast performance with ~75ms delay and supports 32 languages at $50 per 1M characters. Eleven v3 (ELO 1176 on the AA leaderboard) is the expressive flagship — supports inline audio tags ([whispers], [laughs], [excited]) across 70+ languages at $100 per 1M characters.
Beyond TTS, ElevenLabs provides an integrated platform for building customizable interactive voice agents, including Scribe v2 STT, the Conversational AI 2.0 framework, dubbing API for translation, and support for additional audio formats (Opus, A-law for telephony). The Voice Library allows community and company voice uploads categorized by use case.
Pricing:
- Free tier available
- Starter: $6/month
- Creator: $22/month ($11 first month)
- Pro: $99/month
- Scale: $299/month
- Business: $990/month
- Custom Enterprise tiers
#3 Cartesia Sonic-3
The latest Cartesia TTS model, with 40 ms time-to-first-audio (Sonic-3 Turbo) and 90 ms model latency — among the lowest in production TTS. Supports instant voice cloning with minimal audio input (~10 seconds) and allows customization of voice attributes like pitch, speed, and emotion. Supports 40+ languages. Sonic-3 also ships on AWS SageMaker JumpStart for self-hosted deployments. Cartesia additionally offers Line, a full voice agent platform on their owned stack (Sonic-3 + Ink-Whisper STT + Line orchestration) with SOC 2 Type II, HIPAA, and PCI Level 1 compliance.
Pricing:
- Free: $0/month
- Pro: $4/month
- Startup: $39/month
- Scale: $239/month (annual)
- Enterprise: custom
#4 Amazon Polly
Supports 34 languages and dialects with multiple voices across languages. Supports SSML for fine-tuning speech output and allows custom voice creation for branding. Response latency ranges from 100ms to 1 second. Includes four models: Generative, Long-Form, Neural, and Standard. Integrates with other AWS services and can be accessed through AWS Console.
Pricing:
- Long-form: $100 per 1M characters
- Generative: $30 per 1M characters
- Neural: $16 per 1M characters
- Standard: $4 per 1M characters
#5 Microsoft Azure AI Speech
Supports over 140 languages and locales. Offers multiple versions: Standard, Custom, and HD Neural. Allows custom neural voice creation and supports SSML for pronunciation and intonation customization.
The Dragon HD Neural TTS variant (DragonHDLatestNeural) delivers highly expressive, context-aware speech with emotion detection capabilities. Neural HD pricing was reduced from $30 to $22 per 1M characters in March 2026.
Pricing:
- Standard Neural voices: $15 per 1M characters
- Neural HD voices: $22 per 1M characters
- Custom Neural Professional voices: $24/1M characters
- Additional costs for model training and endpoint hosting
#6 Google Text-to-Speech
Supports 380+ voices across 50+ languages. Can create unique voices by recording samples and supports SSML for controlling pitch, speed, volume, and pronunciation. Chirp 3 HD is the latest tier with Instant Custom Voice (voice cloning) and 28 multilingual voices. Latency around 500ms on Standard tier.
Pricing:
- Standard voices: $4 per 1M characters
- WaveNet voices: $16 per 1M characters
- Neural2 voices: $16 per 1M characters
- Studio / Chirp 3 HD voices: premium pricing
#7 PlayHT Dialog
Specifically designed for conversational applications. Works with 9 main languages and 23 additional languages with more than 50 voices available. Voice cloning functionality with 300ms latency. Partners with Groq for faster inference and LiveKit for real-time voice AI integration. Offers Play AI Studio for multi-speaker podcast creation and voice agent building.
Pricing:
- Free tier
- Creator: $31.20/month (billed yearly)
- Unlimited: $29/month
- Professional: $99/month (unlimited voice generation)
- Enterprise: Custom pricing
The AI voice agent calculator projects runtime performance, cost, and infrastructure load based on selected STT and TTS models.
Choosing the Right STT and TTS Models for Your Project
Large providers like Google and Microsoft prioritize stability and infrastructure reliability. Smaller providers like ElevenLabs, Deepgram, and Cartesia often deliver lower latency and more natural-sounding voices.
Different use cases emphasize different capabilities. Entertainment and gaming applications benefit from emotional range and voice realism in TTS. High-volume commercial deployments require proven stability and uptime guarantees. Appointment booking systems need extremely low WER in STT, since users dictate contact information that must be transcribed accurately.
Production performance differs significantly from benchmark results. Providers showcase ideal conditions, but real deployments face quiet speech, speech impairments, heavy accents, background noise, poor connections, and multi-speaker scenarios. These edge cases reveal model limitations that don’t appear in initial testing.
Scaling introduces additional constraints. Traffic spikes, multi-region deployment, language expansion, and integration complexity all affect provider selection. Infrastructure capabilities matter as much as model performance once the system reaches production scale.
STT and TTS choices determine how voice agents sound and understand users. The complete picture includes LLM selection, observability, error handling, compliance, and scaling infrastructure. For detailed LLM selection guidance covering latency, accuracy, and cost tradeoffs, see our LLM comparison guide.
About Softcery: We’re the AI engineering team that founders call when other teams say “it’s impossible” or “it’ll take 6+ months.” We specialize in building advanced AI systems that actually work in production, handle real customer complexity, and scale with your business. We work with B2B SaaS founders in marketing automation, legal tech, and e-commerce – solving the gap between prototypes that work in demos and systems that work at scale. Get in touch.
Frequently Asked Questions
It depends on your use case. Customer service agents prioritize accuracy to avoid misunderstandings. Gaming or entertainment applications prioritize low latency for natural conversation flow.
For most production voice agents, aim for sub-300ms STT latency with the best accuracy you can afford in that range. ElevenLabs Scribe v2 Realtime (2.3% AA-WER v2.0, ~150 ms) and AssemblyAI Universal-3 Pro Streaming (3.2% AA-WER v2.0) both balance these tradeoffs well.
They measure different parts of the voice agent stack.
WER applies to Speech-to-Text (STT) and shows transcription accuracy. Lower is better. Independent testing on AA-WER v2.0 shows top streaming models scoring 2–4% on diverse real-world audio.
ELO Score applies to Text-to-Speech (TTS) and reflects how natural the generated voice sounds. Higher is better. Top models score 1164–1208 on the Artificial Analysis Leaderboard.
Providers typically test on clean, curated datasets that show their models in the best light. Artificial Analysis uses diverse real-world audio with accents, background noise, and challenging acoustic conditions — including AA-AgentTalk, a dataset specifically built around voice-agent-directed speech.
For example, OpenAI reports WER below 5% internally but shows 4.1% on AA-WER v2.0. Both numbers are accurate — they just measure different things. Independent benchmarks give a more realistic picture of production performance.
No. Batch transcription models process pre-recorded audio files and don’t support real-time streaming required for live conversations.
For voice agents, use streaming models like ElevenLabs Scribe v2 Realtime, AssemblyAI Universal-3 Pro Streaming, or Deepgram Nova-3 / Flux. Batch models still have a role for post-call analytics, compliance transcription, and training data labeling.
Latency determines conversation naturalness. Target total round-trip around 800 ms — anything over 1 second feels robotic. Total latency includes STT + LLM + TTS + VAD + network, so each component matters.
Cost scales linearly with usage. At 10,000 hours per month, the difference between AssemblyAI Universal-3 Pro ($0.15/hr) and Speechmatics Ursa 2 Enhanced ($1.35/hr) is $1,500 vs $13,500 monthly. The AI voice agent calculator helps project costs at your expected volume.
We work with B2B SaaS founders who need voice agents that handle real customer complexity. If your prototype works but production feels risky, or your team hit walls with advanced features, we might be able to help.
If it resonates with your situation, reach out and we can discuss whether we’re a good fit.