How to Choose STT and TTS for Voice Agents: Latency, Accuracy, Cost
Last updated on October 26, 2025
Speech-to-Text (STT) and Text-to-Speech (TTS) tech, combined with Large Language Models (LLMs), power most AI voice agents today.
Direct Speech-to-Speech solutions exist but remain limited in production deployment. The STT → LLM → TTS pipeline offers independent model selection, adjustable complexity per use case, and straightforward integration with existing systems.
The selection depends on accuracy requirements, latency constraints, language support, and cost. Models vary significantly across these dimensions, and the most popular providers show distinct tradeoffs.
Understanding STT and TTS Technologies
The voice agent interaction cycle:
- STT captures voice input and converts it to text
- An LLM generates an appropriate response
- TTS converts the response back into natural-sounding speech
Modern STT models rely on deep learning, mostly transformer architectures. They process audio through a few key steps: cleaning it up, pulling out useful features, and modeling the sequence to turn sound into accurate text.
TTS systems reverse this flow. They convert text into a spectrogram (a visual representation of sound frequencies), then generate an audio waveform that produces natural-sounding speech.
Current TTS models achieve near-human naturalness in controlled conditions. Development focuses on cost reduction, cross-device optimization, and stability improvements.
STT faces harder technical constraints. Noisy environments, multi-speaker scenarios, and speaker isolation remain challenging. These limitations drive active development priorities across providers.
Key Criteria for Selecting STT Models
Some Speech-to-Text models shine in quiet call-center setups, others handle noisy real-world audio better. A few core factors determine performance across different environments.
1. Accuracy and Recognition Capabilities
Word Error Rate (WER) measures transcription accuracy. Good models achieve 5–10% WER, meaning 90–95% accuracy. Accuracy varies significantly across different accents, background noise levels, specialized vocabulary domains, and multi-speaker scenarios.
2. Processing Speed and Latency
Real-Time Factor (RTF) evaluates processing speed. Ideal RTF is 0.1, meaning processing takes 10% of audio duration. First Response Latency under 100ms is optimal, 200-500ms remains acceptable, and anything over 1 second feels too slow for natural conversation.
3. Audio Input Requirements
Models must handle different audio qualities, various microphone types, and diverse environmental conditions. The critical capability is filtering and isolating target voices from background noise.
Best Speech-to-Text Models for Voice Agents in 2025
Voice agents require real-time streaming STT with sub-500ms latency. The table below compares leading STT models, with streaming-capable models ranked by their suitability for production voice agent deployments.
| Provider and Model | AA-WER | Languages | Cost/Hour | Latency |
|---|---|---|---|---|
| Recommended | ||||
| Deepgram Nova-3 | ~18.3% | 36 | $0.46 | <300 ms |
| AssemblyAI Universal-2 | ~14.5% | 102 | $0.27 | 300–600 ms |
| OpenAI gpt-4o-transcribe | ~21.4% | 100+ | $0.36 | 320 ms |
| Alternative | ||||
| Deepgram Flux | ~18.3%* | N/A | TBD | ~260 ms |
| AssemblyAI Slam-1 | N/A | 1 (EN) | $0.37 | N/A |
| Gladia AI Solaria | N/A | 100 | $0.61 | ~270 ms |
| Speechmatics Ursa 2 Enhanced | N/A | 50 | $1.35 | <1s |
| Batch-Only | ||||
| Google Cloud Chirp 2 | ~11.6% | 102 | $0.96 | Batch only |
| ElevenLabs Scribe | ~15.1% | 99 | $0.40 | Batch only |
For voice agents: Deepgram Nova-3 is the leading choice, balancing sub-300ms latency with competitive accuracy and cost efficiency. AssemblyAI Universal-2 offers the best accuracy among streaming models (14.5% WER) with strong domain-specific performance in medical and sales contexts. OpenAI gpt-4o-transcribe provides the broadest language support (100+) with consistent 320ms latency.
Note: AA-WER (Artificial Analysis Word Error Rate) is measured across 3 diverse real-world datasets (VoxPopuli, Earnings-22, AMI-SDM). Provider-reported WER often uses cleaner test data and may show lower (better) scores than independent benchmarks. Batch-only models process pre-recorded audio files and do not support real-time streaming.
#1 Deepgram Nova-3
Released in February 2025 as an upgraded version of Nova-2. The leading STT choice for voice agents, balancing sub-300ms latency with competitive accuracy and cost. Achieves 18.3% WER on Artificial Analysis benchmarks. Supports 36 languages and dialects and can switch between 10 languages in real-time during conversations.
A specialized version, Nova-3 Medical (launched March 2025), achieves significantly lower WER for healthcare applications with a median of 3.45% on medical terminology.
Pricing:
- $4.30 per 1000 minutes for streaming audio
- Nova-3 Medical: Custom enterprise pricing
#2 AssemblyAI Universal-2
Comes in two options: Best and Nano. Nano supports over 102 languages while Best works with 20 languages. Achieves 14.5% WER on Artificial Analysis benchmarks - the best accuracy among real-time capable models. Excels at domain-specific transcription in medical and sales contexts. Uses an all-neural architecture for text formatting and improves transcript readability with context-aware punctuation and casing.
Pricing:
- Pay-as-you-go model
- Universal model: $0.27 per hour ($4.50 per 1000 minutes)
- Nano model has a lower rate of $0.12 per hour ($2.00 per 1000 minutes)
- Additional costs may apply for advanced features like speaker detection or sentiment analysis
#3 OpenAI gpt-4o-transcribe
OpenAI introduced two newer ASR models under the GPT-4o architecture in March 2025. Unlike Whisper, these are not open-source and only accessible through OpenAI’s API. Achieves 21.4% WER on Artificial Analysis independent benchmarks across diverse real-world datasets. Supports 100+ languages for multi-language transcription.
gpt-4o-transcribe is tuned for robustness and multilingual capability, while gpt-4o-mini-transcribe offers a lighter, faster, cheaper option best suited for mobile and edge applications.
Note: OpenAI reports WER below 5% on internal benchmarks, but independent testing shows higher error rates on diverse real-world audio.
Pricing:
- gpt-4o-transcribe: $6.00 per 1000 minutes
- gpt-4o-mini-transcribe: $3.00 per 1000 minutes
Other Notable STT Providers
-
Deepgram Flux – Announced October 2025 as the first Conversational Speech Recognition (CSR) model purpose-built for voice agents. Features model-integrated end-of-turn detection (~260ms latency) that eliminates the need for separate Voice Activity Detection systems. Maintains Nova-3 level accuracy while delivering turn-complete transcripts with context-aware turn detection. Too new for independent benchmarking or production validation. Free during October 2025 promotion (up to 50 concurrent connections), regular pricing not yet disclosed.
-
AssemblyAI Slam-1 – Speech Language Model combining LLM architecture with ASR encoders. Introduced in April 2025, supports prompt-based customization and handles up to 1,000 domain-specific terms or phrases (each up to six words), helping recognize specialized terminology by understanding semantic context. English only, in public beta. $0.37 per hour.
-
Gladia AI Solaria – Launched April 2025, covers 100 languages including 42 underserved languages not supported by many competitors. Low latency (~270ms) for real-time scenarios. Lacks independent benchmarking data from Artificial Analysis or other third-party testers, making accuracy difficult to verify against established models. Free tier available, Pro tier: $0.61/hour for batch transcription.
-
Speechmatics Ursa 2 – Released October 2024, covering 50 languages with strong performance in Spanish and Polish on domain-specific benchmarks. Real-Time Enhanced: $1.35/hour. Free tier: 8 hours per month.
-
Google Cloud Chirp 2 – Part of Google Cloud infrastructure using the Universal Speech Model (USM). Achieves the best accuracy on Artificial Analysis benchmarks with 11.6% WER across diverse real-world datasets. Supports 102 languages. Optimized for batch transcription rather than real-time voice agents. $0.96/hour.
-
ElevenLabs Scribe – Launched February 2025, supporting 99 languages with strong accuracy across traditionally underserved languages. Achieves 15.1% WER on Artificial Analysis benchmarks. Features speaker diarization (up to 32 speakers), word-level timestamps, and audio event detection. Currently batch transcription only - not suitable for real-time voice agents. Streaming version announced but no release date. $0.40 per hour.
Text-to-Speech (TTS) Selection Criteria
TTS selection matters as much as STT. The TTS engine determines how natural and human-like the voice agent sounds to end users.
Voice Quality and Naturalness
Voice naturalness is the primary consideration for TTS. Models must avoid robotic qualities, maintain consistency across longer passages, handle partial text fragments, and accurately pronounce specific formats like phone numbers and email addresses.
There’s no universal quality metric for TTS, but platforms like Artificial Analysis use an ELO Score. As of October 2025, top TTS models have ELO scores around 1000–1100.
Voice Customization Options
Providers vary in available voice options and customization capabilities. Basic adjustments include speaking rate, pitch, and emphasis. Advanced systems offer control over voice characteristics like assertiveness, confidence, smoothness, and relaxedness. The Speech Synthesis Markup Language (SSML) enables fine-tuned voice generation across most platforms.
Language Support
Language selection, regional accent configuration, and dialect-specific adjustments determine global deployment viability. Some models are limited to one language and may struggle with mid-call language switches.
Best Text-to-Speech Models and Providers in 2025
The providers below rank highest on the Artificial Analysis Leaderboard as of October 2025. Voice naturalness affects user perception significantly – voices that fall into the Uncanny Valley typically perform worse in production deployments.
| Provider and Model | ELO Score | Languages | Cost (per 1M characters) | Latency |
|---|---|---|---|---|
| ElevenLabs v3 | ~1114 | 32 | ~$206 | N/A |
| ElevenLabs Flash v2.5 | ~1097 | 32 | ~$103 | 75 ms |
| Amazon Polly Long-form | ~1066 | 34 | $100 | 100 ms |
| Azure AI Speech Neural | ~1048 | 140+ | $15 | 300 ms |
| Google Text-to-Speech Standard | ~1034 | 50+ | $4 | 500ms |
| PlayHT Dialog | ~1013 | 32 | $99/mo unlimited | 300-320ms |
#1 ElevenLabs Flash v2.5
Offers two TTS models: Multilingual and Flash. The Flash model, recommended specifically for voice agents, delivers ultra-fast performance with ~75ms delay. Supports voice customization including tone and emotional expression, offers voice cloning capability, and supports 32 languages.
Beyond TTS, ElevenLabs provides an integrated platform for building customizable interactive voice agents, including Scribe STT API, a conversational AI framework, dubbing API for translation, and support for additional audio formats (Opus, A-law for telephony). The Voice Library allows community and company voice uploads categorized by use case.
Pricing:
- Free tier available
- Starter: $5/month
- Creator: $22/month ($11 first month)
- Pro: $99/month
- Scale: $330/month
- Business: $1,320/month
- Custom Enterprise tiers
#2 Cartesia Sonic
Single TTS model called Sonic with 90ms latency. Supports instant voice cloning with minimal audio input (as little as 10 seconds) and allows customization of voice attributes like pitch, speed, and emotion. Can run models directly on devices and supports 15 languages. Playground available at https://play.cartesia.ai/text-to-speech (requires login).
Pricing:
- Monthly subscription for business starts at $49
- Cost is approximately $46.7 per 1M characters
#3 Amazon Polly
Supports 34 languages and dialects with 96 total voices across languages. Supports SSML for fine-tuning speech output and allows custom voice creation for branding. Response latency ranges from 100ms to 1 second. Includes four models: Generative, Long-Form, Neural, and Standard. Integrates with other AWS services and can be accessed through AWS Console.
Pricing:
- Long-form: $100 per 1M characters
- Generative: $30 per 1M characters
- Neural: $16 per 1M characters
- Standard: $4 per 1M characters
#4 Microsoft Azure AI Speech
Supports over 140 languages and locales. Offers multiple versions: Standard, Custom, and HD Neural. Allows custom neural voice creation and supports SSML for pronunciation and intonation customization.
The Dragon HD Neural TTS variant, introduced in March 2025, delivers highly expressive, context-aware speech with emotion detection capabilities. 19 total Dragon HD voices available, though precise Dragon HD pricing is not publicly disclosed.
Pricing:
- Standard Neural voices: $15 per 1M characters
- Custom Neural Professional voices: $24/1M characters
- Additional costs for model training and endpoint hosting
#5 Google Text-to-Speech
Supports over 380 voices across 50+ languages. Can create unique voices by recording samples and supports SSML for controlling pitch, speed, volume, and pronunciation. Can customize voices for different devices. Latency around 500ms. Playground requires Google account login.
Pricing:
- Standard voices: $4 per 1M characters
- WaveNet voices: $16 per 1M characters
- Neural2 voices: $16 per 1M characters
- Studio voices: Premium pricing varies
#6 PlayHT Dialog
Released in February 2025, specifically designed for conversational applications. Works with 9 main languages and 23 additional languages with more than 50 voices available. Voice cloning functionality with 300ms latency. Partners with Groq for faster inference and LiveKit for real-time voice AI integration. Offers Play AI Studio for multi-speaker podcast creation and voice agent building.
Pricing:
- Free tier
- Creator: $31.20/month (billed yearly)
- Unlimited: $29/month
- Professional: $99/month (unlimited voice generation)
- Enterprise: Custom pricing
The AI voice agent calculator projects runtime performance, cost, and infrastructure load based on selected STT and TTS models.
Choosing the Right STT and TTS Models for Your Project
Large providers like Google and Microsoft prioritize stability and infrastructure reliability. Smaller providers like ElevenLabs and Deepgram often deliver lower latency and more natural-sounding voices.
Different use cases emphasize different capabilities. Entertainment and gaming applications benefit from emotional range and voice realism in TTS. High-volume commercial deployments require proven stability and uptime guarantees. Appointment booking systems need extremely low WER in STT, since users dictate contact information that must be transcribed accurately.
Production performance differs significantly from benchmark results. Providers showcase ideal conditions, but real deployments face quiet speech, speech impairments, heavy accents, background noise, poor connections, and multi-speaker scenarios. These edge cases reveal model limitations that don’t appear in initial testing.
Scaling introduces additional constraints. Traffic spikes, multi-region deployment, language expansion, and integration complexity all affect provider selection. Infrastructure capabilities matter as much as model performance once the system reaches production scale.
STT and TTS choices determine how voice agents sound and understand users. The complete picture includes LLM selection, observability, error handling, compliance, and scaling infrastructure. The AI launch plan covers all six systems needed to ship voice agents that reliably handle customer conversations.
About Softcery: We’re the AI engineering team that founders call when other teams say “it’s impossible” or “it’ll take 6+ months.” We specialize in building advanced AI systems that actually work in production, handle real customer complexity, and scale with your business. We work with B2B SaaS founders in marketing automation, legal tech, and e-commerce – solving the gap between prototypes that work in demos and systems that work at scale. Get in touch.
Frequently Asked Questions
It depends on your use case. Customer service agents prioritize accuracy to avoid misunderstandings. Gaming or entertainment applications prioritize low latency for natural conversation flow.
For most production voice agents, aim for sub-300ms latency with the best accuracy you can afford in that range. Deepgram Nova-3 (18.3% WER, <300ms) and AssemblyAI Universal-2 (14.5% WER, 300-600ms) both balance these tradeoffs well.
They measure different parts of the voice agent stack.
WER applies to Speech-to-Text (STT) and shows transcription accuracy. Lower is better. Independent testing shows 11.6-21.4% WER for top streaming models on diverse real-world audio.
ELO Score applies to Text-to-Speech (TTS) and reflects how natural the generated voice sounds. Higher is better. Top models in 2025 score around 1000–1100.
Providers typically test on clean, curated datasets that show their models in the best light. Artificial Analysis uses diverse real-world audio with accents, background noise, and challenging acoustic conditions.
For example, OpenAI reports <5% WER internally but shows 21.4% on Artificial Analysis benchmarks. Both numbers are accurate, they just measure different things. Independent benchmarks give a more realistic picture of production performance.
No. Batch transcription models process pre-recorded audio files and don’t support real-time streaming required for live conversations.
Google Chirp 2 achieves the best accuracy (11.6% WER) but only works for transcribing recordings. For voice agents, you need streaming models like Deepgram Nova-3, AssemblyAI Universal-2, or OpenAI gpt-4o-transcribe.
Latency determines conversation naturalness. Under 300ms feels natural, 300-600ms is acceptable, over 1 second feels robotic. Total latency includes STT + LLM + TTS, so each component matters.
Cost scales linearly with usage. At 10,000 hours per month, the difference between AssemblyAI ($0.27/hr) and Speechmatics ($1.35/hr) is $2,700 vs $13,500 monthly. The AI voice agent calculator helps project costs at your expected volume.
We work with B2B SaaS founders who need voice agents that handle real customer complexity. If your prototype works but production feels risky, or your team hit walls with advanced features, we might be able to help.
The AI launch plan covers the framework we use for production voice agents. If it resonates with your situation, reach out and we can discuss whether we’re a good fit.
Find out what production-ready actually means for your AI system. Your custom launch plan shows the specific reliability, safety, and performance gaps you need to close – with proven solutions.
Get Your AI Launch Plan