How to Choose STT and TTS for Voice Agents: OpenAI, Deepgram, ElevenLabs, and 10 More Compared

Speech-to-Text (STT) and Text-to-Speech (TTS) tech, combined with Large Language Models (LLMs), power most AI voice agents today.

Direct Speech-to-Speech solutions exist but remain limited in production deployment. For detailed comparison of real-time versus turn-based architectures and their cost implications, see our architecture comparison guide. The STT → LLM → TTS pipeline offers independent model selection, adjustable complexity per use case, and straightforward integration with existing systems.

The selection depends on accuracy requirements, latency constraints, language support, and cost. Models vary significantly across these dimensions, and the most popular providers show distinct tradeoffs.

Understanding STT and TTS Technologies

The voice agent interaction cycle:

STT captures voice input and converts it to text
An LLM generates an appropriate response
TTS converts the response back into natural-sounding speech

Modern STT models rely on deep learning, mostly transformer architectures. They process audio through a few key steps: cleaning it up, pulling out useful features, and modeling the sequence to turn sound into accurate text.

TTS systems reverse this flow. They convert text into a spectrogram (a visual representation of sound frequencies), then generate an audio waveform that produces natural-sounding speech.

Current TTS models achieve near-human naturalness in controlled conditions. Development focuses on cost reduction, cross-device optimization, and stability improvements.

STT faces harder technical constraints. Noisy environments, multi-speaker scenarios, and speaker isolation remain challenging. These limitations drive active development priorities across providers.

Key Criteria for Selecting STT Models

Some Speech-to-Text models shine in quiet call-center setups, others handle noisy real-world audio better. A few core factors determine performance across different environments.

1. Accuracy and Recognition Capabilities

Word Error Rate (WER) measures transcription accuracy. Good models achieve 5–10% WER, meaning 90–95% accuracy. Accuracy varies significantly across different accents, background noise levels, specialized vocabulary domains, and multi-speaker scenarios.

2. Processing Speed and Latency

Real-Time Factor (RTF) evaluates processing speed. Ideal RTF is 0.1, meaning processing takes 10% of audio duration. First Response Latency under 100ms is optimal, 200-500ms remains acceptable, and anything over 1 second feels too slow for natural conversation.

3. Audio Input Requirements

Models must handle different audio qualities, various microphone types, and diverse environmental conditions. The critical capability is filtering and isolating target voices from background noise.

Best Speech-to-Text Models for Voice Agents in 2025

Voice agents require real-time streaming STT with sub-500ms latency. The table below compares leading STT models, with streaming-capable models ranked by their suitability for production voice agent deployments.

Provider and Model	AA-WER	Languages	Cost/Hour	Latency
Recommended
Deepgram Nova-3	~18.3%	36	$0.46	<300 ms
AssemblyAI Universal-2	~14.5%	102	$0.27	300–600 ms
OpenAI gpt-4o-transcribe	~21.4%	100+	$0.36	320 ms
Alternative
Deepgram Flux	~18.3%*	N/A	TBD	~260 ms
AssemblyAI Slam-1	N/A	1 (EN)	$0.37	N/A
Gladia AI Solaria	N/A	100	$0.61	~270 ms
Speechmatics Ursa 2 Enhanced	N/A	50	$1.35	<1s
Batch-Only
Google Cloud Chirp 2	~11.6%	102	$0.96	Batch only
ElevenLabs Scribe	~15.1%	99	$0.40	Batch only

^{*Deepgram Flux maintains Nova-3 level accuracy but lacks independent benchmarking.}

For voice agents: Deepgram Nova-3 is the leading choice, balancing sub-300ms latency with competitive accuracy and cost efficiency. AssemblyAI Universal-2 offers the best accuracy among streaming models (14.5% WER) with strong domain-specific performance in medical and sales contexts. OpenAI gpt-4o-transcribe provides the broadest language support (100+) with consistent 320ms latency.

Note: AA-WER (Artificial Analysis Word Error Rate) is measured across 3 diverse real-world datasets (VoxPopuli, Earnings-22, AMI-SDM). Provider-reported WER often uses cleaner test data and may show lower (better) scores than independent benchmarks. Batch-only models process pre-recorded audio files and do not support real-time streaming.

#1 Deepgram Nova-3

Released in February 2025 as an upgraded version of Nova-2. The leading STT choice for voice agents, balancing sub-300ms latency with competitive accuracy and cost. Achieves 18.3% WER on Artificial Analysis benchmarks. Supports 36 languages and dialects and can switch between 10 languages in real-time during conversations.

A specialized version, Nova-3 Medical (launched March 2025), achieves significantly lower WER for healthcare applications with a median of 3.45% on medical terminology.

Pricing:

$4.30 per 1000 minutes for streaming audio
Nova-3 Medical: Custom enterprise pricing

#2 AssemblyAI Universal-2

Comes in two options: Best and Nano. Nano supports over 102 languages while Best works with 20 languages. Achieves 14.5% WER on Artificial Analysis benchmarks - the best accuracy among real-time capable models. Excels at domain-specific transcription in medical and sales contexts. Uses an all-neural architecture for text formatting and improves transcript readability with context-aware punctuation and casing.

Pricing:

Pay-as-you-go model
Universal model: $0.27 per hour ($4.50 per 1000 minutes)
Nano model has a lower rate of $0.12 per hour ($2.00 per 1000 minutes)
Additional costs may apply for advanced features like speaker detection or sentiment analysis

#3 OpenAI gpt-4o-transcribe

OpenAI introduced two newer ASR models under the GPT-4o architecture in March 2025. Unlike Whisper, these are not open-source and only accessible through OpenAI’s API. Achieves 21.4% WER on Artificial Analysis independent benchmarks across diverse real-world datasets. Supports 100+ languages for multi-language transcription.

gpt-4o-transcribe is tuned for robustness and multilingual capability, while gpt-4o-mini-transcribe offers a lighter, faster, cheaper option best suited for mobile and edge applications.

Note: OpenAI reports WER below 5% on internal benchmarks, but independent testing shows higher error rates on diverse real-world audio.

Pricing:

gpt-4o-transcribe: $6.00 per 1000 minutes
gpt-4o-mini-transcribe: $3.00 per 1000 minutes

Other Notable STT Providers

Deepgram Flux – Announced October 2025 as the first Conversational Speech Recognition (CSR) model purpose-built for voice agents. Features model-integrated end-of-turn detection (~260ms latency) that eliminates the need for separate Voice Activity Detection systems. Maintains Nova-3 level accuracy while delivering turn-complete transcripts with context-aware turn detection. Too new for independent benchmarking or production validation. Free during October 2025 promotion (up to 50 concurrent connections), regular pricing not yet disclosed.
AssemblyAI Slam-1 – Speech Language Model combining LLM architecture with ASR encoders. Introduced in April 2025, supports prompt-based customization and handles up to 1,000 domain-specific terms or phrases (each up to six words), helping recognize specialized terminology by understanding semantic context. English only, in public beta. $0.37 per hour.
Gladia AI Solaria – Launched April 2025, covers 100 languages including 42 underserved languages not supported by many competitors. Low latency (~270ms) for real-time scenarios. Lacks independent benchmarking data from Artificial Analysis or other third-party testers, making accuracy difficult to verify against established models. Free tier available, Pro tier: $0.61/hour for batch transcription.
Speechmatics Ursa 2 – Released October 2024, covering 50 languages with strong performance in Spanish and Polish on domain-specific benchmarks. Real-Time Enhanced: $1.35/hour. Free tier: 8 hours per month.
Google Cloud Chirp 2 – Part of Google Cloud infrastructure using the Universal Speech Model (USM). Achieves the best accuracy on Artificial Analysis benchmarks with 11.6% WER across diverse real-world datasets. Supports 102 languages. Optimized for batch transcription rather than real-time voice agents. $0.96/hour.
ElevenLabs Scribe – Launched February 2025, supporting 99 languages with strong accuracy across traditionally underserved languages. Achieves 15.1% WER on Artificial Analysis benchmarks. Features speaker diarization (up to 32 speakers), word-level timestamps, and audio event detection. Currently batch transcription only - not suitable for real-time voice agents. Streaming version announced but no release date. $0.40 per hour.

Text-to-Speech (TTS) Selection Criteria

TTS selection matters as much as STT. The TTS engine determines how natural and human-like the voice agent sounds to end users.

Voice Quality and Naturalness

Voice naturalness is the primary consideration for TTS. Models must avoid robotic qualities, maintain consistency across longer passages, handle partial text fragments, and accurately pronounce specific formats like phone numbers and email addresses.

There’s no universal quality metric for TTS, but platforms like Artificial Analysis use an ELO Score. As of October 2025, top TTS models have ELO scores around 1000–1100.

Voice Customization Options

Providers vary in available voice options and customization capabilities. Basic adjustments include speaking rate, pitch, and emphasis. Advanced systems offer control over voice characteristics like assertiveness, confidence, smoothness, and relaxedness. The Speech Synthesis Markup Language (SSML) enables fine-tuned voice generation across most platforms.

Language Support

Language selection, regional accent configuration, and dialect-specific adjustments determine global deployment viability. Some models are limited to one language and may struggle with mid-call language switches.

Best Text-to-Speech Models and Providers in 2025

The providers below rank highest on the Artificial Analysis Leaderboard as of October 2025. Voice naturalness affects user perception significantly – voices that fall into the Uncanny Valley typically perform worse in production deployments.

Provider and Model	ELO Score	Languages	Cost (per 1M characters)	Latency
ElevenLabs v3	~1114	32	~$206	N/A
ElevenLabs Flash v2.5	~1097	32	~$103	75 ms
Amazon Polly Long-form	~1066	34	$100	100 ms
Azure AI Speech Neural	~1048	140+	$15	300 ms
Google Text-to-Speech Standard	~1034	50+	$4	500ms
PlayHT Dialog	~1013	32	$99/mo unlimited	300-320ms

#1 ElevenLabs Flash v2.5

Offers two TTS models: Multilingual and Flash. The Flash model, recommended specifically for voice agents, delivers ultra-fast performance with ~75ms delay. Supports voice customization including tone and emotional expression, offers voice cloning capability, and supports 32 languages.

Beyond TTS, ElevenLabs provides an integrated platform for building customizable interactive voice agents, including Scribe STT API, a conversational AI framework, dubbing API for translation, and support for additional audio formats (Opus, A-law for telephony). The Voice Library allows community and company voice uploads categorized by use case.

Pricing:

Free tier available
Starter: $5/month
Creator: $22/month ($11 first month)
Pro: $99/month
Scale: $330/month
Business: $1,320/month
Custom Enterprise tiers

#2 Cartesia Sonic

Single TTS model called Sonic with 90ms latency. Supports instant voice cloning with minimal audio input (as little as 10 seconds) and allows customization of voice attributes like pitch, speed, and emotion. Can run models directly on devices and supports 15 languages. Playground available at https://play.cartesia.ai/text-to-speech (requires login).

Pricing:

Monthly subscription for business starts at $49
Cost is approximately $46.7 per 1M characters

#3 Amazon Polly

Supports 34 languages and dialects with 96 total voices across languages. Supports SSML for fine-tuning speech output and allows custom voice creation for branding. Response latency ranges from 100ms to 1 second. Includes four models: Generative, Long-Form, Neural, and Standard. Integrates with other AWS services and can be accessed through AWS Console.

Pricing:

Long-form: $100 per 1M characters
Generative: $30 per 1M characters
Neural: $16 per 1M characters
Standard: $4 per 1M characters

#4 Microsoft Azure AI Speech

Supports over 140 languages and locales. Offers multiple versions: Standard, Custom, and HD Neural. Allows custom neural voice creation and supports SSML for pronunciation and intonation customization.

The Dragon HD Neural TTS variant, introduced in March 2025, delivers highly expressive, context-aware speech with emotion detection capabilities. 19 total Dragon HD voices available, though precise Dragon HD pricing is not publicly disclosed.

Pricing:

Standard Neural voices: $15 per 1M characters
Custom Neural Professional voices: $24/1M characters
Additional costs for model training and endpoint hosting

#5 Google Text-to-Speech

Supports over 380 voices across 50+ languages. Can create unique voices by recording samples and supports SSML for controlling pitch, speed, volume, and pronunciation. Can customize voices for different devices. Latency around 500ms. Playground requires Google account login.

Pricing:

Standard voices: $4 per 1M characters
WaveNet voices: $16 per 1M characters
Neural2 voices: $16 per 1M characters
Studio voices: Premium pricing varies

#6 PlayHT Dialog

Released in February 2025, specifically designed for conversational applications. Works with 9 main languages and 23 additional languages with more than 50 voices available. Voice cloning functionality with 300ms latency. Partners with Groq for faster inference and LiveKit for real-time voice AI integration. Offers Play AI Studio for multi-speaker podcast creation and voice agent building.

Pricing:

Free tier
Creator: $31.20/month (billed yearly)
Unlimited: $29/month
Professional: $99/month (unlimited voice generation)
Enterprise: Custom pricing

The AI voice agent calculator projects runtime performance, cost, and infrastructure load based on selected STT and TTS models.

Choosing the Right STT and TTS Models for Your Project

Large providers like Google and Microsoft prioritize stability and infrastructure reliability. Smaller providers like ElevenLabs and Deepgram often deliver lower latency and more natural-sounding voices.

Different use cases emphasize different capabilities. Entertainment and gaming applications benefit from emotional range and voice realism in TTS. High-volume commercial deployments require proven stability and uptime guarantees. Appointment booking systems need extremely low WER in STT, since users dictate contact information that must be transcribed accurately.

Production performance differs significantly from benchmark results. Providers showcase ideal conditions, but real deployments face quiet speech, speech impairments, heavy accents, background noise, poor connections, and multi-speaker scenarios. These edge cases reveal model limitations that don’t appear in initial testing.

Scaling introduces additional constraints. Traffic spikes, multi-region deployment, language expansion, and integration complexity all affect provider selection. Infrastructure capabilities matter as much as model performance once the system reaches production scale.

STT and TTS choices determine how voice agents sound and understand users. The complete picture includes LLM selection, observability, error handling, compliance, and scaling infrastructure. For detailed LLM selection guidance covering latency, accuracy, and cost tradeoffs, see our LLM comparison guide. The AI launch plan covers all six systems needed to ship voice agents that reliably handle customer conversations.

About Softcery: We’re the AI engineering team that founders call when other teams say “it’s impossible” or “it’ll take 6+ months.” We specialize in building advanced AI systems that actually work in production, handle real customer complexity, and scale with your business. We work with B2B SaaS founders in marketing automation, legal tech, and e-commerce – solving the gap between prototypes that work in demos and systems that work at scale. Get in touch.

Frequently Asked Questions

How do I choose between accuracy and latency for my voice agent?

It depends on your use case. Customer service agents prioritize accuracy to avoid misunderstandings. Gaming or entertainment applications prioritize low latency for natural conversation flow.

For most production voice agents, aim for sub-300ms latency with the best accuracy you can afford in that range. Deepgram Nova-3 (18.3% WER, <300ms) and AssemblyAI Universal-2 (14.5% WER, 300-600ms) both balance these tradeoffs well.

What's the difference between Word Error Rate (WER) and ELO Score?

They measure different parts of the voice agent stack.

WER applies to Speech-to-Text (STT) and shows transcription accuracy. Lower is better. Independent testing shows 11.6-21.4% WER for top streaming models on diverse real-world audio.

ELO Score applies to Text-to-Speech (TTS) and reflects how natural the generated voice sounds. Higher is better. Top models in 2025 score around 1000–1100.

Why do provider-reported WER scores differ from Artificial Analysis benchmarks?

Providers typically test on clean, curated datasets that show their models in the best light. Artificial Analysis uses diverse real-world audio with accents, background noise, and challenging acoustic conditions.

For example, OpenAI reports <5% WER internally but shows 21.4% on Artificial Analysis benchmarks. Both numbers are accurate, they just measure different things. Independent benchmarks give a more realistic picture of production performance.

Can I use batch-only models like Google Chirp 2 or ElevenLabs Scribe for voice agents?

No. Batch transcription models process pre-recorded audio files and don’t support real-time streaming required for live conversations.

Google Chirp 2 achieves the best accuracy (11.6% WER) but only works for transcribing recordings. For voice agents, you need streaming models like Deepgram Nova-3, AssemblyAI Universal-2, or OpenAI gpt-4o-transcribe.

How do latency and cost scale in production?

Latency determines conversation naturalness. Under 300ms feels natural, 300-600ms is acceptable, over 1 second feels robotic. Total latency includes STT + LLM + TTS, so each component matters.

Cost scales linearly with usage. At 10,000 hours per month, the difference between AssemblyAI ($0.27/hr) and Speechmatics ($1.35/hr) is $2,700 vs $13,500 monthly. The AI voice agent calculator helps project costs at your expected volume.

We're building a voice agent but stuck on production readiness. Can you help?

We work with B2B SaaS founders who need voice agents that handle real customer complexity. If your prototype works but production feels risky, or your team hit walls with advanced features, we might be able to help.

The AI launch plan covers the framework we use for production voice agents. If it resonates with your situation, reach out and we can discuss whether we’re a good fit.

Understanding STT and TTS Technologies

Key Criteria for Selecting STT Models

1. Accuracy and Recognition Capabilities

2. Processing Speed and Latency

3. Audio Input Requirements

Best Speech-to-Text Models for Voice Agents in 2025

#1 Deepgram Nova-3

#2 AssemblyAI Universal-2

#3 OpenAI gpt-4o-transcribe

Other Notable STT Providers

Text-to-Speech (TTS) Selection Criteria

Voice Quality and Naturalness

Voice Customization Options

Language Support

Best Text-to-Speech Models and Providers in 2025

#1 ElevenLabs Flash v2.5

#2 Cartesia Sonic

#3 Amazon Polly

#4 Microsoft Azure AI Speech

#5 Google Text-to-Speech

#6 PlayHT Dialog

Choosing the Right STT and TTS Models for Your Project

Frequently Asked Questions

AI Voice Agents for Personal Injury Law Firms: How to Automate Intake Calls

Building AI That Understands Legal Documents (Not Just Reads Them)

How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying

AI Call Center Automation: Actionable Playbook for 2025

Voice Agents for Travel: What Works at HotelPlanner, What Breaks Most Implementations

Custom AI Voice Agents: The Ultimate Guide

How to Build Production-Ready Legal AI Systems

AI for Law Firms: What Actually Works in Production (Beyond the Demos)

Legal Chatbots: When to Build Custom vs Buy Off-the-Shelf

Choosing an LLM for Voice Agents: Speed, Accuracy, Cost

Real-Time (S2S) vs Cascading (STT/TTS) Voice Agent Architecture

8 AI Observability Platforms Compared: Phoenix, Helicone, Langfuse, & More

We Tested 14 AI Agent Frameworks. Here's How to Choose.

The AI Agent Prompt Engineering Trap: Diminishing Returns and Real Solutions

RAG Systems: The 7 Decisions That Determine The Production Fate

How to Implement E-Commerce AI Support: 4-Phase Deployment Guide

AI Agents Break the Same Six Ways. Here's How to Catch Them Early.

Choosing LLMs for AI Agents: Cost, Latency, Intelligence Tradeoffs

You Can't Fix What You Can't See: Production AI Agent Observability Guide

E-Commerce AI Support: What Works, What Fails, Real Store Examples

E-Commerce AI Support ROI Calculator: Volume Thresholds and Break-Even Analysis

Why Voice Agents Sound Great in Demos but Fail in Production

Deploying & Scaling Voice Agents: 4-Phase Framework from POC to Production

Agentic Coding with Claude Code and Cursor: Context, Memory, Workflows

11 Voice Agent Platforms Compared: Vapi, Ultravox, Retell, & More

SOC 2 for Voice AI Agents: Security, Confidentiality, and Quick Wins

US Voice AI Regulations: TCPA, BIPA, COPPA, HIPAA, & State Privacy Laws

Testing Voice Agents: Methods, Metrics, and Tools