Voice Agents

Choosing Speech to Text (STT/ASR) for AI Voice Agents in 2025: Accuracy. Latency. Cost

A concise guide to STT for AI voice agents in 2025 - covering accuracy, latency, language support, and real-time performance.

This guide breaks down what actually matters when choosing an STT engine in 2025: latency under load, handling of overlapping speech, domain adaptation, speaker attribution, and how these factors impact your user experience and system performance. Whether you’re building an in-house stack or evaluating commercial providers, this decision shapes your entire voice interface. Choose accordingly.

Understanding Speech-to-Text Technology

Speech-to-Text (STT), also known as automatic speech recognition (ASR), is the critical first step that allows an AI voice agent to understand a user’s spoken input. In a typical voice AI pipeline, STT transcribes the user’s voice into text, which an LLM then interprets, and finally a TTS system speaks the reply.

Modern STT models leverage advanced deep learning architectures, particularly transformer-based models. The conversion process involves several stages: audio preprocessing, feature extraction, and sequence modeling to transform acoustic signals into precise text output.

Speech-to-Text (STT) technologies still have room for improvement. Developers focus on critical challenges like maintaining accuracy in challenging environments, recognizing multiple speakers simultaneously reliably, and isolating target voices amid background noise. Encouragingly, these are active development priorities for technology providers, promising significant advancements in the near future.

Key Criteria for Selecting an STT Solution

Accuracy and Recognition Capabilities

The primary concern for STT models is their ability to accurately transcribe spoken language. This can be measured using Word Error Rate (WER), an industry standard metric showing the percentage of transcription errors. Good models achieve 5–10% WER, meaning the accuracy is 90–95%.

Key considerations also include accuracy rates for different accents, background noise handling, and specialized vocabulary recognition. For instance, a customer service application needs robust handling of various accents and casual speech patterns, while healthcare applications require precise medical terminology recognition.

The ability to recognize the voices of different people can also be useful. This is relevant for scenarios where we receive audio from many talkers on the same audio channel.

Processing Speed and Latency

Real-time transcription capabilities are crucial for STT models, especially in interactive applications. Key performance metrics include Real-Time Factor (RTF). It measures how much faster the processing is than real time.

An RTF of 1 means that the processing time is exactly the same as the duration of the audio. An RTF of 0.1 indicates that the system processes audio much faster than real-time, meaning it takes only 10% of the audio duration to transcribe it. This is considered excellent for real-time applications, as it allows for rapid transcription and immediate feedback.

Also worth looking at:

First Response Latency: The time at which transcription generation begins. For optimal performance in real-time STT applications, aiming for latencies under 100 ms is ideal, while values up to 200–500 ms can be acceptable depending on the context. Anything above 1 second is generally considered too high for effective interaction.
Speech Completion Detection: Accuracy in determining when the user has finished speaking, which affects response time and the flow of the conversation.
Timestamp Accuracy: Accuracy in providing metadata such as the start and end time of each utterance.

Scalability and Throughput

STT systems need to scale reliably when transcription demand increases - whether it's thousands of concurrent streams or long multi-hour files. Key factors to evaluate include:

Concurrent session limits
API rate limits
Batch vs. streaming capabilities

Additionally, throughput (tokens per second) can be as important as latency, especially in call center and voice analytics pipelines where large amounts of audio must be processed daily.

Audio Input Requirements and Noise Handling

Evaluate the model's ability to handle different audio qualities, microphone types, and background environments. Some models perform better with high-quality audio input, while others are more forgiving of variable conditions.

A critical feature for voice agents in public spaces, particularly for phone-based systems, is the ability to effectively filter and isolate target voices from background noise and other speakers.

Noise Handling Capabilities:

Some models include built-in noise suppression or have been trained on noisy, real-world datasets.
Others rely entirely on the user to pre-process the audio using denoising tools or signal enhancement.

Also worth looking at:

Signal-to-Noise Ratio (SNR) Tolerance: How well the model handles low-volume or noisy recordings.
Background Speech Filtering: The model’s ability to ignore background talk or overlapping voices. Few models handle this well without diarization.
Robust Accent Handling: Noise often affects non-native accents more severely. Some multilingual models compensate better than others.
Audio Preprocessing Requirements: Open-source models like Whisper or Wav2Vec may require you to normalize, filter, and reformat audio manually, while API providers handle this behind the scenes.

Poor input quality leads to higher Word Error Rates regardless of model. If transcription quality is critical, prioritize clean audio capture and preprocessing steps.

Features and Integration Capabilities

STT models vary in functionality. Beyond transcription, evaluate what’s built-in and what requires post-processing or external tools.

Key features to consider:

Speaker Diarization: Identifies and separates speakers in multi-voice audio.
Punctuation and Formatting: Inserts sentence boundaries, capitalization, and paragraph breaks.
Timestamps: Provides word- or utterance-level timing data.
Custom Vocabulary / Prompts: Helps models recognize brand names, acronyms, or jargon.
Language Detection and Translation: Auto-detects spoken language or translates into another.
Summarization / Analytics: Available from select vendors for structured insights.

On integration:

Most providers offer REST APIs, Python/JS SDKs, and WebSocket streaming.
Batch and real-time options vary by vendor.
Open-source models require manual setup but offer full control.
Look for Docker containers, on-premise deployment, and enterprise SLAs if needed.

Choose models with features that match your use case. Avoid overpaying for extras you won’t use.

Costs

STT model pricing is primarily based on per-minute transcription, but actual costs vary depending on speed, features, and deployment type.

Most providers charge $0.006–$0.02 per minute. Streaming transcription is typically more expensive than batch due to compute demands and low-latency requirements.

Also important:

Model Quality: More accurate models with lower WER and better formatting usually cost more. Basic tiers are cheaper but less reliable for complex speech.
Real-Time vs. Batch: Streaming costs more. Batch processing is cheaper and better suited for non-interactive use cases.
Language and Translation: Transcription in non-English languages or with translation adds cost. Auto language detection may also be priced separately
Deployment: Cloud APIs are easy to use but have recurring costs. Open-source models (e.g. Whisper-Medusa, Wav2Vec 2.0) are free, but require infrastructure and maintenance.
Usage Limits: Some providers offer free monthly quotas or discounted volume tiers. Exceeding limits triggers overage charges.

To control costs, match the model tier and feature set to your actual application needs - and monitor usage patterns closely.

Best Speech-to-Text (Automatic Speech Recognition) models & providers in 2025

This list is from providers and models that are most commonly used for ASR tasks. But in order not to make the article endless, it is limited to five STT.

The main indicator by which the models below are compared is WER (Word Error Rate). As for speed metrics, not all providers specify these values. But you should also be attentive to WER: it can vary depending on the dataset, language, domain, etc. Moreover, WERs are usually measured in a pre-recorded voice rather than in a live stream. Therefore, when used in voice agents, the quality will usually be worse.

#1 OpenAI gpt-4o-transcribe & gpt-4o-mini-transcribe

OpenAI introduced two newer ASR (automatic speech recognition) models under the GPT-4o architecture in March 2025: gpt-4o-transcribe and gpt-4o-mini-transcribe, offering a step beyond Whisper in both speed and quality for real-time transcription tasks.

These models are part of the GPT-4o family, as OpenAI’s first fully multimodal model capable of handling text, vision, and audio input/output. Unlike Whisper, which is audio-only and open-source, the GPT-4o transcription models are not open-sourced and are only accessible through OpenAI’s API or integrated systems like ChatGPT. The new flagship models available via API are gpt-4o-transcribe and the lower-cost gpt-4o-mini-transcribe. These models are explicitly positioned as the next generation, demonstrating improved accuracy and reliability compared to the Whisper family (v2 and v3) on various benchmarks.

The Word Error Rate (WER) for English is reported to be below 5%, showing improvements over Whisper, especially in live or overlapping speech. The models are highly optimized for real-time streaming, capable of producing partial transcriptions in milliseconds. Both variants are capable of multi-language transcription, but gpt-4o-transcribe is tuned more for robustness and multilingual capability, while gpt-4o-mini-transcribe is designed to be lighter, faster, and cheaper - best suited for mobile or edge applications.

gpt-4o-transcribe is priced at $0.006 per minute or based on token usage ($2.50 per 1M input tokens, $10.00 per 1M output tokens). gpt-4o-mini-transcribe is cheaper at $0.003 per minute or token-based ($1.25 per 1M input tokens, $5.00 per 1M output tokens)

Gpt-4o-mini-tts

The gpt-4o-mini-tts model is part of OpenAI’s new generation of Text-to-Speech (TTS) systems designed to complement the GPT-4o ecosystem - enabling ultra-low-latency, high-quality voice synthesis for real-time AI voice agents and assistants.

Unlike Whisper (STT) or traditional modular TTS pipelines like Tacotron + WaveNet, gpt-4o-mini-tts is built with efficiency and speed as core priorities, making it ideal for use cases where fast, responsive voice output is critical - such as AI phone agents, voice-enabled apps, and conversational AI interfaces.

As with gpt-4o-mini-transcribe, the TTS model is not open-source and is only accessible through OpenAI’s API.

It is part of GPT-4o’s audio capabilities, and pricing is tied to audio output usage. OpenAI has not released standalone pricing, but typical voice output rates range around $0.015–$0.03 per minute depending on fidelity and usage tier.

#2 Gladia AI Solaria STT

Gladia AI launched its Solaria STT model in April 2025. Positioned for enterprise use, particularly call centers and voice platforms, Solaria claims industry-leading performance with broad multilingual support (100 languages, including 42 purportedly unique), high accuracy.

Solaria is engineered to provide native-level transcription accuracy across a vast array of languages, including those less commonly supported by other platforms. It achieves a Word Accuracy Rate (WAR) of 94% in languages such as English, Spanish, and French, while maintaining an ultra-low latency of 270 milliseconds, ensuring natural and responsive conversations.

Their pricing includes a free tier and a pro tier around $0.612/hour for batch transcription.

#3 AssemblyAI Universal-2

AssemblyAI has one of the latest updates on the STT market: their Universal-2 came out in November 2024, as an improved version of the Universal-1 model. It comes in two options: Best and Nano. Nano is an advanced option that supports over 102 languages, while Best works with 20.

Current AssemblyAI benchmarks claim a 6.6% WER for Universal-2 in English. They emphasizes that Universal-2 shows significant improvements in formatted WER (F-WER) and reduces hallucination rates. Reviews write that Universal-2 copes especially well with medical and sales domains. Universal-2 employs an all-neural architecture for text formatting, significantly improving the readability of transcripts. This includes context-aware punctuation and casing.

AssemblyAI offers a pay-as-you-go pricing model starting at $0.37 per hour (approximately $0.0062 per minute). The much lower rate of $0.12 per hour ($0.002 per minute) applies to their Nano model, designed for cost-effectiveness and broad language support. Additional costs may apply for advanced features like speaker detection or sentiment analysis.

#4 Deepgram Nova-3

Nova-3 is Deepgram’s latest model, which was released in February 2025, as an upgraded version of their proprietary STT model Nova-2. Also Deepgram has expanded its portfolio with specialized versions like Nova-3 Medical, launched in March 2025 targeting healthcare use cases. The company is also promoting its enterprise runtime platform (DER), Aura-2 TTS model, and advancements towards a full Speech-to-Speech (STS) architecture, indicating a broader strategic focus.

The model achieve one of the best WER on the market: 6.84% (average across all domains). This number is relevant for streaming audio, and for the batch data (pre-recorded audio) it is even lower: 5.26%. The specialized Nova-3 Medical model median WER 3.45%, Keyword Error Rate 6.79%.

Nova-3 supports 36 languages and dialects and can switch between recognizing 10 different languages in real time. This means that if an English-speaking speaker throws in a couple of Spanish words, the model will interpret them correctly.

The Deepgram’s STT model costs $0.0077 per minute for streaming audio for users who opt into Deepgram's Model Improvement Program. Lower rates are available for higher volume tiers (Growth plan: $0.0065/min for Nova-3 English). Users who do not opt into the improvement program face higher rates (e.g., $0.0092/min for Nova-3 Multilingual). The specialized Nova-3 Medical model also starts at the $0.0077/min rate.

#5 AssemblyAI Slam-1

Slam-1 is AssemblyAI's latest advancement in speech recognition technology, introduced in April 2025. Slam-1 is new Speech Language Model that combines LLM architecture with ASR encoders for superior speech-to-text transcription. This model delivers unprecedented accuracy through its understanding of context and semantic meaning.

Slam-1 introduces prompt-based customization, allowing users to provide a list of up to 1,000 domain-specific terms or phrases (each up to six words) via the keyterms_prompt parameter. This enables the model to better recognize and transcribe specialized terminology by understanding their semantic context.Maintains an average WER of 7% across diverse datasets, matching the industry-leading accuracy of the Universal model.

Slam-1 is currently in public beta and accessible through AssemblyAI's standard API endpoint. Priced at $0.37 per hour, identical to the Universal model, with volume discounts available for large workloads.

#6 OpenAI Whisper V3

The STT model from OpenAI, Whisper, was first introduced in September 2022 and has since been updated twice, in December 2022 (V2) and November 2023 (V3). The model is available in five variants: from tiny to large.

Whisper was designed to be as versatile as possible, working with 99 languages worldwide. But because of this, it is difficult to evaluate its effectiveness in each specific case. For instance, the WER for English can be as low as 5–6%, while for Scandinavian languages, it ranges from 8–10%. The average WER reported for Whisper is approximately 10.6%.

The model is effective in real-world scenarios, particularly in environments with background noise or heavy accents. However, there can be limitations regarding speed and operational complexity, especially when dealing with large volumes of audio data.

OpenAI offers Whisper as an open-source model, meaning it is free to use. However, for those utilizing the Whisper API, pricing details may vary based on usage and specific implementation. It starts from $0.006 per minute.

Whisper-Medusa (aiOla)

Whisper-Medusa is an optimized version of OpenAI’s Whisper large-v3 model, developed by aiOla and released as an open-source project in early 2025. Medusa introduces a novel multi-token prediction architecture (inspired by transformer decoders like Medusa-Linear), designed to significantly improve inference speed without major sacrifices in accuracy.

The model utilizes a “multi-head” decoder that predicts multiple tokens per step, rather than the standard one-token-at-a-time decoding used in Whisper. This architectural innovation results in up to 50% faster transcription, depending on configuration, while maintaining Whisper-level performance. For example, the 10-head Medusa variant delivers a WER of 4.11% on LibriSpeech Test-Clean, only slightly higher than Whisper’s 4.0%, but with nearly 1.5× speedup. A more balanced configuration—5-head Medusa—achieves an even lower WER of 3.64% and a 1.4× speed gain.

Whisper-Medusa is trained and optimized primarily for English transcription, and while it inherits multilingual capabilities from Whisper, it is not tuned for wide multilingual robustness. The model is well-suited for English-heavy use cases in media, voice analytics, and post-call processing.

Medusa is fully open-source, available under the MIT license, with code and checkpoints hosted on Hugging Face and GitHub. It can be deployed locally or integrated into any pipeline using standard Whisper-compatible frameworks. Latency is not explicitly quantified in milliseconds but is functionally improved through throughput acceleration of 30–50%, making it a viable Whisper alternative for faster offline transcription.

#7 Wav2Vec 2.0 (Meta)

Wav2Vec 2.0, developed by Meta AI (formerly Facebook AI), is a foundational self-supervised ASR model known for its versatility and strong performance in both research and production environments. Unlike traditional supervised models, it learns from raw audio and only requires labeled data for fine-tuning, making it ideal for low-resource languages and domains.

The model achieves 6–8% WER in English, with even better results (~1.8%) possible when fine-tuned on specific datasets like LibriSpeech. It supports 53+ languages, and multilingual training is extended through variants like XLS-R, which was trained on 128 languages.

Wav2Vec 2.0 is entirely open-source, available under a permissive license, and widely supported by the research community. It offers excellent accuracy but is not optimized for real-time transcription. Latency is typically around 700 ms, depending on the deployment setup and audio length, making it more suited to batch use cases.

Its flexibility, high accuracy, and adaptability for fine-tuning make Wav2Vec 2.0 a top choice for custom pipelines, academic research, and applications where on-device processing or domain specialization is required—especially where streaming latency is not a primary concern.

#8 Speechmatics Ursa 2

The latest and most advanced STT model from Speechmatics is called Ursa 2. It was released in October 2024. The update added new languages (e.g. Arabic dialects), expanding the list to 50. The new version also improved the accuracy and speed of the model. In certain languages, such as Spanish and Polish, Ursa 2 is the market leader with 3.3% and 4.4% WER respectively.

Many users also highlight Ursa's superior handling of diverse accents and noisy environments. Though, the average WER for Ursa 2 is 8.6%. Their documentation shows average WER across 21 datasets as 11.96% for Ursa Enhanced and 13.89% for Ursa Standard.

Speechmatics operates on a subscription-based pricing model. The minimum price for using Speechmatics' Speech-to-Text services is approximately $0.0133 per minute for the Standard model in batch transcription, with higher rates for enhanced accuracy and real-time transcription options. Real-time transcription starts with $1.04 per hour, which equals approximately $0.0173 per minute.

Batch Standard: $0.80/hour (~$0.0133/min)
Batch Enhanced: $1.04/hour (~$0.0173/min)
Real-Time Standard: $1.04/hour (~$0.0173/min)
Real-Time Enhanced: $1.35/hour (~$0.0225/min) A free tier offering 8 hours per month is also available.

#9 Google Speech-to-Text Chirp

Google Speech to Text service is part of the huge Google Cloud infrastructure. USM (Universal Speech Model) is responsible for speech recognition in it, but it is not one model, but a whole family. The most advanced model in this family is Chirp. It covers more than 125 languages in the world. And USM works with 309 in total.

Being part of Google means regular updates, the ability to train models on large amounts of data, and the interconnection of the service with other infrastructure applications. Also, Google Speech to Text has a good WER, but as with Whisper, it is highly language-dependent. The average WER is 8.5%.

The first 60 minutes per month are free. After that, pricing starts at approximately $0.024 to $0.036 per minute, depending on the features used (e.g., standard vs. enhanced models)

Chirp 2

Chirp 2 is positioned as the latest generation, offering significant improvements in accuracy and speed over the original Chirp, along with expanded features like word-level timestamps, enhanced model adaptation, speech translation, and support for streaming recognition (real-time). Google has also released specialized models like chirp_telephony and continues broader AI advancements with Gemini models (which can also process speech ), new TPUs, and integrated media capabilities in Vertex AI.

WERs on standard datasets like LibriSpeech or CommonVoice range from around 6% to 11% or higher , while other tests reported WERs from 16% to over 20%. A benchmark from Artificial Analysis specifically lists Chirp 2 with a WER of 9.8% on Common Voice v16.1. AssemblyAI benchmarks place Google's WER higher than their own Universal-2 model.

V1 API (Pay-as-you-go, after 60 free mins/month): $0.016/min (with data logging) or $0.024/min (without data logging).
V2 API (Pay-as-you-go): Uses tiered pricing for standard models (including Chirp/Chirp 2). For the first 500,000 minutes/month, the rate is $0.016/min (non-logged) or $0.012/min (logged). These rates decrease with higher volume, down to $0.004/min (non-logged) or $0.003/min (logged) for usage above 2 million minutes/month.
V2 API Dynamic Batch: Offers lower rates for non-urgent batch processing: $0.003/min (non-logged) or $0.00225/min (logged). The article's pricing significantly misrepresents the actual costs, especially for the current V2 API and its volume tiers. The $0.036/min figure does not appear in the current standard pricing tables.

Comparison of top STTs in 2025

Provider and Model	WER*	Languages	Cost for streaming audio (per minute)	Latency (real-time)
OpenAI gpt-4o-transcribe / mini (API)	~5%	50	$0.006 / $0.003	320 ms
Gladia AI Solaria	N/A (94% WAR)	100 (incl. 42 underserved)	~$0.0102 (Batch) / ~$0.0126 (Live) / Free tier (10h/mo)	270 ms
AssemblyAI Universal-2	~6.6%	102	$0.0062	~300–600 ms
Deepgram Nova-3 / Nova-3 Medical	~6.8%	36	$0.0077 (Eng) / $0.0092 (Multi)	<300 ms
AssemblyAI Slam-1	~7%	1 (EN only)	$0.37/hr	~500–800 ms
Whisper-Medusa (aiOla)	~5%	40+	Free (Open Source)	~500 ms
OpenAI Whisper V3	~5–6% (EN avg 10.6%)	99	Free (OSS) / ~$0.006 via API	~700–1000 ms
Wav2Vec 2.0 (Meta)	~6–8% (EN)	53+	Free	~700 ms
Speechmatics Ursa 2	~12–14%	50	$0.0173	<1s
Google Cloud Chirp 2	~9.8%	102	$0.016 (non-logged) / $0.012 (logged)	N/A

*Average WER values officially declared by providers are indicated. The values may vary depending on the dataset in which the model is tested.

Key Takeaways for Choosing STT in 2025

Accuracy Still Leads. Choose models with <5% WER for English. Prioritize context-aware or domain-tuned models (e.g., Slam-1, Whisper-Medusa) if your use case involves accents, noise, or industry jargon.
Latency Depends on Use Case. For real-time apps, aim for <300 ms latency. GPT-4o-transcribe and Gladia Solaria are top performers here. Batch processing can tolerate higher delays if interactivity isn’t required.
Streaming vs. Batch Pricing. Streaming costs more due to compute demand. Use batch transcription when low latency isn’t needed to reduce costs.
Evaluate Language Coverage. English is covered well across all models. For multilingual apps, look for wide support (e.g., Gladia: 100 languages, Google Chirp: 125+). For local/offline use, open-source models like Whisper-Medusa still perform well.
Match Features to Your Needs. Don’t pay for unused extras. If you don’t need speaker diarization or timestamps, skip models that bundle them into premium pricing. Custom vocabulary, partials, and analytics are only necessary in specific workflows.
Scalability and SLAs Matter for Production. If reliability, uptime, and support are critical, go with enterprise-ready APIs like AssemblyAI, Gladia, or Deepgram. For experimental or internal tools, open-source may be enough.
Use Free Tiers to Test Before Committing. Most providers offer 5–10 hours/month for free. Use this to validate performance on your actual audio samples - not just benchmarks.