How to Choose STT & TTS for AI Voice Agents in 2025: A Comprehensive Guide
Learn how to choose Speech-to-Text and Text-to-Speech technologies for building a voice agent tailored to your use case. We break down key metrics and the most popular models.

Speech-to-Text (STT) and Text-to-Speech (TTS) technologies, combined with Large Language Models (LLM), are the backbone of modern AI voice agents. These technologies work together to transform spoken language into meaningful interactions, making digital communication more natural and intuitive.
While direct Speech-to-Speech solutions are emerging, the current STT→LLM→TTS approach remains the most flexible. This method allows businesses to easily switch language models based on task complexity, providing greater adaptability.
But how do you choose the STT and TTS that best suit your purposes? In this article, we will look at what criteria you should pay attention to, as well as what characteristics the most popular models have today.
Understanding STT and TTS technologies
The voice agent interaction cycle works simply:
- STT captures voice input and converts it to text.
- An LLM generates an appropriate response.
- TTS converts the response back into natural-sounding speech.
Modern STT models leverage advanced deep learning architectures, particularly transformer-based models. The conversion process involves several stages: audio preprocessing, feature extraction, and sequence modeling to transform acoustic signals into precise text output.
Similarly, contemporary TTS systems employ neural networks in a two-step process. The system first converts text into spectrograms (visual representations of sound) and then transforms these spectrograms into audio waveforms that closely mimic human speech patterns.
Text-to-Speech (TTS) models have reached a mature stage, requiring only “polishing”: addressing minor bugs, reducing costs, optimizing device compatibility, and enhancing overall stability.
Speech-to-Text (STT) technologies still have room for improvement. Developers focus on critical challenges like maintaining accuracy in challenging environments, recognizing multiple speakers simultaneously reliably, and isolating target voices amid background noise. Encouragingly, these are active development priorities for technology providers, promising significant advancements in the near future.
What to look for when choosing speech technologies
When evaluating speech technologies for your business, it's important to understand that STT and TTS models serve different purposes and therefore require different evaluation approaches.
To help evaluate these criteria objectively, several industry-standard metrics can guide your decision-making process. Let's look at each criterion and its associated measurements.
Speech-to-Text (STT) selection criteria
Accuracy and recognition capabilities
The primary concern for STT models is their ability to accurately transcribe spoken language. This can be measured using Word Error Rate (WER), an industry standard metric showing the percentage of transcription errors. Good models achieve 5–10% WER, meaning the accuracy is 90–95%.
Key considerations also include accuracy rates for different accents, background noise handling, and specialized vocabulary recognition. For instance, a customer service application needs robust handling of various accents and casual speech patterns, while healthcare applications require precise medical terminology recognition.
The ability to recognize the voices of different people can also be useful. This is relevant for scenarios where we receive audio from many talkers on the same audio channel.
Processing speed and latency
Real-time transcription capabilities are crucial for STT models, especially in interactive applications. Key performance metrics include Real-Time Factor (RTF). It measures how much faster the processing is than real time.
An RTF of 1 means that the processing time is exactly the same as the duration of the audio. An RTF of 0.1 indicates that the system processes audio much faster than real-time, meaning it takes only 10% of the audio duration to transcribe it. This is considered excellent for real-time applications, as it allows for rapid transcription and immediate feedback.
Also worth looking at:
- First Response Latency: The time at which transcription generation begins. For optimal performance in real-time STT applications, aiming for latencies under 100 ms is ideal, while values up to 200–500 ms can be acceptable depending on the context. Anything above 1 second is generally considered too high for effective interaction.
- Speech Completion Detection: Accuracy in determining when the user has finished speaking, which affects response time and the flow of the conversation.
- Timestamp Accuracy: Accuracy in providing metadata such as the start and end time of each utterance.
Audio input requirements
Evaluate the model's ability to handle different audio qualities, microphone types, and background environments. Some models perform better with high-quality audio input, while others are more forgiving of variable conditions.
A critical feature for voice agents in public spaces, particularly for phone-based systems, is the ability to effectively filter and isolate target voices from background noise and other speakers.
Text-to-Speech (TTS) selection criteria
Voice quality and naturalness
The primary consideration for TTS is the naturalness of the generated speech. Consider whether the voices sound robotic or human-like, and how well they maintain consistency across longer passages. Special attention should be paid to the model's ability to maintain consistent sound quality when processing incomplete or partial text fragments.
The model should also include robust dictation mechanisms for accurately pronouncing specific formats like phone numbers, email addresses, and confirmation codes, ensuring clear and precise communication of critical information.
There is no common quality metric for Text To Speech services, similar to WER for STT models. However, AI model research platforms may have custom metrics, and you can start evaluating models from there.
For example, the Artificial Analysis platform collects user responses about the quality of popular TTS models and calculates the ELO Score based on them. At the beginning of 2025, the best performing TTS models have ELO scores around 1000–1100. And the rest of the models on the platform leaderboard have a score of around 850.
Voice customization options
Evaluate the available voice options and customization capabilities. Some businesses need multiple voices for different purposes, while others require brand-specific voice creation.
Consider the model's ability to adjust speaking rate, pitch, and emphasis. For example, higher pitches often convey friendliness and approachability, while lower pitches project authority and seriousness.
Advanced systems also could offer control over specific voice characteristics:
- Assertiveness: Controls the firmness of voice delivery.
- Confidence: Affects how assured the voice sounds.
- Smoothness: Adjusts between smooth and staccato delivery.
- Relaxedness: Modifies tension in the voice.
The fine-grained control over voice generation, allowing developers to adjust intonation and pronunciation for specific words or phrases, is enabled by Speech Synthesis Markup Language.
Language support
Support for multiple languages and regional variations ensures voice agents can serve diverse audiences. This includes:
- Language selection
- Regional accent configuration
- Dialect-specific adjustments
Some models (e.g. cartesia-english) are configured for one language and if you need to switch to another language during the call, there will be problems. These problems could be hard to solve, because there is no real-time update of the call configuration.
Common criteria for both technologies
Both technologies need evaluation of their pricing models, including per-usage costs or subscription fees, scaling costs with increased usage, and additional fees for premium features or customization.
Also, the integration and technical requirements are essential. Check following points:
- API documentation quality and ease of use
- Development resources required
- Compatibility with existing systems
- Deployment options (cloud, on-premise, hybrid)
- Service stability and guaranteed uptime
- Update frequency and maintenance schedule
In addition, both technologies must meet your security standards, such as data handling and privacy practices, regulatory compliance capabilities, encryption standards, and access control features. Also, consider the quality and responsiveness of technical support.
A common pitfall when selecting STT/TTS models is to prioritize agent response speed or cost over quality performance across different environments without proper justification. In practice, conversations sometimes require giving users more time to think, and it is actually needed to slow down the agent's response rate.
Misplaced priorities in testing approaches can lead to situations where a voice agent performs brilliantly in controlled environments like homes or offices but struggles in noisy locations or with poor audio quality.
Best Speech-to-Text (Automatic Speech Recognition) models & providers in 2025
This list is from providers and models that are most commonly used for ASR tasks. But in order not to make the article endless, it is limited to five STT.
The main indicator by which the models below are compared is WER (Word Error Rate). As for speed metrics, not all providers specify these values. But you should also be attentive to WER: it can vary depending on the dataset, language, domain, etc. Moreover, WERs are usually measured in a pre-recorded voice rather than in a live stream. Therefore, when used in voice agents, the quality will usually be worse.
#1 OpenAI gpt-4o-transcribe & gpt-4o-mini-transcribe
OpenAI introduced two newer ASR (automatic speech recognition) models under the GPT-4o architecture in March 2025: gpt-4o-transcribe and gpt-4o-mini-transcribe, offering a step beyond Whisper in both speed and quality for real-time transcription tasks.
These models are part of the GPT-4o family, as OpenAI’s first fully multimodal model capable of handling text, vision, and audio input/output. Unlike Whisper, which is audio-only and open-source, the GPT-4o transcription models are not open-sourced and are only accessible through OpenAI’s API or integrated systems like ChatGPT. The new flagship models available via API are gpt-4o-transcribe and the lower-cost gpt-4o-mini-transcribe. These models are explicitly positioned as the next generation, demonstrating improved accuracy and reliability compared to the Whisper family (v2 and v3) on various benchmarks.
The Word Error Rate (WER) for English is reported to be below 5%, showing improvements over Whisper, especially in live or overlapping speech. The models are highly optimized for real-time streaming, capable of producing partial transcriptions in milliseconds. Both variants are capable of multi-language transcription, but gpt-4o-transcribe is tuned more for robustness and multilingual capability, while gpt-4o-mini-transcribe is designed to be lighter, faster, and cheaper — best suited for mobile or edge applications.
gpt-4o-transcribe is priced at $0.006 per minute or based on token usage ($2.50 per 1M input tokens, $10.00 per 1M output tokens). gpt-4o-mini-transcribe is cheaper at $0.003 per minute or token-based ($1.25 per 1M input tokens, $5.00 per 1M output tokens)
Gpt-4o-mini-tts
The gpt-4o-mini-tts model is part of OpenAI’s new generation of Text-to-Speech (TTS) systems designed to complement the GPT-4o ecosystem — enabling ultra-low-latency, high-quality voice synthesis for real-time AI voice agents and assistants.
Unlike Whisper (STT) or traditional modular TTS pipelines like Tacotron + WaveNet, gpt-4o-mini-tts is built with efficiency and speed as core priorities, making it ideal for use cases where fast, responsive voice output is critical — such as AI phone agents, voice-enabled apps, and conversational AI interfaces.
As with gpt-4o-mini-transcribe, the TTS model is not open-source and is only accessible through OpenAI’s API.
It is part of GPT-4o’s audio capabilities, and pricing is tied to audio output usage. OpenAI has not released standalone pricing, but typical voice output rates range around $0.015–$0.03 per minute depending on fidelity and usage tier.
#2 Gladia AI Solaria STT
Gladia AI launched its Solaria STT model in April 2025. Positioned for enterprise use, particularly call centers and voice platforms, Solaria claims industry-leading performance with broad multilingual support (100 languages, including 42 purportedly unique), high accuracy.
Solaria is engineered to provide native-level transcription accuracy across a vast array of languages, including those less commonly supported by other platforms. It achieves a Word Accuracy Rate (WAR) of 94% in languages such as English, Spanish, and French, while maintaining an ultra-low latency of 270 milliseconds, ensuring natural and responsive conversations.
Their pricing includes a free tier and a pro tier around $0.612/hour for batch transcription.
#3 AssemblyAI Universal-2
AssemblyAI has one of the latest updates on the STT market: their Universal-2 came out in November 2024, as an improved version of the Universal-1 model. It comes in two options: Best and Nano. Nano is an advanced option that supports over 102 languages, while Best works with 20.
Current AssemblyAI benchmarks claim a 6.6% WER for Universal-2 in English. They emphasizes that Universal-2 shows significant improvements in formatted WER (F-WER) and reduces hallucination rates. Reviews write that Universal-2 copes especially well with medical and sales domains. Universal-2 employs an all-neural architecture for text formatting, significantly improving the readability of transcripts. This includes context-aware punctuation and casing.
AssemblyAI offers a pay-as-you-go pricing model starting at $0.37 per hour (approximately $0.0062 per minute). The much lower rate of $0.12 per hour ($0.002 per minute) applies to their Nano model, designed for cost-effectiveness and broad language support. Additional costs may apply for advanced features like speaker detection or sentiment analysis.
#4 Deepgram Nova-3
Nova-3 is Deepgram’s latest model, which was released in February 2025, as an upgraded version of their proprietary STT model Nova-2. Also Deepgram has expanded its portfolio with specialized versions like Nova-3 Medical, launched in March 2025 targeting healthcare use cases. The company is also promoting its enterprise runtime platform (DER), Aura-2 TTS model, and advancements towards a full Speech-to-Speech (STS) architecture, indicating a broader strategic focus.
The model achieve one of the best WER on the market: 6.84% (average across all domains). This number is relevant for streaming audio, and for the batch data (pre-recorded audio) it is even lower: 5.26%. The specialized Nova-3 Medical model median WER 3.45%, Keyword Error Rate 6.79%.
Nova-3 supports 36 languages and dialects and can switch between recognizing 10 different languages in real time. This means that if an English-speaking speaker throws in a couple of Spanish words, the model will interpret them correctly.
The Deepgram’s STT model costs $0.0077 per minute for streaming audio for users who opt into Deepgram's Model Improvement Program. Lower rates are available for higher volume tiers (Growth plan: $0.0065/min for Nova-3 English). Users who do not opt into the improvement program face higher rates (e.g., $0.0092/min for Nova-3 Multilingual). The specialized Nova-3 Medical model also starts at the $0.0077/min rate.
#5 AssemblyAI Slam-1
Slam-1 is AssemblyAI's latest advancement in speech recognition technology, introduced in April 2025. Slam-1 is new Speech Language Model that combines LLM architecture with ASR encoders for superior speech-to-text transcription. This model delivers unprecedented accuracy through its understanding of context and semantic meaning.
Slam-1 introduces prompt-based customization, allowing users to provide a list of up to 1,000 domain-specific terms or phrases (each up to six words) via the keyterms_prompt parameter. This enables the model to better recognize and transcribe specialized terminology by understanding their semantic context.Maintains an average WER of 7% across diverse datasets, matching the industry-leading accuracy of the Universal model.
Slam-1 is currently in public beta and accessible through AssemblyAI's standard API endpoint. Priced at $0.37 per hour, identical to the Universal model, with volume discounts available for large workloads.
#6 Speechmatics Ursa 2
The latest and most advanced STT model from Speechmatics is called Ursa 2. It was released in October 2024. The update added new languages (e.g. Arabic dialects), expanding the list to 50. The new version also improved the accuracy and speed of the model. In certain languages, such as Spanish and Polish, Ursa 2 is the market leader with 3.3% and 4.4% WER respectively.
Many users also highlight Ursa's superior handling of diverse accents and noisy environments. Though, the average WER for Ursa 2 is 8.6%. Their documentation shows average WER across 21 datasets as 11.96% for Ursa Enhanced and 13.89% for Ursa Standard.
Speechmatics operates on a subscription-based pricing model. The minimum price for using Speechmatics' Speech-to-Text services is approximately $0.0133 per minute for the Standard model in batch transcription, with higher rates for enhanced accuracy and real-time transcription options. Real-time transcription starts with $1.04 per hour, which equals approximately $0.0173 per minute.
- Batch Standard: $0.80/hour (~$0.0133/min)
- Batch Enhanced: $1.04/hour (~$0.0173/min)
- Real-Time Standard: $1.04/hour (~$0.0173/min)
- Real-Time E nhanced: $1.35/hour (~$0.0225/min) A free tier offering 8 hours per month is also available.
#7 Google Speech-to-Text Chirp
Google Speech to Text service is part of the huge Google Cloud infrastructure. USM (Universal Speech Model) is responsible for speech recognition in it, but it is not one model, but a whole family. The most advanced model in this family is Chirp. It covers more than 125 languages in the world. And USM works with 309 in total.
Being part of Google means regular updates, the ability to train models on large amounts of data, and the interconnection of the service with other infrastructure applications. Also, Google Speech to Text has a good WER, but as with Whisper, it is highly language-dependent. The average WER is 8.5%.
The first 60 minutes per month are free. After that, pricing starts at approximately $0.024 to $0.036 per minute, depending on the features used (e.g., standard vs. enhanced models)
Chirp 2
Chirp 2 is positioned as the latest generation, offering significant improvements in accuracy and speed over the original Chirp, along with expanded features like word-level timestamps, enhanced model adaptation, speech translation, and support for streaming recognition (real-time). Google has also released specialized models like chirp_telephony and continues broader AI advancements with Gemini models (which can also process speech ), new TPUs, and integrated media capabilities in Vertex AI.
WERs on standard datasets like LibriSpeech or CommonVoice range from around 6% to 11% or higher , while other tests reported WERs from 16% to over 20%. A benchmark from Artificial Analysis specifically lists Chirp 2 with a WER of 9.8% on Common Voice v16.1. AssemblyAI benchmarks place Google's WER higher than their own Universal-2 model.
- V1 API (Pay-as-you-go, after 60 free mins/month): $0.016/min (with data logging) or $0.024/min (without data logging).
- V2 API (Pay-as-you-go): Uses tiered pricing for standard models (including Chirp/Chirp 2). For the first 500,000 minutes/month, the rate is $0.016/min (non-logged) or $0.012/min (logged). These rates decrease with higher volume, down to $0.004/min (non-logged) or $0.003/min (logged) for usage above 2 million minutes/month.
- V2 API Dynamic Batch: Offers lower rates for non-urgent batch processing: $0.003/min (non-logged) or $0.00225/min (logged). The article's pricing significantly misrepresents the actual costs, especially for the current V2 API and its volume tiers. The $0.036/min figure does not appear in the current standard pricing tables.
#8 OpenAI Whisper V3
The STT model from OpenAI, Whisper, was first introduced in September 2022 and has since been updated twice, in December 2022 (V2) and November 2023 (V3). The model is available in five variants: from tiny to large.
Whisper was designed to be as versatile as possible, working with 99 languages worldwide. But because of this, it is difficult to evaluate its effectiveness in each specific case. For instance, the WER for English can be as low as 5–6%, while for Scandinavian languages, it ranges from 8–10%. The average WER reported for Whisper is approximately 10.6%.
The model is effective in real-world scenarios, particularly in environments with background noise or heavy accents. However, there can be limitations regarding speed and operational complexity, especially when dealing with large volumes of audio data.
OpenAI offers Whisper as an open-source model, meaning it is free to use. However, for those utilizing the Whisper API, pricing details may vary based on usage and specific implementation. It starts from $0.006 per minute.
Comparison of top STTs in 2025:
Provider and Model | WER* | Languages | Cost for streaming audio (per minute) | Latency (real-time) |
---|---|---|---|---|
OpenAI gpt-4o-transcribe / mini (API) | ~5% | 50 | $0.006 / $0.003 | 320 ms |
Gladia AI Solaria | N/A (94% WAR) | 100 (incl. 42 underserved) | ~$0.0102 (Batch) / ~$0.0126 (Live) / Free tier (10h/mo) | 270 ms |
AssemblyAI Universal-2 | ~6.6% | 102 | $0.0062 | ~300–600 ms |
Deepgram Nova-3 / Nova-3 Medical | ~6.8% | 36 | $0.0077 (Eng) / $0.0092 (Multi) | <300 ms |
AssemblyAI Slam-1 | ~7% | 1 (EN only) | $0.37/hr | ~500–800 ms |
Speechmatics Ursa 2 | ~12–14% | 50 | $0.0173 | <1s |
Google Cloud Chirp 2 | ~9.8% | 102 | $0.016 (non-logged) / $0.012 (logged) | N/A |
*Average WER values officially declared by providers are indicated. The values may vary depending on the dataset in which the model is tested.
Best Text-to-Speech models & providers in 2025
Similar to the previous selection, this is a limited list. It includes those providers that show the best quality according to the Artificial Analysis Leaderboard in February 2025.
All the services listed below have good voice realism according to user reviews, but we recommend listening to their samples yourself. Emotionality or realism of voice is one of the most important criteria for evaluating TTS. Many people have developed the Uncanny Valley Effect, and chances are that the more realistic the voice, the better it affects the overall conversion rate or effectiveness of the voice agent.
#1 ElevenLabs Flash
The ElevenLabs provider for early 2025 offers two TTS models: Multilingual and Flash. Multilingual is optimized for maximum realism and humanness of the voice, while Flash is an ultra-fast model with ~75ms delay. For voice agents specifically, Flash is recommended. It is integrated into ElevenLabs' broader platform for building customizable interactive voice agents.
Flash from ElevenLabs work with 32 languages. Users can customize various aspects of the voice output, including tone and emotional expression. They also can clone voices. As of April 2025, API plans range from Free, Starter ($5/mo), Creator ($22/mo, $11 first month), Pro ($99/mo), Scale ($330/mo), Business ($1320/mo), to custom Enterprise tiers.
Notable additions include the Scribe STT API, a Conversational AI framework for building agents, a Dubbing API for translation, support for additional audio formats (Opus, A-law for telephony), and improvements to the voice cloning workflow.
The ElevenLabs Playground is available here, but you need to sign up to try it out. A basic version of the playground is also available on the provider's homepage, but with a limited selection of voices, no customization, and no option to choose a model.
There is also a Voice Library, where you can listen to all voices categorized by use case. Interestingly, not only the company's developers but also community members can upload voices to it.
Listen to how two popular voices (male and female) from the ElevenLabs library sound:
Here and below, the phrase "Hi! I’m a Text-to-Speech model, and this is what my voice sounds like." is used for testing. No additional voice settings have been applied.
#2 Cartesia Sonic
Cartesia offers only one TTS model, the Sonic. It is also quite fast and shows 90 ms latency, which is very good for conversations in real-time. Developers can customize voice attributes such as pitch, speed, and emotion, allowing for tailored speech outputs that meet specific needs. Cartesia's technology also allows models to run directly on devices.
Sonic supports instant voice cloning with minimal audio input (as little as 10 seconds), enabling users to replicate specific voices accurately. Sonic works with 15 languages. Monthly subscription for business starts at $49 or $46.7 per 1M characters.
The Cartesia Playground is available at this link, but it is only accessible to logged-in users. In the Voices section, you can browse the entire voice library.
Here’s how two voices from the Cartesia library sound: Help Desk Woman and Customer Service Man:
#3 Amazon Polly Generative
Amazon Polly is a TTS service from cloud provider Amazon that seamlessly integrates with other AWS services. Polly has four models: Generative, Long-Form, Neural, and Standard. The first one is considered the most advanced. While AWS announced many AI updates in late 2024/early 2025, these primarily focused on Amazon Bedrock, Amazon Q, SageMaker, and infrastructure , with less specific news about Polly model evolution found in the snippets, apart from a potentially ambiguous reference to "Amazon Nova Sonic".
The service supports 34 languages as well as their dialects. One to 10+ voice options are offered for each language. A total of 96 voices (for all languages) are available to users.
Polly supports SSML, allowing developers to fine-tune speech output with specific instructions regarding pronunciation, volume, pitch, and speed. Users can create custom voices tailored to specific branding needs or preferences. The delay of Amazon Polly responses varies from 100ms to 1 second. The service operates on a pay-as-you-go pricing model. The minimal cost for business users of the Generative model is $30 per 1M characters.
You can try Amazon Polly through the AWS Console. To do so, you’ll need to sign up and log in to your account, including entering your credit card details—even if you don’t plan to use Amazon’s paid services. Note that the Generative model may not be available in some regions.
Here’s how it sounds:
#5 Microsoft Azure AI Speech Neural
Microsoft Azure TTS is a part of the Microsoft Azure ecosystem that integrates seamlessly with other Azure AI services. Microsoft's TTS model is called Neural, and it has several versions: Standard, Custom, and HD. The latter is the most advanced.
It supports over 140 languages and locales. Users can create custom neural voices tailored to their brand or application needs. Developers can use SSML to customize pronunciation, intonation, and other speech characteristics.
The starting price of $15 per 1M characters for Standard Neural voices (pay-as-you-go) is confirmed. Custom Neural Professional voices are priced higher at $24/1M characters, plus additional costs for model training and endpoint hosting.
Microsoft's Voice Library is open and accessible even to guest users via this link. However, to try out the TTS functionality, you’ll need to sign in with an Azure account.
Examples of Featured voices are below:
Azure's DragonHD
Microsoft introduced Azure AI Speech’s Dragon HD Neural TTS in March 2025. This model is designed to deliver highly expressive, context-aware, and emotionally nuanced synthetic speech, making it particularly well-suited for applications such as conversational agents, podcasts, and multimedia content creation.
Dragon HD Neural TTS integrates large-scale language models (LLMs) to enhance contextual understanding, allowing the system to generate speech that accurately reflects the intended meaning and emotion of the input text.
The model includes advanced emotion detection by analyzing acoustic and linguistic signals, allowing synthesized speech to convey authentic emotional nuances. As of March 2025, 19 Dragon HD voices are available in total across languages and locations.
As of May 2025, Microsoft has not publicly disclosed specific pricing for Azure AI Speech's Dragon HD Neural TTS voices. These high-definition voices are available through Azure's Text-to-Speech service, which typically charges based on the number of characters synthesized. For instance, standard neural voices are priced at approximately $16 per 1 million characters. However, the exact rates for Dragon HD voices may differ and are not specified in the available documentation.
#5 Google Text-to-Speech Studio
The Google TTS service has several models. Among them, the Studio model has the highest rating on the Artificial Analysis platform. Google does not provide specific latency figures for the model, but according to user feedback, it is around 500 ms.
This TTS supports over 380 voices across 50+ languages and variants. Users can create unique voices by recording samples, allowing brands to have a distinct voice across their customer interactions. The API supports SSML, enabling developers to control aspects like pitch, speed, volume, and pronunciation for more tailored speech outputs.
Google TTS offers a flexible pay-as-you-go pricing model. For the Studio model, charges starts from $160 per 1M characters.
You can try out Google's TTS service here. To do so, you need to be logged into your Google account. On the playground, you can choose between only two voices—male or female—but you can customize how they sound on different devices, such as a car speaker.
#6 PlayHT Dialog
PlayHT provides two TTS models: PlayHT 3.0. and Dialog. Both of these models can be used for AI voice agents, but PlayHT Dialog is designed specifically for conversational applications. It was released in February 2025.
PlayHT also offers the PlayHT 3.0 Mini model, optimized for lower latency and speech accuracy. Key developments since February include a partnership with Groq to run the Dialog model on Groq's LPUs, significantly boosting inference speed (claimed 215 chars/s vs 80 chars/s on GPU). They also partnered with LiveKit for real-time voice AI integration and launched the Play AI Studio, a unified platform incorporating multi-speaker podcast creation, advanced narration tools, and voice agent building capabilities.
Dialog works with 9 main languages and 23 additional. The latency of the model is 300 ms and the main advantage is highly natural, expressive and fluid voices. There are more than 50 of them, and functionality of voice cloning is also available.
Current plans include a Free tier, Creator ($31.20/mo billed yearly), Unlimited ($29/mo), and Enterprise (Custom). A Professional plan at $99/month offers unlimited voice generation. The $39/month plan mentioned in the article might be an older tier or a monthly billing option for a lower tier.
PlayHT has an open playground where even unregistered users can explore its features. However, you’ll need to log in to generate speech.
Here’s how the voices that the platform suggests trying first sound:
Comparison of top TTSs in 2025:
Provider and Model | ELO Score* | Languages | Cost (per 1M characters) | Latency |
---|---|---|---|---|
ElevenLabs Flash v2.5 | ~1108 | 32 | ~$60 (Flash) | 75 ms |
Cartesia Sonic-2, Sonic-Turbo | ~1106 | 15 | $37-40 | 100 ms |
AWS Polly Generative | ~1057.900 | 34 | $30 | 100 ms |
Amazon Polly Generative | ~1063.9 | 34 | $30 | 100 ms |
Azure AI Speech Neural Std, Neural HD | ~1057.160 | 140+ | $15 | 300 ms |
Google Text-to-Speech Studio | ~1037.5 | 50+ | $160 | 500ms |
PlayHT Dialog PlayHT 3.0 | ~1013 | 32 | $99/mo unlimited plan | 300-320ms |
*ELO Score specified on the Artificial Analysis platform on May 2025
Choosing the right STT and TTS models for your project
In general, there is this correlation of choice: Google, Microsoft and other large providers provide more stability, but smaller companies like ElevenLabs or Deepgram can offer more realistic voices in TTS and better speeds in STT.
Taras, CTO at Softcery, also shares his recommendations when choosing a model, depending on the use case:
- For projects within entertainment or interactivity/gaming, emotionality and realism in voice (TTS) are important.
- For commercial agents with expected high call volume, stability is important.
- For agents for appointments at a medical center, hair salon, golf club, etc., low WER in STT is important, as contact information is often dictated there, and it is critical to transcribe it correctly.
Testing your chosen models in real-world conditions is crucial for success. While providers often showcase performance in ideal settings, voice agents frequently encounter challenging scenarios. Users may speak quietly or have speech impairments, heavy accents can affect recognition accuracy, and background noise or poor connections can significantly impact performance. Multiple speakers talking simultaneously or interrupting each other also present unique challenges that may not be apparent during initial testing.
When planning for long-term success, consider how your chosen solution will scale alongside your project. This means evaluating providers' capabilities to handle sudden traffic spikes and support multi-region deployment without compromising performance. Additionally, assess their roadmap for adding new languages and voice options, as well as their ability to integrate with an expanding tech stack.