Choosing Text to Speech (TTS) for AI Voice Agents (2025): Voices. Latency. Cost
A practical guide to choosing TTS for AI voice agents in 2025—focused on latency, voice quality, customization, and cost.

For businesses building AI voice agents in 2025, choosing the right TTS solution is critical – it directly shapes user experience and trust. This guide breaks down the key criteria for evaluating TTS systems and compares leading providers, so non-technical and technical leaders alike can make an informed decision.
What is TTS for AI Voice Agent and why it`s Important
Text-to-Speech (TTS) is the final step in a voice agent pipeline. It takes the AI’s response (text) and turns it into audible speech. If STT is how the agent listens, TTS is how it talks.
It’s not just about “having a voice.” TTS shapes how your users perceive the agent. A natural, well-paced voice builds trust and keeps users engaged. A robotic or laggy voice breaks immersion - and loses users fast. Good TTS reduces cognitive load. It makes your agent feel responsive, credible, and useful. Bad TTS distracts and frustrates.
If you’re building voice interfaces, TTS isn’t optional. And if you're serious about user experience, generic synthetic voices won’t cut it. You need performance, control, and quality that fits the use case.
Key Criteria for Choosing a TTS Solution
Voice Quality
The primary consideration for TTS is the naturalness of the generated speech. Consider whether the voices sound robotic or human-like, and how well they maintain consistency across longer passages. Special attention should be paid to the model's ability to maintain consistent sound quality when processing incomplete or partial text fragments.
The model should also include robust dictation mechanisms for accurately pronouncing specific formats like phone numbers, email addresses, and confirmation codes, ensuring clear and precise communication of critical information.
There is no common quality metric for Text To Speech services, similar to WER for STT models. However, AI model research platforms may have custom metrics, and you can start evaluating models from there.
For example, the Artificial Analysis platform collects user responses about the quality of popular TTS models and calculates the ELO Score based on them. At the beginning of 2025, the best performing TTS models have ELO scores around 1000–1100. And the rest of the models on the platform leaderboard have a score of around 850.
Latency and Real-Time Streaming
Assess how quickly the TTS engine begins speaking after receiving text. This is measured by Time to First Audio (TTFA). For real-time voice agents, sub-500 ms TTFA is the minimum. Sub-200 ms is preferred. Below 100 ms is ideal. Look for support for streaming synthesis. This allows the engine to generate and deliver audio incrementally, so playback can start before the full sentence is ready. This is critical for responsive, real-time interaction.
Modern TTS systems built for dialog, such as ElevenLabs' Flash (~75 ms TTFA) and Cartesia's Sonic (~90–100 ms), support low-latency streaming and are suitable for live use. Traditional or high-fidelity models may exceed 500 ms, which can delay responses and disrupt flow.
Use cases with strict timing (e.g. customer support or IVR systems) should prioritize low TTFA and streaming APIs (WebSocket or similar). Some vendors offer both fast and high-quality voices - choose based on whether speed or naturalness matters more in context.
Latency is not just a performance metric. It directly impacts perceived intelligence and responsiveness.
Voice Customization
Evaluate the available voice options and customization capabilities. Some businesses need multiple voices for different purposes, while others require brand-specific voice creation.
Consider the model's ability to adjust speaking rate, pitch, and emphasis. For example, higher pitches often convey friendliness and approachability, while lower pitches project authority and seriousness.
Advanced systems also could offer control over specific voice characteristics:
- Assertiveness: Controls the firmness of voice delivery.
- Confidence: Affects how assured the voice sounds.
- Smoothness: Adjusts between smooth and staccato delivery.
- Relaxedness: Modifies tension in the voice.
The fine-grained control over voice generation, allowing developers to adjust intonation and pronunciation for specific words or phrases, is enabled by Speech Synthesis Markup Language.
Language Support
Support for multiple languages and regional variations ensures voice agents can serve diverse audiences. This includes:
- Language selection
- Regional accent configuration
- Dialect-specific adjustments
Some models (e.g. cartesia-english) are configured for one language and if you need to switch to another language during the call, there will be problems. These problems could be hard to solve, because there is no real-time update of the call configuration.
Cost
Most charge based on characters, words, or audio duration. Some charge extra for advanced features like low-latency streaming, voice cloning, or SSML support.
Break costs down into:
- Standard voices vs premium voices – Premium neural voices are more natural but often priced higher.
- Real-time vs batch processing – Streaming APIs may have higher per-character rates.
- Usage volume – Costs can scale quickly with high call volume. Check for volume discounts or enterprise pricing tiers.
If you use multiple languages, test for hidden surcharges - some vendors price them separately.
Also consider indirect costs:
- Custom voice creation – Voice cloning or branded voices may involve one-time training fees or licensing models.
- Storage and caching – Some platforms charge for storing synthesized audio or reusing clips.
Scalability and Integration
API Access
Modern TTS providers expose their engines via REST APIs or WebSockets. These endpoints accept plain text (or SSML) and return synthesized audio in various formats (e.g., WAV, MP3, PCM). Key considerations:
- Batch vs. Realtime: Some APIs support batch processing for long-form content (e.g., reading documents), others are optimized for real-time synthesis.
- SSML Support: Structured Speech Markup Language (SSML) lets you control pitch, rate, pauses, emphasis, and phoneme pronunciation.
- Customization: Advanced APIs allow voice tuning, custom lexicons, or full voice cloning for brand consistency.
Real-Time Integration
For voice agents, latency is non-negotiable. The TTS system must generate and stream audio as quickly as possible - ideally under 250 ms from input to playback. Real-time TTS integration typically includes:
- Streaming Output: Low-latency audio chunks start playing before the full sentence is synthesized.
- WebSocket Streaming APIs: Maintain persistent connections to reduce overhead and deliver sub-100 ms Time to First Audio (TTFA).Caching: For repeated prompts (e.g., greetings or fallback responses), cache pre-rendered audio to eliminate unnecessary TTS calls.
Leading TTS Providers and Solutions in 2025
Similar to the previous selection, this is a limited list. It includes those providers that show the best quality according to the Artificial Analysis Leaderboard in February 2025.
All the services listed below have good voice realism according to user reviews, but we recommend listening to their samples yourself. Emotionality or realism of voice is one of the most important criteria for evaluating TTS. Many people have developed the Uncanny Valley Effect, and chances are that the more realistic the voice, the better it affects the overall conversion rate or effectiveness of the voice agent.
#1 ElevenLabs Flash
The ElevenLabs provider for early 2025 offers two TTS models: Multilingual and Flash. Multilingual is optimized for maximum realism and humanness of the voice, while Flash is an ultra-fast model with ~75ms delay. For voice agents specifically, Flash is recommended. It is integrated into ElevenLabs' broader platform for building customizable interactive voice agents.
Flash from ElevenLabs work with 32 languages. Users can customize various aspects of the voice output, including tone and emotional expression. They also can clone voices. As of April 2025, API plans range from Free, Starter ($5/mo), Creator ($22/mo, $11 first month), Pro ($99/mo), Scale ($330/mo), Business ($1320/mo), to custom Enterprise tiers.
Notable additions include the Scribe STT API, a Conversational AI framework for building agents, a Dubbing API for translation, support for additional audio formats (Opus, A-law for telephony), and improvements to the voice cloning workflow.
This and some other providers on the list do not disclose the price per 1M characters. In such cases, we will indicate the figure based on calculations from the Artificial Analysis platform.
The ElevenLabs Playground is available here, but you need to sign up to try it out. A basic version of the playground is also available on the provider's homepage, but with a limited selection of voices, no customization, and no option to choose a model.
There is also a Voice Library, where you can listen to all voices categorized by use case. Interestingly, not only the company's developers but also community members can upload voices to it.
Listen to how two popular voices (male and female) from the ElevenLabs library sound:
Here and below, the phrase "Hi! I’m a Text-to-Speech model, and this is what my voice sounds like." is used for testing. No additional voice settings have been applied.
#2 Cartesia Sonic
Cartesia offers only one TTS model, the Sonic. It is also quite fast and shows 90 ms latency, which is very good for conversations in real-time. Developers can customize voice attributes such as pitch, speed, and emotion, allowing for tailored speech outputs that meet specific needs. Cartesia's technology also allows models to run directly on devices.
Sonic supports instant voice cloning with minimal audio input (as little as 10 seconds), enabling users to replicate specific voices accurately. Sonic works with 15 languages. Monthly subscription for business starts at $49 or $46.7 per 1M characters.
The Cartesia Playground is available at this link, but it is only accessible to logged-in users. In the Voices section, you can browse the entire voice library.
Here’s how two voices from the Cartesia library sound: Help Desk Woman and Customer Service Man:
#3 Amazon Polly Generative
Amazon Polly is a TTS service from cloud provider Amazon that seamlessly integrates with other AWS services. Polly has four models: Generative, Long-Form, Neural, and Standard. The first one is considered the most advanced. While AWS announced many AI updates in late 2024/early 2025, these primarily focused on Amazon Bedrock, Amazon Q, SageMaker, and infrastructure , with less specific news about Polly model evolution found in the snippets, apart from a potentially ambiguous reference to "Amazon Nova Sonic".
The service supports 34 languages as well as their dialects. One to 10+ voice options are offered for each language. A total of 96 voices (for all languages) are available to users.
Polly supports SSML, allowing developers to fine-tune speech output with specific instructions regarding pronunciation, volume, pitch, and speed. Users can create custom voices tailored to specific branding needs or preferences. The delay of Amazon Polly responses varies from 100ms to 1 second. The service operates on a pay-as-you-go pricing model. The minimal cost for business users of the Generative model is $30 per 1M characters.
You can try Amazon Polly through the AWS Console. To do so, you’ll need to sign up and log in to your account, including entering your credit card details - even if you don’t plan to use Amazon’s paid services. Note that the Generative model may not be available in some regions.
Here’s how it sounds:
#4 Microsoft Azure AI Speech Neural
Microsoft Azure TTS is a part of the Microsoft Azure ecosystem that integrates seamlessly with other Azure AI services. Microsoft's TTS model is called Neural, and it has several versions: Standard, Custom, and HD. The latter is the most advanced.
It supports over 140 languages and locales. Users can create custom neural voices tailored to their brand or application needs. Developers can use SSML to customize pronunciation, intonation, and other speech characteristics.
The starting price of $15 per 1M characters for Standard Neural voices (pay-as-you-go) is confirmed. Custom Neural Professional voices are priced higher at $24/1M characters, plus additional costs for model training and endpoint hosting.
Microsoft's Voice Library is open and accessible even to guest users via this link. However, to try out the TTS functionality, you’ll need to sign in with an Azure account.
Examples of Featured voices are below:
Azure's DragonHD
Microsoft introduced Azure AI Speech’s Dragon HD Neural TTS in March 2025. This model is designed to deliver highly expressive, context-aware, and emotionally nuanced synthetic speech, making it particularly well-suited for applications such as conversational agents, podcasts, and multimedia content creation.
Dragon HD Neural TTS integrates large-scale language models (LLMs) to enhance contextual understanding, allowing the system to generate speech that accurately reflects the intended meaning and emotion of the input text.
The model includes advanced emotion detection by analyzing acoustic and linguistic signals, allowing synthesized speech to convey authentic emotional nuances. As of March 2025, 19 Dragon HD voices are available in total across languages and locations.
As of May 2025, Microsoft has not publicly disclosed specific pricing for Azure AI Speech's Dragon HD Neural TTS voices. These high-definition voices are available through Azure's Text-to-Speech service, which typically charges based on the number of characters synthesized. For instance, standard neural voices are priced at approximately $16 per 1 million characters. However, the exact rates for Dragon HD voices may differ and are not specified in the available documentation.
#5 Google Text-to-Speech Studio
The Google TTS service has several models. Among them, the Studio model has the highest rating on the Artificial Analysis platform. Google does not provide specific latency figures for the model, but according to user feedback, it is around 500 ms.
This TTS supports over 380 voices across 50+ languages and variants. Users can create unique voices by recording samples, allowing brands to have a distinct voice across their customer interactions. The API supports SSML, enabling developers to control aspects like pitch, speed, volume, and pronunciation for more tailored speech outputs.
Google TTS offers a flexible pay-as-you-go pricing model. For the Studio model, charges starts from $160 per 1M characters.
You can try out Google's TTS service here. To do so, you need to be logged into your Google account. On the playground, you can choose between only two voices - male or female - but you can customize how they sound on different devices, such as a car speaker.
#6 PlayHT Dialog
PlayHT provides two TTS models: PlayHT 3.0. and Dialog. Both of these models can be used for AI voice agents, but PlayHT Dialog is designed specifically for conversational applications. It was released in February 2025.
PlayHT also offers the PlayHT 3.0 Mini model, optimized for lower latency and speech accuracy. Key developments since February include a partnership with Groq to run the Dialog model on Groq's LPUs, significantly boosting inference speed (claimed 215 chars/s vs 80 chars/s on GPU). They also partnered with LiveKit for real-time voice AI integration and launched the Play AI Studio, a unified platform incorporating multi-speaker podcast creation, advanced narration tools, and voice agent building capabilities.
Dialog works with 9 main languages and 23 additional. The latency of the model is 300 ms and the main advantage is highly natural, expressive and fluid voices. There are more than 50 of them, and functionality of voice cloning is also available.
Current plans include a Free tier, Creator ($31.20/mo billed yearly), Unlimited ($29/mo), and Enterprise (Custom). A Professional plan at $99/month offers unlimited voice generation. The $39/month plan mentioned in the article might be an older tier or a monthly billing option for a lower tier.
PlayHT has an open playground where even unregistered users can explore its features. However, you’ll need to log in to generate speech.
Here’s how the voices that the platform suggests trying first sound:
#7 OpenAI GPT-4o TTS
OpenAI’s TTS engine is part of the GPT-4o multimodal architecture, designed to handle voice, text, and vision in a single model. While originally built for ChatGPT’s real-time conversations, the TTS component is now gaining attention for standalone voice agent use.
GPT-4o’s TTS supports multiple voices and languages, with highly natural prosody and fast response. Early benchmarks report latency around 200–250 ms, making it suitable for real-time interaction. However, as of mid-2025, OpenAI does not yet offer full SSML support or voice customization.
There is no public pricing yet for standalone GPT-4o TTS, and it’s not available as a separate API product. Access is currently limited to ChatGPT-based applications and controlled environments.
OpenAI’s voice quality is competitive, but the platform lacks fine-grained control and brand voice support. If and when OpenAI launches a dedicated TTS API, it may compete directly with ElevenLabs, Microsoft, and Google for high-volume, low-latency deployments.
For now, it’s a strong option only if you're already embedded in the OpenAI ecosystem and don’t require voice customization.
#8 Sieve TTS
Sieve TTS is part of Sieve’s AI toolkit, aimed at developers building multilingual voice and dubbing solutions. It supports natural-sounding speech generation, voice cloning, emotional nuance control, and word-level timestamps - all accessible via API.
- Voice cloning & emotion control: You can replicate voices (with consent) and apply emotional adjustments like excitement or calmness.
- Timestamps: Useful for highlighting or syncing text with speech.
- Multilingual support: Via underlying TTS engines - ElevenLabs for ~29 languages, OpenAI’s model extends coverage over 100 languages
Sieve uses streaming APIs to generate audio in real time, making it suitable for interactive use cases. Exact latency figures are not public, so benchmark against your requirements. Sieve doesn’t list pricing upfront - you’ll need to contact sales. But it’s clearly aimed at enterprise and high-volume workflows (multilingual dubbing, branded voice deployment).
Sieve TTS is a strong option if you need voice cloning, emotion control, timestamps, and multi-language coverage in one platform. It’s less suitable if you need straightforward, low-cost TTS for simple voice agents - but it excels in complex, media-rich applications.
#9 IBM Watson Text-to-Speech
IBM Watson TTS is part of IBM’s enterprise AI suite, offering both cloud-based and on-premises deployment options. It’s designed for use cases where data control, security, and infrastructure flexibility are priorities.
The system supports over 10 languages with a limited selection of neural voices. Developers can customize output using SSML and phonetic tuning (IPA, SPR), and advanced users can create custom voices using IBM’s “Tune by Example” method. The engine supports expressive styles like “GoodNews” or “Apology” to adjust emotional tone.
IBM provides REST and WebSocket APIs for integration. The platform supports real-time streaming with sub-second latency, though IBM doesn’t publish exact performance metrics. On-prem deployment allows teams to manage latency and availability internally.
Pricing starts at $0.02 per 1,000 characters for standard neural voices. Premium and custom voice options are available at higher tiers, with added costs for training, model hosting, and private infrastructure use.
You can explore Watson TTS via IBM Cloud. To access the playground or deploy in production, an IBM Cloud account is required.
Comparison of top TTSs in 2025
Provider and Model | ELO Score* | Languages | Cost (per 1M characters) | Latency |
---|---|---|---|---|
ElevenLabs Flash v2.5 | ~1108 | 32 | ~$60 (Flash) | 75 ms |
Cartesia Sonic-2, Sonic-Turbo | ~1106 | 15 | $37–40 | 100 ms |
Amazon Polly Generative | ~1063.9 | 34 | $30 | 100 ms |
Azure AI Speech Neural Std, Neural HD | ~1057.160 | 140+ | $15 | 300 ms |
Google Text-to-Speech Studio | ~1037.5 | 50+ | $160 | 500 ms |
PlayHT Dialog PlayHT 3.0 | ~1013 | 32 | $99/mo unlimited plan | 300–320 ms |
OpenAI GPT-4o TTS | ~1065 | 100+ | $0.015/character | N/A |
Sieve TTS | ~1045 | 50+ | Varies | N/A |
IBM Watson TTS | ~1025 | 16+ | $20.00 (neural) | ~400 ms |
*ELO Score specified on the Artificial Analysis platform on May 2025
Key Takeaways for Choosing TTS in 2025
- Prioritize Latency. Time-to-First-Audio (TTFA) under 250 ms is essential for natural conversations. Streaming TTS with WebSocket support is no longer optional for real-time agents.
- Evaluate Voice Quality. Don’t rely on polished demos. Test voices in your real use case: noisy environments, fast responses, and context switching. Evaluate Mean Opinion Score (MOS) from real user feedback.
- Look for Fine-Tuning Options. Choose providers that let you control intonation, prosody, and pronunciation through SSML or custom voice training. This is key for domain-specific or branded tone of voice.
- Watch for Cost Scaling. TTS pricing is usually per character or per minute. At scale, this adds up fast. Use caching for static prompts and review pricing tiers carefully.
- Ensure API Stability and Scaling. Review concurrency limits, latency SLAs, and regional availability. If you self-host, plan for autoscaling and GPU/CPU resource allocation.
- Multilingual and Accent Coverage Matters. If your audience spans regions or languages, ensure consistent voice quality and proper phonetic handling across locales.
- Test Edge Cases. Mispronunciations, clipped output, robotic cadence, and unnatural inflections show up in production – not in the demo. Push the limits in QA.