AI Voice Agents: Text-to-Speech vs. Voice-to-Voice (Realtime API)
Traditional STT/LLM/TTS infrastructure versus OpenAI's new Realtime API: while one offers cost-effective flexibility for voice AI development, the other delivers premium real-time interactions. Understanding these tradeoffs is crucial for selecting the right voice AI architecture.
At OpenAI Dev Day on October 1, 2024, OpenAI introduced the Realtime API in a public beta, allowing paid developers to create low-latency, multimodal applications.
The Realtime API supports seamless speech-to-speech interactions with six preset voices now, similar to ChatGPT’s Advanced Voice Mode. Key advantages include low latency and robust speech-to-speech capabilities, making it a great tool for building voice AI agents.
The API also features interruption handling, pausing audio streaming if it detects an attempt to speak over it – a valuable feature for interactive voice applications.
In this article, we’ll compare the Realtime API approach and the more old-school STT/LLM/TTS approach, exploring the strengths and limitations of each. This comparison will help clarify which may be best suited for your voice AI use case.
STT, LLM, TTS
- STT: Speech to Text. Converts audio into text. The "ears" of the system, Providers: Deepgram, Amazon Transcribe, Google Speech-to-Text, Microsoft Azure Speech Service.
- LLM: Large Language Model. Processes transcribed text and generates a response. The "brain" of the system. Providers: OpenAI’s GPT Anthropic’s Claude, Meta’s LLaMA, Google’s Gemini.
- TTS, Text to Speech. Synthesizes text into spoken audio. The "voice" of the system. Providers: Deepgram, Cartesia, Microsoft Azure Speech Synthesis.
Each service in this setup is accessed through RESTful or WebSocket APIs, creating a multi-step process with potential latency due to network communication and processing.
Cons of this setup include:
- Technical complexity, integration hell, lack of stability and fallback system needs
- The sequential nature of processing—audio input to STT, STT output to LLM, and LLM response to TTS - creates delays that prevent real-time interaction, offering only an emulation of it, latency could be really big in variability (usually 0.7-3s to respond).
- They do not understand emotions, limiting the conversational experience to literal interpretations and predefined response patterns.
However, this structure offers clear benefits:
- It is cost-effective and supported by numerous ready-made solutions, reducing the need for organizations to develop each component from scratch.
- Established providers, including LiveKit, Bland.ai, and Vapi, offer frameworks that simplify the integration of STT, LLM, and TTS, speeding up development cycles and decreasing operational costs.
- An LLM-agnostic setup allows the use of fine-tuned models, making the system adaptable as LLM technology evolves. The “brain” component is the most critical in determining the system's effectiveness. This flexibility ensures that developers can choose or switch models based on the requirements, guaranteeing relevance as new and improved models emerge over time.
Realtime API Approach
We'll talk about Realtime API, specifically the OpenAI Realtime API Preview multimodal model, the most production-ready voice-to-voice model available. This model, alongside alternatives like Ultravox AI built on top of LLama and smaller open-source options like Moshi by Kyutai, allows companies to integrate these capabilities directly into their infrastructure. These multimodal models simplify interactions by operating as end-to-end systems capable of processing audio input and generating an audio response in real-time, with the ability to interpret user emotions and tone for natural, responsive conversations.
The diagram below illustrates how the Realtime API consolidates voice processing by eliminating the need for separate STT and TTS services, streamlining interactions within a single API call.
You don’t need to be highly technical to recognize the difference in implementation complexity. With OpenAI’s Realtime API, there’s a single, unified process that replaces multiple services - streamlining setup and reducing integration points that lead to
- Simplified architecture with single-service processing for both audio input and output
- Reduced latency and faster response times
- Improved conversational flow and user experience
- Capable of interpreting user emotions and tone for more natural interactions
However, the simplicity comes with notable downsides, primarily the cost. OpenAI’s Realtime model has an exponential pricing model that escalates as conversations grow longer. Bringing costs below $0.56 per minute without sacrificing quality is difficult, which makes the API’s pricing feasible mainly for non-production projects or simple demos.
At launch, OpenAI’s Realtime API was priced at $0.06 per minute for audio input and $0.24 per minute for audio output, a rate reduced by 30% after introducing costs, but still high. In practice, despite expected cost reductions with prompt caching, the actual cost per minute remains significant due to how prompts are processed they accumulate over time as the conversation progresses. This ongoing token processing contributes to the overall cost, as every word spoken adds to the token count, steadily increasing the expense of longer conversations.
Below is a breakdown of projected costs based on real test cases:
- 1-minute call: $0.56 per minute
- 3.5-minute call: $2.80, or $0.73 per minute
- 5-minute call: $4.02, or $0.80 per minute
- 10-minute call: $9.40, or $0.94 per minute
- 20-minute call: $21.20, or $1.06 per minute
This pricing is viable only in cases where users can justify a premium for high-quality, responsive interactions.
For example, building a phone (using Twilio) voice agent on top of an STT/LLM/TTS infrastructure using Cartesia for text-to-speech, OpenAI’s GPT-4o-mini for language processing, and Deepgram for speech-to-text would cost around $0.07–$0.10 per minute. This rate remains nearly constant, as costs do not significantly increase as the conversation progresses
Additionally, using OpenAI Realtime involves other cons:
- Lack of Customizability: You can’t use custom language models or fine-tune the existing model for specific needs, limiting adaptability to niche applications.
- Vendor Lock-In: You're tied to OpenAI’s ecosystem, making it challenging to switch providers or integrate custom components.
- Potential Data Privacy Concerns: Relying on OpenAI's infrastructure for sensitive voice interactions may raise data security or privacy issues, especially for industries with strict data handling requirements.
Conclusion
In conclusion, choosing between OpenAI’s Realtime API and a traditional STT/LLM/TTS infrastructure depends largely on your business needs, budget, and user expectations. Suppose your use case values are high-quality, seamless voice interactions and you can justify the costs by passing them onto users. In that case, the Realtime API offers an unmatched, simplified setup that’s ideal for premium, responsiveness-driven applications.
However, if the real-time API’s pricing doesn’t align with your budget, especially for longer interactions, it’s no surprise that more traditional, cost-effective infrastructures are often preferred. That’s why alternatives like LiveKit, Vapi.ai, and Bland.ai exist: they provide nearly real-time responsiveness without escalating costs, offering flexibility and affordability without building everything from scratch. This balance allows you to deliver high-quality voice interactions tailored to your budget and customer needs.