Voice Agents

AI Voice Agents: Real-Time vs Turn-Based (TTS/STT) Architecture

Explore how AI voice agents are shifting to real-time with streaming STT/TTS, hybrid architectures, and low-latency design for lifelike, fluid conversations.

AI voice agents are becoming more human-like than ever, in large part due to advances in real-time speech processing. Traditionally, voice assistants followed a turn-based pattern: the user speaks, the system listens silently, then after a pause the system responds. Recent breakthroughs by major AI labs and open-source communities are enabling a real-time approach - where the agent can process and even begin responding while the user is still speaking.

Understanding Real-Time Speech-to-Speech Technology

What does “real-time” mean in a voice agent?

In simple terms, a real-time voice agent processes spoken input and generates spoken output with minimal delay, often concurrently. Instead of the classic pattern of wait, then respond, a real-time system can start understanding and formulating a reply as the user is still talking. This reduces awkward silence and makes the interaction feel more natural and interactive. Technically, it relies on streaming pipelines - the audio is processed in small chunks on the fly, and the AI may begin streaming out a response before the user’s utterance is fully finished. This contrasts with traditional voice agents that typically wait for the user to stop speaking and only then send the entire audio for transcription and response generation.

Traditional vs. Real-Time Voice Agent Architecture

While the classic voice pipeline - Voice → STT → LLM → TTS → Voice - has powered most AI voice agents to date. The traditional modular pipeline - is well-understood and can be optimized for low latency, especially with streaming components. However, real-time architectures stream input and output concurrently across the stack, which can reduce perceived delay and improve responsiveness in scenarios that involve rapid turn-taking or mid-utterance interactions.

Here’s how the two compare:

Aspect	Traditional Voice Agent	Real-Time Voice Agent
Architecture	Voice → STT → LLM → TTS → Voice (sequential pipeline)	Fully integrated streaming: Voice → LLM → Voice
STT Processing	Waits for full sentence to transcribe	Sends partial transcripts in real time
LLM Behavior	Starts only after full STT output	Begins processing from partial input as user is speaking
TTS Synthesis	Starts synthesizing audio only after the entire text is generated	Starts speaking as soon as the first tokens are generated, in a stream
Flexibility	High — easy to swap out STT, TTS, and LLM independently	Less flexible — components STT, LLM, TTS must support real-time streaming and coordination
Risks / Challenges	Requires careful orchestration between components to minimize latency	Requires stream orchestration to avoid mishearing or interrupting users
User Experience	Structured and clear, but less dynamic	Feels more "alive" — agent can begin replying before user finishes speaking, can express emotions through voice
Best Use Cases	Complex interactions requiring high accuracy (e.g., IVR systems, technical support)	AI concierges, live support agents, multilingual assistants for fast-paced environments
Technical Requirements	Lower — no need for streaming or session orchestration	Requires low-latency infrastructure and session management for continuous real-time voice processing

Core Architectures for Speech-to-Speech AI Agents

There are a few different architectural approaches emerging for real-time voice agents:

Voice → LLM → Voice (Integrated Speech-to-Speech)

This is the most end-to-end approach: a single model that directly takes audio input and generates audio output. In other words, the AI agent itself is multimodal, understanding spoken language and producing speech without needing separate dedicated STT or TTS modules.

Early examples include Moshi by Kyutai Labs, a fully open-source audio-to-audio LLM, as well as internal implementations like OpenAI’s GPT-4o Audio and Google’s Gemini 2.5 Flash voice capabilities. These models use unified pipelines to minimize latency and preserve conversational flow.

The integrated approach can, in theory, achieve the lowest latency because the model “hears” and “speaks” directly. It also opens up possibilities like the model naturally handling paralinguistic cues - tone, timing, interjections - since it treats speech as a first-class input/output. However, truly end-to-end systems are cutting-edge and complex. They may lack the flexibility of choosing a custom voice or the accuracy of a separately optimized STT engine (at least in current iterations).

Voice -> (STT) -> LLM -> (TTS) -> Voice (Hybrid)

The Hybrid Integration architecture is a versatile approach that balances the strengths of fully integrated models with the flexibility of modular components. In this configuration, the voice input may be processed directly by a multimodal LLM or first converted into text by a dedicated STT module. Similarly, the output can be either directly synthesized speech from the LLM or a textual response that is passed through a separate TTS engine to generate audio. This dual approach offers significant flexibility, as it allows developers to choose the most suitable components for each stage based on the specific requirements of the application.

Key Configurations:

Configuration	Input	Output	Example Models
Voice → LLM → Voice (Direct)	Voice	Voice	Moshi, Qwen-Audio
Voice → LLM → Text → TTS (Partial)	Voice	Voice	Ultravox (planned)
Voice → STT → LLM → Voice (Modular)	Text	Voice	Meta’s MMS, OpenAI GPT-4o + codec voice
Voice → STT → LLM → TTS → Voice (Traditional)	Text	Voice	Google STT + GPT-4o + Azure TTS

For example, Ultravox - an open-source multimodal LLM - is designed to ingest human speech without a separate ASR (STT) stage and output text in real time. You get the advantage of an integrated understanding (fewer cascaded components to introduce delay on the input side. Many current voice agents actually operate this way: they use a streaming speech recognition front-end and then feed text into an LLM, with a standard TTS for output. The benefit is you can choose a high-quality or custom voice for the TTS, or do things like voice cloning more easily on the output, while still reducing input latency. The downside is that you're not fully end-to-end - the TTS stage can still add some latency and complexity.

Or, the STT transcribes user speech into text, which is passed to an audio-capable LLM that outputs speech tokens or raw audio - no separate TTS module is required. This design is supported by emerging systems like Meta’s MMS, Qwen-Audio by Alibaba, and Ultravox.

It strikes a balance between traditional modular pipelines and fully end-to-end models: you retain flexibility in STT, reduce total latency, and simplify the output stack. However, since the LLM handles voice generation, custom voice control is limited, and STT-to-LLM orchestration still needs to be managed.

This architecture is ideal when you want to pair open-source STT (like Whisper) with fast, integrated voice generation - without relying on a full TTS backend.

The Softcery guide on STT/TTS notes that while direct speech-to-speech is emerging, the STT→LLM→TTS approach remains very popular because of its flexibility and the ease of switching out language models or voices as needed.

Available Real-Time Models & Platforms

Real-time AI voice agents have quickly moved from research labs to both commercial products and open-source projects. Here we survey the notable options in 2025, from the tech giants to emerging open-source contenders, as well as some hybrid approaches:

Leading Proprietary Platforms

Two of the most advanced real-time voice AI offerings today come from OpenAI and Google:

Feature	OpenAI GPT-4o (Realtime Preview)	Google Gemini 2.0 Flash
Provider	OpenAI (also via Azure)	Google / DeepMind
Model Type	Multimodal LLM with realtime audio streaming support	Multimodal “flash” LLM optimized for speed and interactivity
Latency – Time to First Token	~280 ms	~280 ms
Audio Input	Streaming audio via WebRTC + WebSocket API	Streaming audio via Multimodal Live API (likely gRPC-based)
Token Generation Speed	~70–100 tokens/second	~155–160 tokens/second
Hosting / Access	Cloud only (OpenAI API / Azure OpenAI Service)	Cloud only (Google AI Studio / Vertex AI)
Developer Integration	Open-source reference stack with LiveKit + WebRTC + OpenAI Streaming API	Access via Google’s Vertex AI or AI Studio; endpoint: gemini-2.0-flash-live-001
Multimodal Capabilities	Yes — audio input, speech output; also supports vision	Yes — audio, video, text input; supports images and rolling context in conversation
Throughput Capacity	~800K tokens/min, ~1,000 req/min (Azure OpenAI, realtime mode)	N/A (not publicly specified, but optimized for high concurrency and streaming)
Unique Strengths	High-quality GPT-4 responses Humanlike voice Double token speed Rolling attention (long interaction memory) Multimodal IO	- Black-box (closed model) - No self-hosting - API cost sensitive at scale - Also closed-source - Limited model control or customization

Sources:

https://news.ycombinator.com/item id=41743327#:~:text=Under%20the%20hood%20it%20works,back%20to%20the%20user%E2%80%99s%20device
https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/#gemini-2-0
https://ai.google.dev/gemini-api/docs/models#:~:text=Gemini 2.0 Flash Live gemini,You can view the

Open-Source Alternatives

Several projects have appeared, showing what’s possible without proprietary models:

Feature	Ultravox (by Fixie.ai)	Moshi (by Kyutai Labs)
Model Type	Multimodal LLM (audio + text encoder, outputs text)	Audio-to-audio LLM (integrated STT and TTS — speech in, speech out)
Architecture	Voice → LLM → Text (planned speech output in future versions)	Voice → LLM → Voice (fully integrated speech-to-speech pipeline)
Streaming Support	Streaming text output with low latency	Full-duplex streaming (supports overlap and interruption)
Time to First Token (TTFT)	~190 ms (on smaller variant)	~160 ms
Token Generation Speed	~200+ tokens/sec	Not token-based; generates speech waveform directly
Base Models	Built on open LLMs (e.g., LLaMA 3 – 8B / 70B)	Proprietary foundation model trained by Kyutai
Audio Processing	Projects audio into same token space as text using custom audio encoder	End-to-end audio encoder and decoder (neural codec pipeline)
Output Type	Text (for now), with plans for speech token output	Audio (neural codec speech)
Hosting / Deployment	Self-hostable; requires GPU infra, especially for 70B variant	Self-hostable (heavy); public demo available at moshi.chat
Open-Source Status	Fully open: model weights, architecture, and code available on GitHub	Fully open: code and demos available; weights provided (early stage)
Extensibility	Can plug in any open-weight LLM; attach custom audio projector	Closed model structure for now; focused on turnkey audio-agent use
Key Strengths	- Easily fine-tuned - Modular and composable - Fast even on smaller models - True speech-to-speech - Handles interruptions - Real-time feel with minimal latency	- Text output only - Less language capability than GPT-4 without fine-tuning - Requires infra setup - Early-stage, less stable - No customization or voice control yet - Experimental
Use Case Fit	Voice-enabled bots with real-time understanding, using custom TTS for output	Full voice agents with natural interruptions and direct speech response

Apart from core models, there are frameworks like Pipecat (an open-source, vendor-agnostic framework for voice agents) which companies like Daily.co use to integrate these models. There are also toolkit libraries like Facebook’s LiveKit (used by OpenAI for WebRTC streaming) and the FastRTC library which simplifies building real-time audio apps in Python. These aren’t models, but they are important for developers to actually deploy real-time agents. Open-source speech recognition (e.g. Vosk, NeMo) and TTS (e.g. VITS, FastSpeech) components can also be assembled to approximate real-time agents if one doesn’t use a single end-to-end model.

Performance Metrics That Matter

When evaluating real-time voice agents, it's critical to look at a few key performance metrics. These metrics determine not just how fast the system is, but also how accurate and intelligible it is, which ultimately define user experience. Below are the core metrics and why they matter:

Time to First Token (TTFT)

Time to First Token (TTFT) - This is a measure of latency. In the context of a voice agent, TTFT usually refers to how long it takes from the user speaking (or finishing a query) to the agent beginning to speak a response. It can be measured from end-of-user-speech to start-of-agent-speech. The latest models we discussed have TTFT in the few hundred milliseconds range. Google’s Gemini Flash, for instance, logs ~0.28 s TTFT, and OpenAI’s GPT-4o realtime is in a similar ballpark (~0.25-0.3 s). These numbers are remarkably low - for comparison, human response latencies in conversation (gap between speakers) are often around 200 ms as well! Keep in mind, TTFT can be affected by network latency too (for cloud APIs), so real-world might be a bit higher than lab values. When assessing a solution, check if TTFT is measured in a controlled setting or end-to-end. Lower TTFT is generally better, but extremely low TTFT might sometimes indicate the model is responding too quickly (maybe before it's certain of user intent), which could have implications for accuracy.

Word Error Rate (WER)

This metric applies to the speech recognition portion of the system. WER measures the percentage of words incorrectly recognized in the transcript. A lower WER means more accurate transcription of the user’s input. For instance, Meta AI’s recent research on streaming LLM-based ASR achieved about 3.0% WER on Librispeech test-clean (and ~7.4% on test-other) in real-time modeisca-archive.org - which is impressively close to the best offline models. For a voice agent, WER is important because any mistake in understanding the user can lead the LLM astray. Cloud providers often publish WER on benchmarks, but real-world WER can be higher. Also, consider that a real-time agent might correct some ASR errors via context (the LLM might infer what the user meant), but generally you want WER as low as possible. Domain adaptation (custom vocabulary or fine-tuning) can help if your use-case has unusual terms.

Real-Time Factor (RTF)

Real-Time Factor (RTF) - RTF is a measure of speed relative to the length of input. An RTF < 1.0 means the system processes faster than the input duration. Different components have their own RTF: STT engine might process audio at, say, 0.2× real time (very fast), and an LLM might generate tokens at, e.g., 50 tokens/sec.Sometimes you'll see tokens per second as a proxy for RTF on generation. For TTS, RTF might refer to how quickly audio is synthesized relative to length (modern TTS often has RTF of 0.1 or better, meaning it can generate 10 seconds of speech in 1 second of processing). When testing a system, ensure that under load it maintains RTF < 1. If you feed audio too fast and the system can’t keep up, you’ll get latency build-up. Some independent analyses of LLM speed (like those by Inferless) show how throughput can vary by model - a smaller model might have worse language quality but far better RTF. Thus, token generation speed can be a deciding factor if ultra-low latency is needed.

Bottom line: Prioritize based on use case.

For real-time speed: go for low TTFT/RTF.
For precision: optimize WER and LLM quality.
For clarity: test TTS in realistic audio conditions.

Cost Analysis & Scalability

The table below outlines the major cost categories you should evaluate - from per-minute cloud pricing to GPU runtime and enterprise overhead.

Cost Category	Description	Examples / Benchmarks	Key Considerations
Usage-Based Pricing (Cloud APIs)	Pay-per-minute for input/output audio via APIs like OpenAI, Google	- OpenAI GPT-4o: ~$0.004/min (input), ~$0.008/min (output) - Google Gemini Live: ~$0.0015–0.006/min	- Simple setup - Scales with usage - Higher cumulative cost at large volumes
Compute Costs (Self-Hosting)	Run open-source models like Ultravox/Moshi on your own infra	- Hosting Ultravox 70B may need A100/H100 GPU per concurrent session - GPU costs: ~$2–$3/hr (cloud)	- Lower marginal cost at scale - Requires infra & DevOps team - Harder to spin up instantly
Scalability / Rate Limits	Limits on concurrent sessions, tokens per minute, request rate	- OpenAI GPT-4o preview: 800K tokens/min, 1K requests/min - Enterprise: up to 30M tokens/min	- Watch for WebSocket caps or long-lived session constraints - Request enterprise quotas if needed
Bandwidth Overhead	Cost of streaming audio data over network	- ~8–64 kbps per stream - Telephony codecs (e.g. G.711 vs G.729) can affect costs	- Minor cost per stream, but adds up at scale - Ensure egress limits aren’t exceeded in cloud setups
Enterprise Overhead	SLAs, premium support, custom deployments, fallback systems	- Regional/on-prem hosting - Redundancy systems (e.g. backup STT or fallback bots)	- Adds reliability and control - Contractual/licensing complexity increases total cost of ownership

Match Cost Strategy to Deployment Scale

There is no one-size-fits-all pricing model. For early-stage projects or low-volume traffic, cloud APIs offer fast setup and predictable pricing - you only pay for what you use. But as usage grows, infrastructure costs from self-hosting may become more economical, especially if you need tight control, data locality, or custom model tuning.

At enterprise scale, success hinges not just on price per minute, but on reliability, rate limits, support agreements, and long-term flexibility. Whether you choose a hosted or self-managed approach, the total cost of ownership (TCO) should include not only processing minutes, but also bandwidth, DevOps effort, redundancy, and support. The most efficient teams optimize not just for raw cost, but for sustainability, scalability, and user experience over time.

Do the math for your specific scenario: estimate average conversation length, how many per day, and multiply by the per-minute or per-token costs of your chosen platform to get a monthly cost. Weigh that against an investment in infrastructure for open-source. And always keep an eye on usage limits or the need to upgrade to enterprise tiers. Scalability is not just about handling peak load technically, but also doing so within budget.

Try our interactive calculator to estimate your monthly costs based on model, speech stack, volume, and architecture. Try our AI Voice Agent Cost Calculator.

Technical Implementation Challenges

Building a real-time voice agent isn't just about picking a model - there are a lot of engineering pieces to put together to actually make it work in a production environment. Here are some of the key technical implementation challenges and considerations:

Streaming Integration (WebRTC, WebSockets, etc.)

To achieve low latency, you need the right streaming mechanisms between the user and your system. The main options are WebRTC, WebSockets, and streaming HTTP/gRPC:

WebRTC: Web Real-Time Communication is the standard for low-latency audio/video streaming in browsers and mobile apps. It uses UDP under the hood for fast transmission and handles packet loss gracefully. Both OpenAI and Google leverage WebRTC for client-side audio capture and play. If your users interact via a web browser or mobile app, you’ll likely use WebRTC to send microphone audio to your server and play back the agent’s voice. WebRTC has features like Acoustic Echo Cancellation (AEC), noise reduction, and automatic gain control (AGC) which are very useful. Libraries like LiveKit, mediasoup, or Twilio can help integrate WebRTC without building it from scratch.
WebSockets / gRPC for server communication: On the server side (or between your server and the AI service), you might use a persistent bidirectional connection. OpenAI’s voice API uses WebSockets - the client sends audio chunks and receives tokens back continuously. Google’s API uses gRPC streaming over HTTP/2. These both achieve a similar effect: a continuous stream rather than discrete HTTP requests. When implementing, ensure you handle binary audio frames appropriately. Keep the connection open for the duration of the conversation session to avoid reconnect overhead.
Audio encoding: Decide on the audio format. PCM raw audio is simple but bulky. Opus is a popular codec (WebRTC uses it) that gives high quality at low bitrate; however, not all APIs accept Opus packets. Some APIs might accept WAV or FLAC frames. Using a compressed codec can save bandwidth - important for mobile users. For phone calls, G.711 µ-law 8kHz is common. You’ll need to transcode that to whatever your ASR expects (most likely 16kHz linear PCM, since ASRs like Whisper or DeepSpeech expect 16k).
Latency tuning: With streaming, you often have buffers. WebRTC, for instance, has jitter buffers to smooth network variation. If they’re too large, they add delay. There’s a trade-off between smooth audio and ultra-low latency. You can configure many WebRTC parameters, but often defaults are fine. For WebSocket, you send data as soon as you have it (e.g., a 20ms audio frame every 20ms). Avoid batching frames or waiting for too long on the client. Also, ensure Nagle’s algorithm is disabled if using TCP (most WebSocket libs do this by default, to not delay small packets).

Handling network issues: Real-time audio needs a plan for packet loss. WebRTC handles this with loss concealment (it can fill in missing audio chunks with plausible noise). If you DIY with WebSockets, you might not have that - but if the network is decent, small losses might not break things, or you rely on the ASR being somewhat robust to minor gaps. For output, if using WebRTC, packet loss can cause blips; some systems use redundant packets or forward error correction. This can be overkill unless you are on very unreliable networks.

In practice, many developers use a combination: WebRTC from client to a relay server (for audio), then server uses the AI API via WebSocket. This is exactly what OpenAI’s example - the voice agent proxies WebRTC to WebSocket. The reason is WebRTC is better for the client side (handles unpredictable networks), while WebSocket is easier to interface with AI models.

Telephony Integration (8 kHz and PSTN)

Audio format (8 kHz): Telephone audio is low bandwidth (8 kHz, 8-bit µ-law). Most high-quality ASR models are trained on 16 kHz or 16 kHz+. So feeding raw 8 k audio can reduce accuracy. One approach is to use a specialized telephone ASR model or do a broad-band gain - e.g., Twilio, Google, etc., have STT models tuned for phone audio. Alternatively, you can upsample the 8 kHz audio to 16 kHz; it doesn't add new info, but at least it lets you feed it into a model expecting 16k. OpenAI’s Whisper, for example, can accept 8k or 16k, but it was mainly trained on 16k. In practice, upsample and maybe apply a band-pass filter to reduce hiss.

SIP/VoIP: You might integrate with a service like Twilio, Nexmo or an on-prem SIP system to get the audio. These often provide the audio via a WebSocket (Twilio has a streaming API that sends 8k PCM in real time) or via a media server. You’ll need to adapt your architecture to ingest that stream and connect it to your AI pipeline.

DTMF and control: If the system needs to detect DTMF tones (touch-tone input), typically your telephony provider will detect those out-of-band and signal them separately (because if not, the tone might confuse the ASR). Plan to handle these signals (e.g., Twilio sends a webhook event for DTMF). Real-time voice agents aim to avoid DTMF menus by using voice, but sometimes users will still try pressing or the design might intentionally allow it for backup.

Latency in telephony: Phone networks introduce some fixed latency (maybe 100-200ms). Users are somewhat accustomed to that from cell calls, etc. Still, you want your system to add minimal overhead on top. Ensuring your processing pipeline is efficient (and ideally hosted in a data center close to the telephony ingress) will help. If you host your AI in US-East and the call ingress is also US-East, roundtrip is minimized; if your AI is across the globe, the phone call might start feeling laggy.

Bridging to agent: If at some point a live agent (human) takes over, you may want to pass along context. For example, if the AI collected info for 3 minutes then escalates, provide the transcribed summary to the human agent so the user isn’t frustrated repeating themselves.

Handling Background Noise & Voice Variability

Noise suppression: It can be valuable to apply a noise suppression algorithm on the input audio before ASR. There are modern ML models (like RNNoise) that can remove background noise (keyboard clacks, fan hum, etc.) in real time. Picovoice’s Koala is a commercial example that shows improvements in intelligibility with noise suppress. The trade-off is sometimes these can slightly distort the voice or consume extra CPU. On output, if the user is in a noisy place, there’s not much you can do except speak clearly. But on input, definitely consider noise reduction.

Microphone differences: People might talk via a headset, speakerphone, car bluetooth, etc. This affects audio quality (frequency response, presence of echo). Echo cancellation is crucial especially if the agent’s voice might be picked up by the mic (e.g., on speakerphone). WebRTC’s AEC can handle a lot of it. If you’re not using WebRTC (say telephone scenario), the phone network’s echo cancellers usually handle it, but if not, you might need an adaptive echo canceler in your pipeline.

VAD and Barge-in: We covered VAD (Voice Activity Detection) earlier. In noisy conditions, a VAD might falsely think noise is speech or vice versa. You can tune VAD sensitivity or combine it with ASR confidence (e.g., only treat as end-of-utterance if silence and/or the last chunk of ASR was marked final). A robust approach: as long as ASR is pumping out transcribed words, assume the user is still talking. When it stops yielding new words and VAD indicates silence for, say, 500ms, then end the turn. Barge-in: always be ready to stop the TTS if the user starts talking. That means monitoring the mic even while the agent speaks. This is standard in full-duplex setups. FastRTC library explicitly mentions built-in voice detection for turn-tak - exactly to simplify this.

Accents and languages: If your user base is diverse, test the ASR on various accents/dialects. Some cloud ASRs let you specify an accent or locale to improve accuracy. For an open model, you might consider fine-tuning on accented data or using a model known for robustness. Similarly, language: if you need bilingual support, choose models that support it (Google and OpenAI support many languages). Real-time multi-language detection is possible (some ASR auto-detect language or you might have to route to different models per expected language).

Stream Management and Orchestration

Managing a continuous conversation stream as opposed to discrete turns introduces orchestration challenges:

Half-duplex vs Full-duplex: Decide if your system will let the agent interrupt or talk simultaneously. Most current systems are effectively half-duplex with barge-in (user can interrupt agent, but agent typically won’t interrupt user except maybe short backchannel utterances). Backchannel (the “uh-huh” and “I see”) is an interesting feature - it makes the agent sound more human if it injects those while user is talking. But implementing that is tricky: you’d have to detect pauses and generate a quick backchannel without derailing the ASR. OpenAI’s demo doesn’t really do that yet; it’s more turn-based. It’s an area of research (some experimental bots do it to seem attentive).

Prompt management: Because the conversation state persists, you might be maintaining a rolling prompt for the LLM. If using an API with a persistent session, they handle it (to some limit). If manually, you might append each user utterance and agent reply. Watch out for context window limits (if conversation is long, you may need to summarize older parts). Ensure important facts user provided aren’t lost - you could re-inject them into the prompt as needed (“Recall: User’s name is X and problem is Y”).

Ensuring required steps: If your flow requires the agent to do something (like verify identity, or ask a specific question), consider building those as checkpoints. You can either let the LLM handle it via prompt instructions, or implement a simple state machine externally. For example, do not send user query to LLM until you have run a separate identity check step. Or if the LLM tries to skip it, detect that and override the response. This is more of a design/logic layer on top of the AI - basically combining rule-based flow with AI. Many real systems do this: they trust the AI for the heavy lifting of understanding and answer generation, but still enforce certain order of actions.

Overall, building a real-time agent is an orchestration challenge - you are effectively writing a tiny real-time operating system for a conversation, juggling input and output streams. The good news is that there are emerging best practices and libraries, as mentioned, to handle a lot of this. Still, expect to do significant testing: for example, simulate a user who starts talking while the agent is talking, to ensure your barge-in logic actually stops the TTS promptly.

Having covered the technical nitty-gritty, we should also consider how to test these systems effectively and ensure they meet the desired metrics in practice.

Testing Real-Time Models

Testing a real-time AI voice agent involves more than just unit tests or offline evaluation. You have to evaluate both the AI’s performance and the system’s real-time behavior. Here are some key aspects of testing and tools:

Key Metrics to Measure in Testing

Measure end-to-end latency: from the time a user finishes asking something (or even from when they start) to the time the agent begins responding. You can do this by injecting known audio and time-stamping events. For example, play a test audio clip to your system and detect when the response audio starts playing. If you have access to internals, measure TTFT on each turn. Aim for those sub-second times we talked about; if it’s higher, identify bottlenecks (maybe STT took 700ms to finalize, etc.).

Measure transcription accuracy (WER) in realistic conditions: You can create a test set of spoken inputs (covering various accents, noise levels, typical phrases for your domain). Run them through the system’s STT (possibly in isolation to remove LLM variance) and compare transcripts to ground truth text. Calculate WER to see how well it does. If WER is high for certain categories (e.g., names, technical terms), that might indicate you need to fine-tune or add custom vocabulary.

Measure response accuracy and relevance: This is trickier, as it’s somewhat subjective. You could also have a set of input-output pairs that you expect and see if the model hits those. Benchmarking the LLM part can involve asking standardized questions. Ensure the streaming nature doesn’t degrade quality (it shouldn’t in theory, but if the model is cutting off or something, that’s an issue).

User experience metrics: These could include things like turn-taking quality (does the agent ever awkwardly pause or talk over the user), the naturalness of the conversation, and user satisfaction. You might do a small user study where people interact with the agent and give feedback on how it felt. Did it respond quickly enough? Did it feel “alive”? Or were there moments of confusion?

Stability metrics: How often does the system glitch? For example, does the audio ever stutter? Do any requests time out? If using cloud APIs, are there any dropped connections? You want to track errors or retries.

Evaluating Real-Time Performance

Beyond measuring, evaluation means interpreting those measurements. For latency, as a rule, anything consistently under ~300 ms for first response is excellent. If you find your TTFT is, say, 800 ms on average, you’ll need to evaluate if that’s acceptable or if you can optimize it. Maybe it’s due to an overly long initial prompt (which takes time for the model to process) - you could shorten it. Or maybe your audio chunk size is too large (if you wait to accumulate 2 seconds of audio before sending, that’s 2 seconds lost - better to send 0.2s chunks).

For accuracy, if the WER or response correctness is not up to par, you might consider adding training data or adjusting prompts. One strategy is to run side-by-side comparisons: have human agents or a baseline system do the same task and compare outcomes. If the AI agent is missing key info or misinterpreting in cases where a human wouldn’t, dig into why - was it an ASR issue or an LLM reasoning issue?

Also evaluate in long sessions: does the performance degrade over time? Sometimes long context can fill up, causing the model to forget earlier content or to slow down. Check memory usage and any increasing delays.

When to Choose Real-Time (Use-Case Suitability and Trade-offs)

Real-time voice agents offer a leap in interactivity, but they are not a fit for every situation. It’s important to assess when the benefits outweigh the costs or complexities.

Scenario	Why Real-Time?	Key Trade-Off(s)	Recommended Strategy
Interactive Customer Support	Faster responses reduce frustration, especially during interruptions or long wait times.	Latency vs Accuracy	Use best-in-class real-time stack for seamless back-and-forth (e.g., GPT-4o, Gemini).
Voice Assistants / Companions	Conversational flow is the product. Real-time creates a natural, human-like experience.	Complexity vs Benefit	Invest in low-latency setup; fine-tune for tone and personality.
Real-Time Collaboration (e.g., Driving, Coding)	Users need fast updates to stay focused on the task. E.g., quick restaurant search while driving.	Latency vs Cost	Optimize latency, but use smaller models where possible to control costs.
IVR Replacement (Phone Menus)	Enables natural interaction; avoids rigid scripts and “operator” loops.	Complexity vs Benefit	Use real-time to support barge-ins and speech overlap; simplifies experience.
Education / Language Learning	Immediate feedback and turn-taking simulate real conversation and aid immersion.	Latency vs Accuracy	Accept slight WER if conversation flows well; reinforce with prompt tuning.
Accessibility Tools	Hands-free and responsive control critical for users with disabilities.	Cost vs Complexity	Use streaming STT and TTS; keep models lightweight but responsive.
Sales Chat / Live Engagement	Delay kills engagement. Response speed boosts conversions.	Latency vs Infrastructure cost	Prioritize speed; keep stack warm; use scalable infra or hybrid options.
Research Assistants / Data Retrieval	Depth matters more than speed — users value thoughtful, accurate responses.	Latency vs Completeness	Allow extra generation time; consider slower, more thorough models.
Internal Tools / Small Userbase	Response delays are tolerable if quality is high and system is stable.	Complexity vs ROI	Use partial streaming setup; optimize for reliability and ease of deployment.

Customization & Fine-Tuning Options

One of the strengths of modern AI voice agents (especially those based on LLMs) is the ability to customize them - both in how they speak and how they understand/behave. Here are ways you can tailor a voice agent to your brand or use-case:

Voice Customization (TTS Voice Selection & Cloning)

If using a separate TTS, you usually have a voice library to choose from (e.g., Google Cloud TTS has many voices, Amazon Polly, etc.). Pick a voice that matches your brand persona - cheerful, authoritative, calm, etc. Ensure the voice supports the language and style you need. Some platforms have expressive voices that can convey different emotions via SSML (Speech Synthesis Markup Language) tags.

Voice Cloning: This means creating a custom voice, often by providing a few minutes of a target speaker’s recordings. Services like ElevenLabs, Microsoft’s Custom Neural Voice, or startups like Resemble AI offer voice cloning. As noted in the research draft, some systems support cloning with just a few seconds of audio. For example, OpenAI (for ChatGPT) cloned a few specific voices (like one for their assistant persona). If having a unique voice is important (say the voice of your company’s spokesperson or a fictional character), cloning is a route. Keep in mind ethical and legal considerations - you should have rights to the voice you clone.

With end-to-end models like Moshi, the “voice” is essentially baked into the model’s speech decoder. Currently, they might have a default voice. In future, we might see end-to-end models that can mimic different voices if given a prompt or example (some research does zero-shot voice style transfer). Ultravox plans to eventually output speech tokens that could be fed into a unit vocoder; if you swap the vocoder’s voice profile, you might change the voice without retraining the LLM portion.

Multilingual or Accent customization: If your audience speaks multiple languages, you may need voices for each language. Or if you want the agent to speak English with a certain accent (to match users), some TTS allow that. This is part of voice persona design.

Fine-Tuning the Language Model

Domain fine-tuning: If the base LLM doesn’t have specific knowledge (say medical terms, or company product details), fine-tuning on domain data can improve both understanding and response accuracy. OpenAI supports fine-tuning GPT-3.5 and has said GPT-4 fine-tuning is coming. Fine-tuning can also help the model adopt a specific style reliably without needing massive prompts. As the research notes, benefits include higher quality and shorter prompts (which also saves latency and cost.

Behavior fine-tuning: You can fine-tune on example dialogues that represent exactly how you want the agent to converse. This can enforce subtle things: maybe you want the agent to always address the user formally as “Sir/Madam”, or you have a certain phrasing it should use for legal reasons (e.g. always saying “this is not financial advice” in certain contexts).

With open-source LLMs, fine-tuning is often feasible using low-rank adaptation (LoRA) or other techniques, even on smaller hardware. Ultravox’s design explicitly allows swapping in a fine-tuned LLM backbone. For instance, if you fine-tune Llama 3 on your dialog data, you can plug it into Ultravox’s speech pipeline.

Fine-tuning does require dataset preparation. Fine-tuning requires representative data - synthetic or real (with permission). Data privacy should be considered - if fine-tuning on real customer data, ensure it’s permitted and secure.

Prompt Engineering and Dynamic Prompts

Beyond model training, how you prompt the model each turn is a form of “soft customization”. You’ll likely craft a detailed system prompt describing the agent’s persona, knowledge base, and goals. This is critical for guiding style (whether the agent is friendly, formal, humorous, etc.). For example: “You are CallBot, a polite and helpful banking assistant. You speak in short, clear sentences and never reveal confidential information. You refer to the user as Mr./Ms. followed by their last name.” This sets the tone.
Use role-playing in the prompt to inject guidance. Some create example dialogues in the prompt to show the desired style.
As discussed, you can inject variables (names, etc.) and updated instructions as the conversation progresses. That’s a form of runtime customization per user/session.

Integrating External Knowledge and Tools

Real-time agents might need the latest info or access to databases (e.g., flight booking details). Rather than try to bake all knowledge into the model (which might be outdated), use tool integration. The agent can call an API (like check inventory, or search docs) and then incorporate that result. This way, you customize the agent’s capabilities without retraining - you give it tools. Many frameworks now allow for such tool use (function calling in OpenAI, plugins, etc.)

If you have an internal knowledge base (like product manuals), you could use retrieval augmentation: have the agent automatically retrieve relevant text from a vector database based on the conversation, and include that text in the prompt for factual accuracy. This is a customization to ensure the agent’s answers align with your data.

Testing Customizations

After any fine-tuning or prompt changes, test thoroughly to ensure the agent still behaves under real-time conditions. Fine-tuning might improve certain aspects but could also make the model verbose or too cautious unless done carefully (OpenAI notes fine-tuning can reduce latency because of shorter prompts, which is a plus).

Update and Maintenance

Over time, you may need to update the agent - new product info, new voice, etc. Customization is not a one-shot; plan for a cycle of updates. If you fine-tune a model, and then the base model improves or you get access to a better model, you’ll want to reapply your customization on the new base (e.g., fine-tune GPT-4.1 when it comes, using learnings from GPT-4). Modular approaches (like separating knowledge via retrieval) can ease maintenance.

In summary, you can shape both the voice and the mind of the voice agent:

Voice: by picking or creating the right TTS voice.
Mind/Personality: by prompt design or fine-tuning on dialogues.
Knowledge/Skills: by integrating tools or fine-tuning on domain data.

By leveraging these, a generic AI model becomes your AI agent, aligned with your company’s identity and goals.

Conclusion

AI voice agents with real-time capabilities represent a major advance in how machines interact with humans. They bring us closer to the kind of fluid, responsive conversations we have with people. By understanding the technical underpinnings and carefully considering use-case fit, performance metrics, costs, and customization avenues, you can make an informed decision about adopting this technology. For some, real-time voice AI will be a game-changer that delights users and sets a new bar for interaction; for others, a phased or cautious approach might be prudent until the ecosystem further matures.

We hope this comprehensive guide has given you both a high-level framework and deep technical insight to evaluate real-time AI voice agents for your needs. The field is evolving rapidly, and what’s cutting-edge today (like streaming LLMs) may become standard in a year or two - making it increasingly feasible to deploy voice agents that are fast, smart, and personable. The opportunity to create more natural human-computer interactions is here, and with the right strategy, you can harness it to enhance your products or services.