AI Voice Agent Cost Calculator

Estimate per-minute cost AND round-trip latency for voice AI agents in 2026. Compare LLMs, TTS, STT, platforms, and transport (telephony or WebRTC) — with verified pricing and latency data from official provider sources.

Technology

Balanced Demo

Production-ready cost and quality

0:000:00

Large Language Model (LLM)

AI engine that processes conversation context and generates responses, with costs based on input and output token usage

Speech Synthesis (Text to Speech / TTS)

Converts AI text responses into spoken audio, with costs scaling based on character count and voice quality

Speech Recognition (Speech to Text / STT / ASR)

Transcribes caller speech into text for AI processing, with costs typically charged per minute of audio

Developer Platform

Infrastructure that orchestrates voice agent components and manages real-time conversations, charged per minute of usage

Transport (Telephony / WebRTC)

How the audio reaches the agent — PSTN telephony provider (Twilio, Telnyx) or WebRTC (Daily, LiveKit, Agora). Telephony providers may also charge per warm transfer.

Note: Provider list rates as of May 2026. LLM cost includes a 1.8× reality factor for context growth, interruptions and tool calls. Bundled platforms embed 15–40% markup on STT/TTS/LLM. Telephony excludes 5–15% USF/regulatory pass-through and DID rental.

Parameters

AI Agent Talk Time (%)

50%

25%

50%

75%

100%

LLM Input Size

Detailed (~3000 tokens)

Concise (~1000 tokens)

Detailed (~3000 tokens)

Extensive (~6000 tokens)

Average Call Duration (minutes)

< 3 min

< 10 min

< 30 min

< 60 min

Cost Breakdown

$0.0000/ 1 min

Cost breakdown

No costs to display

Estimated Round-Trip Latency

0 msuser stops → agent speaks

No costs to display

Calibrated against real Vapi+casegen production data (2026-05). LLM TTFT scales with prompt size (longer system prompt + history = slower first token). Endpointing reflects each platform's default VAD: Vapi ~1450ms, Retell ~700ms, Pipecat ~300ms. Bundled platforms (Bland, Ultravox) skip inter-component network; BYOK middleware adds ~60ms; full BYOK adds ~120ms. Aim for under 1.5s; 1.5–2.5s is typical production; over 3s feels broken. Reasoning-mode LLMs add seconds, so voice agents skip them.

Want to know the full cost of your AI voice agent?

API pricing doesn't cover development, testing, infrastructure, or optimization. Share what you're building and we'll map the complete cost.

Built with♥by

The Real Cost of Running an AI Voice Agent in 2026

Understanding what drives AI voice pricing is essential for anyone building or scaling a production-grade agent. The AI voice agent cost in 2026 is shaped by several core components:

Core Price Components:

Speech Synthesis (Text-to-Speech / TTS) — converts written text into natural-sounding speech.
Speech Recognition (Speech-to-Text / STT / ASR) — converts spoken audio into written text using models that transcribe language in real time.
Large Language Model (LLM) — neural networks trained on vast text data to understand and generate human-like responses.
Voice Agent Platform — provides the infrastructure and tools to build, deploy, and manage AI-powered voice agents.
Transport (Telephony / WebRTC) — connects audio between user and agent, either over PSTN telephony (Twilio, Telnyx) or WebRTC (Daily, LiveKit, Agora) for browser-based agents.

Cost Breakdown for a Production-Grade AI Voice Agent

1. Speech Synthesis (Text-to-Speech / TTS)

What is TTS and why is it needed?

Text-to-Speech allows a voice agent to speak responses in a natural-sounding voice, which is essential for phone-based or voice interactions. Understanding the real-time vs turn-based TTS architecture is crucial for optimizing both costs and performance.

What are the key TTS considerations?

Voice quality & naturalness: Modern neural TTS voices sound more human-like, with attention to tone and prosody. Quality varies by provider and voice type.
Language and voice options: Support for multiple languages and accents is crucial for global use. Custom voices (branded voice personas) may be offered at higher cost.
Latency: The time to synthesize speech should be low to keep conversations flowing. Streaming TTS can output audio in real-time as text is processed.
Integration: TTS is usually accessed via cloud API. Some providers allow on-premise or edge deployment (often at enterprise tiers) for low latency or privacy.

What are the usual TTS billing models?

Per character (or per million characters) of text converted to speech. This is the most common model – you pay based on the length of the text input.
Free tiers are common (e.g. millions of chars per month free) to get started. Beyond that, it's pure usage-based billing.

What are the usual TTS price ranges?

Roughly $4 to $200 per 1 million characters in 2026. Standard cloud voices (Polly Standard, Google Standard) sit at ~$4/M. Mainstream neural voices (Polly Neural, Google Neural2, Azure Neural, OpenAI gpt-4o-mini-tts) cluster around $12–$22/M. Premium real-time models (Cartesia Sonic 3 ~$35/M effective, ElevenLabs Turbo/Flash v2.5 ~$50/M) and ultra-realistic flagships (ElevenLabs v3 ~$100/M, Hume Octave ~$50–$150/M, Google Studio $160/M) dominate the high end. Open-source self-hosted (Fish Speech, Kokoro, Qwen-TTS) is $0 plus GPU. Translates to roughly $0.005–$0.04 per minute of generated speech.

What are the key TTS providers?

Cartesia (Sonic 3, Sonic Turbo, Sonic 2): High-speed neural TTS optimized for conversational use. Sonic 3 is the 2026 flagship with ~90ms latency and 15 credits/sec audio billing.
ElevenLabs (v3, Turbo v2.5, Flash v2.5): Ultra-realistic voice cloning and ~75ms streaming. v3 ~$100/M chars, Turbo/Flash ~$50/M.
OpenAI (gpt-4o-mini-tts, tts-1, tts-1-hd): Token-based audio output ($12/M effective for mini-tts, $15/M for tts-1, $30/M HD).
Hume Octave: Emotion-aware TTS with fine-grained prosody control. Pro tier $50–$100/M chars.
Inworld (TTS-1.5 Mini, TTS-1.5 Max): Real-time TTS with Mini at $25/M chars and Max at $35/M (drops on Growth tier).
Rime (Mist, Arcana): Per-audio-minute billing ($0.030/min PAYG, ~$39/M chars effective). Strong for English voice agents.
MiniMax Hailuo Speech (2.5 Turbo, 2.6 HD): Multilingual TTS ~$40–$50/M chars. Credit-based subscription.
Microsoft Azure Speech (Neural, Neural HD): SSML support and fine-grained control. Neural $15/M, Neural HD $22/M.
Amazon Polly: Standard $4/M, Neural $16/M, Generative $30/M, Long-form $100/M.
Google Cloud TTS (Standard, Neural2, Chirp 3 HD, Studio): $4/$16/$30/$160 per 1M chars respectively.
Open source self-host: Fish Speech, Kokoro, Qwen-TTS — free model weights, only GPU compute.

2. Speech Recognition (Speech-to-Text / STT / ASR)

What is STT and why is it needed?

Speech-to-Text (STT), also known as Automatic Speech Recognition (ASR), is the component that understands what the user is saying. It transcribes the caller's speech into text so that it can be processed by an LLM. Without accurate STT, the voice agent cannot know the user's request. For an in-depth comparison of STT and TTS providers, check our comprehensive STT/TTS selection guide.

Key considerations?

Accuracy: Quality varies by model (general vs. phone-call optimized models, etc.). Good models achieve 5–10% WER, meaning the accuracy is 90–95%.
Latency: For a live voice agent, low latency streaming transcription is needed so you're not keeping the user waiting. Represented by the Real-Time Factor (RTF) metric.
Language support: Important if your service needs to handle multiple languages. Some STT engines specialize or have better accuracy in certain languages.
Customization: Some providers allow custom vocabularies or even custom acoustic models (training on specific data) to improve accuracy on proper nouns or industry jargon.
Features: Extras like speaker diarization, punctuation capitalization, profanity filtering, etc., can add value in voice agent use-cases.

Usual billing models:

Per minute of input audio – typically billed per minute (or per second) of audio processed. Most cloud STT services charge based on the length of the audio input.
Often pay-as-you-go, with automatic volume tier discounts. Some have free monthly minutes.
Streaming vs batch: Some providers price streaming (real-time) and batch (asynchronous file transcription) similarly, others may differ slightly.

Usual price ranges:

Streaming STT spans roughly $0.0015–$0.024 per minute in 2026. The cheapest end is self-hosted or specialized models (NVIDIA Parakeet via Together at $0.0015/min, Cartesia Ink-Whisper at $0.00217/min). Mainstream cloud providers cluster around $0.0025–$0.012/min. Realtime/multimodal models like OpenAI gpt-realtime bundle STT+LLM at ~$0.06/min. Volume commits can drop streaming rates by 30–50%.

Key providers

Deepgram (Nova-3): Real-time STT engine optimized for low latency and high accuracy. PAYG promo $0.0048/min mono, $0.0058/min multilingual.
AssemblyAI (Universal-Streaming, Universal-3 Pro): API-first STT with prompt-steerable models, speaker diarization, and summarization. Universal-Streaming $0.0025/min, Universal-3 Pro Streaming $0.0075/min.
Cartesia Ink-Whisper: Cheapest real-time streaming STT for voice agents — $0.00217/min on Scale plan.
NVIDIA Parakeet TDT 0.6B v3: Open-weights ASR backed by NVIDIA. Available via Together AI at $0.0015/min, or self-hosted on GPU.
OpenAI gpt-4o-transcribe / Whisper V3: $0.006/min effective. gpt-4o-mini-transcribe at $0.003/min. Whisper open weights for self-host.
Speechmatics (Ursa 2): Strong multilingual support and accent robustness. Standard PAYG ~$0.012/min, Pro tier from $0.004/min on commit.
Gladia (Solaria): Low-latency transcription with speaker detection. Starter $0.0125/min, Growth as low as $0.00417/min.
Google Cloud Speech-to-Text (Chirp 2/3): Enterprise STT with wide language support, streaming and batch modes (~$0.024/min streaming).

3. Large Language Models (LLMs)

What is it and why is it needed?

The LLM is essentially the "brain" of the voice agent. It takes the transcribed text input and processes language – understanding intent and context – then generates a text response. In a modern AI voice agent, an LLM (like GPT-style models) enables natural conversation, dynamic responses, and handling of free-form user input that rule-based systems can't. For detailed guidance on selecting the right model, see our comprehensive LLM selection guide.

Key considerations:

Model capability: Different LLMs have different strengths and weaknesses. Choosing an appropriate model impacts how well your agent can handle complex queries.
Context length: Newer models in 2026 support very long inputs (sometimes millions of tokens), but using long contexts can be costly.
Latency: Large models can be slow. For voice agents, you may favor a slightly smaller or distilled model if it significantly improves response time.
Privacy & Hosting: Sending conversation data to an external API might raise compliance issues if the content is sensitive. For businesses handling sensitive data, review our guides on SOC 2 compliance and legal compliance requirements.
Customization: Some providers allow fine-tuning (which may incur training and hosting costs).

Usual billing models:

Token-based billing– this is standard for most LLM APIs. You pay for input tokens (the text you send in, including conversation history) and output tokens (the text the model generates). A token is roughly 0.75 words, so 1,000 tokens ~ 750 words.

Key providers:

OpenAI: GPT-5.5, GPT-5.4 family (mini/nano), GPT-5, plus gpt-realtime for bundled audio-in/audio-out. 2026 lineup spans $0.20/M input (5.4 nano) to $30/M output (5.5 Pro).
Anthropic: Claude Opus 4.7, Sonnet 4.6, Haiku 4.5. 1M-token context window, prompt caching with 90% discount on cache hits.
Google: Gemini 3.1 Pro, Gemini 3 Flash, Gemini 3.1 Flash Lite. Ultra-fast and cheap for voice agents — Gemini 3 Flash $0.50/M input is a common balanced choice.
DeepSeek: V4 Pro and V4 Flash. V4 Flash $0.14/$0.28 per 1M is among the cheapest competent models.
xAI: Grok 4.3 and Grok 4.1 Fast. Grok 4.1 Fast at $0.20/$0.50 punches above its weight for voice.
Meta: Llama 4 Maverick / Scout, Llama 3.3 70B. Open weights — host on Together AI, Fireworks, Groq, or self-hosted GPU.
Mistral: Mistral Large 3 ($0.50/$1.50), Medium 3.5, Small 4. Open-weights European alternative.

4. Voice Agent Platform

What is it and why is it needed?

A Voice Agent Platform is an orchestration layer or framework for building the actual voice bot/agent. It typically ties together the telephony, STT, LLM, and TTS components, handling the call flow logic, state management, and integration with any backend systems.

Key considerations:

Integration vs. all-in-one: Some platforms require you to bring your own STT, LLM, etc., while others provide an end-to-end solution.
Features: Look for call recording, DTMF for IVRs, real-time handoff, latency management, and analytics.
Scalability and Reliability: Handling potentially many concurrent calls is non-trivial. Platforms take care of scaling the voice infrastructure.
Quality Assurance: Implementing proper testing and monitoring is essential for production deployments. Learn more about QA metrics and testing tools for voice agents.

Usual billing models:

Usage-based (per minute of call) – The platform might charge you per minute of voice call handled by the AI agent. This often encompasses the underlying costs (STT, TTS, etc.), essentially as a bundled rate.

Usual price ranges:

$0.01–$0.14 per minute in 2026. Pure orchestration platforms (Pipecat Cloud, LiveKit Cloud Agents) charge $0.01/min and pass model costs through at vendor cost. Mid-market bring-your-own-key platforms (Vapi $0.05, Retell $0.055, Synthflow $0.09) layer an explicit per-minute fee on raw component costs. Bundled platforms (Bland AI $0.11–$0.14) embed LLM/STT/TTS/telephony in a single rate, with an estimated 15–40% markup on the underlying components. Speech-to-speech bundles (Ultravox $0.05) include the model itself.

Key providers:

Vapi: $0.05/min orchestration fee. Bring-your-own-key for STT/LLM/TTS, developer-oriented middleware.
Bland AI: $0.11–$0.14/min bundled (STT+LLM+TTS+telephony included). Plus $0.04–$0.05/min on warm transfers.
Retell AI: $0.055/min voice infra, components priced separately. All-in PAYG range $0.07–$0.31/min.
Synthflow: $0.09/min voice engine. All-in $0.15–$0.24/min. Add-ons for performance routing, edge latency, white-label.
Millis AI: $0.02/min base + pass-through components. BYO LLM is free.
LiveKit Cloud Agents: $0.01/min agent session. Build tier free up to 1K min/mo. Open-source self-host option available.
Pipecat Cloud (Daily): $0.01/min active for agent-1x ($0.0005/min reserved). Built-in SIP $0.005/min, PSTN $0.018/min, transfer $0.20/event. Open-source Pipecat framework available for self-host.
Ultravox: Speech-to-speech bundled at $0.05/min (Whisper + GLM + TTS integrated). SIP variant $0.005/min.
Agora Conversational AI: $0.0265/min audio + ASR participant-min. First 300 min/mo free.

5. Transport (Telephony / WebRTC)

What is it and why is it needed?

Transport carries audio between user and agent. Two modes: PSTN telephony (phone calls over SIP trunks) and WebRTC (browser- or app-embedded calls). PSTN providers handle phone numbers, inbound/outbound call routing, and SIP session control. WebRTC providers handle media servers, NAT traversal, and adaptive bitrate. Pick telephony if users dial a number; pick WebRTC if they hit a button on your web or mobile app.

Usual billing models:

Telephony per-minute — billed per call duration, rates differ for inbound vs outbound and by destination. US local typically $0.0035–$0.014/min depending on direction and provider.
Phone number rental — monthly fee per DID. US local DIDs: $0.50/mo (Plivo, SignalWire), ~$1/mo (Telnyx), $1.15/mo (Twilio).
Transfer / dial fees — Telnyx charges $0.10 per Dial verb invocation (warm transfers). Pipecat Cloud SIP Refer transfers $0.20 per event. Twilio has no separate transfer fee.
Regulatory pass-through — US carriers add 5–15% in USF, state telecom, E911 surcharges on top of advertised per-minute rates.
WebRTC per participant-minute — typically $0.0004–$0.004/min for audio-only; minimum-monthly tiers on Cloud plans (LiveKit Ship $50/mo, Scale $500/mo). Self-hosted LiveKit/mediasoup is free OSS plus your own TURN bandwidth.

Usual price ranges:

Telephony US: $0.005–$0.02/min for inbound + outbound average, plus $0.50–$2/mo per number, plus regulatory surcharge. WebRTC audio-only: $0.0004–$0.004/min on managed tiers, $0 on self-host (excluding infra).

Key telephony providers:

Twilio: Market leader. Local in $0.0085, out $0.014. Global reach, no per-transfer fee, $1.15/mo DID.
Telnyx: Cheaper baseline ($0.005/min outbound, in from $0.0035) but $0.10 per warm transfer can dominate cost at high transfer rates.
Plivo: Local in $0.0055, out $0.0115; SIP/Browser-SDK flat $0.0033. $0.50/mo DID.
SignalWire: Local in $0.0066, out $0.008; SIP/WebRTC flat $0.003. $0.50/mo DID.
Bandwidth, Vonage, Zadarma: Other CPaaS options with regional or specialized pricing. Vonage offers per-second billing.

Key WebRTC providers:

Daily.co: Audio-only $0.00099/min standard, drops to $0.00036/min at volume. 10K free participant-min/mo.
LiveKit Cloud: $0.0005/min on Ship tier, $0.0004/min on Scale. Open-source self-host option.
Agora: ~$0.001/min audio. 10K free min/mo.
100ms: Audio ~$0.001/min (75% off video).
Self-host (LiveKit / mediasoup): Free OSS; cost is TURN bandwidth and compute (~$0.0001–0.0005/participant-min depending on egress).

Summary: AI Voice Agent Cost per Minute

Component	Typical Cost per Minute	Notes
TTS	$0.005–$0.04	Billed per character. OSS self-host = $0; ElevenLabs v3 / Studio voices top end.
STT	$0.0015–$0.024	Billed per audio minute. NVIDIA Parakeet cheapest, Google Chirp / Azure top end.
LLM	$0.005–$0.05	Per-token billing × ~1.8× reality factor for context growth, interrupts, tool calls.
Platform	$0.01–$0.14	Pipecat / LiveKit $0.01; Vapi $0.05; Bland bundled $0.14 (incl. components).
Transport	$0.0004–$0.02	WebRTC low end; PSTN telephony high end. Add 5–15% USF and DID rental.
Total	$0.13–$0.30	Production-grade voice agents in 2026. Cost-effective stacks land $0.05–$0.10; premium / managed / Realtime-API stacks $0.30+.

Note: LLMs are priced per token, TTS per character, STT per minute, transport per minute or participant-minute. Per-minute estimates above assume typical voice usage (150 WPM, ~4 turns/min) and apply a 1.8× reality factor on LLM cost to capture conversation-history token growth (compounds O(n²) with turns), function-calling round-trips, and barge-in handling. Bundled platforms like Bland AI embed an estimated 15–40% markup on the underlying STT/TTS/LLM. Telephony figures exclude US 5–15% USF / regulatory pass-through and per-DID monthly rental.

Want to know the full cost of your AI voice agent?

API pricing doesn't cover development, testing, infrastructure, or optimization. Share what you're building and we'll map the complete cost.

All Model Pricing

Large Language Models (LLM)

Model	Provider	Input Price (1M tokens)	Output Price (1M tokens)	Link
Claude Opus 4.7	Anthropic	$5.00	$25.00	Pricing
Claude Opus 4.6	Anthropic	$5.00	$25.00	Pricing
Claude Opus 4.5	Anthropic	$5.00	$25.00	Pricing
Claude Opus 4.1	Anthropic	$15.00	$75.00	Pricing
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	Pricing
Claude Sonnet 4.5	Anthropic	$3.00	$15.00	Pricing
Claude Haiku 4.5	Anthropic	$1.00	$5.00	Pricing
Claude Haiku 3.5	Anthropic	$0.800	$4.00	Pricing
GPT-5.5	OpenAI	$5.00	$30.00	Pricing
GPT-5.5 Pro	OpenAI	$30.00	$180.00	Pricing
GPT-5.4	OpenAI	$2.50	$15.00	Pricing
GPT-5.4 mini	OpenAI	$0.750	$4.50	Pricing
GPT-5.4 nano	OpenAI	$0.200	$1.25	Pricing
GPT-5	OpenAI	$1.25	$10.00	Pricing
GPT-5 mini	OpenAI	$0.250	$2.00	Pricing
Gemini 3.1 Pro	Google	$2.00	$12.00	Pricing
Gemini 3 Flash	Google	$0.500	$3.00	Pricing
Gemini 3.1 Flash Lite	Google	$0.250	$1.50	Pricing
Gemini 2.5 Pro	Google	$1.25	$10.00	Pricing
Gemini 2.5 Flash	Google	$0.300	$2.50	Pricing
Gemini 2.5 Flash Lite	Google	$0.100	$0.400	Pricing
DeepSeek V4 Pro	DeepSeek	$0.435	$0.870	Pricing
DeepSeek V4 Flash	DeepSeek	$0.140	$0.280	Pricing
Grok 4.3	xAI	$1.25	$2.50	Pricing
Grok 4.1 Fast	xAI	$0.200	$0.500	Pricing
Grok 4	xAI	$4.25	$21.25	Pricing
Llama 4 Maverick	Meta	$0.350	$0.850	Pricing
Llama 4 Scout	Meta	$0.170	$0.660	Pricing
Llama 3.3 70B	Meta	$0.880	$0.880	Pricing
Llama 3.1 405B	Meta	$3.50	$3.50	Pricing
Mistral Large 3	Mistral	$0.500	$1.50	Pricing
Mistral Medium 3.5	Mistral	$1.50	$7.50	Pricing
Mistral Small 4	Mistral	$0.150	$0.600	Pricing

Text-to-Speech (TTS)

Model	Provider	Price (1K characters)	Link
ElevenLabs v3	ElevenLabs	$0.100	Pricing
ElevenLabs Turbo v2.5	ElevenLabs	$0.050	Pricing
ElevenLabs Flash v2.5	ElevenLabs	$0.050	Pricing
Cartesia Sonic 3	Cartesia	$0.035	Pricing
Cartesia Sonic Turbo	Cartesia	$0.0467	Pricing
Cartesia Sonic 2	Cartesia	$0.0467	Pricing
OpenAI GPT-4o Mini TTS	OpenAI	$0.012	Pricing
OpenAI TTS-1	OpenAI	$0.015	Pricing
OpenAI TTS-1 HD	OpenAI	$0.030	Pricing
Azure AI Speech Neural	Microsoft	$0.015	Pricing
Azure AI Speech Neural HD	Microsoft	$0.022	Pricing
Google TTS Standard	Google	$0.004	Pricing
Google TTS Neural2	Google	$0.016	Pricing
Google TTS Chirp 3 HD	Google	$0.030	Pricing
Google TTS Studio	Google	$0.160	Pricing
Amazon Polly Standard	Amazon	$0.004	Pricing
Amazon Polly Neural	Amazon	$0.016	Pricing
Amazon Polly Generative	Amazon	$0.030	Pricing
Amazon Polly Long-form	Amazon	$0.100	Pricing
PlayAI Dialog	PlayAI	$0.040	Pricing
IBM Watson TTS Neural	IBM	$0.020	Pricing
Hume Octave	Hume	$0.100	Pricing
Inworld TTS-1.5 Mini	Inworld	$0.025	Pricing
Inworld TTS-1.5 Max	Inworld	$0.035	Pricing
Rime Mist	Rime	$0.039	Pricing
Rime Arcana	Rime	$0.039	Pricing
MiniMax Hailuo Speech 2.5 Turbo	MiniMax	$0.040	Pricing
MiniMax Hailuo Speech 2.6 HD	MiniMax	$0.050	Pricing
Smallest AI Lightning V3.1	Smallest AI	$0.025	Pricing
Fish Speech (self-host)	Fish Audio	$0.005
Kokoro (self-host)	Kokoro	$0.001
Qwen-TTS (self-host)	Alibaba	$0.005

Speech-to-Text (STT)

Model	Provider	Price (per minute)	Link
Cartesia Ink-Whisper	Cartesia	$0.0022	Pricing
NVIDIA Parakeet TDT 0.6B v3	Together AI	$0.0015	Pricing
AssemblyAI Universal-3 Pro Streaming	AssemblyAI	$0.0075	Pricing
AssemblyAI Universal-Streaming	AssemblyAI	$0.0025	Pricing
AssemblyAI Whisper-Streaming	AssemblyAI	$0.005	Pricing
AssemblyAI Universal-2 (batch)	AssemblyAI	$0.0025	Pricing
Deepgram Nova-3	Deepgram	$0.0048	Pricing
Deepgram Nova-3 Multilingual	Deepgram	$0.0058	Pricing
OpenAI gpt-4o-transcribe	OpenAI	$0.006	Pricing
OpenAI gpt-4o-mini-transcribe	OpenAI	$0.003	Pricing
OpenAI Whisper V3	OpenAI	$0.006	Pricing
OpenAI GPT Realtime	OpenAI	$0.060	Pricing
OpenAI GPT Realtime Mini	OpenAI	$0.010	Pricing
Google Speech-to-Text Chirp 2	Google	$0.024	Pricing
Speechmatics Ursa 2	Speechmatics	$0.0117	Pricing
Gladia Solaria	Gladia	$0.0125	Pricing
Azure Speech-to-Text Streaming	Microsoft	$0.0167	Pricing
Amazon Transcribe Streaming	Amazon	$0.024	Pricing
Whisper-Medusa ASR (self-host)	aiOla	$0.0015

Developer Platforms

Platform	Provider	Price (per minute)	Link
No Platform (BYOK direct)	None	$0.000
Vapi	Vapi	$0.050	Pricing
Bland AI	Bland AI	$0.140	Pricing
Millis AI	Millis AI	$0.020	Pricing
Retell AI	Retell AI	$0.055	Pricing
Synthflow	Synthflow	$0.090	Pricing
Pipecat Cloud (agent-1x active)	Daily	$0.010	Pricing
LiveKit Cloud Agents	LiveKit	$0.010	Pricing
Agora Conversational AI	Agora	$0.0265	Pricing
Ultravox (S2S bundled)	Ultravox	$0.050	Pricing
Pipecat (self-host)	Pipecat	$0.001	Pricing
LiveKit (self-host)	LiveKit	$0.001	Pricing

Transport (Telephony / WebRTC)

Provider	Mode	Price (per minute)	Transfer Fee	Link
No Transport	web	$0.000	—
Twilio	telephony	$0.0113	—	Pricing
Telnyx	telephony	$0.004	$0.100	Pricing
Plivo	telephony	$0.0085	—	Pricing
SignalWire	telephony	$0.0073	—	Pricing
Bandwidth	telephony	$0.0078	—	Pricing
Zadarma	telephony	$0.012	—	Pricing
Daily.co WebRTC	web	$0.001	—	Pricing
LiveKit Cloud (Scale)	web	$0.0004	—	Pricing
Agora RTC	web	$0.001	—	Pricing
100ms	web	$0.001	—	Pricing
Vonage Video API	web	$0.0041	—	Pricing
Self-host (LiveKit / mediasoup)	web	$0.0003	—	Pricing

AI Voice Agent Cost Calculator

Technology

Parameters

Cost Breakdown

Cost breakdown

Estimated Round-Trip Latency

Want to know the full cost of your AI voice agent?

Contents

The Real Cost of Running an AI Voice Agent in 2026

Core Price Components:

Cost Breakdown for a Production-Grade AI Voice Agent

1. Speech Synthesis (Text-to-Speech / TTS)

What is TTS and why is it needed?

What are the key TTS considerations?

What are the usual TTS billing models?

What are the usual TTS price ranges?

What are the key TTS providers?

2. Speech Recognition (Speech-to-Text / STT / ASR)

What is STT and why is it needed?

Key considerations?

Usual billing models:

Usual price ranges:

Key providers

3. Large Language Models (LLMs)

What is it and why is it needed?

Key considerations:

Usual billing models:

Key providers:

4. Voice Agent Platform

What is it and why is it needed?

Key considerations:

Usual billing models:

Usual price ranges:

Key providers:

5. Transport (Telephony / WebRTC)

What is it and why is it needed?

Usual billing models:

Usual price ranges:

Key telephony providers:

Key WebRTC providers:

Summary: AI Voice Agent Cost per Minute

Want to know the full cost of your AI voice agent?

All Model Pricing

Large Language Models (LLM)

Text-to-Speech (TTS)

Speech-to-Text (STT)

Developer Platforms

Transport (Telephony / WebRTC)