The Core Latency Budget: Every Millisecond Between Microphone and Speaker

Last updated on May 22, 2026

Talk to AI engineers

We build and advise on production AI systems. Bring your questions to a free intro call.

A caller finishes a sentence and waits. The gap before the agent answers is the single number that decides whether the call feels like a conversation or a help-desk queue. Humans hold that gap near 200 ms in natural speech, measured across ten languages by Stivers et al. in PNAS. A voice agent that takes 1.4 seconds to start replying is roughly seven times slower than the person on the other end expects.

“Use streaming” is the answer most teams reach for, and it is not enough. Streaming removes the wait for a full LLM completion, but the voice AI latency budget is the sum of twelve separate components, and streaming touches three of them. The other nine (capture framing, endpointing, jitter buffers, codec transcoding, network hops, playout buffers, tool calls, barge-in recovery) stay exactly as slow as they were. A pipeline can stream every byte and still land a 1,300 ms turn gap because endpointing alone burned 500 ms before the LLM ever saw a token. The deeper architectural choice behind streaming versus cascaded turns sits in our real-time vs turn-based voice agent architectures guide.

This article walks the full voice agent latency budget from the caller’s microphone to the caller’s speaker. Each component gets a millisecond cost, the reason it costs that, and the named techniques that compress it. The numbers are current as of 2026, and every figure links to its source.

One framing note before the walk: the component numbers below are the clean web/WebRTC path. Most production voice agents run over the phone, not the browser, and the phone adds a telephony floor (SIP, the PSTN, and mobile carrier networks) that you cannot co-locate away. Treat the budget below as the controllable part, and the carrier leg as a fixed tax on top, covered in Component 5 and our voice agent telephony stack guide.

The headline budget and what it leaves out

Two published reference budgets are worth knowing before walking the components:

Source	STT	LLM (TTFT)	TTS	Network	Total target
Twilio	350 ms	375 ms	100 ms	–	1,115 ms (1,400 ms limit)
Telnyx	100–300 ms	350–1,000 ms	90–200 ms	50–200 ms	600 ms–1.7 s

Twilio splits its target two ways: the Mouth-to-Ear Turn Gap (caller stops speaking → reply audio reaches the ear, 1,115 ms target) and the Platform Turn Gap (only the slice the platform controls, excluding last-mile network, 885 ms target). Telnyx sells a co-located product and claims under 200 ms p95 for it – treat that as vendor-reported.

Both are plain-English tours of the cascaded STT → LLM → TTS pipeline, and both scope out tool-call spikes, reasoning delays, fillers, and barge-in cost. That is exactly where this article goes deeper: all twelve subcomponents, including the jitter buffer, codec transcoding, speculative execution, and barge-in recovery.

Input pipeline – from microphone to transcript

Component 1: Mic capture and audio framing

Audio enters the pipeline in frames, and the frame size sets the first floor on latency. The packetization interval (ptime) is the lever: 20 ms is the VoIP default and the Opus default frame size. Smaller packets shave sender-side latency but cost bandwidth and encoding efficiency, so 20 ms is the sweet spot for most deployments – drop to 10 ms only when every millisecond matters and bandwidth is free. Browser capture adds a small buffer of its own on top.

Three techniques compress this component. Use the smallest stable capture buffer the runtime allows, and align it with the chunk size the STT stage expects so audio is not re-buffered downstream. Capture natively at the sample rate STT wants, typically 16 kHz, to avoid a resampling pass. Treat 20 ms ptime as the default and justify any deviation.

Realistic budget: 20–60 ms for capture buffer, framing, and media edge ingress combined. No single authoritative benchmark exists for aggregate capture overhead, so this range is composed from runtime documentation and the Twilio waterfall.

Component 2: VAD trigger and endpointing

Endpointing decides when the caller has stopped talking, and it is the single largest controllable contributor to the turn gap. Every millisecond spent here is dead air the caller hears before the agent has done any work.

Voice activity detection runs first, and it is cheap. Silero VAD infers on each frame in around 1 ms; a modern turn-detection model reaches an end-of-turn decision in roughly 50–100 ms. VAD is not where the time goes.

Endpointing is. Fixed-silence endpointing waits for a silence threshold, traditionally around 500 ms, before declaring the turn over – half a second of dead air added to every single turn, before the agent has done anything. Machine-learning and semantic endpointing cut that to roughly 150–200 ms, as covered in the Twilio post and in Cresta’s engineering write-up, by predicting end-of-turn from the content of speech rather than from silence alone.

Deepgram Flux, a conversational STT model with built-in end-of-turn detection, reports a median end-of-turn detection under 300 ms with a p95 around 1.5 s, and claims a 200–600 ms response-latency reduction against a traditional STT-plus-VAD setup. Its default eot_threshold is 0.7; raising it to 0.8–0.9 trades latency for fewer false cut-offs on slow speakers.

Three techniques apply. Replace fixed-silence endpointing with semantic or contextual turn detection, the biggest single endpointing win. Tune the endpoint threshold per use case, lower for snappy chat and higher to avoid cutting off callers reading out numbers or addresses. Use eager end-of-turn to speculatively start the LLM before the endpoint is confirmed, covered in Component 11.

Realistic budget: 150–500+ ms depending on method. Fixed silence sits at the top of that range; semantic detection sits near the bottom.

Component 3: Jitter buffer sizing

The jitter buffer absorbs packet-arrival variation so audio plays smoothly, and it taxes both inbound and outbound paths. The tradeoff is fixed: too small drops packets and stutters; too large adds playout latency every turn. WebRTC’s adaptive buffer (NetEQ) is on by default and resizes continuously from observed jitter – the right default in nearly every case.

Three techniques apply. Choose adaptive over fixed buffering. Cap it low on stable datacenter links and allow more headroom on mobile or variable links. Pair it with forward error correction and packet-loss concealment, such as Opus in-band FEC, so the buffer can stay small without stuttering.

Realistic budget: 30–120 ms, paid as a tax on both the ingress and egress audio paths.

Component 4: Codec selection and transcoding

The codec adds a small algorithmic delay, but the real cost is transcoding. Bridging WebRTC to the PSTN forces an Opus-to-G.711 conversion, which requires a full decode and re-encode – burning CPU, adding delay, and permanently losing the wideband detail G.711 can’t carry. Compressed-to-compressed transcoding compounds the quality loss with every hop.

Three techniques apply. Stay end-to-end Opus where the stack allows it, and avoid PSTN bridges that force G.711 transcoding. When the phone is unavoidable, transcode once at the media edge rather than repeatedly across hops. For the deeper carrier and codec decisions underneath, see our voice agent telephony stack guide.

Realistic budget: codec decode around 25 ms, plus a variable CPU and scheduling penalty when transcoding is in the path.

Component 5: Network RTT across transports

Network round-trip time depends on transport and geography. WebRTC RTT typically runs 60–70 ms with well-placed servers and drops lower when services are co-located. The thresholds that matter: RTT above 300 ms noticeably degrades quality, and above 1,000 ms causes talk-over and dead air. A stitched, non-co-located stack can add a hundred-plus milliseconds of inter-service and SIP hops on top.

This is also where the phone changes everything. The figures above are the web path; most production voice agents take calls over the PSTN or mobile networks, and the carrier leg adds latency you do not control. Mobile and cellular access networks add meaningful one-way media delay (on the order of tens to 100+ ms, worse on congested radio), and carrier interconnect plus SIP signaling stack on top. You can co-locate your STT, LLM, and TTS at the media edge; you cannot co-locate the caller’s cell tower. For most phone-based agents this carrier floor, not the cloud pipeline, is the binding real-world bottleneck – the voice agent telephony stack guide covers the SIP, PSTN, and codec detail underneath it.

Transport choice is not purely a latency decision. WebRTC runs over UDP, so a lost packet does not stall the stream, while WebSocket over TCP can head-of-line block – the choice carries tradeoffs beyond raw delay.

Three techniques apply. Co-locate STT, LLM, TTS, and orchestration at the media edge and deploy regionally near callers – the voice agent platform comparison maps which orchestration layers (LiveKit, Pipecat, Vapi) sit closest to the media edge. Hold persistent HTTP keep-alive connections to LLM and TTS endpoints to eliminate per-turn TCP and TLS handshakes. Use ICE Trickle to start media flow before full negotiation completes.

Realistic budget: 50–200 ms on the web path, controllable down to ~30–80 ms with co-location – but add a telephony floor on top for any phone call (carrier routing, mobile access, SIP), which is fixed and uncontrollable rather than “comparable to web.” No clean head-to-head PSTN-versus-WebRTC RTT benchmark exists in primary sources, so treat that floor as a real but unquantified add.

Component 6: STT first-partial versus final transcript

STT latency splits into two numbers that get conflated, and conflating them inflates the budget. The first partial is an interim transcript hypothesis. The final transcript is the stabilized result after the endpoint. A voice agent should drive downstream work off stabilized partials plus a fast end-of-turn signal, never off the official final.

AssemblyAI Universal-Streaming cites around 90 ms to the first word, while the stabilized final transcript can lag several hundred milliseconds behind – which is exactly why you drive off partials, not the final.

Deepgram Flux is a conversational STT model with built-in end-of-turn detection, reporting sub-300 ms end-of-turn and collapsing STT, VAD, and endpointing into one step. General streaming STT lands 100–300 ms for first partials when fed small audio chunks.

Three techniques apply. Feed STT small audio chunks, at or below 50 ms, for faster partials. Use a conversational STT model with integrated end-of-turn detection, such as Flux, to collapse STT, VAD, and endpointing into one step. Start LLM prompt assembly on stabilized partials rather than waiting for the final. For provider-by-provider comparison across STT and TTS engines, see our STT and TTS selection guide.

STT model	First-partial latency	Notes
AssemblyAI Universal-Streaming	~90 ms first word; p50 ~150 ms, p90 ~240 ms after endpoint	Mixed-dataset time-to-final near 760 ms
Deepgram Flux	sub-300 ms end-of-turn	Built-in EoT; Flux Multilingual GA 2026-04-29, 10 languages
General streaming STT	100–300 ms first partial	Requires ≤50 ms audio chunks

Realistic budget: 100–300 ms to a usable transcript. The final-transcript wait is avoidable.

Reasoning and tool calls

Component 7: LLM time-to-first-token versus tokens-per-second

For voice, time-to-first-token dominates the turn gap. Output tokens per second matters only enough to stay ahead of TTS playback, because TTS consumes text far slower than a fast LLM emits it. Once the first sentence streams into TTS, raw output speed beyond roughly 50–80 tokens per second rarely bottlenecks the conversation.

The reasoning-mode trap is the single most expensive mistake in voice model selection. Reasoning model variants think before emitting the first token, which is catastrophic for a live call. Artificial Analysis data captured 2026-05-22 shows GPT-5.1 in high-reasoning mode at a TTFT near 60.7 s, GPT-5.5 high near 53.7 s, and GPT-5.5 xhigh near 125 s – leave “thinking” on and the caller waits up to a full minute for the first word, so the call is already over. A reasoning mode never belongs in the live voice path. Genuine reasoning work should route to a background or async track with a filler in the live path. See our voice LLM selection guide for TTFT benchmarks by model and the voice prompt engineering playbook for the per-provider reasoning toggles to disable in production.

Fast non-reasoning models are the correct choice – a few 2026 reference points, per Artificial Analysis:

Model	TTFT	Note
Groq-hosted Llama	~0.18 s	Custom inference silicon, fastest tier
Command A+	~0.27 s	Fast non-reasoning
Claude Haiku 4.5	~0.6 s	Smart-but-fast workhorse
GPT-5.1 / GPT-5.5 (high reasoning)	53–125 s	Never use in the live path

A model that feels fast in chat at a one-second first token is already eating two-thirds of a voice turn budget. Pick one that streams the first token under roughly 300 ms where possible, or accept around 600 ms with a Haiku-class model.

Five techniques apply. Choose the model on TTFT and p95-TTFT, not on chat feel or peak intelligence. Stream and start TTS on the first complete clause, never awaiting the full completion. Keep prompts short, since long contexts inflate TTFT. Hold persistent connections and avoid managed conversation-history layers that add a hop, which is Twilio’s explicit advice. Route any genuine reasoning to a background track behind a filler.

Realistic budget: 300–750 ms TTFT for a well-chosen fast model; the Twilio target is 375 ms. The exact non-reasoning TTFT for GPT-5.x variants is not cleanly isolated in current benchmark data, so verify it on the live model page before committing to a specific number.

Production check: Softcery’s Casegen AI intake agent keeps the LLM in non-reasoning mode for every live turn and reserves reasoning work for post-call analysis. That discipline is what keeps TTFT inside the budget the rest of the pipeline depends on.

Component 8: Tool-call latency and pre-fetching

Tool calls are not core latency. They are per-interaction spikes that blow the budget only when they land in the critical path. A round trip (LLM emits a tool call, executes the request, resumes, then TTS runs) can add hundreds of milliseconds to several seconds. Twilio’s piece scopes tool-call spikes out, which is why this article covers them.

The design goal is to keep tool latency off the critical path. Four techniques do that. Pre-fetch predictable context, such as the caller record and account status, when the call connects, before anything requests it, per Cresta. Use speculative tool calling to optimistically execute read-only tools before the model is certain they are needed, running in parallel and keeping results silent until used. Run guardrails and tool calls concurrently with the main LLM generation rather than serially. Cover any unavoidable tool latency with an interstitial filler or acknowledgment, so dead air does not equal perceived latency, and set aggressive timeouts with fallbacks so a slow tool cannot stall a turn.

Realistic budget: highly variable. No authoritative published tool-call latency figure exists, because the design target is to keep tool latency off the critical path entirely rather than to budget for it.

Building voice agents that hold sub-1.5s p95 in production? Softcery has shipped voice systems where every component above was measured, budgeted, and compressed. Schedule a consultation to audit your latency stack.

Output pipeline – from response to speaker

Component 9: TTS time-to-first-audio

TTS latency is measured as time-to-first-audio, and the catch is that vendor inference numbers understate what a caller hears: production time-to-first-audio adds network, connection setup, and the playout buffer on top of the quoted inference figure. Judge on the end-to-end number, not the spec sheet. Sub-100 ms time-to-first-audio is the 2026 baseline for conversational TTS; the leaders and their inference-vs-real gap:

TTS model	Inference claim	End-to-end observed
Cartesia Sonic-3	~40 ms	~90 ms
ElevenLabs Flash v2.5	~75 ms	~150 ms (+ per-turn HTTP overhead)
Deepgram Aura-2	~90 ms	under 200 ms

Four techniques apply. Pick a Flash-class streaming TTS model and reserve expressive models, such as ElevenLabs v3, for non-realtime use. Hold a persistent WebSocket to TTS to avoid the per-turn HTTP connection overhead. Co-locate TTS with orchestration. Feed TTS the first clause from the LLM immediately, synthesizing at sentence-level granularity.

Realistic budget: 60–250 ms end-to-end time-to-first-audio; the Twilio target is 100 ms with a 250 ms limit.

Component 10: Playout buffer and underrun

The playout buffer on the caller’s side is a latency tax paid to prevent stutter. An empty buffer produces underrun, an audible gap; an over-full buffer adds delay to every turn.

An adaptive playout buffer sized to the network (tighter on stable datacenter links, wider on variable mobile ones) is the right approach. Watch three signals in production: packet inter-arrival variance, decode CPU load, and playback underruns. Because the LLM emits text faster than TTS synthesizes it, TTS pacing rather than LLM output speed sets the floor on sustained audio.

Three techniques apply. Size the adaptive playout buffer to the network class. Synthesize at sentence or clause granularity and stream chunks before the full sentence finishes. Watch underrun counters in production, and on repeated underruns raise the base buffer rather than chasing the lowest theoretical latency.

Realistic budget: 40–250 ms, a tax that must be paid to avoid stutter.

Component 11: Speculative execution and barge-in recovery

The last component covers two patterns that bend the budget: speculative execution removes time, and barge-in recovery is the cost of being interrupted.

Barge-in is when a caller interrupts mid-utterance. The system must stop TTS, flush queued audio, and process the new input. The target is user-speech-onset to TTS suppression under 200 ms. The agent should retain context of what it was saying so it can resume or acknowledge gracefully rather than restart.

Speculative execution removes latency by starting work before it is certain. Eager execution starts the LLM call before the endpoint is fully confirmed; Deepgram’s EagerEndOfTurn fires LLM calls speculatively in anticipation of a turn end and discards them if the caller keeps talking. Predictive ASR, an academic technique, generates a complete utterance from a partial one while the caller is still speaking, opening a larger latency-savings window than endpoint-based prefetch. Hedging launches multiple LLM calls in parallel and takes whichever returns first, per Cresta.

Five techniques apply. Run a cheap, fast VAD on the agent’s output path so barge-in registers within one or two frames. Flush TTS output queue and jitter buffer immediately on barge-in – queued audio is the recovery cost, small buffers shrink it. Start LLM speculatively on stabilized partials, discard on a miss. Use hedged LLM calls for tail-latency control. Budget the cost of speculation: wasted compute on discarded calls is a real line item, not a free optimization.

Realistic budget: barge-in recovery targets under 200 ms; speculative execution can remove 200–600 ms from the turn gap, per the Deepgram Flux end-of-turn claim.

The full stack-up and the honest target

Stacking every component gives the end-to-end turn gap. The spread between low and high columns is the difference between an optimized pipeline and a default one.

Component	Budget (ms)	Primary lever
Mic capture + framing	20–60	Small aligned buffer, native 16 kHz
VAD + endpointing	150–400+	Semantic turn detection over fixed silence
Jitter buffer (ingress)	30–80	Adaptive buffer, FEC
STT first partial	100–250	≤50 ms chunks, integrated EoT
LLM TTFT	300–600	Fast non-reasoning model, short prompts
TTS time-to-first-audio	80–200	Flash-class model, persistent WebSocket
Jitter buffer / playout (egress)	40–150	Adaptive buffer sized to network class
Network RTT	50–150	Co-location, persistent connections
End-to-end turn gap	≈750–1,400	Components overlap; not a strict column sum

An uncut pipeline lands roughly 750–1,400 ms, which matches Twilio’s 1,115 ms target and 1,400 ms upper limit. Under about 0.8 s feels like talking to a person; past about 1.5 s feels like a walkie-talkie. The end-to-end gap runs below the arithmetic sum of the column because several components overlap – STT first-partial latency in particular is mostly absorbed while the caller is still speaking and endpointing runs, so it does not stack cleanly on top.

This table is the controllable, in-pipeline budget. A phone call adds a telephony floor on top of it (carrier routing, mobile access, and the Opus→G.711 transcoding hop from Component 4), which is why a clean sub-800 ms target is harder to hit over the PSTN than over the web. Codec transcoding and tool calls sit outside the table because they are conditional: transcoding applies only on phone bridges, and tool calls should be engineered off the critical path.

Production proof: Softcery’s Casegen AI bilingual intake holds a 1.2 s average response latency through this same component discipline. Sub-1.5 s is achievable when every line of the budget above is engineered, not assumed.

The honest target is not the 200 ms human turn gap. That figure is the canonical primary baseline from Stivers et al., whose ten-language sample found median response gaps of 0–300 ms and mostly 0–200 ms, but no cascaded STT-LLM-TTS pipeline reaches it. The practical ceiling is widely cited around 800 ms as the point where an agent stops feeling conversational, with above 1,500 ms feeling like a walkie-talkie. The 800 ms number is an industry-consensus rule of thumb. No primary research paper pins abandonment to exactly 800 ms, so it should be treated as a working target rather than a measured threshold. Producing even a one-word reply takes a human around 600 ms, per Telnyx citing the turn-taking literature, which puts a realistic floor under any agent.

Compressing toward sub-800 ms is a matter of attacking the largest controllable components in order. Semantic endpointing removes 250–350 ms against fixed silence. A fast non-reasoning model with a sub-300 ms TTFT removes another 300 ms against a one-second chat-grade model. Co-location of STT, LLM, and TTS at the media edge removes 70–140 ms of network and inter-service hops. Speculative execution removes a further 200–600 ms by starting the LLM before the endpoint confirms. Streaming everywhere, the technique most teams start with, is necessary but not sufficient; it earns its place only after the other four are in hand.

The pipeline that hits sub-800 ms is not the one running the fastest model. It is the one where every component above was budgeted, measured, and compressed deliberately. That budget sheet is what separates a voice agent that holds a conversation from one that runs a queue.

Need a voice agent that holds the budget instead of breaking through it? Reach out to Softcery.

Frequently Asked Questions

What voice AI agents have the lowest latency?

As of mid-2026, layer leaders are: STT first-partial – AssemblyAI Universal-Streaming (~90 ms first word) and Deepgram Flux (sub-300 ms with integrated end-of-turn detection); LLM TTFT – Groq-hosted Llama (~180 ms p50) and Cohere Command A+ (~270 ms) on non-Groq stacks; TTS TTFA – Cartesia Sonic-3 (~40 ms vendor, ~90 ms p90) and ElevenLabs Flash v2.5. End-to-end voice-to-voice under 1.5 s p95 is achievable only when every component in the table above is budgeted, not assumed.

What are the top programmable voice AI APIs with low latency?

By layer: STT – AssemblyAI Universal-Streaming, Deepgram Flux, OpenAI Whisper-streaming. LLM – Groq, Anthropic Claude Haiku 4.5, Cohere Command A+, Google Gemini 3 Flash (all in non-reasoning mode). TTS – Cartesia Sonic-3, ElevenLabs Flash v2.5, Deepgram Aura-2. Orchestration – LiveKit and Pipecat for streaming media edge. Mix-and-match by layer; no single vendor leads every line.

What is the best real-time streaming voice AI API?

Streaming alone is necessary but not sufficient – the turn gap is twelve components, and streaming touches three. For streaming STT, AssemblyAI Universal-Streaming returns first words around 90 ms. For streaming TTS, Cartesia Sonic-3 hits ~40 ms time-to-first-audio. Pair those with a fast non-reasoning LLM (Groq, Command A+, Haiku 4.5) and co-locate at the media edge to get end-to-end under 1.5 s p95.

Can voice AI agents run on edge devices for low latency?

Yes, in narrow cases. On-device wins when sub-100 ms latency is the requirement, HIPAA blocks cloud audio, or the deployment is an offline kiosk. The on-device stack is Whisper-tiny or NVIDIA Parakeet for STT, a small open LLM (Llama 3.2 1B/3B class), and Pocket-TTS or Piper for output. The tradeoff is accuracy: edge STT/TTS lag cloud benchmarks by several points of WER and a perceptible drop in voice naturalness. For most production B2B SaaS voice agents, cloud with regional co-location beats edge.

Why is streaming not enough to hit a low turn gap?

Streaming removes the wait for a full LLM completion, which touches three of the twelve latency components: LLM time-to-first-token, TTS time-to-first-audio, and the handoff between them. The other nine components – capture framing, endpointing, jitter buffers, codec transcoding, network hops, playout buffers, tool calls, and barge-in recovery – stay exactly as slow as they were. A fully streaming pipeline can still land a 1,300 ms turn gap if fixed-silence endpointing burned 500 ms before the LLM saw a token.

What is the difference between STT first-partial and final-transcript latency?

The first partial is an interim transcript hypothesis, available around 90–150 ms after speech. The final transcript is the stabilized result after the endpoint, which can take 700 ms or more. AssemblyAI Universal-Streaming cites about 90 ms first-word latency, while a mixed-dataset benchmark put its time-to-final near 760 ms. A voice agent should drive downstream work off stabilized partials plus a fast end-of-turn signal, never off the official final.

Why should reasoning models never be used in the live voice path?

Reasoning model variants think before emitting the first token. Artificial Analysis data from May 2026 shows GPT-5.1 in high-reasoning mode at a time-to-first-token near 60.7 s, GPT-5.5 high near 53.7 s, and GPT-5.5 xhigh near 125 s. Those delays make a live call impossible. Genuine reasoning work should route to a background or async track with a filler in the live path, while the live path uses a fast non-reasoning model.

What is a realistic end-to-end turn gap target for a voice agent?

The natural human turn gap is around 200 ms, measured across ten languages by Stivers et al. in PNAS, but no cascaded STT-LLM-TTS pipeline reaches it. The practical ceiling for an agent that still feels conversational is widely cited around 800 ms, which is an industry-consensus rule of thumb rather than a measured abandonment threshold. Above roughly 1,500 ms the conversation feels like a walkie-talkie. An uncut pipeline lands near 750–1,400 ms.

Which components offer the biggest latency wins?

Endpointing is the single largest controllable component: replacing fixed-silence endpointing with semantic turn detection removes 250–350 ms. Choosing a fast non-reasoning model with a sub-300 ms time-to-first-token removes another 300 ms against a chat-grade model. Co-locating STT, LLM, and TTS at the media edge removes 70–140 ms. Speculative execution, starting the LLM before the endpoint confirms, can remove a further 200–600 ms.

Pay-by-Bank and Agentic Commerce: What ACP, AP2, MCP, and UCP Actually Enable

The headline budget and what it leaves out

Input pipeline – from microphone to transcript

Component 1: Mic capture and audio framing

Component 2: VAD trigger and endpointing

Component 3: Jitter buffer sizing

Component 4: Codec selection and transcoding

Component 5: Network RTT across transports

Component 6: STT first-partial versus final transcript

Reasoning and tool calls

Component 7: LLM time-to-first-token versus tokens-per-second

Component 8: Tool-call latency and pre-fetching

Output pipeline – from response to speaker

Component 9: TTS time-to-first-audio

Component 10: Playout buffer and underrun

Component 11: Speculative execution and barge-in recovery

The full stack-up and the honest target

Frequently Asked Questions

Can Pay-by-Bank Work With Agentic Commerce Protocols?

Agentic Commerce 101: How Selling Through AI Assistants Works

How to Make Your Store Visible to AI Shopping Agents

EU & EEA Voice AI Regulations 2026: AI Act, GDPR, ePrivacy, and the Country Mosaic

Middle East Voice AI Regulations 2026: UAE, Saudi Arabia, Israel, and the GCC

UK, Switzerland & Non-EU Europe Voice AI Regulations 2026

The Code-Switching Gap: Where Multilingual Voice AI Loses Callers Mid-Sentence

Voice-Specific Prompt Engineering: Why Text Prompts Break in Real-Time Audio

The Telephony Layer Under Your Voice Agent (And When It Breaks)

AI Voice Agents for Personal Injury Law Firms: How to Automate Intake Calls

Building AI That Understands Legal Documents (Not Just Reads Them)

How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying

AI Call Center Automation: Actionable Playbook for 2026

Voice Agents for Travel: What Works at HotelPlanner, What Breaks Most Implementations

Custom AI Voice Agents: The Ultimate Guide (Updated May 2026)

How to Build Production-Ready Legal AI Systems

AI for Law Firms: What Actually Works in Production (Beyond the Demos)

Legal Chatbots: When to Build Custom vs Buy Off-the-Shelf

Choosing an LLM for Voice Agents: Speed, Accuracy, Cost

Real-Time (S2S) vs Cascading (STT/TTS) Voice Agent Architecture

9 AI Observability Platforms Compared: Phoenix, Langfuse, Logfire, & More

We Tested 14 AI Agent Frameworks. Here's How to Choose.

The AI Agent Prompt Engineering Trap: Diminishing Returns and Real Solutions

RAG Systems: The 7 Decisions That Determine The Production Fate

How to Implement E-Commerce AI Support: 4-Phase Deployment Guide

AI Agents Break the Same Six Ways. Here's How to Catch Them Early.

Choosing LLMs for AI Agents in 2026: Cost, Latency, Intelligence Tradeoffs

You Can't Fix What You Can't See: Production AI Agent Observability Guide

E-Commerce AI Support: What Works, What Fails, Real Store Examples

E-Commerce AI Support ROI Calculator: Volume Thresholds and Break-Even Analysis

Why Voice Agents Sound Great in Demos but Fail in Production

Deploying & Scaling Voice Agents: 4-Phase Framework from POC to Production

Agentic Coding: Context, Memory, Workflows, Skills, Subagents

12 Voice Agent Platforms Compared: Vapi, Ultravox, Retell, & More

SOC 2 for Voice AI Agents: Security, Confidentiality, and Quick Wins

US Voice AI Regulations 2026: TCPA, BIPA, COPPA, HIPAA, State AI Laws

Testing Voice Agents: Methods, Metrics, and Tools

How to Choose STT and TTS for Voice Agents: Latency, Accuracy, Cost