The Telephony Layer Under Your Voice Agent (And When It Breaks)

Last updated on May 22, 2026

A voice agent demo sounds clean because the demo runs over a browser, in WebRTC, on a fast connection, with no carrier in the path. Production calls travel a different route. They cross a carrier network, a Session Border Controller, a codec boundary, and a caller-ID attestation chain – and every one of those stages stays invisible until it fails. The telephony layer surfaces three ways: a carrier outage drops live calls, a phone number gets labeled “Spam Likely” and answer rates collapse, or a codec mismatch forces transcoding that quietly degrades transcription accuracy. Teams that ignored this layer during the build discover it during the incident, the way polished demos fail in production.

This article maps that layer for the engineers and technical founders deciding how much of it to own. One conclusion up front: carrier minutes are cheap next to speech-to-text, LLM, and text-to-speech, so telephony is rarely the cost lever. The reasons to care are reliability and deliverability.

A common reason teams reach for this layer is replacing an IVR. Voice AI versus IVR is less a rip-and-replace than a swap inside the same plumbing: the SIP trunk and carrier stay put, while the rigid DTMF menu tree gives way to an LLM that handles natural-language turns and branches on intent. The carrier, SBC, and codec path below are identical whether the caller presses 1 or just says what they want. The IVR-replacement decision lives in the agent runtime, not the telephony stack.

What sits beneath the agent

The agent runtime is a pipeline – voice activity detection, speech-to-text, an LLM, text-to-speech – that produces and consumes audio. The telephony layer is everything that moves that audio between the agent and a human on a phone or in a browser, and it has three parts: a carrier or CPaaS that connects to the public phone network, a transport protocol (SIP for phone numbers, WebRTC for browsers) that carries call setup and the audio, and a Session Border Controller that polices the boundary between networks.

How much of this you touch depends on the model. Turnkey platforms hide all three parts. Bring-your-own-carrier (BYOC) lets you plug a chosen carrier into a managed runtime. Fully in-house builds own everything. The rest of the article works through each part so that choice rests on evidence.

Carrier and CPaaS comparison: Twilio, Telnyx, Bandwidth, SignalWire, Plivo

The carrier is the entry point. A Communications Platform as a Service (CPaaS) sells programmable access to the phone network through an API and bills per minute. Five providers dominate the voice-agent conversation: Twilio, Telnyx, Bandwidth, SignalWire, and Plivo.

The per-minute rate is the headline number, and it misleads. A voice agent connects through a SIP trunk, so it pays the SIP rate, not the local-number rate – and some providers stack channel or trunking fees on top. Model the full path before comparing.

US per-minute voice rates below were observed on 2026-05-22; re-verify on each provider’s pricing page before committing.

ProviderInbound localInbound toll-freeOutbound USSIP / trunking rateNotable structure
Twilio$0.0085/min$0.0220/min$0.0140/min$0.0040/minBroadest ecosystem; SIP rate drops at very high volume
Telnyxfrom $0.0032/minfrom $0.015/minfrom $0.002/minfrom $0.005/minTotal = API minute + trunking fee + per-channel fees
Bandwidthsee notesee notesee notesee noteTier-1 facilities-based carrier; pricing page blocks automated fetch
SignalWire$0.0066/min$0.0147/min$0.0080/min$0.0030/minLowest flat SIP rate (both directions); HI/AK add a fee
Plivo$0.0055/min$0.0180/min$0.0115/min$0.0033/minSimple flat pricing; 60-second billing

Bandwidth is left blank on purpose: its pricing page blocks automated fetch, so confirm the rate on their page directly. What matters more is that Bandwidth is a tier-1, facilities-based carrier that owns its network and numbers – structurally different from the CPaaS resellers, and the usual pick for voice AI vendors with enterprise telephony support and a carrier-direct relationship.

Positioning matters more than the third decimal place. Twilio has the broadest product surface and the largest ecosystem, the highest list rate, and room to negotiate on committed use. Telnyx runs a private global IP backbone as a licensed carrier and posts the cheapest headline rate, but with the most layered cost (API minute + trunking fee + per-channel fees). SignalWire, from the creators of FreeSWITCH, has the lowest flat SIP rate. Plivo sits in the mid-market with simple flat pricing.

For a voice agent the SIP rate is what applies, which puts the relevant carrier cost in the $0.003–$0.014/min band – hold that for the build-vs-buy section.

SBC architecture and why concurrency exposes it

A Session Border Controller (SBC) sits at the border between two telephony networks and governs the sessions crossing it – topology hiding, NAT traversal, security, transcoding, call admission control, and more. It also smooths over the fact that SIP is not one consistent protocol: every carrier speaks a slightly different dialect, and the SBC interworks between them.

The capacity spec that matters is concurrent sessions, not calls per second – how many simultaneous calls the device can hold. Call admission control enforces that ceiling so an overload doesn’t take down everything downstream.

Concurrency is what exposes the SBC, because transcoding scales with it: each concurrent call that needs codec conversion burns CPU on the media plane (the next section quantifies how much). Cloud-native SBCs scale media, signaling, and transcoding independently for exactly this reason. An SBC sized only for signaling will fall over under a transcoding-heavy load.

Most builders never touch an SBC directly – the CPaaS is the SBC, run carrier-grade and never configured by the agent. A dedicated SBC, whether a commercial appliance, an open-source build (Kamailio, FreeSWITCH), or SBC-as-a-service, becomes a real decision only when going BYOC or fully in-house.

Audio quality: codecs and the narrowband ceiling

Codec choice: Opus, G.711, G.722, and the transcoding penalty

A codec encodes and decodes the audio. The choice sets quality, bandwidth, and – through transcoding – CPU cost and transcription accuracy. Three codecs cover the voice-agent case.

CodecSample rateBitrateBandRole
G.7118 kHz64 kbit/s fixedNarrowband (~300 Hz–3.4 kHz)The PSTN baseline; PCM µ-law/A-law; resilient across multiple transcoding hops
G.72216 kHz48 / 56 / 64 kbit/sWideband (“HD voice”)Noticeably clearer than G.711 where the full path supports it
Opus8 / 16 / 24 / 48 kHz6 kbit/s to 510 kbit/sNarrowband to fullband, adaptiveThe WebRTC default codec; scales from low-bitrate speech to fullband stereo

The public phone network forces G.711: any call that touches the PSTN is bandlimited to 8 kHz. A voice agent running Opus internally – which it does if WebRTC is anywhere in its media stack – has to convert Opus to G.711 and back at the SBC whenever a call bridges to the PSTN. That conversion is transcoding, and it carries a real penalty: it adds roughly 20–50 ms of latency, eats most of a CPU core per concurrent call, and cuts call-handling capacity by about an order of magnitude. It also loses information for good – once audio crosses into G.711, the wideband detail is gone and transcoding back cannot recover it.

The takeaway: a call that stays inside WebRTC end-to-end keeps Opus and stays wideband; a call to or from a phone number hits G.711 the moment it touches the PSTN, and no agent-side codec choice changes that. Carriers sell HD-voice (wideband) SIP trunking, but it only holds where the entire path supports it – one hop to a non-HD endpoint collapses the call back to 8 kHz. Transcoding is one line item in the agent’s end-to-end latency budget.

The 8 kHz narrowband ceiling on transcription accuracy

The 8 kHz limit is not just an audio-quality problem – it sets a hard ceiling on speech-to-text accuracy, imposed by the channel rather than the model.

PSTN audio is bandlimited to roughly 300 Hz–3.4 kHz, and the high-frequency cues that distinguish fricatives – the /s/, /f/, and /th/ sounds – live above that range. On a narrowband call that energy is simply gone, so the model can’t tell those phonemes apart as well, and word error rate rises.

Voicegain’s 2025 benchmark on 8 kHz call-center audio shows where that lands:

STT modelAccuracy on 8 kHz audio
Amazon AWS~88%
Voicegain-Whisper-Large-V3~86%
Voicegain Omega~85%
Google Video model~68%

Best-in-class STT tops out around 86–88% on 8 kHz telephony audio. A better model moves it a point or two but can’t break the ceiling, because the limit is missing acoustic information, not model capability. The practical lesson: swapping STT models to chase the last few points of accuracy on a PSTN call is optimizing above a wall the telephony layer already set.

The same limit hits the output side – a TTS voice tuned for 24 kHz loses naturalness downsampled to 8 kHz. Some models ship telephony-grade output natively (Deepgram Aura-2, ElevenLabs Flash, Cartesia Sonic-3), so the safest check is a listening test on real G.711 output, covered in choosing STT and TTS.

Deliverability: getting calls answered

STIR/SHAKEN attestation and the 2025-2026 caller-ID rules

A voice agent that places outbound calls has a deliverability problem distinct from quality: the call has to be answered. Caller-ID authentication and number reputation decide whether it is, and both sit inside the broader US voice AI regulations an outbound program has to clear.

STIR/SHAKEN is the framework US carriers use to sign calls and assert how much the originating provider knows about the caller. It defines three attestation levels:

  • A, full attestation: the originating provider authenticated the customer and confirmed the customer is authorized to use the calling number.
  • B, partial attestation: the provider authenticated the call origin but cannot confirm the caller owns the number.
  • C, gateway attestation: the provider can verify where it received the call but not its source – typical for international or inbound-gateway calls.

For an outbound voice agent, the target is A-level attestation. Calls signed B or C are far more likely to be spam-labeled or filtered before they reach a handset. Reaching A requires the originating provider to verify number ownership, which is why number provenance – whether the carrier can confirm the agent owns the calling number – belongs in the carrier-selection decision, not just the per-minute rate.

The rules tightened recently. Since September 2025, a provider can outsource the technical act of signing only if it still makes every attestation decision itself and signs with its own certificate – a clampdown on improper attestations by parties that didn’t originate the call. Providers also recertify in the Robocall Mitigation Database annually.

A proposed FCC rule (not yet adopted) would push verified caller-name to the handset on A-level calls – which would make A-level attestation worth even more.

A2P 10DLC and number-reputation management

Attestation handles whether a call is signed. Reputation handles whether the called party’s carrier flags the number as spam. These are separate systems, and a voice agent at volume has to manage both.

A2P 10DLC is the registration regime for application-to-person SMS over standard 10-digit numbers. It’s an SMS rule, not a voice rule, but it matters to voice-agent teams because many also send SMS and the registered brand identity feeds the same caller-reputation systems. Since early 2025, US carriers block – not throttle – unregistered 10DLC traffic outright: unregistered means undelivered.

A few details are worth knowing before launch (confirm against The Campaign Registry directly): a Reseller ID is required when a platform registers campaigns for a client, the brand’s EIN must be aged, opt-in URLs must be live and verifiable, and some states now mandate multi-year opt-out record retention. Registration takes several business days, so start it early.

Number reputation is the voice-side equivalent. Three analytics engines drive carrier spam-labeling: Hiya on AT&T, TNS on Verizon, and First Orion on T-Mobile. A high-volume outbound agent with low answer rates will accumulate spam flags over time, and once a number reads “Spam Likely” on a handset, answer rates collapse.

Remediation is straightforward: Free Caller Registry submits a number to all three engines at no cost and is the standard first step to clear a “Spam Likely” label, with paid monitoring services (Numeracle, Bandwidth) on top. For a high-volume outbound agent the working practice is to rotate a healthy pool of numbers, register them, hold A-level attestation, and watch for flags. Softcery hit exactly this building an outbound calling system for a law firm – holding A-level attestation and rotating a registered pool was what kept answer rates from collapsing.

Outbound numbers getting flagged “Spam Likely” or stuck at B-level attestation? Softcery builds voice agents where caller-ID attestation, number reputation, and carrier provenance are designed in, not patched after answer rates drop. Schedule a consultation to review your deliverability setup.

Architecture and decisions

WebRTC versus SIP: matching protocol to deployment

SIP and WebRTC are the two transport protocols beneath a voice agent, and the choice is not a quality ranking. It follows from where the human is. (The separate real-time vs cascading architecture tradeoff sits one layer up, inside the agent runtime.)

SIP is the protocol for phone-number calls across the PSTN and mobile networks – an outbound AI SDR dialing cell phones has no other path. WebRTC is the protocol for browser and in-app voice (UDP transport, Opus, built-in encryption); an in-product “talk to the agent” widget runs on it.

So the decision is simple: calls to or from phone numbers go SIP, browser and app voice go WebRTC. When the latency target is aggressive, WebRTC’s UDP transport wins; an existing SIP contact center makes SIP the path of least resistance.

Most production agents run both – WebRTC inside the media stack, SIP at the carrier edge – which means every call crosses a SIP-to-WebRTC bridge, and that bridge pays the transcoding cost from the codec section. LiveKit ships such a bridge out of the box. The choice isn’t which is better; it’s matching protocol to where the human is, and accepting the bridge cost where both appear.

Multi-carrier failover under outage

A single carrier is a single point of failure – the 2020 US outage that cut phone service for tens of millions is exactly the exposure a one-carrier agent carries.

The fix is to qualify each phone number across more than one carrier, so if carrier A fails its traffic reroutes to carrier B before the caller notices. A failover design leans on a few building blocks: independent trunks on independent carriers, SBCs doing health-checked routing, SIP OPTIONS probes as the keep-alive, and PSTN fallback. Well configured, it detects failure within seconds and switches over fast enough that users rarely notice.

An outbound agent should hold credentials for at least two trunks and route by health status. The counterpoint: some vendors argue redundancy within one provider simplifies operations – true, but it leaves that provider as the single failure domain.

Voice AI with telephony: how agent platforms integrate SIP and PSTN

The telephony layer – carrier, SIP or WebRTC, SBC – sits beneath the agent runtime of VAD, STT, LLM, and TTS. Buyers searching for leading voice AI with SIP and PSTN integration are really asking one question: how much of this layer does each platform expose? The voice agent platforms compared breakdown ranks them by overall feature surface; here the lens is narrower – the SIP and PSTN integration underneath – and the difference maps onto the build-vs-buy decision.

  • Vapi – turnkey; transport fully hidden, an engineer never configures it. Supports BYOC.
  • Retell – managed service; BYOC through custom SIP trunks.
  • Bland AI – turnkey; SIP integration offered as an enterprise feature.
  • LiveKit – open-source SFU; ships a dedicated SIP-to-WebRTC bridge and explicit BYOC (inbound and outbound trunks, with Telnyx/Twilio/Plivo).
  • Pipecat – transport-agnostic Python framework; you choose the transport. Maximum BYOC flexibility, most assembly.
  • Telnyx AI – telephony and agent from one vendor on Telnyx’s own carrier network. The opposite of BYOC: everything in-house at the vendor.

So the platforms form a spectrum: Vapi and Bland hide the telephony layer, LiveKit and Pipecat expose it, Telnyx AI owns it end-to-end. Where a team lands should follow from how much of that layer it needs to control – the next question.

Decision framework: turnkey, BYOC, or in-house

The honest starting point: carrier cost is small next to the rest of the pipeline. The SIP rate sits around $0.003–$0.014/min, while all-in platform costs run roughly ten times that (Vapi and Retell both land near $0.13–$0.33/min) because LLM and TTS dominate. Carrier minutes are a rounding error. So the decision to own telephony should turn on control and reliability, not per-minute savings.

Turnkey platform. A turnkey platform fits when telephony is not a differentiator and speed to launch is. The platform runs carrier-grade SBCs, handles attestation, and abstracts transport. The tradeoff is less control over routing, codec path, and failover, and a per-minute markup. For most early-stage voice agents, that markup buys time that is worth more than the margin.

BYOC. Bringing a chosen carrier into a managed agent platform fits when the carrier relationship needs to be specific – a particular provider for number provenance and A-level attestation, a negotiated committed-use rate, an existing contact-center trunk, or a multi-carrier failover requirement the platform’s bundled carrier cannot meet. BYOC captures most of the carrier-side control without rebuilding the media stack, and it is how most voice AI vendors with enterprise telephony support plug into a client’s existing carrier. LiveKit, Vapi, Retell, and Pipecat all support it.

Fully in-house. Owning carrier contracts, SBCs, and routing fits a narrow case – generic CPaaS analysis only puts the crossover at tens of millions of monthly minutes with a dedicated telecom team. In-house buys margin elimination and full routing and codec control, paid for with carrier contracts, STIR/SHAKEN signing obligations, Robocall Mitigation Database filings, E911, number provenance, and 24/7 operations.

The framework in one line: default to turnkey, move to BYOC when the carrier relationship has to be specific for deliverability or contract reasons, and go fully in-house only at sustained scale with a dedicated telecom team. The reasons that actually move a voice agent off turnkey are A-level attestation, number provenance, and multi-carrier failover – reliability and deliverability concerns – not the per-minute rate.

Softcery builds production voice agents, and the recurring pattern is that the telephony layer gets attention only after an incident. Choosing carrier, codec path, and failover model deliberately during the build is cheaper than discovering them during an outage or a spam-label investigation.

Building or scaling a voice agent and want the telephony layer decided on purpose – carrier, codec path, attestation, and failover – before it surfaces in production? Schedule a consultation and we will map it with you.

Frequently Asked Questions

Does owning telephony in-house meaningfully reduce voice agent cost?

Rarely. The carrier SIP rate sits around $0.003–$0.014/min, while all-in platform costs run roughly ten times higher because LLM and TTS dominate the pipeline. Owning telephony eliminates only a small fraction of total cost. The real reasons to take more control are A-level caller-ID attestation, number provenance, and multi-carrier failover – reliability and deliverability concerns, not cost ones.

Why does a voice agent's transcription accuracy stop improving around 87%?

Calls that cross the public phone network are bandlimited to 8 kHz narrowband, roughly 300 Hz to 3.4 kHz. High-frequency phonetic cues for fricatives like /s/, /f/, and /th/ live above 3.4 kHz and are absent from the audio entirely. Voicegain’s 2025 benchmark on 8 kHz call-center audio showed best-in-class speech-to-text models topping out at 86–88% accuracy. That ceiling is set by the channel, not the model, so swapping STT models moves the number a point or two but does not break the ceiling. A call that stays inside WebRTC end-to-end avoids the narrowband hop and is not subject to this limit.

What is the transcoding penalty and when does a voice agent pay it?

Transcoding is converting media from one codec to another mid-call, most commonly Opus to G.711 when a WebRTC-based agent bridges to the PSTN. Each transcoding step adds roughly 20–50 ms of latency, can consume 50–80% of a CPU core per concurrent call, and reportedly cuts a system’s call-handling capacity by roughly an order of magnitude. It also causes permanent information loss: once audio crosses into 8 kHz G.711, the wideband detail cannot be recovered by transcoding back. A voice agent pays this penalty on any call that touches a phone number, because the PSTN forces G.711.

What attestation level should an outbound voice agent aim for, and why?

A-level, full attestation. Under STIR/SHAKEN, A-level means the originating provider authenticated the customer and confirmed the customer is authorized to use the calling number. Calls signed B (partial) or C (gateway) are far more likely to be spam-labeled or filtered before reaching a handset. Reaching A requires the carrier to verify number ownership, which makes number provenance a real carrier-selection criterion. As of September 18, 2025, FCC rules also restrict third-party call signing: the obligated provider must make all attestation decisions and sign with its own certificate.

When should a voice agent use BYOC instead of a turnkey platform?

Bring-your-own-carrier fits when the carrier relationship has to be specific. Common triggers: needing a particular provider for number provenance and A-level attestation, a negotiated committed-use rate, an existing SIP contact-center trunk to integrate with, or a multi-carrier failover design the platform’s bundled carrier cannot support. BYOC captures carrier-side control without rebuilding the media stack, and LiveKit, Vapi, Retell, and Pipecat all support it. A fully in-house build is a separate, higher bar – generic CPaaS analysis places the crossover above roughly 50M monthly minutes with a dedicated telecom team, and no voice-agent-specific crossover figure is established.

What are the leading voice AI platforms with SIP and PSTN integration?

The leading voice AI with SIP and PSTN integration falls into three groups. Turnkey platforms that hide transport but support BYOC: Vapi and Bland AI. Managed services with custom SIP trunks: Retell. Open frameworks that expose the SIP-to-WebRTC bridge directly: LiveKit (it ships a dedicated bridge and names Telnyx, Twilio, and Plivo as providers) and Pipecat. And the carrier-native option, Telnyx AI, which delivers telephony and the agent from one vendor. Which one fits depends on how much of the SIP and PSTN layer the team needs to control.

How do I find a phone number for my voice AI agent?

Three paths. Buy one through a CPaaS – Twilio, Telnyx, Plivo, SignalWire, or Bandwidth – when going BYOC, then point the SIP trunk at the agent. Buy one inside a turnkey platform like Vapi or Retell, which provisions and wires it for you. Or port an existing business number to keep its reputation. Before purchase, check the carrier on the things that decide deliverability: STIR/SHAKEN attestation level, number provenance, geographic coverage, and toll-free versus local fit. Toll-free reads as business but can attract more filtering; local numbers lift answer rates but need a healthy pool.

How does voice AI replace IVR systems?

Voice AI versus IVR is a runtime swap, not a telephony rebuild. The carrier, SIP trunk, and SBC stay the same; only the logic on top changes. A traditional IVR routes callers through a fixed DTMF menu tree (“press 1 for billing”). Voice AI replaces that tree with an LLM that takes natural-language turns, recognizes intent, and branches dynamically – no menu to memorize. The caller speaks instead of pressing keys, and the agent can ask follow-ups, handle out-of-order requests, and escalate to a human when needed. The migration risk sits in the agent runtime and prompt design, not the carrier layer underneath.

Which voice AI vendors support enterprise telephony?

For voice AI vendors with enterprise telephony support, the carrier-direct options are Bandwidth (a tier-1, facilities-based carrier that owns its network and numbers) and Telnyx (a licensed carrier on a private global IP backbone). For teams that want enterprise control without owning carrier contracts, the BYOC-capable platforms – LiveKit, Pipecat, Vapi, and Retell – integrate with an existing enterprise SIP trunk, which is usually how an enterprise plugs its own carrier relationship and A-level attestation into a managed agent runtime.

AI Voice Agent Cost Calculator

See how much it would cost to build and launch your AI voice agent, tailored to your business in under a minute.

Try the AI Voice Calculator
Multilingual Voice AI Agents and Code-Switching: The Engineering Guide for Real-Time ASR and TTS

The Code-Switching Gap: Where Multilingual Voice AI Loses Callers Mid-Sentence

Hinglish and Spanglish callers do not speak one language per call. Here is how to build an ASR-to-TTS pipeline that follows them across the switch instead of breaking on it.

Lowest-Latency Voice AI Agents: The Engineering Budget From Microphone to Speaker

The Core Latency Budget: Every Millisecond Between Microphone and Speaker

Streaming is not an answer. Here is the full turn-gap budget broken into twelve components, each in milliseconds, with the techniques that actually move the number.

Voice Prompt Engineering for AI Agents: Why Text Prompts Break in Real-Time Audio

Voice-Specific Prompt Engineering: Why Text Prompts Break in Real-Time Audio

A prompt that works in ChatGPT reads markup aloud, says 'two thousand five' for a year, and talks over the caller. Here is the prompt-engineering playbook for streaming voice agents.

AI Voice Agents for Personal Injury Intake: Solving the Missed-Call Problem

AI Voice Agents for Personal Injury Law Firms: How to Automate Intake Calls

AI voice agents handle personal injury intake 24/7 with attorney-level qualification. Technical deep-dive covering architecture, bilingual support, compliance, and real production results.

Building AI That Actually Understands Legal Documents: RAG Architecture for 500-Page Contracts

Building AI That Understands Legal Documents (Not Just Reads Them)

Engineering perspective on legal document AI: difference between text ingestion and contextual reasoning, RAG architecture for massive contracts, and how production systems handle legal complexity.

How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

Engineering perspective on legal AI research: RAG systems, citation hallucination prevention, validation architectures, and what makes production systems reliable.

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying Legal AI Solutions

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying

A founder-focused guide to legal AI development, covering market landscape, core technologies, compliance navigation, build vs buy decisions, and scaling strategies.

AI Call Center Automation: Actionable Playbook for 2026

AI Call Center Automation: Actionable Playbook for 2026

The CS landscape is changing. Expectations are rising, and teams are overworked. For the first time, the technology is mature enough to help.