The Code-Switching Gap: Where Multilingual Voice AI Loses Callers Mid-Sentence
Last updated on May 22, 2026
A caller reaches a US support line and says: “Necesito cancelar mi suscripción, but first can you tell me the early-termination fee?” A monolingual English pipeline mangles the Spanish opening; a monolingual Spanish one mangles the English back half. Either way the agent answers a question the caller never asked, and the call degrades from the first turn.
This is code-switching, and for the tens of millions of Spanish/English bilingual callers in US-Hispanic markets it is the default way people talk, not an edge case (the same holds for Hinglish, Singlish, and Taglish speakers elsewhere). A multilingual voice AI agent — a multilingual AI voice assistant — that treats “Spanish” and “English” as separate modes cannot serve a Spanglish speaker. Handling them is an architecture decision across the whole pipeline: how language is detected, which model transcribes the audio, and how one voice speaks both languages back.
This article is about real-time conversation, not asynchronous dubbing or voice translation — those ship a different vendor stack and latency budget, closer to our real-time vs turn-based voice agent architectures guide.
Understanding code-switching
What code-switching is, and why it is the default register
Code-switching is a multilingual speaker alternating languages within the same conversation, utterance, or sentence. The LinCE benchmark paper defines it formally as “the multilingual phenomenon that happens when speakers alternate languages within the same sentence or utterance.” Two granularities matter for engineering.
Inter-sentential switching happens between sentences. A caller finishes one full sentence in Spanish, then starts the next in English. Each sentence is internally monolingual, so a system that detects language at sentence boundaries can route each one correctly.
Intra-sentential switching, also called intra-utterance switching, happens mid-sentence. “Mujhe apna account balance check karna hai” carries Hindi grammar and English nouns inside one breath group. This is the hard case. No sentence boundary separates the languages, so any system that decides language once per sentence will be wrong for part of every switched sentence.
The named language pairs follow predictable patterns. Hinglish is Hindi-English. Spanglish is Spanish-English. Singlish is colloquial Singaporean English mixing English, Malay, Hokkien, Mandarin, and Tamil. Taglish is Filipino-English. Franglais is French-English. These are not accents or dialects. They are stable mixed registers with their own grammar, and speakers move fluidly between matrix and embedded languages.
Prevalence numbers in this space come mostly from vendor marketing rather than census data. Figures like “250 million Hinglish speakers” appear in vendor blogs such as Haptik’s and should be read as order-of-magnitude marketing claims. The defensible framing is qualitative: in India, Singapore, and US-Hispanic regions, code-switching is the normal way bilingual callers speak, so a production agent serving those markets will encounter it on a large share of calls.
The four ways monolingual ASR fails on code-switched speech
Monolingual ASR does not degrade gracefully on code-switched audio. It fails in four distinct ways, catalogued by Shunya Labs and corroborated across the literature:
- Acoustic mismatch — a model tuned to one language mis-maps the sounds of the other, because those sounds were never in its training data.
- Language-model penalty — the model assigns near-zero probability to cross-language word sequences, so a correct hypothesis (an English noun after a Hindi verb) gets pruned during decoding.
- Vocabulary gaps — the decoder has no tokens for the alternate language at all (one IEEE study counted 177 out-of-vocabulary tokens per test set), so it substitutes the nearest word it does have.
- Prosody disruption — different stress and intonation throw off endpointing, so the system cuts or merges utterances before recognition even finishes.
The aggregate cost gets quoted as a 30-50% relative WER increase versus clean monolingual audio — half again as many errors. That range recurs across vendor sources (Gladia, Haptik, Shunya) with no single primary citation, so treat it as a consensus figure, not a measured constant.
The peer-reviewed benchmarks are the hard evidence, and they should anchor any architecture decision:
| Benchmark | What it measures | Result |
|---|---|---|
| MUCS 2021 | Hindi/Bengali–English code-switching WER (~600 hrs) | 32.45% baseline |
| SEAME | Mandarin–English mixed error rate (MER) | 16–23% best, 34% baseline |
| Whisper fine-tuned | Indian-accented English WER | 15.08% (Medium) |
The takeaway: even purpose-built code-switching systems land in the 16–32% error band on hard mixed audio — roughly one word in five wrong. A monolingual pipeline pointed at the same audio does meaningfully worse, and a 30-50% increase on top of an already double-digit WER produces transcripts the language model cannot recover from. (One caveat: LinCE is often cited here but is a text NLP benchmark, not an ASR one — useful for framing language identification, not transcription accuracy.)
Choosing the ASR architecture for multilingual voice AI
Three granularities of language detection
Language detection inside a voice agent happens at one of three granularities, and the choice decides which callers the system can serve.
Per-turn or per-call detection identifies the language once at the start of speech and routes everything that follows. Cheapest option, and it fails any caller who switches mid-turn — which is every code-switching caller.
Per-utterance detection runs at each utterance boundary. It handles inter-sentential switching because each utterance is internally monolingual. It still fails intra-sentential switching because a single utterance with both languages gets one label.
Mid-utterance or token-level detection decides language per word or per frame. Required for true intra-sentential Hinglish and Spanglish. Modern unified multilingual models do this implicitly in one forward pass; Deepgram Nova-3’s multi mode, for example, returns a language field per word and a languages array per channel.
Granularity is not a tuning knob applied after the fact. It is a property of the architecture. A pipeline built around per-turn detection cannot be upgraded to token-level by changing a parameter — it has to be rebuilt around a model that decides language per token natively.
Voice AI APIs with multilingual support: unified model versus language-specialist router
Two architectures can transcribe multilingual call traffic, and they make opposite trade-offs.
A single unified multilingual model runs one forward pass. There is no separate routing step and no detection-latency tax. The model follows the speaker across switches natively and preserves speaker identity across languages because the same model processes the whole utterance. The cost is coverage and peak accuracy: unified models typically support a fixed, often smaller language list, and per-language accuracy can trail a dedicated monolingual specialist on clean monolingual audio.
A language-specialist router detects the language, then sends the audio to the best monolingual engine for that language. It achieves the best clean-monolingual accuracy and the widest aggregate language coverage. The cost is everything code-switching needs: the router adds language-identification latency, breaks on intra-sentential switching because it commits to one engine per segment, and complicates speaker-identity continuity because different engines handle different parts of the same call.
As of mid-2026, frontier real-time vendors have converged on the unified-model approach for code-switching specifically. Routers survive for breadth, covering 90-plus languages, and for batch or offline use where latency does not matter.
The current multilingual ASR landscape:
| Model / Vendor | Languages | Code-switching | Notes |
|---|---|---|---|
| Whisper large-v3 / large-v3-turbo (OpenAI, open) | ~99 trained; strong in ~10 | No explicit CS mode; unreliable on CS, needs testing | Still the current open Whisper family; no official large-v3 successor announced as of mid-2026 |
| Deepgram Nova-3 | 36+ monolingual; Multilingual mode covers 10 languages (English, Spanish, French, German, Hindi, Italian, Japanese, Dutch, Russian, Portuguese) | Yes – language=multi, unified model, no LID step, per-word language tags | Vendor claims ~34% relative batch-WER drop, ~21% streaming [vendor-claimed] |
| AssemblyAI Universal-3 Pro | Multilingual streaming covers 6 languages (English, Spanish, French, German, Italian, Portuguese); 90+ in batch | Yes – built-in intra-utterance CS across the 6, single forward pass, preserves speaker identity | Universal-3 Pro streaming launched in the 2026 cycle |
| ElevenLabs Scribe v2 / v2 Realtime | 90+ languages, auto-detect | Yes – CS handled with no manual config; demoed on Spanglish | Vendor claims 93.5% accuracy on a multilingual benchmark [vendor-claimed] |
| Google Chirp 3 | 85+ languages/locales; trained across 100+ | Auto language detection; transcribes the dominant language – not marketed for intra-sentential CS | GA in Speech-to-Text API v2 |
| Meta Omnilingual ASR | 1,600+ languages (zero-shot extendable to 5,400+) | Not specifically a code-switching system | Released late 2025, Apache-2.0; 4.3M hrs training, CER under 10% in 78% of languages |
| Mistral Voxtral (Mini 3B / Small 24B, open) | 13 languages, auto-detect (EN, ZH, HI, ES, AR, FR, PT, RU, DE, JA, KO, IT, NL) | Yes – handles CS within one audio file, no config | Apache-2.0; 2.3M hrs training; Voxtral Small beats Whisper on FLEURS per Mistral [vendor-claimed] |
Two reading instructions for this table. The vendor “WER drop” figures, Deepgram’s ~34% and ElevenLabs’ 93.5%, are marketing claims with no independent verification. The peer-reviewed benchmarks in the previous section are the figures to plan against. And the language counts are not interchangeable with code-switching support: Omnilingual covers 1,600-plus languages but is not a code-switching system, while AssemblyAI’s streaming code-switching works across only six. Coverage breadth and intra-sentential code-switching are different capabilities. For a wider provider comparison across pricing, latency, and accuracy, see our STT and TTS selection guide.
Speech-native models: feeding audio straight into the LLM
A third architecture is emerging that, in theory, fits code-switching better than either option above. Instead of the usual STT → text → LLM chain, a speech-native model projects audio directly into a multilingual LLM’s token space — no separate transcription layer, no language-identification step. The model “hears” the audio and reasons over it in one pass.
That structure removes the exact place where code-switching breaks. There is no transcription step that commits to one language at the switch, and the LLM’s own multilingual knowledge applies to the sound itself rather than to a transcript that may already be wrong. Ultravox is the clearest example: its v0.7 release builds on a GLM-4.6 backbone and uses a custom audio encoder to feed audio into the LLM’s token space, ranking #1 among speech models on VoiceBench. Mistral Voxtral, Qwen3-Omni, Kyutai Moshi, and Gemini Live belong to the same class.
The honest caveat: this class is newer and heavier than a Deepgram or AssemblyAI endpoint. Ultravox v0.7 still outputs text paired with a downstream TTS rather than pure speech-out, self-hosting needs B200/H100-class GPUs, and few of these models are telephony-proven at scale today. Treat speech-native as the promising frontier of the unified approach, not a settled default — but the one to watch for code-switching specifically.
Why latency now settles the architecture choice
For real-time code-switching, latency is the decisive argument for a unified model, and accuracy is secondary to it.
Human conversation has a natural response window of roughly 300ms, the gap a person leaves before it feels like a delay. The AssemblyAI “300ms rule” frames this as the target for a perceptually instant response. The overall acceptable budget for a first-syllable response sits around 2-3 seconds, but the perceived snappiness lives inside that 300ms window — see our voice agent latency budget guide for the full per-component walkthrough.
Every stage of the pipeline — detecting the end of the turn, transcribing, the LLM, the voice — spends part of that window, and they overlap rather than run in sequence (the voice agent latency budget guide breaks the milliseconds down stage by stage).
The point that matters for code-switching: a separate language-identification step adds roughly 70-200ms of its own — a beat of delay on every reply, spent before the agent has understood anything. That figure is a single vendor’s estimate, but the direction is not in dispute: detecting the language first is pure added latency.
A unified multilingual model eliminates that step. It decides language inside the same forward pass that produces the transcript. Deepgram, AssemblyAI, ElevenLabs, and Voxtral all explicitly market the absence of a separate language-detection step. For a router, code-switching costs both a latency tax and intra-sentential accuracy. For a unified model, it costs neither. That is why the architecture question is now effectively settled for real-time code-switching agents.
Building a multilingual voice agent that holds quality across code-switching? Softcery builds production voice systems where ASR and TTS hold up across languages from day one. Schedule a consultation to scope your stack.
TTS, prosody, and regional variants
Top multilingual TTS voice AI platforms: cross-lingual voice cloning compared
The TTS side has its own code-switching requirement. A voice agent usually has a brand persona, a single cloned voice. That voice must speak every supported language without sounding like a different person when it switches. A clone that is convincing in English but turns into a generic voice in Spanish breaks the persona on every Spanglish turn.
The current TTS landscape for cross-lingual cloning:
| TTS Model | Languages | Cross-lingual cloning | Latency |
|---|---|---|---|
| ElevenLabs Eleven v3 | 70+ languages TTS | Cloned voice usable across 32+ languages; Instant Voice Clone (~1 min audio) or Professional Voice Clone (30+ min). Caveat: PVCs not yet fully optimized for v3, IVC recommended during the v3 preview | – |
| Cartesia Sonic-3 | 42 languages, including 9 Indian languages | A cloned voice works for TTS in any of the 42 languages; clone from 10-30s of audio | Model latency under 100ms (one customer cites 90ms) |
| Deepgram Aura-2 | 7 languages (EN, ES, Dutch, French, German, Italian, Japanese); multiple EN/ES accents | No voice cloning offered | sub-90ms latency; $30 per 1M chars |
| Inworld TTS-1.5 Max | Multilingual | Instant voice cloning from 5-15s audio | sub-250ms P90 (Max); topped the Artificial Analysis Speech Arena at ~1236 ELO in 2026 |
| Gradium | 5 languages (EN, FR, DE, ES, PT) | Voice cloning offered; mid-sentence CS across all 5 | ”no latency penalty” claimed |
| Sarvam Bulbul V3 | 22 Indian languages; 25+ voices | Indian-language specialist; emotion control | – |
The trade-offs are concrete. ElevenLabs has the broadest cloning language coverage, but its highest-fidelity clone, the Professional Voice Clone, lags its newest model, v3, which is why the Instant Voice Clone is recommended during the v3 preview. Cartesia Sonic-3 is the cleanest single-clone-across-42-languages story with sub-100ms latency. Deepgram Aura-2 is fast and accurate but offers no cloning at all, which is a real constraint when voice identity matters to the brand. Sarvam Bulbul V3 is the option to evaluate for Indian-language-heavy traffic.
One caveat applies to all of them. Every cross-lingual speaker-similarity claim in this space is vendor-side. No independent benchmark of how well a cloned voice preserves identity across languages exists in the public literature. Similarity should be verified by listening tests on the actual target languages before committing.
Prosody preservation is a real and unsolved measurement gap
Getting the words right is not the same as getting the delivery right. Code-switching TTS models “may exhibit unnatural prosody and inaccurate tones when mixing words from different languages,” and end-to-end code-switching TTS shows prosody instability, especially over longer paragraph-length text. Stress, rhythm, and intonation can break at the switch boundary even when the transcription is perfect.
The research literature offers techniques rather than solved problems. Dual speaker embeddings separate the controls: one embedding governs accent and prosody, a second governs timbre, and the approach reportedly improves both nativeness and speaker preservation. Text enhancement for paragraph-level code-switching TTS improves prosody stability. CrossVoice is a cascade speech-to-speech system with explicit cross-lingual prosody preservation via transfer learning. Fine-tuning multilingual TTS on code-switching datasets captures conversational prosody and rapid intra-sentence switching.
The honest engineering position is that no production vendor publishes prosody-continuity metrics across a code-switch boundary. There is no standard number for how natural a voice sounds when it crosses from Hindi into English mid-sentence. This is a genuine measurement gap. Teams building code-switching agents should treat prosody as something to evaluate by ear on representative switched utterances, not something a vendor spec sheet will quantify.
Regional accent and variant handling is its own routing dimension
Treating a language as a single monolithic target is itself a failure mode. “Spanish” and “English” are not single acoustic targets, and a system that treats them that way underperforms for whole regions of callers.
Spanish splits clearly. Deepgram Aura-2 exposes Mexican, Peninsular, Colombian, and Latin American Spanish as distinct accents. US and Latin American Spanish differ from Spain Spanish in vocabulary, pronunciation, and formality registers. English splits the same way: Aura-2 exposes American, British, Australian, Irish, and Filipino accents. Portuguese diverges enough between Brazilian and European variants that ASR and TTS vendors treat them as separate locales, splitting pt-BR and pt-PT in their language lists.
Indian English is the most-documented variant, and the research makes the point plainly: accuracy swings widely by language and region, and degrades further for North-East and South India speakers. The Svarah benchmark (9.6 hours, 117 speakers across 65 locations) and LAHAJA exist precisely because the usual English datasets under-represent these speakers. The fixable part is data: adding the Shrutilipi corpus pulled Hindi WER from 18.8% down to 13.5%. The lesson for system design is that “the right accent” is a routing decision, not an afterthought — a single “English” or “Spanish” target leaves whole regions of callers underserved.
The takeaway for system design: accent and variant are routing dimensions just like language. A code-switching agent for the Indian market does not just need Hindi and English. It needs the right Indian-English variant handling, and it will underperform for North-East and South India callers unless that is tested explicitly.
Deployment and a decision framework
Softcery’s Casegen AI deployment shows what these choices look like in production. The agent detects language early in the call and runs the entire bilingual English/Spanish intake — tested across Mexican, Caribbean, and Central American Spanish dialects — holding case qualification consistent as callers switch between the two. It reaches 95% case capture and a 1.2-second average response latency.
A decision framework for building the pipeline
The decisions above resolve into a short sequence for any team building a multilingual voice agent.
- Classify the traffic. If callers switch only between full sentences, per-utterance detection is enough. If callers switch mid-sentence — the norm for Hinglish, Spanglish, Singlish, and Taglish speakers — the system needs token-level detection, and that means a unified multilingual model.
- Choose ASR architecture. For real-time code-switching, choose a unified multilingual model — or, increasingly, a speech-native model that skips transcription entirely: both carry no language-identification latency tax and handle intra-sentential switching in one pass. Reserve a language-specialist router for batch or offline transcription, or for cases needing breadth across 90-plus languages where latency is not a constraint. Confirm the model’s fixed language list actually covers the target pairs, since coverage breadth and code-switching support are separate properties.
- Verify ASR on code-switched test data, not vendor demos. Awaaz.ai, itself a code-switching vendor, argues that most “supports language X” claims reflect only basic capability and that buyers should demand proof on code-mixed test data. Build a test set from real switched utterances and measure WER or MER against the peer-reviewed baselines: 16-32% is the realistic band on hard mixed audio, and a system far above that needs work.
- Choose TTS for voice-identity continuity. If the brand persona depends on a consistent cloned voice, the clone must hold identity across every supported language, and that has to be verified by listening tests because no independent similarity benchmark exists. If the persona does not require cloning, a fast non-cloning model like Aura-2 is viable.
- Treat prosody as an evaluation step, not a spec. No vendor quantifies prosody continuity across a switch boundary, so it has to be judged by ear on representative switched utterances before launch.
The pattern across all five steps is the same: code-switching is a property the architecture either has or lacks, and the costly mistake is building a per-turn or router-based pipeline and discovering the gap only when bilingual callers start phoning in. The decision to make is the first one, and it is an architecture decision.
Need a multilingual voice AI agent that follows callers across the language switch instead of breaking on it? Reach out to Softcery.
Frequently Asked Questions
For real-time code-switching, the strongest options as of mid-2026 are Deepgram Nova-3 in multi mode, AssemblyAI Universal-3 Pro Streaming, ElevenLabs Scribe v2 Realtime, and Mistral Voxtral. All four ship intra-utterance code-switching with no separate language-detection step. Vendor accuracy claims in the multilingual ASR table above are marketing figures; plan against the peer-reviewed MUCS 2021 and SEAME benchmarks instead.
Four production APIs currently cover intra-sentential multilingual transcription with a unified model: Deepgram Nova-3 Multilingual (10 languages), AssemblyAI Universal-3 Pro Streaming (6 languages real-time, 90+ in batch), ElevenLabs Scribe v2 Realtime (90+ languages, auto-detect), and Mistral Voxtral (13 languages, Apache-2.0 open weights). Google Chirp 3 and Meta Omnilingual add wider coverage but are not marketed for code-switching.
Yes, across different vendor stacks. Mistral Voxtral covers Arabic, Hindi, and Japanese inside its 13-language set with auto-detect. Deepgram Nova-3’s multilingual mode covers Hindi and Japanese. Sarvam Bulbul V3 specialises in 22 Indian languages including Hindi with native voices. For TTS, ElevenLabs Eleven v3 and Cartesia Sonic-3 both speak all three languages, and Sonic-3 specifically lists 9 Indian-language voices for Hindi-heavy deployments.
Cartesia Sonic-3 covers all four inside its 42-language TTS catalog with cross-lingual cloning. Mistral Voxtral handles ES, DE, KO, and FR inside its 13-language ASR set with auto-detect. Deepgram Nova-3’s multilingual mode covers Spanish, German, and French (Korean via the monolingual list). ElevenLabs Eleven v3 speaks all four across its 70+ TTS languages. For Spanish specifically, Deepgram Aura-2 exposes Mexican, Peninsular, Colombian, and Latin American variants as distinct accents — treat regional Spanish as its own routing dimension, not a single language.
Voice translation rewrites a finished audio file or stream into another language — useful for dubbing and post-production. A multilingual voice agent holds a live conversation, detecting and responding in the caller’s language turn by turn, including mid-sentence code-switches. The vendor stacks differ: dubbing leans on HeyGen, Synthesia, and Runway; real-time multilingual agents lean on Deepgram, AssemblyAI, ElevenLabs, and Voxtral. Latency budgets differ by an order of magnitude.
Inter-sentential code-switching happens between sentences: a caller finishes one full sentence in one language and starts the next in another. Each sentence is internally monolingual, so a system that detects language at sentence boundaries can route each one correctly. Intra-sentential code-switching happens mid-sentence, with both languages inside one breath group, as in “Mujhe apna account balance check karna hai.” This is the hard case, because no sentence boundary separates the languages and any system that decides language once per sentence will be wrong for part of every switched sentence.
Monolingual ASR fails in four distinct ways on code-switched speech: acoustic mismatch, because phoneme inventories differ between languages; language-model penalty, because the model assigns near-zero probability to cross-language word sequences and prunes correct hypotheses; vocabulary gaps, because the decoder has no tokens for the alternate language; and prosody disruption, because differing stress patterns break endpointing. The aggregate cost is widely cited as a 30-50% relative WER increase versus clean monolingual audio, though that range is a vendor consensus figure rather than a single measured benchmark.
For real-time code-switching, a unified multilingual model is the better choice. It runs one forward pass, carries no separate language-detection latency (an explicit detection step adds roughly 70-200ms per vendor estimates), handles intra-sentential switching natively, and preserves speaker identity across languages. A language-specialist router achieves higher clean-monolingual accuracy and wider language coverage, but it adds detection latency, breaks on mid-sentence switches, and complicates speaker continuity. Routers remain useful for batch transcription and for breadth across 90-plus languages where latency does not matter.
Several TTS vendors offer cross-lingual cloning where a single cloned voice works across many languages: ElevenLabs Eleven v3 supports a cloned voice across 32+ languages, and Cartesia Sonic-3 supports a clone across 42 languages with sub-100ms latency. The caveat is that every cross-lingual speaker-similarity claim in this space is vendor-side, with no independent benchmark. Identity preservation across languages should be verified with listening tests on the actual target languages before committing to a vendor.
In theory, yes — and they are the class to watch. A speech-native model (Ultravox, Mistral Voxtral, Qwen3-Omni, Kyutai Moshi, Gemini Live) feeds audio straight into a multilingual LLM with no separate transcription layer, so there is no step that commits to one language at the switch. The catch is maturity: this class is newer, heavier to self-host, and less telephony-proven than a Deepgram or AssemblyAI endpoint today. For most production deployments a unified streaming STT model is still the safer default, with speech-native as the frontier to pilot.
See how much it would cost to build and launch your AI voice agent — tailored to your business in under a minute.
Try the AI Voice Calculator