Voice-Specific Prompt Engineering: Why Text Prompts Break in Real-Time Audio
Last updated on May 22, 2026
Take a system prompt that works in ChatGPT and connect it to a streaming voice agent. It misfires in ways that never appear on screen. The agent reads markup aloud, narrating “asterisk asterisk” before a bolded word. It says “two thousand five” for the year 2005. It pronounces “$42.50” as “forty-two point five zero.” It jumps in during a natural mid-sentence pause and talks over the caller. None of these are model-quality problems. They are medium problems. A text prompt assumes a turn-complete, visually-rendered, retry-able channel. Streaming voice is none of those.
This article covers the prompt-engineering playbook specifically for streaming voice agents: why text prompts break, how to format for speech, how to make numbers and names reliable, how to structure a system prompt that re-executes every turn, and how to test prompt changes without shipping regressions. The provider guidance comes from OpenAI, Vapi, Retell, ElevenLabs, and LiveKit. The production examples come from Softcery’s own voice-agent prompts for the Hotel Birger Jarl in-stay concierge.
Why text prompts break in voice: real-time, audio-only, single-pass, interruptible
Text prompts assume four properties that streaming voice removes. The medium is real-time, so there is no pause to retry a bad response. It is audio-only, so there is no screen to render structure. It is single-pass, so the caller hears the first draft and only the first draft. It is interruptible, so the caller can speak over the agent at any moment. Each removed property maps to a concrete failure class.
Visual formatting leaks into speech. A text prompt that produces bullet lists, bold, headers, or numbered lists has no equivalent in audio. The text-to-speech engine either reads the markup aloud or the model dumps an unspeakable wall. OpenAI’s Realtime prompting guide states the rule plainly: “Voice agents must never output formatting that only works visually – no bold, italics, or headers; no numbered or bulleted lists.” (OpenAI Realtime Prompting Guide)
Numbers get mis-normalized by the TTS. “$42.50” can be read “forty-two point five zero.” Dates like “03/04/2025” are ambiguous between US and EU ordering. Year-like numbers misfire, so “2005” comes out “two thousand five” instead of “twenty oh five.” Modern neural TTS handles standard cases but fails on edge cases: ambiguous dates, Roman numerals, industry-specific formats. (SIMBA Voice)
The prompt re-executes every turn under latency pressure. Vapi frames the system prompt as “the agent’s operating system, re-executed on every turn.” (Vapi Prompting Guide) A long text prompt that costs nothing extra in a chatbot becomes a per-turn instruction-attention tax in streaming voice, where every token competes with the latency budget.
The agent confirms what the caller did not say. A text-trained model fills gaps confidently. In audio, that means hallucinated confirmations from unclear input. OpenAI’s Realtime guidance counters this directly: “Do not guess what the user meant from unclear audio. Do not reason when the audio is unclear.” (OpenAI realtime models guide)
The agent talks over the caller. Turn-boundary detection is its own failure surface. LiveKit puts it this way: “the moment between when you finish a sentence and when an agent starts responding determines whether a conversation feels natural or painful.” Voice-activity detection alone triggers on silence, so the agent jumps in during a natural mid-sentence pause. (LiveKit, Turn Detection for Voice Agents)
The demand for voice-specific prompting is visible in the market. Both Vapi and Retell publish free prompting masterclasses on YouTube (Retell, Vapi), and multiple paid third-party Udemy courses now ship dedicated “prompt engineering for voice” modules (example). Generic LLM prompting and voice prompting have split into separate disciplines.
Voice-first formatting: no lists, no markup, one question at a time
The first rewrite of any text prompt for voice strips every visual structure. Lists, bold, headers, and tables exist for eyes. A voice agent that emits them either spells the markup or produces a response no caller can follow.
The replacement for a list is a sentence. When an agent needs to present several items, it verbalizes them as flowing prose. Softcery’s hotel concierge prompt makes this a hard principle: the agent “MUST NOT use lists, bullets, emojis, or stage directions like laughs,” and when it needs to present several items such as restaurant hours, it “MUST verbalize them as a flowing, natural sentence.” This is Softcery production practice for the Hotel Birger Jarl agent, and it matches OpenAI’s no-visual-formatting rule one to one.
Voice-first formatting also constrains response length. Vapi recommends keeping responses under roughly two sentences and asking one question at a time. (Vapi Prompting Guide) Retell gives the same instruction: “Ask one question at a time: Avoid overwhelming the caller.” (Retell Prompt Engineering Guide) Softcery’s prompt enforces the same boundary, asking “only one clarifying question at a time.” A caller cannot scroll back to re-read a paragraph. Anything longer than a couple of sentences exceeds working memory and forces a “sorry, can you repeat that.”
Number reliability: pre-spelling, SSML say-as, and pronunciation dictionaries
Numbers are the single largest source of voice-prompt bugs because every number passes through a normalization step before synthesis, and that step guesses.
Provider guidance converges on one instruction: numbers must be written in the prompt the way they should be spoken. OpenAI’s Realtime conversion table is explicit: $42.50 becomes “forty-two dollars and fifty cents,” 03/04/2025 becomes “March fourth, twenty twenty-five,” and (831) 239-8123 becomes “eight three one, two three nine, eight one two three.” For codes, OpenAI’s rule is verbatim: “When reading numbers or codes, speak each character separately, separated by hyphens (e.g., 4-1-5). Repeat EXACTLY the provided number; do not omit any digits.” (OpenAI Realtime guide, OpenAI realtime models guide) Retell says the same for dates: “Return dates in spoken form: Say ‘January fifteenth’ not ‘1/15’.” (Retell guide)
Three fixes exist, in increasing order of robustness.
| Fix | Mechanism | Best for |
|---|---|---|
| Text normalization before synthesis | Convert “03/12/2026” to “March 12, 2026” in the prompt or pipeline before it reaches the TTS | Domain content the team controls: knowledge bases, scripted lines, dates |
SSML <say-as> tags | Wrap values with interpret-as="telephone", interpret-as="date" format="mdy", interpret-as="currency", cardinal, or ordinal | Dynamic values the agent reads back from tools or variables |
| Custom pronunciation dictionaries | Map domain terms, company acronyms, and product names to fixed pronunciations | Specialized vocabulary the built-in TTS dictionary lacks |
Source: SIMBA Voice, with SSML <say-as> documented in the W3C SSML reference.
Acronym handling needs its own rule. “FBI” should be spelled letter by letter as “F-B-I,” but “NASA” should be read as the word “Nassa.” Most TTS engines ship a built-in dictionary for common cases and need custom additions for specialized vocabulary. (SIMBA Voice)
The most reliable pattern goes further than any of the three fixes: pre-spell every number in the prompt and knowledge base so the model never performs a live conversion. Softcery’s prompt guide states it directly: “Spell things out the way they should be spoken… write ‘half past six in the morning’, not ‘06:30’; ‘nineteen seventy-four’, not ‘1974’. This removes a conversion the model can get wrong live.” The Hotel Birger Jarl knowledge base follows this throughout, with entries like “founded in nineteen seventy-four” and “Tulegatan, number eight.” The number never exists in numeric form, so there is no conversion to fail.
A real bug from Softcery’s iteration history shows why this matters. An earlier version of the hotel prompt instructed the agent to handle currency, and the agent read “195 SEK” as “S-E-K” – spelling the currency code letter by letter, because the initialism rule fired on it. The fix in the current prompt is a Pronunciation Guide rule: currency is always spoken as the name, never the code, so “195 SEK” becomes “one hundred ninety-five Swedish kronor.” The rule carries an explicit exception so the general initialism rule (“TV” to “T-V”) does not fight it. The same Pronunciation Guide section spells the emergency number “112” as the individual digits “one one two,” reads the hotel phone number with written-out grouping and natural pauses, expands addresses, and inserts ellipses to pace multi-step explanations.
Name spelling and spell-back confirmation
Names and high-precision identifiers fail in two directions. The TTS mispronounces them on the way out, and speech-to-text mis-transcribes them on the way in.
The outbound problem is a pronunciation problem. OpenAI’s recommended prompt structure includes a named “Reference Pronunciations” section with examples like “Pronounce ‘SQL’ as ‘sequel,’” “Pronounce ‘Kyiv’ as ‘KEE-iv,’” and “Pronounce ‘Huawei’ as ‘HWAH-way.’” (OpenAI Realtime guide) For any agent operating in a specific domain – hotel names, drug names, legal terms – this section is not optional.
The inbound problem is a confirmation problem. When a caller spells an order code or an email, the agent must read it back before acting on it. OpenAI’s spell-back examples are verbatim:
- Order code: “Just to confirm, I heard O-R-D dash 3-1-2-5-B-2-3. Is that right?”
- Email: “Just to confirm, that is c-h-e-n at example dot com, right?”
- Phone: “You said 0-2-1-5-5-5-1-2-3-4, correct?”
(OpenAI realtime models guide) ElevenLabs reinforces the same point for tool calls: structured identifiers passed to a tool should carry an explicit format and example in the parameter description, because STT delivers spoken-form values into the conversation context. (ElevenLabs prompting guide)
One pattern that production teams rely on does not appear in the primary platform docs: phonetic-alphabet confirmation, where the agent confirms a spelled name as “A as in Alpha, N as in November.” Softcery uses this as an in-house practice for high-stakes name capture, but it is industry and Softcery practice rather than a cited provider rule. Teams should treat it that way and test it against their own STT accuracy before relying on it.
Turn-efficient confirmation: one field per turn, batch confirm at the end
Confirmation in voice trades against patience. Every read-back costs a turn, and a caller who hears their information repeated four times in four turns hangs up. The turn-efficient pattern collects one field per turn and confirms everything once.
OpenAI states the rule directly: “Don’t ask for name, date of birth, and phone number in one turn.” Collect each field individually, then “confirm everything at once.” (OpenAI Realtime guide) The same guide draws the line on what gets a read-back at all: skip read-backs when collecting intent, preference, or soft qualification data, and reserve them for high-precision identifiers such as codes, emails, phone numbers, and amounts. For agents that can perform write actions, OpenAI adds a further rule: “define clear confirmation boundaries before write actions.” (OpenAI realtime models guide)
Softcery’s hotel concierge prompt arrived at the identical pattern in production. Principle 4 of the prompt instructs: “If the guest makes multiple requests in one call, gather all of them, read them back together as one summary, get a single ‘yes,’ and then acknowledge them as noted.” A guest who asks for a taxi, a dinner reservation, and a wake-up call hears one summary and gives one confirmation, not three round-trips. The combined rule for any voice agent: collect one field per turn, confirm in a single batch, and read back only the load-bearing identifiers.
System prompt structure under streaming-context pressure
A voice-agent system prompt re-executes every turn, so its structure decides which instructions survive under latency and attention pressure. Every major provider converges on the same skeleton: identity first, examples last.
| Provider | Section order |
|---|---|
| OpenAI Realtime | Role & Objective, Personality & Tone, Context, Reference Pronunciations, Tools, Instructions/Rules, Conversation Flow, Safety & Escalation |
| Vapi | Identity & Personality, Response Guidelines, Guardrails, Context, Workflow/Use Cases, Examples |
| Retell | Identity, Style Guardrails, Response Guidelines, Task Instructions, Objection Handling |
| ElevenLabs | Personality, Environment, Tone, Goal, Guardrails, Tools (block names approximate) |
| Softcery (hotel concierge) | Persona, Core Operating Principles, System Variables & Tools, Pronunciation Guide, Interaction Flow, Hotel Knowledge Base |
Sources: OpenAI, Vapi, Retell, ElevenLabs. The ElevenLabs “six building blocks” naming is approximate, taken from a published guide summary rather than a re-fetched primary page.
Three properties hold across all five skeletons. Identity and hard guardrails come first because they must survive every turn, even when context pressure forces the model to attend to less of the prompt. Examples come last because they are the most token-expensive and the most expendable under pressure. Conversation flow is split into phases with explicit exit criteria – OpenAI’s example: “Exit to Discovery: Caller states they are a [X] customer” – so the agent knows when to leave one stage for the next.
Softcery’s skeleton adds one section the provider templates do not name: a dedicated Pronunciation Guide. It sits between System Variables & Tools and Interaction Flow, and it holds every number-spelling, currency, acronym, and address rule described earlier. Pulling those rules into a single labeled section keeps them from scattering through the prompt where they get missed in iteration.
Softcery’s prompt guide also enforces a separation the provider skeletons imply but rarely state: behavior and facts live in different sections. Core Operating Principles and Interaction Flow define behavior. The Knowledge Base holds facts. The guide is blunt about why: “When they mix, both rot.” A fact buried in a behavior rule gets edited as if it were a rule; a rule buried in the knowledge base gets treated as reference data the model can ignore.
Two writing-style rules apply under context pressure. OpenAI recommends bullets over paragraphs inside the prompt itself – the prompt is read by the model, not spoken – and capitalized text for the key rules that must hold. OpenAI also notes that small wording changes break behavior, citing a swap of “inaudible” for “unintelligible” that improved noisy-input handling. (OpenAI Realtime guide) Prompt length itself is a latency cost: shorter prompts with fewer tokens produce faster inference, which matters more in voice than anywhere else. (Building Production-Ready Voice Agents)
The strongest structural rule is to start minimal. OpenAI’s guidance: “Begin with a minimal prompt, run evaluations, then add instructions only for behaviors that fail in testing.” (OpenAI Realtime guide) A long prompt assembled up front from imagined failure modes carries dead weight on every turn.
Model family changes the defaults of everything above
The patterns above apply across LLMs, but the defaults each model brings to those patterns differ in ways that change the prompt itself. The same voice prompt that works against Claude Opus 4.7 needs different reasoning toggles, different format conventions, and different scaffolding decisions on Gemini 3.1 Live or Llama 4. Five model-level differences are worth budgeting for before the prompt is written.
Reasoning mode is the operational footgun
The voice agent latency budget established that reasoning-mode variants do not belong in the live path. Each model family lets teams disable reasoning differently, and the safe configuration is not the default on every platform.
Gemini 3.1 Live defaults thinkingLevel to minimal for the express purpose of voice latency, per Google’s Live API docs. Nothing to do. OpenAI Realtime exposes reasoning.effort at minimal, low, medium, high, xhigh, and the official guide directs production voice agents to start at low. Safe if followed. Anthropic Claude ships effort at max, xhigh, high, medium, low on Opus 4.7, and voice deployments need both effort=low and an explicit anti-thinking line in the prompt: “Thinking adds latency and should only be used when it will meaningfully improve answer quality. When in doubt, respond directly.” (Anthropic prompting best practices) Forget the nudge and time-to-first-token inflates without warning. Qwen3-Omni and Qwen3.5-Omni ship separate Instruct and Thinking checkpoints with no runtime toggle, so voice deployments must wire the Instruct checkpoint at deploy time, per the Qwen3-Omni README. Llama 4 has no reasoning toggle at all. Nothing to disable, nothing to misconfigure.
Llama and Gemini Live are the safest by default. Claude and OpenAI both have configurations that quietly destroy time-to-first-token if the wrong setting leaks through. Audit every voice deployment’s reasoning configuration during pre-launch, not after the first slow-turn complaint.
Tool-call leadership is contextual, not absolute
Voice-specific tool benchmarks tell a different story from text aggregates. Sierra’s τ-voice benchmark, the only public voice-specific tool-agent measure, ranks xAI Grok Voice Think Fast 1.0 at 67.3%, Gemini 3.1 Flash Live at 43.8%, and GPT-Realtime 1.5 at 35.3%. The widely-held assumption that GPT leads tool calling holds on text aggregate benchmarks and does not carry over to voice in this measure.
The per-vendor picture for tool-call-heavy voice flows is concrete. Anthropic Opus 4.7 uses tools less often by default than Opus 4.6, and raising the effort setting is the documented lever to push tool use back up. OpenAI Realtime ships per-tool risk policies as a recommended prompt section: read-only lookups fire immediately, account changes require confirmation, payments require explicit amount-and-consequence read-back. Gemini 3.1 Live regressed from Gemini 2.5’s asynchronous, non-blocking function calls to sequential-only, so long tool calls now stall the conversation. The mitigation is to keep calls fast, or to emit a filler (“let me check that”) before a slow call. Llama 4’s tool-call JSON adherence is inconsistent in practitioner reports, so cascade stacks should add schema validation and retry rather than trust the model to emit valid JSON on every call.
One general rule holds across every model: validate and retry tool calls regardless of headline benchmark. Even the BFCL v3 text-leaders sit around 92%, which means roughly one in twelve calls still fails on the strongest models, per practitioner summaries.
Format preference differs more than the platform skeletons suggest
The system-prompt skeletons compared earlier in this article cover platforms. Underneath the platforms, model families ship their own format conventions.
Anthropic Claude recommends XML tag structuring for prompts that mix instructions, context, examples, and variable inputs, with tag names like <instructions>, <context>, <input>, and <examples>. Anthropic frames XML as a disambiguation aid for mixed-content prompts, not as a measured win over markdown, so the “Claude likes XML” maxim should be treated as a recommendation rather than a benchmark fact. OpenAI Realtime wants bullets over paragraphs in the prompt, capitalized text for the key rules (“IF MORE THAN THREE FAILURES THEN ESCALATE”), and JSON envelopes for structured tool outputs, per the Realtime Prompting Guide. OpenAI does not recommend XML. Gemini Live publishes no format preference in the primary docs, and practitioners commonly reuse the OpenAI structure on Gemini, which is convention, not measurement. Llama 4 uses a chat template with special tokens and supports two tool-call formats, Python-call syntax [func_name(param1=value1)] and JSON arrays, with a baseline system prompt that explicitly warns against templated phrasing like “it’s important to.” Qwen3-Omni’s voice prompt guidance goes further and tells the agent to avoid “formal phrasing, mechanical expressions, bullet points” in spoken output, with a roughly 50-word-per-turn target.
The Qwen and OpenAI difference is the cleanest illustration of why a single voice-prompt template ports badly. OpenAI puts bullets in the prompt to be followed. Qwen bans bullets in the output to be spoken. Both are right within their respective frames, and a team that reuses one prompt across both models without rewriting will get either dropped instructions on one side or bulleted speech on the other.
Anthropic specifically tells teams to remove checklists for frontier Claude
The tension between scripting every step and leaving the model room to figure out is the most explicit per-model difference, and each vendor takes a clear side.
Anthropic, on Opus 4.7: “If you’ve added scaffolding to force interim status messages (‘After every 3 tool calls, summarize progress’), try removing it.” (Anthropic prompting best practices) Frontier Claude is calibrated to handle the goal, and scripted scaffolding interferes with the model’s own pacing. The same guide tells teams to use numbered steps when running against Sonnet or Haiku tiers, where explicit ordering still helps. OpenAI’s Realtime guide takes the opposite stance with an eight-section prompt skeleton (Role, Personality, Context, Reference Pronunciations, Tools, Instructions, Conversation Flow, Safety and Escalation) and rules typed as explicit policies. The structure is the discipline. Meta documents the Llama 4 baseline system prompt as the steering mechanism, with the model responsive to the prompt and a generic default style without it. Qwen pushes toward concise prose with an explicit clarifying-question rule when references are unclear.
The operational rule that follows: a goal-style prompt that works against Opus 4.7 will under-perform against Sonnet 4.6 or Haiku 4.5 in the same family, and will likely break against Llama 4 or Qwen3-Omni without added scaffolding. Match the prompt’s scaffolding density to the model tier, not to a single house style.
Cookbook depth is uneven
The supporting documentation a team can draw on differs sharply by vendor, which matters more in voice than in text because voice-specific prompt advice is comparatively new.
OpenAI ships the deepest voice-specific cookbook. The Realtime Prompting Guide and the “Using realtime models” API guide together cover prompt skeleton, reasoning control, tool-use policy, unclear-audio handling, entity capture, and persona disfluency. Anthropic’s prompting docs are strong on general principles and ship nothing voice-specific. Google’s Live API docs cover capabilities and configuration without recommending a prompt structure. Meta documents the chat template and provides a baseline system prompt with no voice extension. Alibaba’s Qwen3-Omni README has voice prompt snippets and is the thinnest of the five.
For teams standardizing on Claude, Gemini, Llama, or Qwen for voice, the playbook in this article and the platform docs from Vapi, Retell, LiveKit, and ElevenLabs are the working guidance, and per-model adjustments come from practitioner experience rather than from a vendor cookbook. Budget time for that gap.
Disfluency: two meanings, two decisions
The word “disfluency” covers two different things in voice prompting, and conflating them produces bad prompts. One is the caller’s hesitation. The other is the agent’s filler speech.
Tolerating the caller’s disfluency is a turn-detection problem, not a prompt problem. When a caller pauses mid-sentence to think, the agent must wait, not jump in. The mechanism is turn detection. LiveKit ships a context-aware turn-detector model: “When the model predicts that the user is not done with their turn, the agent will wait for a significantly longer period of silence before responding. This helps to prevent unwanted interruptions during natural pauses in speech.” LiveKit’s recommended pipeline pairs that model with Silero VAD; for speech-to-speech models, it recommends using the provider’s built-in turn detection. (LiveKit Turn Detection) The prompt’s only contribution here is the guard against assuming through a pause: OpenAI’s “Do not guess what the user meant from unclear audio.”
Adding agent disfluency is a persona decision, and it is not a universal best practice. OpenAI’s Realtime guide treats agent disfluency as a deliberate, prompted design element: a defined vocabulary of fillers (“um,” “uh,” “so,” “well”), thinking sounds (“let me see,” “one sec”), stutters, and self-corrections, at a baseline of “2–4 disfluencies per turn.” It also stresses calibration to persona: a clinical triage agent uses “let me see,” not “uh.” (OpenAI Realtime guide) LiveKit pairs filler with timed pauses, inserting a <break time='300ms'/> after a standalone “um,” and trains the behavior with explicit before/after examples. (LiveKit, sounding more realistic)
Softcery deliberately bans agent disfluency in the hotel concierge prompt. The same principle that forbids stage directions like “laughs” forbids “um” and “uh.” The reasoning is persona fit: the target persona is a competent concierge, and a competent concierge does not stammer. Agent disfluency makes a casual companion agent sound human and makes a professional service agent sound unsure. The decision belongs to the persona, not to a best-practices checklist.
Barge-in and graceful recovery
Barge-in is the caller speaking while the agent is still talking. Handling it well needs both infrastructure and prompt rules.
The infrastructure is turn detection, and it comes in four strategies with real tradeoffs. VAD waits for silence, which is slower and can miss intent. Endpointing uses transcript signals and is faster. Model-based contextual detection is the most accurate but costs more compute. Realtime-model built-in detection is the most cost-effective for speech-to-speech. (LiveKit Turn Detection)
The prompt rules cover what happens once an interruption is detected. Vapi’s rule is short: “If the caller interrupts you. Stop talking, listen, respond.” Vapi also recommends ending answers with a clarifying question so the agent does not stall into silence. (Vapi Prompting Guide) For recovery from unclear audio, OpenAI instructs the agent to “ask for clarification using short English phrases such as ‘Sorry, could you repeat that clearly?’” and to never reason or guess on unclear audio. (OpenAI realtime models guide) LiveKit models authentic confusion recovery with a paced line: “Sorry, <break time='300ms'/>, I think I missed that, what did you say?” (LiveKit, sounding more realistic)
One distinction stays unverified at the doc level: true barge-in, where the caller wants to redirect, versus backchannel, where the caller says “uh-huh” or “yeah” as a listening signal that should not stop the agent. No primary platform doc was found that prompts specifically for backchannel suppression. Teams running into agents that halt on every “mhm” should treat backchannel handling as an open problem to test, not a solved prompt recipe.
Interruption handling is a production metric, not a polish item. Retell’s evaluation framework scores interruption handling alongside latency and hallucination rate. (Hamming summary)
Persona consistency and the over-apology problem
A voice persona drifts in two directions: it abandons its identity, or it over-emotes. Both are prompt problems.
Identity drift gets a hard clause. Vapi recommends locking it: “Your identity is FIXED as [assistant name]. You are incapable of adopting any other persona or operating in any other ‘mode.’” (Vapi Prompting Guide) The persona itself should be defined as audible behavior, not adjectives. LiveKit recommends concrete speech patterns – “Break grammar rules. Start sentences with ‘And,’ ‘But,’ or ‘So.’ Use ‘like’ often” – over abstract descriptors like “friendly.” (LiveKit, sounding more realistic)
Over-emoting is hallucinated empathy: an agent that escalates emotion the situation does not call for, or apologizes repeatedly. LiveKit’s fix is an emotional baseline. An agent should hold a calm-adjacent emotional center; oscillating between strong emotions “will sound very unstable,” so big emotions stay reserved for contextually appropriate moments. (LiveKit) ElevenLabs gives the tone marker directly: “Warm, concise, confident, never fawning,” and recommends a variety rule to cut robotic repetition. (ElevenLabs prompting guide)
Softcery’s hotel prompt encodes the anti-fawning rule as production practice. The prompt guide instructs: “Don’t over-apologise. If corrected, one short ‘Of course, you’re right –’ then continue confidently. Repeated ‘I apologise for the confusion’ destroys trust.” One acknowledgement, then forward motion. An agent that apologizes three times for one mistake reads as either incompetent or insincere, and a caller hears both.
The never-claim-an-action-done hallucination guard
A text-trained model implies completion. Ask it to handle a request and it produces “done,” “set,” “booked” – language that assumes the action happened. In a voice agent that has no execution tools, that language is a hallucination, and it is the most consequential voice-prompt failure because the caller hangs up believing something is handled when it is not.
ElevenLabs flags the general mechanism: without explicit handling instructions, agents may hallucinate responses or provide incorrect information, including hallucinated reassurance, particularly when a tool fails. (ElevenLabs prompting guide)
Softcery’s hotel concierge agent runs into this directly. Its only tool is hangUp(reason). It cannot book a taxi, send food, or set a wake-up call; it captures requests for the reception team to execute. The prompt guide bans the completion phrases outright: “your taxi is booked,” “the food is on its way,” “your wake-up call is set” are forbidden. The correct framing is explicit: “I’ve noted that for our reception team, and they’ll make sure it’s set for you.” The agent describes what it actually did – noting the request – not what a human will later do.
This pairs with a confidence rule that pushes in the opposite direction, and the tension is deliberate. The Softcery agent is told to answer confidently on non-sensitive topics; banned phrases include “I don’t know” and “I’d have to check,” because a concierge that hedges on the breakfast hours feels broken. But a Priority-3 list – allergies, billing, medical, legal, pet policy, lost and found – marks where improvisation is forbidden and “reception will follow up” is the correct answer. The prompt guide names policing that exact distinction “the single most common fix in iteration.” The agent should seem to know everything, while knowing what it does not get to fake, and never pretending an action is done.
Two more Softcery patterns belong to this class of streaming-voice failure. The first is a timezone failure. The {{currentDateTime}} variable arrives in UTC, and guests ask “is the restaurant open now?” – a question where a one-to-two-hour error gives the wrong answer. Softcery’s recommended fix is backend conversion: one line of code removes a whole error class. The prompt fallback, used only when the backend cannot change, hard-codes the CET/CEST offsets and adds a safety valve near daylight-saving switches, because LLMs are weak at date arithmetic. The second is a language-switch rule with no confidence hedge. Softcery’s prompt guide warns against giving the agent an escape clause like “if you’re not confident in the language,” because “it will abuse it and refuse capable languages.” The instruction is unconditional: “switch and proceed.” An over-cautious hedge becomes a refusal generator.
Voice-specific evals: test against a representative set
A prompt change that fixes one call can break ten others, and voice regressions are probabilistic – they do not show up in a single test. Vapi states the rule: “Validate prompt changes against a representative test set, not single calls. Probabilistic regressions don’t show up in one-off testing.” (Vapi Prompting Guide) Vapi’s headline metric is success rate: “the percentage of requests your agent handles from start to finish without human intervention.”
Platform-native and dedicated tooling both exist. Vapi Test Suites run scripted multi-step simulations as voice or text, graded by LLM rubrics, though a 15-minute call cap and per-minute cost make them impractical for wide repeatable regression testing. (Vapi Test Suites, Cekura analysis) Retell’s framework checks latency thresholds, hallucination rates, and interruption handling. (Hamming summary) Among dedicated platforms, Hamming.ai runs thousands of concurrent test calls with AI-generated personas spanning varied accents, speaking speed, and patience levels, and covers regression, red-teaming, and production-to-test replay. Coval is simulation-first and CI/CD-oriented, applying validation techniques “inspired by autonomous vehicle validation,” triggering automated tests against large scenario sets on every prompt change. (Hamming vs. Coval) These feature lists come from Hamming-published material and should be read as the vendors’ own descriptions.
Softcery runs its own eval discipline on the same principle. The prompt guide includes a failure-mode table that maps symptom to root cause to fix. One row: the symptom “Let me have reception check the time/weather/menu” traces to defer-to-human being used as a catch-all, and the fix is a whitelist of always-answer topics. The guide also defines a 24-scenario test plan with explicit pass/fail signals. The discipline that ties it together is targeted: “When a test fails, capture the exact agent line – the fix is almost always a targeted edit to one rule, not a rewrite.” A failing voice prompt rarely needs rebuilding. It needs one rule changed, and the test set proves the change did not break anything else.
Practical takeaway
Voice prompting is not generic LLM prompting with a few formatting tweaks. The medium is real-time, audio-only, single-pass, and interruptible, and each property removes an assumption the text prompt was built on. The working method is the same one OpenAI, Vapi, and Softcery’s production practice all point to.
Start minimal. Write the identity, the hard guardrails, and a short conversation flow, and nothing more. Add rules only for failures observed in testing, not failures imagined in advance, because every rule is a per-turn tax. Pre-spell numbers in the prompt and knowledge base so no live conversion can fail. Keep behavior and facts in separate sections. Confirm one field per turn and batch the read-back. Decide agent disfluency by persona, not by default. Never let the agent claim an action is done that it did not do. And validate every prompt change against a representative test set, because the regression that matters will not appear in a single call.
Frequently Asked Questions
Frequently Asked Questions
A text prompt assumes a turn-complete, visually-rendered, retry-able channel. Streaming voice is none of those. It is real-time, so there is no pause to retry a bad response; audio-only, so visual formatting like lists and bold has no equivalent and gets read aloud or dumped as an unspeakable wall; single-pass, so the caller hears only the first draft; and interruptible, so the caller can speak over the agent at any moment. The same prompt also re-executes every turn under a latency budget, so its length becomes a cost. Voice prompting needs its own playbook, not a few formatting tweaks on a chatbot prompt.
Every number passes through a normalization step before synthesis, and that step guesses. The most reliable fix is to pre-spell numbers directly in the prompt and knowledge base, so the model never performs a live conversion: write “nineteen seventy-four,” not “1974,” and “one hundred ninety-five Swedish kronor,” not “195 SEK.” For dynamic values the agent reads back from tools, wrap them in SSML <say-as> tags with the correct interpret-as value. For domain vocabulary the built-in TTS dictionary lacks, add a custom pronunciation dictionary. Provider guidance from OpenAI and Retell converges on the same principle: numbers in the prompt should be written the way they should be spoken.
It depends on the persona, and it is not a universal best practice. The word “disfluency” covers two different things. Tolerating the caller’s hesitation is a turn-detection problem solved by infrastructure, not prompt text. Adding the agent’s own filler is a deliberate prompt choice. OpenAI treats agent disfluency as a designed element at a baseline of two to four per turn, calibrated to persona. But filler that makes a casual companion agent sound human makes a professional service agent sound unsure. Softcery deliberately bans agent disfluency in its hotel concierge prompt because the target persona is a competent concierge, and a competent concierge does not stammer.
Collect one field per turn, then confirm everything in a single batch at the end, and read back only the load-bearing identifiers. A caller who hears their information repeated four times in four turns hangs up. OpenAI’s rule is to avoid asking for name, date of birth, and phone number in one turn, collect each individually, and confirm them all at once. Read-backs should be reserved for high-precision identifiers like codes, emails, phone numbers, and amounts, and skipped for soft data like intent and preference. Softcery’s hotel concierge prompt arrived at the same pattern: gather all of a guest’s requests, read them back as one summary, and get a single “yes.”
Letting the agent claim an action is done that it did not do. Text-trained models imply completion – they produce “done,” “set,” “booked” – even when the agent has no tool to execute the action. In a voice agent, the caller hangs up believing something is handled when it is not. The fix is an explicit prompt rule that bans completion phrases and replaces them with accurate framing about what the agent actually did. Softcery’s hotel concierge, whose only tool is hangUp, bans “your taxi is booked” and requires “I’ve noted that for our reception team, and they’ll make sure it’s set for you.” The agent describes capturing the request, not completing it.
Yes, in concrete ways. Reasoning-mode defaults differ by vendor: Gemini 3.1 Live defaults to safe minimal thinking, Llama 4 has no toggle at all, OpenAI Realtime tells teams to start at low effort, Anthropic Claude requires an explicit anti-thinking instruction in the prompt, and Qwen Omni requires choosing the Instruct checkpoint at deploy time. Format conventions differ: Anthropic recommends XML for mixed-content prompts, OpenAI wants bullets and a labeled section skeleton, and Qwen explicitly bans bullets in spoken output. Scaffolding density should match the model tier: Anthropic tells teams to remove checklist scaffolding for Opus 4.7, while smaller open models like Llama 4 and Qwen3-Omni still need more explicit step lists. Tool-call leadership in voice is currently held by xAI Grok Voice Think Fast 1.0 on Sierra’s τ-voice benchmark at 67.3%, not by GPT, which scored 35.3% on the same benchmark.
See how much it would cost to build and launch your AI voice agent — tailored to your business in under a minute.
Try the AI Voice Calculator