Voice-Specific Prompt Engineering: Why Text Prompts Break in Real-Time Audio

Last updated on May 22, 2026

Talk to AI engineers

We build and advise on production AI systems. Bring your questions to a free intro call.

Take a system prompt that works in ChatGPT and connect it to a streaming voice agent. It misfires in ways that never appear on screen. The agent reads markup aloud, narrating “asterisk asterisk” before a bolded word. It says “two thousand five” for the year 2005. It pronounces “$42.50” as “forty-two point five zero.” It jumps in during a natural mid-sentence pause and talks over the caller. None of these are model-quality problems. They are medium problems. A text prompt assumes a turn-complete, visually-rendered, retry-able channel. Streaming voice is none of those.

This article covers the prompt-engineering playbook specifically for streaming voice agents, treating voice prompt AI as its own discipline: why text prompts break, how to format for speech, how to make numbers and names reliable, how to structure a system prompt that re-executes every turn, and how to test prompt changes without shipping regressions. The provider guidance comes from OpenAI, Vapi, Retell, ElevenLabs, and LiveKit. The production examples come from Softcery’s own voice-agent prompts for the Hotel Birger Jarl in-stay concierge.

Why text prompts break in voice

Text prompts assume four properties that streaming voice removes. The medium is real-time, so there is no pause to retry a bad response. It is audio-only, so there is no screen to render structure. It is single-pass, so the caller hears the first draft and only the first draft. It is interruptible, so the caller can speak over the agent at any moment. Each removed property maps to a concrete failure class.

Visual formatting leaks into speech. A text prompt that produces bullet lists, bold, headers, or numbered lists has no equivalent in audio. The text-to-speech engine either reads the markup aloud or the model dumps an unspeakable wall. OpenAI’s Realtime prompting guide states the rule plainly: “Voice agents must never output formatting that only works visually – no bold, italics, or headers; no numbered or bulleted lists.” (OpenAI Realtime Prompting Guide)

Numbers get mis-normalized by the TTS. “$42.50” can be read “forty-two point five zero.” Dates like “03/04/2025” are ambiguous between US and EU ordering. Year-like numbers misfire, so “2005” comes out “two thousand five” instead of “twenty oh five.” Modern neural TTS handles standard cases but fails on edge cases: ambiguous dates, Roman numerals, industry-specific formats. (SIMBA Voice)

The prompt re-executes every turn under latency pressure. Vapi frames the system prompt as “the agent’s operating system, re-executed on every turn.” (Vapi Prompting Guide) A long text prompt that costs nothing extra in a chatbot becomes a per-turn instruction-attention tax in streaming voice, where every token competes with the latency budget.

The agent confirms what the caller did not say. A text-trained model fills gaps confidently. In audio, that means hallucinated confirmations from unclear input. OpenAI’s Realtime guidance counters this directly: “Do not guess what the user meant from unclear audio. Do not reason when the audio is unclear.” (OpenAI realtime models guide)

The agent talks over the caller. Turn-boundary detection is its own failure surface. LiveKit puts it this way: “the moment between when you finish a sentence and when an agent starts responding determines whether a conversation feels natural or painful.” Voice-activity detection alone triggers on silence, so the agent jumps in during a natural mid-sentence pause. (LiveKit, Turn Detection for Voice Agents)

The demand for voice-specific prompting is visible in the market. Both Vapi and Retell publish free prompting masterclasses on YouTube (Retell, Vapi), and multiple paid third-party Udemy courses now ship dedicated “prompt engineering for voice” modules (example). Generic LLM prompting and voice prompting have split into separate disciplines.

Writing speech the ear can follow

Three rewrites turn a screen-shaped prompt into one the caller can hear: strip visual formatting, make every number speakable, and spell back the names and codes that have to be exact.

Voice-first formatting

The first rewrite of any text prompt for voice strips every visual structure. Lists, bold, headers, and tables exist for eyes. A voice agent that emits them either spells the markup or produces a response no caller can follow.

The replacement for a list is a sentence. When an agent needs to present several items, it verbalizes them as flowing prose. Softcery’s hotel concierge prompt makes this a hard principle: the agent “MUST NOT use lists, bullets, emojis, or stage directions like laughs,” and when it needs to present several items such as restaurant hours, it “MUST verbalize them as a flowing, natural sentence.” This is Softcery production practice for the Hotel Birger Jarl agent, and it matches OpenAI’s no-visual-formatting rule one to one.

Voice-first formatting also constrains response length. Vapi recommends keeping responses under roughly two sentences and asking one question at a time. (Vapi Prompting Guide) Retell gives the same instruction: “Ask one question at a time: Avoid overwhelming the caller.” (Retell Prompt Engineering Guide) Softcery’s prompt enforces the same boundary, asking “only one clarifying question at a time.” A caller cannot scroll back to re-read a paragraph. Anything longer than a couple of sentences exceeds working memory and forces a “sorry, can you repeat that.”

Making numbers reliable

Numbers are the single largest source of voice-prompt bugs because every number passes through a normalization step before synthesis, and that step guesses.

Provider guidance converges on one instruction: numbers must be written in the prompt the way they should be spoken. OpenAI’s Realtime conversion table is explicit: $42.50 becomes “forty-two dollars and fifty cents,” 03/04/2025 becomes “March fourth, twenty twenty-five,” and (831) 239-8123 becomes “eight three one, two three nine, eight one two three.” For codes, OpenAI’s rule is verbatim: “When reading numbers or codes, speak each character separately, separated by hyphens (e.g., 4-1-5). Repeat EXACTLY the provided number; do not omit any digits.” (OpenAI Realtime guide, OpenAI realtime models guide) Retell says the same for dates: “Return dates in spoken form: Say ‘January fifteenth’ not ‘1/15’.” (Retell guide)

Three fixes exist, in increasing order of robustness.

Fix	Mechanism	Best for
Text normalization before synthesis	Convert “03/12/2026” to “March 12, 2026” in the prompt or pipeline before it reaches the TTS	Domain content the team controls: knowledge bases, scripted lines, dates
SSML `<say-as>` tags	Wrap values with `interpret-as="telephone"`, `interpret-as="date" format="mdy"`, `interpret-as="currency"`, `cardinal`, or `ordinal`	Dynamic values the agent reads back from tools or variables
Custom pronunciation dictionaries	Map domain terms, company acronyms, and product names to fixed pronunciations	Specialized vocabulary the built-in TTS dictionary lacks

Source: SIMBA Voice, with SSML <say-as> documented in the W3C SSML reference.

Acronym handling needs its own rule. “FBI” should be spelled letter by letter as “F-B-I,” but “NASA” should be read as the word “Nassa.” Most TTS engines ship a built-in dictionary for common cases and need custom additions for specialized vocabulary. (SIMBA Voice)

The most reliable pattern goes further than any of the three fixes: pre-spell every number in the prompt and knowledge base so the model never performs a live conversion. Softcery’s prompt guide states it directly: “Spell things out the way they should be spoken… write ‘half past six in the morning’, not ‘06:30’; ‘nineteen seventy-four’, not ‘1974’. This removes a conversion the model can get wrong live.” The Hotel Birger Jarl knowledge base follows this throughout, with entries like “founded in nineteen seventy-four” and “Tulegatan, number eight.” The number never exists in numeric form, so there is no conversion to fail.

A real bug from Softcery’s iteration history shows why this matters. An earlier version of the hotel prompt instructed the agent to handle currency, and the agent read “195 SEK” as “S-E-K” – spelling the currency code letter by letter, because the initialism rule fired on it. The fix in the current prompt is a Pronunciation Guide rule: currency is always spoken as the name, never the code, so “195 SEK” becomes “one hundred ninety-five Swedish kronor.” The rule carries an explicit exception so the general initialism rule (“TV” to “T-V”) does not fight it. The same section extends the rule to the emergency number, phone numbers, addresses, and the pacing pauses between multi-step explanations.

Spelling names back

Names and high-precision identifiers fail in two directions. The TTS mispronounces them on the way out, and speech-to-text mis-transcribes them on the way in.

The outbound problem is a pronunciation problem. OpenAI’s recommended prompt structure includes a named “Reference Pronunciations” section with examples like “Pronounce ‘SQL’ as ‘sequel,’” “Pronounce ‘Kyiv’ as ‘KEE-iv,’” and “Pronounce ‘Huawei’ as ‘HWAH-way.’” (OpenAI Realtime guide) For any agent operating in a specific domain – hotel names, drug names, legal terms – this section is not optional.

The inbound problem is a confirmation problem. When a caller spells an order code or an email, the agent must read it back before acting on it. OpenAI’s spell-back examples are verbatim:

Order code: “Just to confirm, I heard O-R-D dash 3-1-2-5-B-2-3. Is that right?”
Email: “Just to confirm, that is c-h-e-n at example dot com, right?”
Phone: “You said 0-2-1-5-5-5-1-2-3-4, correct?”

(OpenAI realtime models guide) ElevenLabs reinforces the same point for tool calls: structured identifiers passed to a tool should carry an explicit format and example in the parameter description, because STT delivers spoken-form values into the conversation context. (ElevenLabs prompting guide)

One pattern that production teams rely on does not appear in the primary platform docs: phonetic-alphabet confirmation, where the agent confirms a spelled name as “A as in Alpha, N as in November.” Softcery uses this as an in-house practice for high-stakes name capture, but it is industry and Softcery practice rather than a cited provider rule. Teams should treat it that way and test it against their own STT accuracy before relying on it.

Structuring the conversation

With speech made reliable, the next layer is flow: how the agent collects information turn by turn and how the system prompt itself is ordered to survive every turn.

Turn-efficient confirmation

Confirmation in voice trades against patience. Every read-back costs a turn, and a caller who hears their information repeated four times in four turns hangs up. The turn-efficient pattern collects one field per turn and confirms everything once.

OpenAI states the rule directly: “Don’t ask for name, date of birth, and phone number in one turn.” Collect each field individually, then “confirm everything at once.” (OpenAI Realtime guide) The same guide draws the line on what gets a read-back at all: skip read-backs when collecting intent, preference, or soft qualification data, and reserve them for high-precision identifiers such as codes, emails, phone numbers, and amounts. For agents that can perform write actions, OpenAI adds a further rule: “define clear confirmation boundaries before write actions.” (OpenAI realtime models guide)

Softcery’s hotel concierge prompt arrived at the identical pattern in production. Principle 4 of the prompt instructs: “If the guest makes multiple requests in one call, gather all of them, read them back together as one summary, get a single ‘yes,’ and then acknowledge them as noted.” A guest who asks for a taxi, a dinner reservation, and a wake-up call hears one summary and gives one confirmation, not three round-trips. The combined rule for any voice agent: collect one field per turn, confirm in a single batch, and read back only the load-bearing identifiers.

System prompt structure

A voice-agent system prompt re-executes every turn, so its structure decides which instructions survive under latency and attention pressure. Every major provider converges on the same skeleton: identity first, examples last.

Provider	Section order
OpenAI Realtime	Role & Objective, Personality & Tone, Context, Reference Pronunciations, Tools, Instructions/Rules, Conversation Flow, Safety & Escalation
Vapi	Identity & Personality, Response Guidelines, Guardrails, Context, Workflow/Use Cases, Examples
Retell	Identity, Style Guardrails, Response Guidelines, Task Instructions, Objection Handling
ElevenLabs	Personality, Environment, Tone, Goal, Guardrails, Tools (block names approximate)
Softcery (hotel concierge)	Persona, Core Operating Principles, System Variables & Tools, Pronunciation Guide, Interaction Flow, Hotel Knowledge Base

Sources: OpenAI, Vapi, Retell, ElevenLabs. The ElevenLabs “six building blocks” naming is approximate, taken from a published guide summary rather than a re-fetched primary page.

Three properties hold across all five skeletons. Identity and hard guardrails come first because they must survive every turn, even when context pressure forces the model to attend to less of the prompt. Examples come last because they are the most token-expensive and the most expendable under pressure. Conversation flow is split into phases with explicit exit criteria – OpenAI’s example: “Exit to Discovery: Caller states they are a [X] customer” – so the agent knows when to leave one stage for the next.

Softcery’s skeleton adds one section the provider templates do not name: a dedicated Pronunciation Guide. It sits between System Variables & Tools and Interaction Flow, and it holds every number-spelling, currency, acronym, and address rule described earlier. Pulling those rules into a single labeled section keeps them from scattering through the prompt where they get missed in iteration.

Softcery’s prompt guide also enforces a separation the provider skeletons imply but rarely state: behavior and facts live in different sections. Core Operating Principles and Interaction Flow define behavior. The Knowledge Base holds facts. The guide is blunt about why: “When they mix, both rot.” A fact buried in a behavior rule gets edited as if it were a rule; a rule buried in the knowledge base gets treated as reference data the model can ignore.

Two writing-style rules apply under context pressure: OpenAI recommends bullets over paragraphs and capitalized text for the key rules that must hold, since the prompt is read by the model, not spoken. (OpenAI Realtime guide) Prompt length itself is a latency cost – fewer tokens produce faster inference, which matters more in voice than anywhere else.

The strongest structural rule is to start minimal. OpenAI’s guidance: “Begin with a minimal prompt, run evaluations, then add instructions only for behaviors that fail in testing.” (OpenAI Realtime guide) A long prompt assembled up front from imagined failure modes carries dead weight on every turn.

How the model changes the prompt

The patterns above apply across LLMs, but the defaults each model brings to those patterns differ in ways that change the prompt itself, on top of the model trade-offs covered in choosing the right LLM. The same voice prompt that works against Claude Opus 4.7 needs different reasoning toggles, different format conventions, and different scaffolding decisions on Gemini 3.1 Live or Llama 4. Five model-level differences are worth budgeting for before the prompt is written.

Reasoning mode

The voice agent latency budget established that reasoning-mode variants do not belong in the live path. Each model family lets teams disable reasoning differently, and the safe configuration is not the default on every platform.

Gemini Live and Llama 4 are safe by default – Gemini defaults thinking to minimal, and Llama 4 has no reasoning toggle at all. OpenAI and Anthropic are not. OpenAI Realtime’s official guide directs production voice agents to start at low effort; Anthropic Claude needs both a low effort setting and an explicit anti-thinking line in the prompt: “Thinking adds latency and should only be used when it will meaningfully improve answer quality. When in doubt, respond directly.” (Anthropic prompting best practices) Qwen ships separate Instruct and Thinking checkpoints, so wire Instruct at deploy time.

Leave the wrong setting on and time-to-first-token inflates without warning. Audit every voice deployment’s reasoning configuration during pre-launch, not after the first slow-turn complaint.

Tool-call reliability

Voice-specific tool benchmarks tell a different story from text aggregates. On Sierra’s τ-voice benchmark – the only public voice-specific tool-agent measure – xAI Grok Voice Think Fast 1.0 leads at 67.3%, with GPT-Realtime at 35.3%. The assumption that GPT leads tool calling holds on text benchmarks and does not carry over to voice here.

Per-vendor quirks matter for tool-heavy flows: Claude uses tools less aggressively unless the effort setting is raised, Gemini Live regressed to sequential-only function calls so a slow call now stalls the conversation, and Llama 4’s tool-call JSON is inconsistent enough to need schema validation. One rule holds across every model: validate and retry tool calls regardless of headline benchmark, since even the strongest models still fail roughly one call in twelve.

Per-model format conventions

Underneath the platform skeletons, model families ship their own format conventions. Anthropic recommends XML tags to separate instructions, context, and examples – a disambiguation aid for mixed-content prompts, not a measured win over markdown. OpenAI Realtime wants bullets in the prompt, capitalized text for the key rules, and JSON for tool outputs.

The cleanest illustration of why one template ports badly is Qwen versus OpenAI. OpenAI puts bullets in the prompt to be followed; Qwen bans bullets in the output to be spoken. Both are right within their own frame, and a team that reuses one prompt across both without rewriting gets either dropped instructions on one side or bulleted speech on the other.

Scaffolding by model tier

Scripting every step versus leaving the model room is the most explicit per-model split. Anthropic tells teams to remove forced scaffolding on frontier Claude – “if you’ve added scaffolding to force interim status messages, try removing it” – because the model handles the goal and scripted steps interfere with its own pacing. (Anthropic prompting best practices) The same guide says use numbered steps on smaller Sonnet or Haiku tiers, and OpenAI’s eight-section skeleton takes the opposite, structure-is-the-discipline stance.

The operational rule: a goal-style prompt that works on Opus 4.7 will under-perform on a smaller tier and likely break on Llama 4 or Qwen without added scaffolding. Match scaffolding density to the model tier, not to a single house style.

Uneven vendor docs

The supporting docs differ sharply by vendor, which matters more in voice than in text because voice-specific prompt advice is new. OpenAI ships by far the deepest voice-specific cookbook – its Realtime guides cover prompt skeleton, reasoning control, tool-use policy, unclear-audio handling, and persona. Anthropic, Google, Meta, and Qwen are thinner or silent on voice specifics.

For teams standardizing on Claude, Gemini, Llama, or Qwen, the playbook in this article and the platform docs from Vapi, Retell, LiveKit, and ElevenLabs are the working guidance, and per-model adjustments come from practitioner experience rather than a vendor cookbook. Budget time for that gap.

Shipping a voice agent and the prompt keeps misfiring in production? Schedule a consultation and we’ll audit it against the patterns above.

Handling real-time conversation

Once the prompt is structured, the remaining failures happen live: pauses the agent misreads, interruptions it fumbles, and a persona that drifts or over-emotes.

Disfluency

The word “disfluency” covers two different things in voice prompting, and conflating them produces bad prompts. One is the caller’s hesitation. The other is the agent’s filler speech.

Tolerating the caller’s disfluency is a turn-detection problem, not a prompt problem. When a caller pauses mid-sentence to think, the agent must wait, not jump in. The mechanism is turn detection: LiveKit’s context-aware turn-detector waits through a longer silence when it predicts the caller is not done, paired with Silero VAD, or the provider’s built-in detection for speech-to-speech models. (LiveKit Turn Detection) The prompt’s only contribution here is the guard against assuming through a pause: OpenAI’s “Do not guess what the user meant from unclear audio.”

Adding agent disfluency is a persona decision, and it is not a universal best practice. OpenAI’s Realtime guide treats agent disfluency as a deliberate, prompted design element: a defined vocabulary of fillers (“um,” “uh,” “so,” “well”), thinking sounds (“let me see,” “one sec”), stutters, and self-corrections, at a baseline of “2–4 disfluencies per turn.” It also stresses calibration to persona: a clinical triage agent uses “let me see,” not “uh.” (OpenAI Realtime guide)

Softcery deliberately bans agent disfluency in the hotel concierge prompt. The same principle that forbids stage directions like “laughs” forbids “um” and “uh.” The reasoning is persona fit: the target persona is a competent concierge, and a competent concierge does not stammer. Agent disfluency makes a casual companion agent sound human and makes a professional service agent sound unsure. The decision belongs to the persona, not to a best-practices checklist.

Barge-in and recovery

Barge-in is the caller speaking while the agent is still talking. Handling it well needs both infrastructure and prompt rules.

The infrastructure is turn detection, and it comes in four strategies with real tradeoffs that the turn-based vs real-time architecture guide compares end-to-end. VAD waits for silence, which is slower and can miss intent. Endpointing uses transcript signals and is faster. Model-based contextual detection is the most accurate but costs more compute. Realtime-model built-in detection is the most cost-effective for speech-to-speech. (LiveKit Turn Detection)

The prompt rules cover what happens once an interruption is detected. Vapi’s rule is short: “If the caller interrupts you. Stop talking, listen, respond.” Vapi also recommends ending answers with a clarifying question so the agent does not stall into silence. (Vapi Prompting Guide) For recovery from unclear audio, OpenAI instructs the agent to “ask for clarification using short English phrases such as ‘Sorry, could you repeat that clearly?’” and to never reason or guess on unclear audio. (OpenAI realtime models guide)

One distinction stays unverified at the doc level: true barge-in, where the caller wants to redirect, versus backchannel, where the caller says “uh-huh” or “yeah” as a listening signal that should not stop the agent. No primary platform doc was found that prompts specifically for backchannel suppression. Teams running into agents that halt on every “mhm” should treat backchannel handling as an open problem to test, not a solved prompt recipe.

Interruption handling is a production metric, not a polish item. Retell’s evaluation framework scores interruption handling alongside latency and hallucination rate. (Hamming summary)

Persona and over-apology

A voice persona drifts in two directions: it abandons its identity, or it over-emotes. Both are prompt problems.

Identity drift gets a hard clause. Vapi recommends locking it: “Your identity is FIXED as [assistant name]. You are incapable of adopting any other persona or operating in any other ‘mode.’” (Vapi Prompting Guide) The persona itself should be defined as audible behavior, not adjectives. LiveKit recommends concrete speech patterns – “Break grammar rules. Start sentences with ‘And,’ ‘But,’ or ‘So.’ Use ‘like’ often” – over abstract descriptors like “friendly.” (LiveKit, sounding more realistic)

Over-emoting is hallucinated empathy: an agent that escalates emotion the situation does not call for, or apologizes repeatedly. LiveKit’s fix is an emotional baseline. An agent should hold a calm-adjacent emotional center; oscillating between strong emotions “will sound very unstable,” so big emotions stay reserved for contextually appropriate moments. (LiveKit) ElevenLabs gives the tone marker directly: “Warm, concise, confident, never fawning,” and recommends a variety rule to cut robotic repetition. (ElevenLabs prompting guide)

Softcery’s hotel prompt encodes the anti-fawning rule as production practice. The prompt guide instructs: “Don’t over-apologise. If corrected, one short ‘Of course, you’re right –’ then continue confidently. Repeated ‘I apologise for the confusion’ destroys trust.” One acknowledgement, then forward motion. An agent that apologizes three times for one mistake reads as either incompetent or insincere, and a caller hears both.

Reliability guards and testing

Two final disciplines keep a working prompt working: a guard against the agent claiming an action it never took, and an eval loop that proves a prompt change fixed one call without breaking ten others.

Never claim an action is done

A text-trained model implies completion. Ask it to handle a request and it produces “done,” “set,” “booked” – language that assumes the action happened. In a voice agent that has no execution tools, that language is a hallucination, and it is the most consequential voice-prompt failure because the caller hangs up believing something is handled when it is not.

ElevenLabs flags the general mechanism: without explicit handling instructions, agents may hallucinate responses or provide incorrect information, including hallucinated reassurance, particularly when a tool fails. (ElevenLabs prompting guide)

Softcery’s hotel concierge agent runs into this directly. Its only tool is hangUp(reason). It cannot book a taxi, send food, or set a wake-up call; it captures requests for the reception team to execute. The prompt guide bans the completion phrases outright: “your taxi is booked,” “the food is on its way,” “your wake-up call is set” are forbidden. The correct framing is explicit: “I’ve noted that for our reception team, and they’ll make sure it’s set for you.” The agent describes what it actually did – noting the request – not what a human will later do.

This pairs with a confidence rule that pushes in the opposite direction, and the tension is deliberate. The Softcery agent is told to answer confidently on non-sensitive topics; banned phrases include “I don’t know” and “I’d have to check,” because a concierge that hedges on the breakfast hours feels broken. But a Priority-3 list – allergies, billing, medical, legal, pet policy, lost and found – marks where improvisation is forbidden and “reception will follow up” is the correct answer. The prompt guide names policing that exact distinction “the single most common fix in iteration.” The agent should seem to know everything, while knowing what it does not get to fake, and never pretending an action is done.

Two more Softcery patterns belong to this class. A timezone failure: the {{currentDateTime}} variable arrives in UTC, so “is the restaurant open now?” gets the wrong answer when local time differs – the fix is backend conversion, not prompt-side date math, which LLMs are weak at. And a language-switch rule with no confidence hedge: give the agent an escape clause like “if you’re not confident in the language” and “it will abuse it and refuse capable languages,” so the instruction is unconditional – “switch and proceed.”

Voice-specific evals

A prompt change that fixes one call can break ten others, and voice regressions are probabilistic – they do not show up in a single test. Vapi states the rule: “Validate prompt changes against a representative test set, not single calls. Probabilistic regressions don’t show up in one-off testing.” (Vapi Prompting Guide) Vapi’s headline metric is success rate: “the percentage of requests your agent handles from start to finish without human intervention.”

Platform-native and dedicated tooling both exist, and the voice agent testing guide covers the full method. Vapi and Retell ship in-platform test suites, though call caps and per-minute cost limit wide regression runs. Dedicated platforms go further: Hamming.ai runs thousands of concurrent test calls with AI-generated personas across accents and patience levels, and Coval is simulation-first and CI/CD-oriented, firing automated tests on every prompt change.

Softcery runs its own eval discipline on the same principle. The prompt guide includes a failure-mode table that maps symptom to root cause to fix. One row: the symptom “Let me have reception check the time/weather/menu” traces to defer-to-human being used as a catch-all, and the fix is a whitelist of always-answer topics. The guide also defines a 24-scenario test plan with explicit pass/fail signals. The discipline that ties it together is targeted: “When a test fails, capture the exact agent line – the fix is almost always a targeted edit to one rule, not a rewrite.” A failing voice prompt rarely needs rebuilding. It needs one rule changed, and the test set proves the change did not break anything else.

Production proof: Softcery’s Casegen call-evaluation pipeline scores every conversation against quality criteria, which is how a prompt change gets validated on real traffic instead of a single test call.

Practical takeaway

Voice prompting is not generic LLM prompting with a few formatting tweaks. The medium is real-time, audio-only, single-pass, and interruptible, and each property removes an assumption the text prompt was built on. The working method is the same one OpenAI, Vapi, and Softcery’s production practice all point to.

Start minimal. Write the identity, the hard guardrails, and a short conversation flow, and nothing more. Add rules only for failures observed in testing, not failures imagined in advance, because every rule is a per-turn tax. Pre-spell numbers in the prompt and knowledge base so no live conversion can fail. Keep behavior and facts in separate sections. Confirm one field per turn and batch the read-back. Decide agent disfluency by persona, not by default. Never let the agent claim an action is done that it did not do. And validate every prompt change against a representative test set, because the regression that matters will not appear in a single call.

If a voice agent prompt is misfiring in production and the fix is not obvious, schedule a consultation and Softcery will audit it against these patterns.

Frequently Asked Questions

Why does a system prompt that works in ChatGPT fail in a voice agent?

A text prompt assumes a turn-complete, visually-rendered, retry-able channel. Streaming voice is none of those. It is real-time, so there is no pause to retry a bad response; audio-only, so visual formatting like lists and bold has no equivalent and gets read aloud or dumped as an unspeakable wall; single-pass, so the caller hears only the first draft; and interruptible, so the caller can speak over the agent at any moment. The same prompt also re-executes every turn under a latency budget, so its length becomes a cost. Voice prompting needs its own playbook, not a few formatting tweaks on a chatbot prompt.

How do I stop a voice agent from misreading numbers, dates, and prices?

Every number passes through a normalization step before synthesis, and that step guesses. The most reliable fix is to pre-spell numbers directly in the prompt and knowledge base, so the model never performs a live conversion: write “nineteen seventy-four,” not “1974,” and “one hundred ninety-five Swedish kronor,” not “195 SEK.” For dynamic values the agent reads back from tools, wrap them in SSML <say-as> tags with the correct interpret-as value. For domain vocabulary the built-in TTS dictionary lacks, add a custom pronunciation dictionary. Provider guidance from OpenAI and Retell converges on the same principle: numbers in the prompt should be written the way they should be spoken.

Should a voice agent use filler words like 'um' and 'uh'?

It depends on the persona, and it is not a universal best practice. The word “disfluency” covers two different things. Tolerating the caller’s hesitation is a turn-detection problem solved by infrastructure, not prompt text. Adding the agent’s own filler is a deliberate prompt choice. OpenAI treats agent disfluency as a designed element at a baseline of two to four per turn, calibrated to persona. But filler that makes a casual companion agent sound human makes a professional service agent sound unsure. Softcery deliberately bans agent disfluency in its hotel concierge prompt because the target persona is a competent concierge, and a competent concierge does not stammer.

How should a voice agent confirm information without wasting the caller's time?

Collect one field per turn, then confirm everything in a single batch at the end, and read back only the load-bearing identifiers. A caller who hears their information repeated four times in four turns hangs up. OpenAI’s rule is to avoid asking for name, date of birth, and phone number in one turn, collect each individually, and confirm them all at once. Read-backs should be reserved for high-precision identifiers like codes, emails, phone numbers, and amounts, and skipped for soft data like intent and preference. Softcery’s hotel concierge prompt arrived at the same pattern: gather all of a guest’s requests, read them back as one summary, and get a single “yes.”

What is the most dangerous prompt mistake in a voice agent?

Letting the agent claim an action is done that it did not do. Text-trained models imply completion – they produce “done,” “set,” “booked” – even when the agent has no tool to execute the action. In a voice agent, the caller hangs up believing something is handled when it is not. The fix is an explicit prompt rule that bans completion phrases and replaces them with accurate framing about what the agent actually did. Softcery’s hotel concierge, whose only tool is hangUp, bans “your taxi is booked” and requires “I’ve noted that for our reception team, and they’ll make sure it’s set for you.” The agent describes capturing the request, not completing it.

Does the underlying LLM change the prompt a voice agent needs?

Yes, in concrete ways. Reasoning-mode defaults differ by vendor: Gemini 3.1 Live defaults to safe minimal thinking, Llama 4 has no toggle at all, OpenAI Realtime tells teams to start at low effort, Anthropic Claude requires an explicit anti-thinking instruction in the prompt, and Qwen Omni requires choosing the Instruct checkpoint at deploy time. Format conventions differ: Anthropic recommends XML for mixed-content prompts, OpenAI wants bullets and a labeled section skeleton, and Qwen explicitly bans bullets in spoken output. Scaffolding density should match the model tier: Anthropic tells teams to remove checklist scaffolding for Opus 4.7, while smaller open models like Llama 4 and Qwen3-Omni still need more explicit step lists. Tool-call leadership in voice is currently held by xAI Grok Voice Think Fast 1.0 on Sierra’s τ-voice benchmark at 67.3%, not by GPT, which scored 35.3% on the same benchmark.

What is the best AI voice receptionist prompt?

A voice receptionist prompt follows the same playbook as any production voice agent, tuned to the front-desk job. Lock the identity, collect one field per turn and batch the confirmation, spell back high-precision identifiers, and ban completion phrases the agent cannot deliver. A law-firm intake prompt confirms the matter and escalates conflicts; a hotel concierge prompt verbalizes options as flowing prose; a healthcare intake prompt keeps a HIPAA-safe disclosure script. The strongest receptionist prompts pre-spell every number and route anything sensitive to a human with “reception will follow up.”

What are examples of AI voice agent prompts?

Three production examples from Softcery’s hotel concierge show the patterns. The pronunciation rule spells “195 SEK” as “one hundred ninety-five Swedish kronor” so the TTS never reads the currency code letter by letter. The never-claim-an-action-done rewrite replaces “your taxi is booked” with “I’ve noted that for our reception team, and they’ll make sure it’s set for you.” The batch-confirmation pattern gathers a guest’s taxi, dinner, and wake-up requests, reads them back as one summary, and takes a single “yes.”

How do I enforce brand voice in an AI voice agent?

Brand voice in a voice prompt AI is enforced with three rules. Lock the persona so the agent cannot adopt another mode: Vapi phrases it “your identity is FIXED.” Set an explicit tone marker, like ElevenLabs’ “Warm, concise, confident, never fawning.” Add an anti-fawning rule so the agent acknowledges a correction once and moves on, because repeated apologies destroy trust. Define the persona as audible behavior, not adjectives.

What is the best AI voice prompt script builder?

Vapi, Retell, and ElevenLabs each ship a prompt editor inside their platform, the fastest way to build and test a voice prompt script without code. Treat the system prompt as the voice training script for the agent: a script for AI voice training that you version and refine, not a one-off. For version-controlled prompts, Softcery keeps the script in MDX alongside the agent so every change is reviewable and reversible. The builder matters less than the discipline: start minimal, add rules only for failures observed in testing, and validate every change against a representative test set.

Pay-by-Bank and Agentic Commerce: What ACP, AP2, MCP, and UCP Actually Enable

Why text prompts break in voice

Writing speech the ear can follow

Voice-first formatting

Making numbers reliable

Spelling names back

Structuring the conversation

Turn-efficient confirmation

System prompt structure

How the model changes the prompt

Reasoning mode

Tool-call reliability

Per-model format conventions

Scaffolding by model tier

Uneven vendor docs

Handling real-time conversation

Disfluency

Barge-in and recovery

Persona and over-apology

Reliability guards and testing

Never claim an action is done

Voice-specific evals

Practical takeaway

Frequently Asked Questions

Can Pay-by-Bank Work With Agentic Commerce Protocols?

Agentic Commerce 101: How Selling Through AI Assistants Works

How to Make Your Store Visible to AI Shopping Agents

EU & EEA Voice AI Regulations 2026: AI Act, GDPR, ePrivacy, and the Country Mosaic

Middle East Voice AI Regulations 2026: UAE, Saudi Arabia, Israel, and the GCC

UK, Switzerland & Non-EU Europe Voice AI Regulations 2026

The Code-Switching Gap: Where Multilingual Voice AI Loses Callers Mid-Sentence

The Core Latency Budget: Every Millisecond Between Microphone and Speaker

The Telephony Layer Under Your Voice Agent (And When It Breaks)

AI Voice Agents for Personal Injury Law Firms: How to Automate Intake Calls

Building AI That Understands Legal Documents (Not Just Reads Them)

How AI Legal Research Actually Works (And Why Most Tools Get Citations Wrong)

The Legal AI Roadmap: What Founders Need to Know Before Building or Buying

AI Call Center Automation: Actionable Playbook for 2026

Voice Agents for Travel: What Works at HotelPlanner, What Breaks Most Implementations

Custom AI Voice Agents: The Ultimate Guide (Updated May 2026)

How to Build Production-Ready Legal AI Systems

AI for Law Firms: What Actually Works in Production (Beyond the Demos)

Legal Chatbots: When to Build Custom vs Buy Off-the-Shelf

Choosing an LLM for Voice Agents: Speed, Accuracy, Cost

Real-Time (S2S) vs Cascading (STT/TTS) Voice Agent Architecture

9 AI Observability Platforms Compared: Phoenix, Langfuse, Logfire, & More

We Tested 14 AI Agent Frameworks. Here's How to Choose.

The AI Agent Prompt Engineering Trap: Diminishing Returns and Real Solutions

RAG Systems: The 7 Decisions That Determine The Production Fate

How to Implement E-Commerce AI Support: 4-Phase Deployment Guide

AI Agents Break the Same Six Ways. Here's How to Catch Them Early.

Choosing LLMs for AI Agents in 2026: Cost, Latency, Intelligence Tradeoffs

You Can't Fix What You Can't See: Production AI Agent Observability Guide

E-Commerce AI Support: What Works, What Fails, Real Store Examples

E-Commerce AI Support ROI Calculator: Volume Thresholds and Break-Even Analysis

Why Voice Agents Sound Great in Demos but Fail in Production

Deploying & Scaling Voice Agents: 4-Phase Framework from POC to Production

Agentic Coding: Context, Memory, Workflows, Skills, Subagents

12 Voice Agent Platforms Compared: Vapi, Ultravox, Retell, & More

SOC 2 for Voice AI Agents: Security, Confidentiality, and Quick Wins

US Voice AI Regulations 2026: TCPA, BIPA, COPPA, HIPAA, State AI Laws

Testing Voice Agents: Methods, Metrics, and Tools

How to Choose STT and TTS for Voice Agents: Latency, Accuracy, Cost