Handling IVR in Ultravox voice AI calls — DTMF, hold music, operator transfer
● Handling IVR in Ultravox voice AI calls — DTMF, hold music, operator transfer.md
# Handling IVR in Ultravox-powered outbound voice AI calls
A practical recipe for getting an Ultravox-driven outbound voice agent to navigate real-world IVR menus reliably — emit DTMF digits at the right moment, ignore hold music, and hand off to an operator when needed. Battle-tested against multi-level IVRs in production-style telephony pipelines (Plivo / Twilio inbound to Ultravox via WebSocket).
If you just want the working prompt block, jump to **§3**.
---
## 1. Architecture in one diagram
```
your service ── creates voice call ─► [telephony provider: Plivo / Twilio]
│ dials destination number
▼
[target IVR / human]
│ audio (μ-law 8kHz, bidirectional WebSocket)
▼
[Ultravox voice model]
│ optional: tool fire when AI needs grounded info
▼
your callback URL
body: {question, ...static params}
resp: {answer, status}
│
▼
Ultravox reads aloud
```
Two channels the voice AI uses mid-call: **speak** (to the human/IVR) and **fire a tool** (to your back-end for runtime info). IVR navigation uses the speak channel — Ultravox emits **DTMF** by sending digit telephony events through the audio stream.
---
## 2. Why naive IVR prompts fail
Things teams commonly try first and what goes wrong:
| Naive prompt instruction | Failure mode |
| --- | --- |
| "Press 1 for reservations" | IVR menus differ; a hardcoded "1 or 2" guess wastes a press, sometimes lands on the wrong department. |
| "Wait for human" (no qualifier) | Voice AI starts talking when corporate hold jingles play — it interprets the music as a person greeting and restarts its intro. |
| Listing all menu options as if the AI reads them | The model parrots the menu back instead of pressing a digit. |
| Nothing about hold music | Voice AI re-introduces every 5–10s while on hold, sounding broken. |
| No "don't claim human" rule | When asked "are you a person?" without explicit handling, model dodges or invents. |
Also surfaced: Ultravox v0.6 **sometimes verbalizes** the action — emits literal text like `*presses 1*` instead of firing the actual DTMF event — then self-corrects on the next turn. This reproduces multiple times across calls. The prompt needs to discourage stage-direction-style emoting.
---
## 3. The IVR section (drop into `system_prompt` as-is)
```
IVR: If an automated menu plays, stay silent while it speaks, then send the
DTMF digit that maps to the department you need (e.g. reservations, support,
front desk). If unclear, try the operator. Hold music or corporate welcome
jingles are NOT a human — keep waiting. Do not restart the introduction
until a real person speaks. Send the actual DTMF digit; do not narrate
"*presses 1*" or describe the action.
```
Why each line earns its place:
- **"stay silent while it speaks"** — stops the AI talking over the menu. Otherwise STT cuts the menu mid-option and the model picks the wrong digit.
- **"DTMF digit that maps to the department you need"** — goal-language, not prescriptive. Model picks the correct option for whichever menu plays. Verified across multiple distinct IVR layouts.
- **"If unclear, try the operator"** — fallback to human (almost always 0 globally).
- **"Hold music or corporate welcome jingles are NOT a human"** — directly fixes the hold-loop issue. Without this, the agent restarts its intro every time the jingle pauses.
- **"Do not restart the introduction until a real person speaks"** — backstop for the same problem in different wording; model sometimes latches onto one phrasing and not the other.
- **"Send the actual DTMF digit; do not narrate `*presses 1*`"** — explicit fix for the verbalize-instead-of-press failure.
---
## 4. Repeatable prompt template
```
You are an AI assistant calling <COMPANY/PERSON> on behalf of <CUSTOMER_NAME>.
IVR: <the IVR section from §3 verbatim>
WHEN A HUMAN ANSWERS:
OPEN: <one-sentence introduction the AI speaks first>
REQUEST: <bulleted facts the call is about — dates, numbers, IDs>
GATHER: <bulleted info to extract from the agent>
IF THE AGENT ASKS FOR DETAILS YOU DO NOT KNOW: <fallback line>
CLOSE: <how to end. What NOT to commit to (payment, sign-up, etc.)>
STYLE: <tone, language, sentence length, "do not claim to be a human" line>
For any preference question you cannot answer, call the askAgent tool and
read the response back. Do not invent details.
```
Five blocks: identity / IVR / human handoff / data gathering / close + style. Last line ties in the runtime tool callback.
---
## 5. Runtime fallback tool
When the agent runs into a question the prompt didn't anticipate ("loyalty number?", "dietary preference?", "exact arrival time?"), let it call a tool that hits your back-end. Standard contract:
```
POST <your_callback_url>
Body: { question: string, workflow_id: string, user_id?: string }
Resp: { answer: string, status: "ok" | "unknown" }
```
The `answer` is read aloud verbatim, ≤2 sentences. Must complete in <15s (Ultravox tool timeout is 20s; set `Timeout: "20s"` on the temporary tool).
`workflow_id` should be passed as a `staticParameter` on the temporary tool so the AI doesn't have to remember/include it on every call. Auth headers can ride on the temporary tool's HTTP config too (location: `PARAMETER_LOCATION_HEADER`).
---
## 6. Operational gotchas
- **`endReason: "unjoined"`, `billedDuration: "0s"`** — the telephony→Ultravox WebSocket stream never linked up within the join timeout. Almost always a webhook URL mismatch (e.g. ngrok URL rotated, but your service still advertises the old URL in the answer-webhook XML response). Fix: confirm your public base URL is current and restart the service.
- **`MaxDuration` defaults are short.** Most templates ship with 300s. Real outbound calls (hotels, support) routinely involve 1–3 minute holds. Set max-duration to at least 600s for any human-handled outbound flow.
- **Ultravox `dynamicParameters` with `PARAMETER_LOCATION_BODY` arrive flat** at the body root, not nested under the tool name. e.g. when the tool fires with parameter `question`, the request body is `{"question": "...", "workflow_id": "..."}`, NOT `{"askAgent": {"question": "..."}}`. The tool webhook handler must read flat.
- **Ultravox tool default timeout is ~2s** unless you explicitly set `Timeout` on the `temporaryTool`. Slower LLMs (GPT-4o, Sonnet) get cancelled mid-call. Use a fast model (Haiku, GPT-4o-mini) for tools the voice AI waits on, AND set `Timeout: "20s"`.
- **DTMF emoting** — Ultravox v0.6 sometimes outputs `*presses 1*` as text rather than firing the digit. Mitigate via explicit prompt line. Re-evaluate on model upgrade.
- **Hold music = silence to STT.** While on hold, you'll see no `MESSAGE_ROLE_USER` text additions in the transcript for minutes at a time. That's normal. The agent will keep waiting per the prompt.
---
## 7. How to inspect a call after-the-fact
Source of truth is Ultravox's API. Given a `call_id`:
```bash
UV_KEY=<your-ultravox-api-key>
# call status (joined, ended, endReason, billedDuration)
curl -sS https://api.ultravox.ai/api/calls/<CALL_ID> \
-H "X-API-Key: $UV_KEY" | jq
# full transcript
curl -sS "https://api.ultravox.ai/api/calls/<CALL_ID>/messages?pageSize=500" \
-H "X-API-Key: $UV_KEY" | jq
# audio recording (mono 8kHz WAV)
curl -sS -L -o /tmp/<CALL_ID>.wav \
https://api.ultravox.ai/api/calls/<CALL_ID>/recording \
-H "X-API-Key: $UV_KEY"
```
`endReason` cheat sheet:
- `hangup` — call completed normally (or `MaxDuration` hit; check `billedDuration` vs `maxDuration`).
- `unjoined` — telephony→Ultravox stream never linked. Almost certainly a webhook-URL issue.
- `agent_hangup` — AI hung up (typically a tool action).
- `timeout` / `connection_error` / `system_error` — see Ultravox docs.
---
## 8. Empirically observed call patterns
Two illustrative outbound calls to multi-level IVRs:
**Call A — single-level IVR, agent fetches rates on hold**
- Heard menu → pressed correct digit (DTMF) → reservations.
- Reached human, gave details, agent put on 2-min hold to fetch rates.
- Hit 5-min `MaxDuration` cap during hold. Polite auto-close fired.
- Lessons: ✅ flow-correct, ❌ duration too short → bump max_duration.
**Call B — multi-level menu, transferred to operator**
- IVR worked, transferred to operator at ~30s.
- Operator never picked up — only hold music for 4+ minutes.
- DTMF emote happened twice (`*presses*` as text) before model fired actual digit; self-corrected.
- 5-min cap hit, hangup.
- Lessons: ✅ IVR navigation, ❌ wait for operator (real-world variable), ❌ duration.
---
## 9. Quick wins to apply on day 1
1. Drop §3 verbatim into your `system_prompt`.
2. Use a fast LLM for runtime tool answers (Haiku-class) and set the temporary tool `Timeout` to 20s.
3. Bump `MaxDuration` to 600–900s for any outbound call to a human-staffed line.
4. Confirm your public webhook base URL (especially behind ngrok) before every test session — stale URLs are the #1 cause of `unjoined`.
5. Use the Ultravox API directly (not your local DB) to inspect call state. DB state plumbing is often eventual and may lag.
paste the URL into Claude Code, Codex or Cursor — the agent fetches the full body via npad's API.