The challenge
GoDaddy's legacy support bot was built on classical NLU: predefined intents, rigid slot filling, brittle fallbacks. It worked until it did not, and it stopped working loudest in the markets that mattered most, where code switching, idiom, and non Latin scripts broke the model's assumptions. We were shipping credible design on top of a surface that had run out of headroom.
The business signal tracked the user signal. Containment was plateauing. CSAT was flat. Escalation cost was rising. Each new market required a taxonomy expansion the team could not sustain. A replatforming was no longer optional. The question was how to sequence the migration without disturbing the user base it served.
Constraints
- No rip and replace. The legacy system was load bearing across eight markets. Migration had to be gradual, measurable, and reversible at every step.
- Latency budget. Users abandoned in under three seconds. Any LLM layer had to meet the NLU latency floor, not exceed it.
- Observability gap. Prompt drift, hallucination, and tool call failures had to be visible in production, not discovered by customers.
- Cross market parity. A design decision made in market A had to hold in markets B through H without separate tuning cycles.
- Responsible AI and compliance. Bias risk, output verification, and citation faithfulness had to be enforced at the system level, not left to prompt authors.
My approach
I led the AI product strategy and conversation architecture end to end, working across engineering, data science, support operations, localization, and product leadership. The work fell into five workstreams that ran in parallel once sequencing was in place.
- Vendor evaluation framework. Before the team committed to a model, I authored a decision framework scoring OpenAI, Anthropic, and AWS Bedrock against latency, cost, accuracy, reliability, bias risk, and scalability. The framework doubled as the build vs buy rationale that unlocked a $500K+ budget and leadership buy in.
- Multi agent architecture. I specified a pattern in which each agent owned a narrow responsibility: intent framing, tool execution, answer synthesis, safety review. A thin orchestrator brokered between them with explicit state, a versioned tool schema, and a memory policy that was reviewed as a first class artifact.
- Grounded RAG with AI output verification. I shipped a retrieval augmented generation pipeline with hybrid retrieval (semantic plus BM25), citation validation, and output verification guardrails. Responses that could not cite the grounded source were held back. This single design choice reduced hallucinations by 60% compared to the ungrounded baseline and became the trust floor for the rest of the system.
- Prompt architecture as code. System prompts, tool specifications, and guardrails lived in a versioned repository, reviewed in pull requests, and tested against golden sets. Prompt strings never reached production as strings. They reached production as specifications with contracts.
- LangFuse observability and shadow mode. Every trace, every tool call, every token was accounted for. Each new agent ran in shadow mode against the legacy NLU for two weeks per flow before the first user saw its output. Cutover was one flow, one market, one percentile at a time, with rollback criteria defined before every ramp.
Multi agent over monolith. A single large prompt could have handled the task surface in principle. I chose the multi agent pattern because it made each failure localizable: when a response went wrong, I could tell which agent produced it, which tool it called, and which turn it drifted on. Monolithic prompts hide failures in the middle of the prompt body. Production systems need the opposite property.
LangFuse over a homegrown tracer. I assessed the buy versus build tradeoff against the cost of writing and maintaining an internal observability layer at production scale. LangFuse met the requirements for trace granularity and cost per call within the latency budget. Engineering time stayed on product work.
Shadow mode before first user. Every LLM decision ran in parallel with the legacy NLU for two weeks per flow before touching a user. The cost was delay. The return was that we never shipped a regression a customer had to find for us.
Artifacts I authored or led
- Vendor evaluation framework (OpenAI vs Anthropic vs AWS Bedrock) with six weighted criteria, used as the build vs buy decision document for executive review
- Multi agent state diagram: orchestrator plus four specialist agents plus safety gate, with explicit handoff contracts
- Prompt specification template (contract style schema) adopted across all agent roles
- Hybrid RAG pipeline design: semantic plus BM25, citation validation rubric, AI output verification thresholds
- Evaluation harness mapping prompt versions to containment and CSAT deltas
- Market by market rollout dashboard with automated rollback triggers in LangFuse
- Change management program: training curriculum, documentation, and internal enablement for 200+ non technical employees across support, sales, and operations
Results
The measured outcomes were the headline. The structural outcome was the one I care about more. The unit of iteration got smaller. Instead of the team rewriting taxonomies every quarter, it shipped prompt and tool specification changes daily, with traces to prove each change landed, and with rollback discipline that kept the bar high. The 200+ non technical employees who ran the system after rollout kept running it after I left. That is the metric I track hardest.
About these numbers
The figures on this page are drawn from internal program reporting I authored or co-authored as the practitioner on the engagement. They are reproduced here in rounded form. They were not produced by an independent third party, and proprietary detail has been omitted where required by the engagement.
Lift figures (CSAT, accuracy, handle time, hallucination rate) reflect pre/post comparisons against a matched baseline using the cohort, time window, and measurement instrument noted in the case study. Volume and adoption figures come from production analytics dashboards. Cost figures reflect either avoided spend or unlocked budget in the named fiscal period.
- 25% faster response time: median end-to-end response latency, post-cutover vs. legacy NLU baseline, measured per market on matched flow cohorts.
- 40% CSAT lift: post-conversation CSAT survey deltas, post-cutover vs. pre-cutover, on matched flows across 8+ markets.
- 60% hallucination reduction: measured against the citation-faithfulness rubric used in the RAG evaluation (claim-level scoring), comparing the grounded RAG pipeline against the ungrounded baseline.
- $2M annual savings: estimated avoided support cost, derived from the containment lift applied to per-contact cost in the named fiscal year. Estimate, not audited finance figure.
- $500K+ vendor budget unlocked: strategic budget approved against the build-vs-buy framework I authored for executive review.
- 8+ markets and 20M+ users: counts at the time of the cutover and the named user base served by the support stack.
What I would do differently
Invest in the evaluation harness before the first agent ships, not after the second one does. Every week we delayed formal evaluations cost us a week of trust with stakeholders who reasonably wanted to see numbers, not anecdotes. A related note to self: write the handoff contracts for downstream tools at the same time the tools are specified, not after the first integration failure. Handoff schemas are cheap to design up front and expensive to retrofit.
Collaborators
Partnered with data science on model selection, RLHF feedback loops, and fine tuning strategy. Partnered with engineering on tool call infrastructure, latency budgets, and the LangFuse integration. Partnered with L2 support operations on the handoff specifications that made the system safe to deploy at scale. Partnered with localization leads across eight markets on cultural adaptation and language support. Reported into product leadership for ROI tracking and rollout decisions.
Skills demonstrated
- AI product strategy
- Multi agent architecture
- Vendor evaluation (build vs buy)
- Prompt architecture and versioning
- Hybrid RAG design
- AI output verification and guardrails
- LangFuse observability
- Shadow mode rollout design
- Evaluation harness design
- Cross market product strategy
- Latency and cost engineering
- Change management at enterprise scale
- Stakeholder alignment and ROI reporting