The challenge
GoDaddy's conversational AI routed on an intent taxonomy that had evolved faster than it had been designed. Intents had been added one at a time, shaped by the product team that needed them that week, with no single owner accountable for how the pieces fit together. At a smaller scale this is survivable. At 20M+ users across 8+ markets, the seams began to show. The classifier misrouted roughly 30% of inbound conversations. The teams who inherited those conversations spent most of their time correcting for the bad routing, not solving the underlying user problem.
The obvious fix was to retrain the classifier. The less obvious fix was that no amount of retraining would help, because the target labels themselves were inconsistent. The same user message could legitimately map to three intents depending on how the annotator felt that day. The problem was upstream of the model. It was in the taxonomy.
Constraints
- Production routing depended on it. The taxonomy was load bearing. A redesign had to be rolled out without disrupting the existing agentic system that served millions of conversations weekly.
- Multi market vocabulary. The same concept surfaces differently in eight languages and regional idioms. The taxonomy had to hold up across all of them without a per market rewrite.
- Downstream contracts. Multiple systems, routing logic, analytics, reporting, and business dashboards consumed intent labels. Renaming an intent was never a cosmetic change.
- Reproducible rigor. The design had to be defensible. I needed a taxonomy that an independent reviewer could look at and verify was mutually exclusive and collectively exhaustive, not a judgment call.
My approach
I approached the work as a data design project, not a labeling project. The difference matters: a labeling project picks names for what already exists. A data design project defines the shape of the space before anything gets a name. The steps were sequential but each produced an artifact the team could review.
- Corpus mining. I pulled 50K+ anonymized real conversations across markets and product surfaces. The corpus was filtered for quality (complete turns, actionable user intent, representative of volume) and deliberately oversampled edge cases that the existing taxonomy handled poorly.
- Clustering without predetermined labels. Before I drew the taxonomy, I clustered the corpus along two axes: what the user is trying to accomplish (the task) and which product surface it touches (the domain). The clustering surfaced 12 natural domains and a long tail of tasks within each. This is the moment the redesign stopped being opinion and started being data.
- Taxonomy design principles. I authored three principles the taxonomy had to satisfy: mutually exclusive (no user message can legitimately map to two intents), collectively exhaustive (every observed user message has a home, including fallbacks), and edge case explicit (ambiguous or adversarial inputs have their own intents rather than being silently misrouted).
- Iteration with SMEs. I walked product, support, and data science SMEs through the draft taxonomy in structured reviews. The goal was not consensus. The goal was to surface the edge cases each role saw but the others did not. The SMEs caught what the clustering could not.
- Prompt engineering for the classifier. With the taxonomy stable, I rewrote the classifier prompts to use the 150+ intents as explicit labels, with definitions, positive examples, and negative examples per intent. RLHF signals from production operators were routed back into the classifier to tighten the highest volume misroutes.
- A/B rollout. The new taxonomy shipped behind a classifier flag, one market at a time, with the legacy system running in parallel so routing quality could be measured against the old baseline on matched cohorts. Cutover was tied to per market performance thresholds.
- Evaluation harness. I authored an accuracy evaluation that ran on every classifier change, using a held out golden set labeled by SMEs. The harness made taxonomy quality a continuous observable, not a project artifact.
Hierarchical, not flat. A flat taxonomy with 150+ intents would have been easier to classify against but harder to reason about. I chose a two level structure: 12 domains at the top, specific intents underneath. The domains are stable, the intents under them can evolve as the product does, and routing logic can operate at either level depending on downstream needs.
Edge cases as first class intents. Ambiguous messages, adversarial inputs, and out of scope requests got their own intent buckets rather than being forced into the closest real intent. This felt wrong to stakeholders at first because it looked like fewer routing successes. The routing successes it replaced had been silent failures. Counting them honestly was the point.
Artifacts I authored or led
- Conversation mining pipeline: corpus filtering, clustering, coverage reporting
- Taxonomy specification: 150+ intents across 12 domains with definitions, positive examples, negative examples, and edge case notes
- Classification prompt template used across all agent roles that consumed intent labels
- Accuracy evaluation harness: held out golden set, per intent precision and recall, drift detection
- A/B rollout plan with per market performance thresholds and rollback criteria
- Migration guide for downstream consumers (routing, analytics, reporting) so label changes did not break contracts
Results
The user facing outcome was the handle time drop. The system facing outcome was that the classifier could finally learn against a stable signal. The organizational outcome was that conversations about intent quality moved from opinion to evidence, because the evaluation harness made quality observable week over week.
About these numbers
The figures on this page are drawn from internal program reporting I authored or co-authored as the practitioner on the engagement. They are reproduced here in rounded form. They were not produced by an independent third party, and proprietary detail has been omitted where required by the engagement.
Lift figures (CSAT, accuracy, handle time, hallucination rate) reflect pre/post comparisons against a matched baseline using the cohort, time window, and measurement instrument noted in the case study. Volume and adoption figures come from production analytics dashboards. Cost figures reflect either avoided spend or unlocked budget in the named fiscal period.
- 30% misrouting reduction: measured against a held-out golden set labeled by SMEs across all 8 markets, post-rollout vs. legacy classifier on matched cohorts.
- 20% handle-time reduction: measured during the per-market A/B rollout window, comparing matched conversation cohorts on the new vs. legacy taxonomy.
- 50K+ conversations: refers to the anonymized corpus mined and clustered across markets and product surfaces; the held-out evaluation set is a separate, smaller SME-labeled golden set.
- 150+ intents and 12 domains: counts at the time of the cutover; the taxonomy has continued to evolve since then.
What I would do differently
Build the evaluation harness on day one, not at the end. I had a strong taxonomy four weeks in and an honest way to measure it eight weeks in. Reversing those two would have given stakeholders confidence earlier and saved a meeting or three. A related lesson: write the downstream migration guide in parallel with the taxonomy, not after. Downstream consumers had legitimate concerns that were answerable with a good guide, and answering them late cost more trust than the guide would have cost to draft.
Collaborators
Partnered with data science on the clustering analysis and the classifier training. Partnered with product, support, and operations SMEs on the structured taxonomy reviews. Partnered with engineering on the classifier rollout and the evaluation harness. Reported into product leadership for rollout sequencing and cross market alignment.
Skills demonstrated
- NLP taxonomy design
- Corpus mining and clustering
- Mutually exclusive, collectively exhaustive classification
- Prompt engineering for classification
- RLHF signal integration
- Evaluation harness design
- Cross market rollout planning
- Downstream contract management