RAG Pipeline Evaluation for a Production Support System, Christi Akinwumi

The challenge

Before GoDaddy committed to a retrieval augmented generation architecture for the production agentic support system, I needed an honest answer to a question every vendor was willing to answer for us: which RAG configuration actually serves our users best. The help center was clean and well structured. The team had three credible options on the table, each with a plausible story. None of the stories agreed. I ran a controlled evaluation so the architecture decision could rest on numbers rather than narratives.

Constraints

Focused corpus. GoDaddy's support knowledge base is narrow and high quality compared to a general knowledge corpus. Results had to be reported with that scope stated plainly.
Production latency budget. Answers had to come back inside the same latency envelope as existing support search. No latency regression was acceptable regardless of accuracy gain.
Reproducible evaluation. Every question, every retrieved passage, every generated answer logged. No feels better. The harness had to run on demand by any engineer on the team.
Citation faithfulness. Every generated answer had to be scored against whether it could cite the grounded source. AI output verification at evaluation time, not only at deployment time.

My approach

I designed the evaluation as a controlled comparison across three pipeline configurations, with prompt and model held constant so the retrieval strategy was the only variable that moved.

Three pipeline configurations, isolated for clean comparison:
- Config A: small non overlapping chunks, single pass hybrid retrieval (semantic plus BM25)
- Config B: larger overlapping chunks, single pass semantic retrieval only
- Config C: small chunks, multi query rewrite with a re ranker
Golden question set built from real support queries, each labeled with expected citations by support SMEs.
Precision, recall, and latency measured per configuration. Generated answers scored against a citation faithfulness rubric I authored, checking not only that a citation was present but that the cited passage supported the claim.
Blind review by support subject matter experts on a random subsample to catch what automated scoring missed.
AI output verification thresholds applied to every generated answer: answers that could not cite the grounded source were not allowed to pass.

Architecture decision

Hybrid retrieval beat dense only. The intuition most teams carry into RAG is that semantic retrieval is strictly better than keyword retrieval. Over a clean support corpus, that intuition is wrong. Keyword retrieval catches technical terms, part numbers, and brand specific jargon that the embedding model never learned to encode. Hybrid retrieval captured both, at a negligible latency cost.

Multi query and re ranker did not earn their cost. Config C looked sophisticated and underperformed. Multi query rewrites introduced redundant context that diluted precision. The re ranker's added latency was not earned by the precision gain on this corpus. Sophisticated is not the same as correct.

Artifacts I authored or led

Evaluation harness: question set, scoring rubric, pipeline runner, per config report generator
Citation faithfulness rubric: claim level scoring, not just citation presence
Per configuration precision, recall, and latency report
Recommendation memo with deploy configuration, monitoring plan, and go or no go criteria for production rollout

Results

Config Awinning configuration

45%answer accuracy lift

60%hallucinations reduced

30%human escalations down

The winning configuration, Config A with small non overlapping chunks and hybrid retrieval, was the basis of the grounded RAG pipeline that shipped into the production agentic support system. Post deployment, the architecture delivered a 45% answer accuracy lift, a 60% reduction in hallucinations, and a 30% reduction in human escalations compared to the pre RAG baseline. The evaluation paid for itself six weeks after it ended.

On a focused, well structured corpus, the architecture you want is the simplest one that meets the bar. Complexity is a cost, not a feature.

About these numbers

The figures on this page are drawn from internal program reporting I authored or co-authored as the practitioner on the engagement. They are reproduced here in rounded form. They were not produced by an independent third party, and proprietary detail has been omitted where required by the engagement.

Lift figures (CSAT, accuracy, handle time, hallucination rate) reflect pre/post comparisons against a matched baseline using the cohort, time window, and measurement instrument noted in the case study. Volume and adoption figures come from production analytics dashboards. Cost figures reflect either avoided spend or unlocked budget in the named fiscal period.

45% answer accuracy lift: measured on the SME-labeled golden question set comparing the deployed Config A pipeline against the pre-RAG baseline.
60% hallucination reduction: measured against the citation-faithfulness rubric (claim-level scoring of whether the cited passage supports each generated claim), not just citation presence.
30% human escalations down: measured against pre-RAG production escalation rates over a matched post-deploy window.
Three pipelines benchmarked: prompt and model held constant across configurations so retrieval strategy was the only variable that moved.
Findings are corpus-specific. The GoDaddy support knowledge base is narrow and well structured; results may differ on broader or messier corpora.

What I would do differently

Add a fourth configuration with pure BM25 retrieval from the start. I expected dense retrieval to dominate and did not budget for the sparse only comparison. It is the first question a reviewer will ask, and it is the right question. The second lesson was cheaper to learn: citation faithfulness scoring should be the first metric computed, not the third. It is the metric that separates a useful grounded answer from a confident wrong one.

Collaborators

Worked with engineering on pipeline implementation and with support subject matter experts on the golden set construction and blind review. Partnered with data science on the evaluation harness and its reproducibility guarantees. Final recommendation reviewed with product leadership and the conversational AI architecture working group.

Skills demonstrated

RAG evaluation design
Retrieval strategy comparison (dense, sparse, hybrid)
Golden set construction
Citation faithfulness scoring
AI output verification
Latency and precision tradeoff analysis
Blind SME review coordination
Technical memo writing for leadership

RAG Pipeline Evaluation for a Production Support System

The challenge

Constraints

My approach

Artifacts I authored or led

Results

What I would do differently

Collaborators

Skills demonstrated

Seriously, let's chat about your next AI project.

Christi's AI Assistant