Reference

Conversational AI Glossary

Plain-language definitions for the terms that come up most in my work: containment, evals, orchestration, RAG, intent, and the rest. Written the way I would explain them on a call, not the way a vendor data sheet would.

Citation faithfulness
Whether a cited source actually backs up the claim attached to it. You check it by reading the cited passage, not by confirming a citation exists. A confident answer with a citation that does not support it is worse than no citation at all.
Containment
The share of conversations a system resolves on its own, without handing off to a human. It is a core support metric, but it only counts when the user actually got helped. Containment that is really just deflection shows up later as a second contact.
Context engineering
Designing everything the model sees around a given prompt: retrieved documents, earlier turns, tool results, and system instructions. The prompt is one input. The context is the whole state the system assembles, turn after turn.
Conversation design
The craft of shaping how a system talks: word choice, tone, error recovery, persona, and the shape of a multi-turn exchange. For a conversational product, the writing is the interface, so the writing is the product.
Drift
Slow change in a model's behavior or its inputs over time, so a system that passed its checks at launch quietly degrades. You catch it by running evals continuously, not once.
Eval
A repeatable test that scores a system's output against known-good answers or a rubric. Evals turn 'this feels better' into a number you can compare across releases.
Eval harness
The machinery that runs evals on demand: the question set, the scoring rubric, the pipeline that runs it, and the per-version report. The teams that ship reliable AI build this first and wire it into every release.
Golden set
A held-out set of inputs with human-labeled correct answers, used to measure accuracy and catch regressions. You keep it stable so scores stay comparable as the system changes.
Grounding
Tying an answer to specific source material (a document, a database row, a retrieved passage) instead of letting the model answer from memory. Grounding is what lets a citation be honest.
Hallucination
When a model produces something fluent and confident that its sources and the facts do not support. It is the failure mode that grounding and citation checks exist to catch.
Handoff (escalation)
The point where a conversation moves from the AI to a human or another system. Good handoff design carries the full context across, so the person never has to start over.
Hybrid retrieval
Combining semantic (vector) search with keyword search so retrieval catches both meaning and exact terms like error codes and product names. It usually beats either method on its own.
Intent
What a user is actually trying to do, named as a category the system can act on, like 'reset password' or 'cancel order'. It is the unit an intent classifier predicts.
Intent taxonomy
The full, organized set of intents a system recognizes, ideally mutually exclusive and built from real conversations rather than guesses. A muddy taxonomy is the quiet cause of most misrouting.
Latency
How long the system takes to respond. It is a design constraint, not an afterthought: a correct answer that arrives too late still loses the user.
LLM-as-judge
Using one language model to score another model's output against a rubric. It scales well, as long as you calibrate it against a human grader on a sample so the scores stay trustworthy.
Multi-agent system
An architecture where several role-specialized model calls each handle part of a task, coordinated by a thin orchestrator, instead of one prompt trying to do everything.
Orchestration
The cross-system work of routing a request, calling tools, tracking state, and handling fallbacks. Most of what people call 'agent' work is really orchestration: the AI doing the steps a human used to do by hand.
Prompt architecture
Treating prompts as versioned product specs with tests, kept in source control, instead of strings edited live in production. It lets you change a prompt and see the eval impact before it ships.
RAG (retrieval-augmented generation)
A pattern where the system retrieves relevant documents first, then asks the model to answer using them. It grounds answers in real sources instead of model memory.
Slot filling
Collecting the specific pieces of information a task needs (a date, an account number, a size) across one or more turns before the system acts.

Have a term you want defined here? Send me a note.

Let's build

Tell me what you are building and where it hurts.

I take a small number of engagements each quarter through Intelligent CX Consulting . If what you're reading here sounds like the thing you need, get in touch.

Hiring for a full-time role instead? My resume is at /resume.