Reference

Conversational AI Glossary

Plain-language definitions for the terms that come up most in my work: containment, evals, orchestration, RAG, intent, and the rest. Written the way I would explain them on a call, not the way a vendor data sheet would.

Citation faithfulness: Whether a cited source actually backs up the claim attached to it. You check it by reading the cited passage, not by confirming a citation exists. A confident answer with a citation that does not support it is worse than no citation at all.
Containment: The share of conversations a system resolves on its own, without handing off to a human. It is a core support metric, but it only counts when the user actually got helped. Containment that is really just deflection shows up later as a second contact.
Context engineering: Designing everything the model sees around a given prompt: retrieved documents, earlier turns, tool results, and system instructions. The prompt is one input. The context is the whole state the system assembles, turn after turn.
Conversation design: The craft of shaping how a system talks: word choice, tone, error recovery, persona, and the shape of a multi-turn exchange. For a conversational product, the writing is the interface, so the writing is the product.
Drift: Slow change in a model's behavior or its inputs over time, so a system that passed its checks at launch quietly degrades. You catch it by running evals continuously, not once.
Eval: A repeatable test that scores a system's output against known-good answers or a rubric. Evals turn 'this feels better' into a number you can compare across releases.
Eval harness: The machinery that runs evals on demand: the question set, the scoring rubric, the pipeline that runs it, and the per-version report. The teams that ship reliable AI build this first and wire it into every release.
Golden set: A held-out set of inputs with human-labeled correct answers, used to measure accuracy and catch regressions. You keep it stable so scores stay comparable as the system changes.
Grounding: Tying an answer to specific source material (a document, a database row, a retrieved passage) instead of letting the model answer from memory. Grounding is what lets a citation be honest.
Hallucination: When a model produces something fluent and confident that its sources and the facts do not support. It is the failure mode that grounding and citation checks exist to catch.
Handoff (escalation): The point where a conversation moves from the AI to a human or another system. Good handoff design carries the full context across, so the person never has to start over.
Hybrid retrieval: Combining semantic (vector) search with keyword search so retrieval catches both meaning and exact terms like error codes and product names. It usually beats either method on its own.
Intent: What a user is actually trying to do, named as a category the system can act on, like 'reset password' or 'cancel order'. It is the unit an intent classifier predicts.
Intent taxonomy: The full, organized set of intents a system recognizes, ideally mutually exclusive and built from real conversations rather than guesses. A muddy taxonomy is the quiet cause of most misrouting.
Latency: How long the system takes to respond. It is a design constraint, not an afterthought: a correct answer that arrives too late still loses the user.
LLM-as-judge: Using one language model to score another model's output against a rubric. It scales well, as long as you calibrate it against a human grader on a sample so the scores stay trustworthy.
Multi-agent system: An architecture where several role-specialized model calls each handle part of a task, coordinated by a thin orchestrator, instead of one prompt trying to do everything.
Orchestration: The cross-system work of routing a request, calling tools, tracking state, and handling fallbacks. Most of what people call 'agent' work is really orchestration: the AI doing the steps a human used to do by hand.
Prompt architecture: Treating prompts as versioned product specs with tests, kept in source control, instead of strings edited live in production. It lets you change a prompt and see the eval impact before it ships.
RAG (retrieval-augmented generation): A pattern where the system retrieves relevant documents first, then asks the model to answer using them. It grounds answers in real sources instead of model memory.
Slot filling: Collecting the specific pieces of information a task needs (a date, an account number, a size) across one or more turns before the system acts.

Have a term you want defined here? Send me a note.

Let's build

Tell me what you are building and where it hurts.

I take a small number of engagements each quarter through Intelligent CX Consulting . If what you're reading here sounds like the thing you need, get in touch.

Hiring for a full-time role instead? My resume is at /resume.