Reference
Conversational AI Glossary
Plain-language definitions for the terms that come up most in my work: containment, evals, orchestration, RAG, intent, and the rest. Written the way I would explain them on a call, not the way a vendor data sheet would.
- Citation faithfulness
- Whether a cited source actually backs up the claim attached to it. You check it by reading the cited passage, not by confirming a citation exists. A confident answer with a citation that does not support it is worse than no citation at all.
- Containment
- The share of conversations a system resolves on its own, without handing off to a human. It is a core support metric, but it only counts when the user actually got helped. Containment that is really just deflection shows up later as a second contact.
- Context engineering
- Designing everything the model sees around a given prompt: retrieved documents, earlier turns, tool results, and system instructions. The prompt is one input. The context is the whole state the system assembles, turn after turn.
- Conversation design
- The craft of shaping how a system talks: word choice, tone, error recovery, persona, and the shape of a multi-turn exchange. For a conversational product, the writing is the interface, so the writing is the product.
- Drift
- Slow change in a model's behavior or its inputs over time, so a system that passed its checks at launch quietly degrades. You catch it by running evals continuously, not once.
- Eval
- A repeatable test that scores a system's output against known-good answers or a rubric. Evals turn 'this feels better' into a number you can compare across releases.
- Eval harness
- The machinery that runs evals on demand: the question set, the scoring rubric, the pipeline that runs it, and the per-version report. The teams that ship reliable AI build this first and wire it into every release.
- Golden set
- A held-out set of inputs with human-labeled correct answers, used to measure accuracy and catch regressions. You keep it stable so scores stay comparable as the system changes.
- Grounding
- Tying an answer to specific source material (a document, a database row, a retrieved passage) instead of letting the model answer from memory. Grounding is what lets a citation be honest.
- Hallucination
- When a model produces something fluent and confident that its sources and the facts do not support. It is the failure mode that grounding and citation checks exist to catch.
- Handoff (escalation)
- The point where a conversation moves from the AI to a human or another system. Good handoff design carries the full context across, so the person never has to start over.
- Hybrid retrieval
- Combining semantic (vector) search with keyword search so retrieval catches both meaning and exact terms like error codes and product names. It usually beats either method on its own.
- Intent
- What a user is actually trying to do, named as a category the system can act on, like 'reset password' or 'cancel order'. It is the unit an intent classifier predicts.
- Intent taxonomy
- The full, organized set of intents a system recognizes, ideally mutually exclusive and built from real conversations rather than guesses. A muddy taxonomy is the quiet cause of most misrouting.
- Latency
- How long the system takes to respond. It is a design constraint, not an afterthought: a correct answer that arrives too late still loses the user.
- LLM-as-judge
- Using one language model to score another model's output against a rubric. It scales well, as long as you calibrate it against a human grader on a sample so the scores stay trustworthy.
- Multi-agent system
- An architecture where several role-specialized model calls each handle part of a task, coordinated by a thin orchestrator, instead of one prompt trying to do everything.
- Orchestration
- The cross-system work of routing a request, calling tools, tracking state, and handling fallbacks. Most of what people call 'agent' work is really orchestration: the AI doing the steps a human used to do by hand.
- Prompt architecture
- Treating prompts as versioned product specs with tests, kept in source control, instead of strings edited live in production. It lets you change a prompt and see the eval impact before it ships.
- RAG (retrieval-augmented generation)
- A pattern where the system retrieves relevant documents first, then asks the model to answer using them. It grounds answers in real sources instead of model memory.
- Slot filling
- Collecting the specific pieces of information a task needs (a date, an account number, a size) across one or more turns before the system acts.
Have a term you want defined here? Send me a note.
Let's build
Tell me what you are building and where it hurts.
I take a small number of engagements each quarter through Intelligent CX Consulting . If what you're reading here sounds like the thing you need, get in touch.
Hiring for a full-time role instead? My resume is at /resume.