Evaluations are the product, Christi Akinwumi

The most common thing missing from an AI system in production is the one piece that would tell you whether it works. Prompts get tuned. Models get swapped. Retrieval gets reindexed. And somewhere in the background, a team is launching features on a surface nobody has agreed how to measure. When the system behaves, nobody notices. When it misbehaves, everyone blames the nearest visible part, usually the model, and the cycle starts again. That missing piece has a name. It's the eval harness, and on serious AI teams it's no longer optional.

Evals usually get framed as quality assurance, something you apply to a feature after the feature is built. That framing is the problem. An eval suite isn't a testing step. It's the operational definition of what "working" means for a probabilistic system. Without it, you can't release a change without guessing whether you made things better or worse. With it, you have a spec that runs. That's the difference between the AI products that compound over time and the ones that drift until someone rewrites them.

Why vibes stopped scaling

In the prototype phase, vibes work. The founder tries the chatbot, nods, and releases it. A few engineers try it, nod, and release the next version. Manual judgement is doing the work of evaluation, and for a while that's enough. The signals are loud, the system is small, and the stakes are forgiving.

Production breaks that arrangement in three places at once. Traffic volume outruns what any human can skim. Input diversity outruns what the team anticipated. And the rate of change outruns what manual review can keep up with, because every prompt edit, model upgrade, and retrieval tweak can quietly regress a corner of the distribution nobody was looking at. Gartner reported a 1,445% surge in multi-agent system inquiries between the first quarter of 2024 and mid 2025.^[1] The industry has moved into orchestration at speed. In most teams, the evaluation practices haven't.

The failure mode is predictable. A change that looked good in developer testing reaches production and degrades a segment of users the team never explicitly tested against. Nobody notices for a week. By the time the support tickets arrive, three more changes have landed on top of the regression, and untangling which one caused what becomes an archaeology project. Teams that have been through this once tend to come out the other side believing, sincerely and sometimes expensively, that the eval harness should have existed before the first production user ever showed up.

What an eval actually is

The word eval covers more ground than it admits. In practice, a mature AI team runs three distinct evaluation types, and a healthy team knows which one it's reaching for. Conflating them is where most eval programs lose credibility.

Unit evals

These are assertions against a fixed input and a known good output. The user asks a specific question, and the system should return a specific answer or call a specific tool with specific arguments. Unit evals are narrow and boring and indispensable. They're your regression layer. When a prompt edit quietly breaks refund intent routing, the unit eval for refund intent is what tells you before a customer does. Most teams underinvest here because it looks trivial. It isn't. It's the floor under everything else.

Task evals

These measure end-to-end behaviour across a representative sample of real tasks. Not a single turn but a full interaction. Did the user accomplish what they came for? Did the system use the right tools in the right order? Did the handoff carry the right context when the conversation escalated to a human? Task evals need labelled trajectories, which is where the work is, and they're what tell you whether your agentic workflow actually agents. They're also where the line between a demo and a product becomes visible.

Rubric evals

Some qualities, tone, helpfulness, faithfulness to a retrieved source, resist a simple assertion. A rubric eval defines the criterion in writing, then scores outputs against it. Often the scorer is another language model, which is what the industry has taken to calling LLM-as-judge. It's a legitimate tool and a dangerous one. Legitimate because it scales. Dangerous because the judge is a probabilistic component too, with its own biases and its own failure modes. Rubric evals earn their keep when the rubric is strict, the examples are calibrated against human graders, and you audit the judge on a regular cadence. They lose their keep the moment a team treats the judge's output as ground truth.

Evaluation-driven development

The analogy to test-driven development is tempting and only partly right. TDD asks the engineer to write the test before the code. In AI systems, writing the eval before the idea is often wasteful, because you discover the interesting failure modes by watching users interact with a working version. The useful discipline is narrower. Every change to a prompt, a retrieval strategy, or an orchestration pattern should land together with the eval that justifies it. The eval is the reason the change is safe to merge. No eval, no merge.

This works in practice because eval sets are cheaper to grow than to start. The first twenty labelled examples are the hardest to get. The next two hundred come almost for free, because once the harness exists, every production incident, every support escalation, every surprising log line is a candidate example. A team that has internalised this turns every failure into a permanent piece of scaffolding. Regressions become harder to release than fixes. The eval set compounds the way a good codebase does, and for the same reason: it captures decisions you'd otherwise have to remake every release.

A regression budget helps. Not every eval will pass on every change, and demanding one hundred per cent green is how teams learn to ignore the harness. What matters is that regressions are visible, that they're explained, and that the team shares a sense of what regression is acceptable in exchange for what improvement. The eval harness is a negotiation surface for trade-offs, not a pass-fail gate.

Evals as product spec

The strongest argument for investing in evals early isn't operational. It's editorial. The eval suite is, in practice, the clearest spec a product team can produce for an AI feature. Every example in the suite is a statement about what the system should do in a situation the team considered important enough to codify. Every rubric is a statement about what quality means on this surface. Every threshold is a statement about how close to that quality the team is willing to release.

This is the same argument I made inContent Engineering in 2026from a different direction. Structure beats opinion. The evaluation set, like the typed content component, is what lets a team have a real conversation about quality instead of an argument about taste. It's what turns "this feels off" into "example 47 fails the faithfulness rubric and example 112 passes the tone rubric but fails the task rubric", which is a conversation you can actually resolve.

The tie to the other failure modes is direct. The feedback failures I described inThe AI Isn't Broken are what happen when an organization has no mechanism for promoting real production moments into an evaluation set. The context problems I walked through inContext Engineeringare only solvable if you can measure whether a given context budget is actually improving the outcome. Evals are the instrument that makes every other AI discipline accountable to something outside its own opinion.

The Monday version

If you recognise yourself in this, the first move is unglamorous and fast. Pull the last week of production logs. Pick twenty real user turns, across the intents that matter most. Label what the right behaviour would have been. Commit those twenty examples to the repository in a format a test runner can execute. Add one command that runs them. Add one dashboard that shows the pass rate. That's your harness. It's crude, it's incomplete, and it's already more evaluation infrastructure than most AI features have ever had.

From there the work becomes additive. Each production incident adds an example. Each new feature adds a task trajectory. Each rubric gets calibrated against a human grader once a quarter. The harness grows the way a test suite grows: slowly, stubbornly, and in the direction of the problems the team actually hits. Over a year, the suite becomes the single richest description of what the product is, which is also the single richest description of what the product is for.

Evals aren't the boring infrastructure under the product. In AI systems, they're closer to the product than the model is. The model will keep changing. The prompts will keep getting rewritten. The retrieval will be reindexed, the orchestration rearchitected, the tools swapped out. What survives every one of those changes, and what tells you whether the system is still the one you meant to build, is the suite of examples that defines what good looks like. The teams who took evals seriously first will, in two years, look like the teams who took testing seriously in 2010: early, right, and quietly responsible for everything working.

References

[1]Gartner. Multiagent Systems in Enterprise AI: Efficiency, Innovation and Vendor Advantage. Gartner, 2025 · Reports a 1,445% increase in client inquiries on multi-agent systems, Q1 2024 to Q2 2025.↩

Evaluations are the product.