Evals Are the Product, Christi Akinwumi

The most common artifact missing from an AI system in production is the one that would tell you whether it works. Prompts are tuned. Models are swapped. Retrieval is reindexed. Somewhere in the background, a team is launching features on a surface that nobody has agreed how to measure. When the system behaves well, nobody notices. When it misbehaves, everyone blames the nearest visible component, usually the model, and the cycle starts again. The shape of the missing artifact has a name. It is the eval harness, and in serious AI teams it is no longer optional.

Evals are often framed as quality assurance, something a team applies to a feature after the feature has been built. That framing is the problem. An eval suite is not a testing step. It is the operational definition of what "working" means for a probabilistic system. Without it, a team cannot ship a change without guessing whether the change made things better or worse. With it, the team has a spec that runs. That distinction is the difference between the AI products that compound over time and the ones that drift until someone rewrites them.

Why vibes stopped scaling

In the prototype phase, vibes work. The founder tries the chatbot, nods, and ships it. A few engineers try it, nod, and ship the next version. Manual judgement is doing the work of evaluation, and for a while that is enough. The signals are loud, the system is small, and the stakes are forgiving.

Production breaks that arrangement in three places at once. The volume of traffic exceeds what any human can skim. The diversity of inputs exceeds what the team anticipated. The rate of change exceeds what manual review can keep up with, because every prompt edit, model upgrade, and retrieval tweak can quietly regress a corner of the distribution that nobody was looking at. Gartner reported a 1,445% surge in multi-agent system inquiries between the first quarter of 2024 and mid 2025.^[1] The industry has moved into orchestration at speed. The evaluation practices, in most teams, have not.

The failure mode is predictable. A change that looked good in developer testing ships to production and degrades a segment of users the team never explicitly tested against. Nobody notices for a week. By the time the support tickets arrive, three more changes have landed on top of the regression, and untangling which one caused what becomes an archaeology project. Teams that have been through this once tend to come out the other side believing, sincerely and sometimes expensively, that the eval harness should have existed before the first production user ever arrived.

What an eval actually is

The word eval covers more ground than it admits. In practice, a mature AI team runs three distinct evaluation types, and a healthy team knows which one it is reaching for. Conflating them is where most eval programs lose credibility.

Unit evals

These are assertions against a fixed input and a known good output. The user asks a specific question, the system should return a specific answer or call a specific tool with specific arguments. Unit evals are narrow and boring and indispensable. They are the regression layer. When a prompt edit quietly breaks refund intent routing, the unit eval for refund intent is what tells you before a customer does. Most teams underinvest in this layer because it looks trivial. It is not trivial. It is the floor under everything else.

Task evals

These measure end-to-end behaviour across a representative sample of real tasks. Not a single turn but a full interaction. Did the user accomplish what they came for. Did the system use the right tools in the right order. Did the handoff carry the right context when the conversation escalated to a human. Task evals require labelled trajectories, which is where the work is, and they are what tell you whether your agentic workflow actually agents. They are also where the distinction between a demo and a product becomes visible.

Rubric evals

Some qualities, tone, helpfulness, faithfulness to a retrieved source, resist a simple assertion. A rubric eval defines the criterion in writing, and then scores outputs against it. Often the scorer is another language model, which is what the industry has taken to calling LLM-as-judge. This is a legitimate tool and a dangerous one. It is legitimate because it scales. It is dangerous because the judge is a probabilistic component too, with its own biases and its own failure modes. Rubric evals earn their keep when the rubric is strict, the examples are calibrated against human graders, and the judge is audited on a regular cadence. They lose their keep when a team treats the judge's output as ground truth.

Evaluation-driven development

The analogy to test-driven development is tempting and only partly correct. TDD asks the engineer to write the test before the code. In AI systems, writing the eval before the idea is often wasteful, because the interesting failure modes are discovered by watching users interact with a working version. The useful discipline is narrower. Every change to a prompt, a retrieval strategy, or an orchestration pattern should land together with the eval that justifies it. The eval is the reason the change is safe to merge. No eval, no merge.

This works in practice because eval sets are cheaper to grow than to start. The first twenty labelled examples are the hardest to get. The next two hundred come almost for free, because once the harness exists, every production incident, every support escalation, every surprising log line is a candidate example. A team that has internalised this turns every failure into a permanent piece of scaffolding. Regressions become harder to ship than fixes. The eval set compounds the way a good codebase does, and for the same reason: it captures decisions that would otherwise have to be remade every release.

A regression budget helps. Not every eval will pass on every change, and demanding one hundred per cent green is how teams learn to ignore the harness. What matters is that regressions are visible, that they are explained, and that the team has a shared sense of what regression is acceptable in exchange for what improvement. The eval harness is a negotiation surface for trade-offs, not a pass-fail gate.

Evals as product spec

The strongest argument for investing in evals early is not operational. It is editorial. The eval suite is, in practice, the clearest spec a product team can produce for an AI feature. Every example in the suite is a statement about what the system should do in a situation the team has considered important enough to codify. Every rubric is a statement about what quality means on this surface. Every threshold is a statement about how close to that quality the team is willing to ship.

This is the same argument I made in Content Engineering in 2026 from a different direction. Structure beats opinion. The evaluation set, like the typed content component, is what lets a team have a real conversation about quality instead of an argument about taste. It is what transforms "this feels off" into "example 47 fails the faithfulness rubric and example 112 passes the tone rubric but fails the task rubric", which is a conversation that can actually be resolved.

The tie to the other failure modes is direct. The feedback failures I described in The AI Isn't Broken are what happen when an organization has no mechanism for promoting real production moments into an evaluation set. The context problems I walked through in Context Engineering are only solvable if the team can measure whether a given context budget is actually improving the outcome. Evals are the instrument that makes every other AI discipline accountable to something outside its own opinion.

The Monday version

For teams who recognise themselves in this, the first move is unglamorous and fast. Pull the last week of production logs. Pick twenty real user turns, across the intents that matter most. Label what the right behaviour would have been. Commit those twenty examples to the repository in a format a test runner can execute. Add one command that runs them. Add one dashboard that shows the pass rate. That is the harness. It is crude, it is incomplete, and it is already more evaluation infrastructure than most AI features have ever had.

From there the work becomes additive. Each production incident adds an example. Each new feature adds a task trajectory. Each rubric gets calibrated against a human grader once a quarter. The harness grows the way a test suite grows: slowly, stubbornly, and in the direction of the problems the team actually hits. Over a year, the suite becomes the single richest description of what the product is, which is also the single richest description of what the product is for.

Evals are not the boring infrastructure under the product. In AI systems, they are closer to the product than the model is. The model will keep changing. The prompts will keep being rewritten. The retrieval will be reindexed, the orchestration rearchitected, the tools swapped out. What survives every one of those changes, and what tells the team whether the system is still the one they meant to build, is the suite of examples that defines what good looks like. The teams who took evals seriously first will, in two years, look like the teams who took testing seriously in 2010: early, right, and quietly responsible for everything working.

References

[1] Gartner. Multi-agent systems client inquiry analysis, Q1 2024 – Q3 2025 . Gartner Research , 2025 — Cited figure: 1,445% increase in client inquiries on multi-agent systems. Verify the exact Gartner publication and add the URL before publishing. ↩

Evals are the product.