The eval discipline that keeps AI products honest

Most AI products in 2026 are shipping without proper evals. The teams that have figured it out are quietly running circles around the ones still relying on "looks good to me" as a release gate. Six months ago this was a niche operator concern. Today it is the single largest gap between AI products that are working and AI products that are silently regressing.

This essay is the discipline that actually works in production. Not a framework. A discipline. The distinction matters.

What an eval actually is

Strip the jargon and an eval is simply: a fixed input, a desired property of the output, and a way to measure whether the property holds. In the simplest case the input is "summarise this contract" and the property is "extracts all parties named in section 1, in order, with no fabrications." The measurement might be a strict regex check, a programmatic schema validation, or another model grading the answer.

The reason this matters in AI products is that the underlying models are non-deterministic in ways traditional testing assumed away. The same prompt with the same temperature can produce three different outputs across three runs, all plausible, with subtly different errors. A unit test that runs once and passes proves almost nothing about production behaviour.

Evals are the only mechanism we have found that converts model behaviour from "vibe-tested" to "engineered." They are not optional for products that touch revenue, regulated processes, or customer trust. We treat them with the same gravity that a 2018 team would have treated database migrations.

The four eval categories we actually use

Over the last eighteen months of running AI features in production across the companies we operate, we have settled on four categories of eval that, between them, catch nearly everything. We do not run all four for every feature — that would be theatre — but every feature gets at least one category and most get two.

Category one: schema evals. The model has to return output in a specific shape. We test that the shape holds across, typically, 50 to 200 fixed inputs. This is the cheapest eval to run and it catches the most embarrassing failures (model returns prose instead of JSON, returns the right JSON but with a missing required field, hallucinates fields not in the schema). Most products that only run schema evals are still better than the median, which is no evals at all.

Category two: factual evals. The model has to extract or assert something specific about the input. We have a fixed input, we know what the right answer is, and we check the model produced it. "This contract is governed by the laws of England and Wales" — yes or no. "The party named in section 3.2 is X Limited" — yes or no. These evals require human labour to build the gold set but, once built, run forever. The compounding return is enormous.

Category three: behaviour evals. The model has to behave in a way that is not directly checkable but that we can measure via a second model as judge. The classic example: "rewrite this email to be more polite without changing the meaning." Politeness is fuzzy. Meaning preservation is fuzzy. But you can have a second model score both axes on a 1-5 scale and reject anything below a threshold. The second model is wrong sometimes; the aggregate over 200 examples is reliable.

Category four: regression evals. A snapshot of the previous good behaviour of the model on a fixed input set, used to catch regressions when you change a prompt, switch a model, or update a tool. This is the most underused category and it is the one that has saved us the most actual money. Every time we change a prompt, the regression suite runs. If any of the 200 fixed cases diverge by more than threshold, the change is blocked. It is the AI equivalent of a CI test suite.

Two of those four categories will get you most of the way for most features. Pick the two that match what your product breaks on, not the two that sound most prestigious.

The discipline, in five rules

The framework matters less than the discipline of using it. The five rules we hold our team to:

Rule one: write the eval before the feature ships. If a feature is going into production without a passing eval, the feature does not go into production. This is the bright line. The instinct to ship now and add evals later is the instinct that leads to silently regressing systems six months out. Resist.

Rule two: every customer-reported bug becomes an eval. When a customer reports that the AI did the wrong thing, the first step is to convert the failure into a deterministic eval case that fails today. Then you fix it. Then the eval passes. This is the same loop as test-driven development, applied to AI. It compounds rapidly. After 18 months we have around 800 evals across the companies, almost all of them generated this way.

Rule three: evals run on every change, not on a schedule. A nightly eval run is useless. A pre-merge eval run is the actual gate. CI runs the evals. The PR fails if the evals fail. The work of getting this set up takes about a day. The work it saves over a year is dramatically more.

Rule four: keep the gold set small and curated. Hundreds of evals is usually enough. Thousands is rare. The instinct to grow the eval set unboundedly is wrong because each additional eval has diminishing marginal value and the cost of running the suite grows linearly. We hold ourselves to roughly 200 evals per feature, ruthlessly curating which ones stay.

Rule five: never write evals against a single model's output. The point of evals is to compare a model's output to a ground truth, not to lock in current behaviour. A subtle but common mistake is "the eval passes because the eval was written by reading the model output." Always write the eval from the spec, not from the model's current behaviour.

Tooling, with honest notes

The honest version of tooling for evals in 2026 is that you do not need much.

We have tried most of the popular eval frameworks. They are all fine. They are also mostly overkill for the work most teams need to do. A spreadsheet of inputs and expected outputs, plus a 50-line script that runs them against the model and prints pass/fail, is genuinely sufficient for most products' first six months. The temptation to install a full eval platform on day one is usually wrong.

When you outgrow the spreadsheet — typically when you have more than 50 evals or more than one feature being evaluated — the next step is to use Promptfoo, OpenAI evals, or to roll your own. We have done all three. Roll your own is fastest if you have a competent engineer and the tolerance to maintain it. Promptfoo is the lowest-friction path if you want batteries-included.

The expensive eval platforms — the ones marketed as enterprise tooling — are almost always wrong for the stage of company that needs to be running evals. They are calibrated for teams that already have evals working at scale and now want them prettier. Buy them at year three, not year one.

What this looks like in our companies, specifically

To make this concrete, the AI feature stack at Homemove runs approximately:

220 schema evals across the structured-extraction features (contract parsing, listing classification, customer triage).
180 factual evals specifically for the contract-parsing model — these were largely built from customer-reported bugs and historical edge cases.
80 behaviour evals for the customer-facing chat assistant — these use a second model as judge and run weekly.
400+ regression evals capturing the current good behaviour across all features. These are the most numerous because they are the cheapest to maintain.

The whole suite runs in under five minutes on every PR. It blocks merges. It has caught roughly forty silent regressions in the last twelve months, every one of which would have shipped to production if the suite were not in place. The cost of building the suite was, in retrospect, a few weeks of two engineers' time. The return is impossible to argue with.

Why this matters for the Academy

The reason we teach evals as part of the Coding pillar of the Academy curriculum is that, in our hiring, the single largest skill gap in graduate AI engineers in 2026 is the gap on evals. Graduates who have built impressive AI demos almost never know how to make those demos not break in production. The students who leave our Academy understand this gap, have built their own eval suite for the product they shipped during the cohort, and have an answer to "how do you know it won't regress" that most hiring managers will not have heard before.

If you are an employer reading this and you are interviewing AI engineering candidates: ask them about evals. It is the single most discriminating question we know.

The Moonlabs Academy is a twelve-week cohort course in Derby. Evals are taught as part of the Coding pillar in week 4. Cursor + Claude Code + an eval discipline is the modern operator stack.

The eval discipline that keeps AI products honest

What an eval actually is

The four eval categories we actually use

The discipline, in five rules

Tooling, with honest notes

What this looks like in our companies, specifically

Why this matters for the Academy

James Freestone

Keep reading

The curriculum with no textbook: teaching a field that reinvents itself every quarter

The candle and the token: what happens when the price of thinking collapses

The first week in the Incubator, hour by hour

Your next chapter starts here.