SlackJust In: New Slack Community! Connect with AI engineers building with Confident AI, join now →
KNOWLEDGE BASE

What Makes a Good Eval

When you first start thinking about evals, the instinct is to pick a side: either automate everything so it runs in CI, or have smart people read outputs and call it done. This is not unusual — most teams start here. The problem is that both paths end in a ship decision nobody trusts.

The fix is the 50/50 principle: automated metrics validated against human judgment on outcomes, and human judgment scaled by automation. The two exist to catch each other lying.

Without calibration, the automation-only path breaks the same way every time: the team tunes until the graph trends up, ships, and discovers that "pass" did not mean "good outcome."

What to do instead:

  • Pick a set of real outputs and have humans label them for outcome quality — did the interaction succeed, not how it scored.
  • Run your automated metrics on those same outputs and compare. Where the metric says "pass" and humans say "fail," you have a false positive. Where the metric says "fail" and humans say "good outcome," you have a false negative.
  • Iterate thresholds or judge prompts until that disagreement is small enough to trust. Then your automation is doing what you think it is doing.

Humans label outcomes — did the caller get the reservation confirmed, did the code suggestion compile and pass tests, did the extraction match the schema — not "is this a 0.72 on metric X." That is the ground truth your metrics must predict.

Two objections come up constantly:

  • "We cannot label everything." You do not need to. You need enough labeled overlap to know whether your automated gate is lying — often on the order of dozens of cases you actually read, not thousands of synthetic rows.
  • "Our LLM judge is smart enough." Maybe — prove it the same way. If it disagrees with human outcome labels at a rate you would not accept from a junior reviewer, it is not ready to own the decision.

Skip this merge and you end up with dashboards nobody references in a ship decision, or a PM manually reading outputs before every release because they do not trust the pipeline.

TL;DR — Automation scales your judgment; humans prove your automation is right. Run both or trust neither.