What Makes a Good Eval - Confident AI

Oftentimes, when a team comes to us, what they do is start with a preconception of what good evaluation metrics should be. They tackle things like bias and correctness, but all of it in itself is very vague. For example, what does correctness actually mean? Does your application really need bias? Should something like bias really belong to the engineering function or the security team?

Most of the time, when we ask people these questions, the answer is clear: it's not clear, and we need multiple levels of stakeholder alignment in order to get these answers out. The very first thing when building a good evaluation is understanding what we actually want. What is the objective that we are trying to achieve? For most users, this will be as simple as testing our AI app based on functionality. Correctness is great, but the subjectivity in that definition is not, so we'll talk more about this later.

The most successful teams we work with generally have these three boxes checked off before writing their first evaluation:

They have human annotations in place. These can be historical, and these don't necessarily have to be part of an ongoing workflow where everybody sits down for 10 or 30 minutes every single week just to get annotations in. You don't need a dedicated team of SMEs, but the truth is that the person building the AI app, and that could be an engineer, actually needs to look over the outputs and annotate them in a very systematic manner.
They know what the business outcome is in terms of the evaluation. To set the record straight, making sure that there's no bias in your AI outputs is rarely a business KPI to hit. That's the bare minimum, and that is the difference between testing for functionality versus testing for safety and security, which is another story for another time.
If you have neither of these, don't worry, because that's the whole point of this playbook. Keep reading, and we'll start showing you how to get to a stage where you are comfortable writing evaluations without going crazy.

Despite the importance of this, it's still not uncommon to get pushback that no one on the team wants to do the annotation. Now here's the good news: all you need is 10 to 20 annotations per week. For a team of five, that's equivalent to sitting down for 5 minutes every single day, with every person spending just a few minutes on a few annotations to make up those 20.

The next time people say they don't want to do annotations, it's simple: they do have time, they just don't have dedicated time. Book out 15 minutes every single week. Have everyone sit down and review the quality, and once that is being reviewed, you are good to go and scale your metric evaluation as much as you hope.