What Should I Measure? - Confident AI

This is probably one of the most common questions we get asked. Usually, folks come to us and ask right away what metrics they should use, without even telling us what use case they are building or how they built it. Is this an agent? Is this multi-turn? Is this user-facing? Our answer is always that we have no clue at all. The reason why we don't know what metrics you should be using is because it's not clear what constitutes good. This really ties back to the previous section of this chapter where we talked about the importance of writing good evaluations.

In fact, this question can get so severe that folks who have already turned on metrics and have overcommitted to metrics way too early often come to us disappointed that the metrics don't mean anything, or they don't trust the metrics, or they don't know what the metrics mean. To me, the final question is the most ridiculous, because how could you turn on metrics and waste tokens on something that you don't even understand? That's like your AI app outputting gibberish but continuing to generate new outputs just because it feels like the right thing to do.

So what should we do in this case? Well, there are really two checkboxes that you have to tick before you can decide what to measure. The first thing is the business outcome that we talked about in the previous section. The second thing is understanding the failure modes in your current AI app. If your AI app is already in production, that's better. You're testing with live users. If not, simply get anyone from engineers that are building to even managers testing it out as a third pair of eyes.

Ideally, everyone sits down and figures out what the failure modes are. This actually refers to a process called error analysis, which we'll define later in this playbook in chapters two and three.