ChatGPT, the leading code generator, has soared in popularity over the past year thanks to the seemingly omniscient GPT-4. Its ability to generate coherent and poetic responses to previously unseen contexts has accelerated the development of other foundational large language models (LLMs), such as Anthropic’s Claude, Google’s Bard, and Meta’s open-source LLaMA model. Consequently, this has enabled ML engineers to build retrieval-based LLM applications around proprietary data like never before. But these applications continue to suffer from hallucinations, struggle to keep up-to-date with the latest information, and don’t always respond relevantly to prompts.
In this article, as the founder of Confident AI, the world’s first open-source evaluation infrastructure for LLM applications, I will outline how to evaluate LLM and retrieval pipelines, different workflows you can employ for evaluation, and the common pitfalls when building RAG applications that evaluation can solve.
Before we begin, does your current approach to evaluation look something like the code snippet below? You loop through a list of prompts, run your LLM application on each one of them, wait a minute or two for it to finish executing, manually inspect everything, and try to evaluate the quality of the output based on each input.
If this sounds familiar, this article is desperately for you. (And hopefully, by the end of this article, you’ll know how to stop eyeballing results.)
Evaluation is an involved process but has huge downstream benefits as you look to iterate on your LLM application. Building an LLM system without evaluations is akin to building a distributed backend system without any automated testing — although it might work at first, you’ll end up wasting more time fixing breaking changes than building the actual thing. (Fun fact: Did you know that AI-first applications suffer from a much lower one-month retention because users don’t revisit flaky products?)
By the way, if you're looking to get a better general sense of what LLM evaluation is, here is another great read.
The first step to any successful evaluation workflow for LLM applications is to create an evaluation dataset, or at least have a vague idea of the type of inputs your application is going to get. It might sound fancy and a lot of work, but the truth is you’re probably already doing it as you’re eyeballing outputs.
Let’s consider the eyeballing example above. Correct me if I’m wrong, but what you’re really trying to do is to judge an output based on what you’re expecting. You probably already know something about the knowledge base you’re working with and are likely aware of what retrieval results you expect to see should you also choose to print out the retrieved text chunks in your retrieval pipeline. The initial evals dataset doesn’t have to be comprehensive, but start by writing down a set of QAs with the relevant context:
Here, the “input” is mandatory, but “expected_output” and “context” are optional (you’ll see why later).
If you wish to automate things, you can try to generate an evals dataset by looping through your knowledge base (which could be in a vector database like Qdrant) and ask GPT-3.5 to generate a set of QAs instead of manually doing it yourself. It’s flexible, versatile, and fast, but limited by the data it was trained on. (Ironically, you’re more likely to care about evaluation if you’re building in a domain that requires deep expertise, since it’s more reliant on the retrieval pipeline rather than the foundational model itself.)
Lastly, you might wonder, “Why do I need an evaluation dataset when there are already standard LLM benchmarks out there?”. Well, it’s because public benchmarks like Stanford HELM are redundant when it comes to evaluating an LLM application that’s based on your proprietary data.
The next step in evaluating LLM applications is to decide on the set of metrics you want to evaluate your LLM application on. Some examples include:
I’ll write about all the different types of metrics in another article, but as you can see, different metrics require different components in your evals dataset to reference against one another. Factual consistency doesn’t care about the input, and toxicity only cares about the output. (Here, we would call factual consistency a reference-based metric since it requires some sort of grounded context, while toxicity, for example, is a reference-less metric.)
This step involves taking all the relevant metrics you’ve previously identified and implementing a way to compute a score for each data point in your evals dataset. Here’s an example of how you might implement a scorer for factual consistency (code taken from DeepEval):
Here, we used a natural language inference model from Hugging Face to compute an entailment score ranging from 0–1 to measure factual consistency. It doesn’t have to be this particular implementation, but you get the point — you’ll have to decide how you want to compute a score for each metric and find a way to implement it. One thing to note is that LLM outputs are probabilistic in nature, so your implementation of the scorer should take this into account and not penalize outputs that are equally correct but different from what you expect.
At Confident AI, we use a combination of model-based, statistical, but also LLM-based scorers depending on the type of metric we’re trying to evaluate. For example, we use a model-based approach to evaluate metrics such as factual consistency (NLI models) and answer relevancy (cross-encoders), while for more nuanced metrics such as coherence, we implemented a framework called G-Eval (which applies LLMs with Chain-of-Though) for evaluation using GPT-4. (If you’re interested, here’s the paper that introduces GEval — a robust framework to utilize LLMs for evaluation) In fact, the authors of the paper found that G-Eval outperforms all traditional scores such as:
If you’re not familiar with these scores, don’t worry, I’ll be writing about all the different scores and metrics next week, so stay tuned.
Lastly, you’ll need to define a passing criterion for each metric; the passing criterion is the threshold which the metric score will need to meet in order for your LLM application output to be deemed satisfactory for a given input. For example, a passing criterion for the factual consistency metric implemented above could be 0.6, since the metric outputs a score ranging from 0 to 1. (Similarly, the passing criteria might be 1 for a metric that outputs a 0 or 1 binary score.)
With everything in place, you can now loop through your evaluation dataset and evaluate each data point individually. The algorithm looks something like this:
Now, you can stop eyeballing outputs and ensure that having confidence in your LLM application is as easy as having passing test cases.
There are several benefits of setting up an evaluation framework that would allow you to rapidly iterate and improve on your LLM application/retrieval pipeline:
Although your evaluation framework is now in place, it is flimsy and fragile, especially in the early days of deploying to production. This is because your users will start prompting your application in ways you’ve never expected, but that’s okay. To build a truly robust LLM application, you should:
Another way to evaluate LLM applications could be an auto-evaluation approach where LLMs are used as judges for picking the best output when presented with several different choices. In fact, data from Databricks claims that LLM-as-a-judge agrees with human grading on over 80% of judgments. There are several points to note when using LLM-as-a-judge:
A possible approach to auto-evaluation is to:
A problem I have with this approach, and why we haven’t implemented a way to do this at Confident AI, is that it leaves nothing actionable for subsequent iteration and improvement.
Evaluating LLM pipelines is essential to building robust applications, but evaluation is an involved and continuous process that requires a lot of work. If you want to do short-lived, untrusted evaluation, print statements are a great choice. However, if you want to employ a robust evaluation infrastructure in your current development workflow, you can use Confident AI. We’ve done all the hard work for you already, and although we’re still in alpha, you can find us on GitHub⭐
Thank you for reading, and I’ll be back next week to talk about all the different metrics for LLM evaluation.
Subscribe to our weekly newsletter to stay confident in the AI systems you build.