Most developers don't evaluate their GPT outputs when building applications even if that means introducing unnoticed breaking changes because evaluation is very, very hard. In this article, you're going to learn how to evaluate ChatGPT (LLM) outputs the right way. (PS. if you want to learn how to build your own evaluation framework, click here.)
On the agenda:
To understand why LLMs are difficult to evaluate and why they're often times referred to as a "black box", let's debunk are LLMs and how they work.
ChatGPT is an example of a large language model (LLM) and was trained on huge amounts of data. To be exact, around 300 billion words from articles, tweets, r/tifu, stack-overflow, how-to-guides, and other pieces of data that were scraped off the internet 🤯
Anyway, the GPT behind "Chat" stands for Generative Pre-trained Transformers. A transformer is a specific neural network architecture which is particularly good at predicting the next few tokens (a token == 4 characters for ChatGPT, but this can be as short as one character or as long as a word depending on the specific encoding strategy).
So in fact, LLMs don't really "know" anything, but instead "understand" linguistic patterns due to the way in which they were trained, which often times makes them pretty good at figuring out the right thing to say. Pretty manipulative huh?
All jokes aside, if there's one thing you need to remember, it's this: the process of predicting the next plausible "best" token is probabilistic in nature. This means that, LLMs can generate a variety of possible outputs for a given input, instead of always providing the same response. It is exactly this non-deterministic nature of LLMs that makes them challenging to evaluate, as there's often more than one appropriate response.
When I say LLM applications, here are some examples of what I'm referring to:
LLM applications usually have one thing in common - they perform better when augmented with proprietary data to help with the task at hand. Want to build an internal chatbot that helps boost your employee's productivity? OpenAI certainly doesn't keep tabs on your company's internal data (hopefully 😥).
This matters because it is now not only OpenAI's job to ensure ChatGPT is performing as expected ⚖️ but also yours to make sure your LLM application is generating the desired outputs by using the right prompt templates, data retrieval pipelines, model architecture (if you're fine-tuning), etc.
Evaluation (I'll just call them evals from hereon) helps you measure how well your application is handling the task at hand. Without evals, you will be introducing unnoticed breaking changes and would have to manually inspect all possible LLM outputs each time you iterate on your application 👀 which to me sounds like a terrible idea 💀
There are two ways everyone should know about when it comes to evals - with and without ChatGPT. In fact, you can learn how to build your own evaluation framework in under 20 minutes here.
A nice way to evaluate LLM outputs without using ChatGPT is using other machine learning models derived from the field of NLP. You can use specific models to judge your outputs on different metrics such as factual correctness, relevancy, biasness, and helpfulness (just to name a few, but the list goes on), despite non-deterministic outputs.
For example, we can use natural language inference (NLI) models (which outputs an entailment score) to determine how factually correct a response is based on some provided context. The higher the entailment score, the more factually correct an output is, which is particularity helpful if you're evaluating a long output that's not so black and white in terms of factual correctness.
You might also wonder how can these models possibly "know" whether a piece of text is factually correct 🤔 It turns out you can provide context to these models for them to take at face value 🥳 In fact, we call these context ground truths or references. A collection of these references are often referred to an evaluation dataset.
But not all metrics require references. For example, relevancy can be calculated using cross-encoder models (another ML model), and all you need is supply the input and output for it to determine how relevant they are to each another.
Off the top of my head, here's a list of reference-less metrics:
And here is a list of reference based metrics:
Note that reference based metrics doesn't require you to provide the initial input, as it only judges the output based on the provided context.
There's a new emerging trend to use state-of-the-art (aka ChatGPT) LLMs to evaluate themselves or even other others LLMs.
G-Eval is a recently developed framework that uses LLMs for evals.
I'll attach an image from the research paper that introduced G-eval below, but in a nutshell G-Eval is a two part process - the first generates evaluation steps, and the second uses the generated evaluation steps to output a final score.
Let's run though a concrete example. Firstly, to generate evaluation steps:
Once the evaluation steps has been generated:
Step 3 is actually pretty complicated 🙃 because to get the probability of the output tokens, you would typically need access to the raw model outputs, not just the final generated text. This step was introduced in the paper because it offers more fine-grained scores that better reflect the quality of outputs.
Here's a diagram taken from the paper that can help you visualize what we learnt:
Utilizing GPT-4 with G-Eval outperformed traditional metrics in areas such as coherence, consistency, fluency, and relevancy 😳 but, evaluations using LLMs can often be very expensive.
So, my recommendation would be to evaluate with G-Eval as a starting point to establish a performance standard and then transition to more cost-effective traditional methods where suitable.
By now, you probably feel inundated by all the jargon and definitely wouldn't want to implement everything from scratch. Imagine having to research what's the best way to compute each individual metric, train your own model for it, and code up an evaluation framework... 😰
Luckily, there are a few open source packages such as ragas and DeepEval that provides an evaluation framework so you don't have to write your own 😌
As the cofounder of Confident (the company behind DeepEval), I'm going to go ahead and shamelessly show you how you can unit test your LLM applications using DeepEvals 😊 (but seriously, we have an amazing Pytest-like developer experience, easy to setup, and offer a free platform for you to visualize your evaluation results)
Let's wrap things up with some coding.
To implement our much anticipated evals, create a project folder and initialize a python virtual environment by running the code below in your terminal:
Your terminal should now start something like this:
Run the following code:
Lastly, set your OpenAI API key as an environment variable. We'll need OpenAI for G-Evals later (which basically means using LLMs for evaluation). In your terminal, paste in this with your own API key (get yours here if you don't already have one):
Let's create a file called `test_evals.py` (note that test files must start with "test"):
Paste in the following code:
Now run the test file:
For each of the test cases, there is a predefined metric provided by DeepEval, and each of these metrics output a score from 0 - 1. For example, `FactualConsistencyMetric(minimum_score=0.5)` means we want to evaluate how factually correct an output is, where the `minimum_score=0.5` means the test will only pass if the output score is higher than a 0.5 threshold.
Let's go over the test cases one by one:
Notice how there's up to 4 moving parameters for a single test case - the input, the expected output, the actual output (of your application), and the context (that was used to generate the actual output). Depending on the metric you're testing, some parameters are optional, while some are mandatory.
Lastly, what if you want to test more than a metric on the same input? Here's how you can aggregate metrics on a single test case:
Not so hard after all huh? Write enough of these (10-20), and you'll have much better control over what you're building 🤗
PS. And here's a bonus feature DeepEval offers: free web platform for you to view data on all your test runs.
Try running the following command:
Follow the instructions (login, get your API key, paste it in the CLI), and run the test again by typing in the same command:
Let me know what happens!
In this article, you've learnt:
With evals, you can stop making breaking changes to your LLM application ✅ quickly iterate on your implementation to improve on metrics you care about ✅ and most importantly be confident in the LLM application you build 😇
The source code for this tutorial is available here:
Thank you for reading, and till next time
Subscribe to our weekly newsletter to stay confident in the AI systems you build.
In this article, I'll share how JudgmentalGPT, our in-house evaluator was built using OpenAI's Assistants.
In this interactive tutorial, I'll show you how to become a Midjournalist to create image you image.