Jeffrey Ip
Cofounder @ Confident AI, creating companies that tell stories. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

A Gentle Introduction to LLM Evaluation

November 22, 2023
11 min read
A Gentle Introduction to LLM Evaluation

Most developers don't evaluate their GPT outputs when building applications even if that means introducing unnoticed breaking changes because evaluation is very, very hard. In this article, you're going to learn how to evaluate ChatGPT (LLM) outputs the right way. (PS. if you want to learn how to build your own evaluation framework, click here.)

On the agenda:

  • what are LLMs and why they're difficult to evaluate
  • different ways to evaluate LLM outputs
  • how to evaluate in python


What are LLMs and what makes them so hard to evaluate?

To understand why LLMs are difficult to evaluate and why they're often times referred to as a "black box", let's debunk are LLMs and how they work.

ChatGPT is an example of a large language model (LLM) and was trained on huge amounts of data. To be exact, around 300 billion words from articles, tweets, r/tifu, stack-overflow, how-to-guides, and other pieces of data that were scraped off the internet ๐Ÿคฏ

Anyway, the GPT behind "Chat" stands for Generative Pre-trained Transformers. A transformer is a specific neural network architecture which is particularly good at predicting the next few tokens (a token == 4 characters for ChatGPT, but this can be as short as one character or as long as a word depending on the specific encoding strategy).

So in fact, LLMs don't really "know" anything, but instead "understand" linguistic patterns due to the way in which they were trained, which often times makes them pretty good at figuring out the right thing to say. Pretty manipulative huh?

All jokes aside, if there's one thing you need to remember, it's this: the process of predicting the next plausible "best" token is probabilistic in nature. This means that, LLMs can generate a variety of possible outputs for a given input, instead of always providing the same response. It is exactly this non-deterministic nature of LLMs that makes them challenging to evaluate, as there's often more than one appropriate response.

Why do we need to evaluate LLM applications?

When I say LLM applications, here are some examples of what I'm referring to:

  • Chatbots: For customer support, virtual assistants, or general conversational agents.
  • Code Assistance: Suggesting code completions, fixing code errors, or helping with debugging.
  • Legal Document Analysis: Helping legal professionals quickly understand the essence of long contracts or legal texts.
  • Personalized Email Drafting: Helping users draft emails based on context, recipient, and desired tone.

LLM applications usually have one thing in common - they perform better when augmented with proprietary data to help with the task at hand. Want to build an internal chatbot that helps boost your employee's productivity? OpenAI certainly doesn't keep tabs on your company's internal data (hopefully ๐Ÿ˜ฅ).

This matters because it is now not only OpenAI's job to ensure ChatGPT is performing as expected โš–๏ธ but also yours to make sure your LLM application is generating the desired outputs by using the right prompt templates, data retrieval pipelines, model architecture (if you're fine-tuning), etc.

Evaluation (I'll just call them evals from hereon) helps you measure how well your application is handling the task at hand. Without evals, you will be introducing unnoticed breaking changes and would have to manually inspect all possible LLM outputs each time you iterate on your application ๐Ÿ‘€ which to me sounds like a terrible idea ๐Ÿ’€

How to evaluate LLM outputs

There are two ways everyone should know about when it comes to evals - with and without ChatGPT. In fact, you can learn how to build your own evaluation framework in under 20 minutes here.

Evals without ChatGPT

A nice way to evaluate LLM outputs without using ChatGPT is using other machine learning models derived from the field of NLP. You can use specific models to judge your outputs on different metrics such as factual correctness, relevancy, biasness, and helpfulness (just to name a few, but the list goes on), despite non-deterministic outputs.

For example, we can use natural language inference (NLI) models (which outputs an entailment score) to determine how factually correct a response is based on some provided context. The higher the entailment score, the more factually correct an output is, which is particularity helpful if you're evaluating a long output that's not so black and white in terms of factual correctness.

You might also wonder how can these models possibly "know" whether a piece of text is factually correct ๐Ÿค” It turns out you can provide context to these models for them to take at face value ๐Ÿฅณ In fact, we call these context ground truths or references. A collection of these references are often referred to an evaluation dataset.

But not all metrics require references. For example, relevancy can be calculated using cross-encoder models (another ML model), and all you need is supply the input and output for it to determine how relevant they are to each another.

Off the top of my head, here's a list of reference-less metrics:

  • relevancy
  • bianess
  • toxicity
  • helpfulness
  • harmlessness

And here is a list of reference based metrics:

  • factual correctness
  • conceptual similarity

Note that reference based metrics doesn't require you to provide the initial input, as it only judges the output based on the provided context.

Using ChatGPT for Evals

There's a new emerging trend to use state-of-the-art (aka ChatGPT) LLMs to evaluate themselves or even other others LLMs.

G-Eval is a recently developed framework that uses LLMs for evals.

I'll attach an image from the research paper that introduced G-eval below, but in a nutshell G-Eval is a two part process - the first generates evaluation steps, and the second uses the generated evaluation steps to output a final score.

Let's run though a concrete example. Firstly, to generate evaluation steps:

  1. introduce an evaluation task to ChatGPT (eg. rate this summary from 1 - 5 based on relevancy)
  2. introduce an evaluation criteria (eg. Relevancy will based on the collective quality of all sentences)

Once the evaluation steps has been generated:

  1. concatenate the input, evaluation steps, context, and the actual output
  2. ask it to generate a score between 1 - 5, where 5 is better than 1
  3. (Optional) take the probabilities of the output tokens from the LLM to normalize the score and take their weighted summation as the final result

Step 3 is actually pretty complicated ๐Ÿ™ƒ because to get the probability of the output tokens, you would typically need access to the raw model outputs, not just the final generated text. This step was introduced in the paper because it offers more fine-grained scores that better reflect the quality of outputs.

Here's a diagram taken from the paper that can help you visualize what we learnt:

Utilizing GPT-4 with G-Eval outperformed traditional metrics in areas such as coherence, consistency, fluency, and relevancy ๐Ÿ˜ณ but, evaluations using LLMs can often be very expensive.

So, my recommendation would be to evaluate with G-Eval as a starting point to establish a performance standard and then transition to more cost-effective traditional methods where suitable.

Evaluating LLM outputs in python

By now, you probably feel inundated by all the jargon and definitely wouldn't want to implement everything from scratch. Imagine having to research what's the best way to compute each individual metric, train your own model for it, and code up an evaluation framework... ๐Ÿ˜ฐ

Luckily, there are a few open source packages such as ragas and DeepEval that provides an evaluation framework so you don't have to write your own ๐Ÿ˜Œ

As the cofounder of Confident (the company behind DeepEval), I'm going to go ahead and shamelessly show you how you can unit test your LLM applications using DeepEvals ๐Ÿ˜Š (but seriously, we have an amazing Pytest-like developer experience, easy to setup, and offer a free platform for you to visualize your evaluation results)

Let's wrap things up with some coding.

Setting up your test environment

To implement our much anticipated evals, create a project folder and initialize a python virtual environment by running the code below in your terminal:

mkdir evals-example
cd evals-example
python3 -m venv venv
source venv/bin/activate

Your terminal should now start something like this:


Installing dependencies

Run the following code:

pip install deepeval

Setting your OpenAI API Key

Lastly, set your OpenAI API key as an environment variable. We'll need OpenAI for G-Evals later (which basically means using LLMs for evaluation). In your terminal, paste in this with your own API key (get yours here if you don't already have one):

export OPENAI_API_KEY="your-api-key-here"

Writing your first test file

Let's create a file called `` (note that test files must start with "test"):


Paste in the following code:

from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.metrics.answer_relevancy import AnswerRelevancyMetric
from deepeval.metrics.conceptual_similarity import ConceptualSimilarityMetric
from deepeval.metrics.llm_eval import LLMEvalMetric
from deepeval.test_case import LLMTestCase
from deepeval.run_test import assert_test
import openai

def test_factual_correctness():
    input = "What if these shoes don't fit?"
    context = "All customers are eligible for a 30 day full refund at no extra costs."
    output = "We offer a 30-day full refund at no extra costs."
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, context=context)
    assert_test(test_case, [factual_consistency_metric])

def test_relevancy():
    input = "What does your company do?"
    output = "Our company specializes in cloud computing"
    relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output)
    assert_test(test_case, [relevancy_metric])

def test_conceptual_similarity():
    input = "What did the cat do?"
    output = "The cat climbed up the tree"
    expected_output = "The cat ran up the tree."
    conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, expected_output=expected_output)
    assert_test(test_case, [conceptual_similarity_metric])

def test_humor():
    def make_chat_completion_request(prompt):
        response = openai.ChatCompletion.create(
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
        return response.choices[0].message.content
    input = "Write me something funny related to programming"
    output = "Why did the programmer quit his job? Because he didn't get arrays!"
    llm_metric = LLMEvalMetric(
        criteria="How funny it is",
    test_case = LLMTestCase(query=input, output=output)
    assert_test(test_case, [llm_metric])

Now run the test file:

deepeval test run

For each of the test cases, there is a predefined metric provided by DeepEval, and each of these metrics output a score from 0 - 1. For example, `FactualConsistencyMetric(minimum_score=0.5)` means we want to evaluate how factually correct an output is, where the `minimum_score=0.5` means the test will only pass if the output score is higher than a 0.5 threshold.

Let's go over the test cases one by one:

  1. `test_factual_correctness` tests how factually correct your LLM output is relative to the provided context.
  2. `test_relevancy` tests how relevant the output is relative to the given input.
  3. `test_conceptual_similarity` tests how conceptually similar the LLM output is relative to the expected output.
  4. `test_humor` tests how funny your LLM output is. This test case is the only test case that uses ChatGPT for evaluation.

Notice how there's up to 4 moving parameters for a single test case - the input, the expected output, the actual output (of your application), and the context (that was used to generate the actual output). Depending on the metric you're testing, some parameters are optional, while some are mandatory.

Lastly, what if you want to test more than a metric on the same input? Here's how you can aggregate metrics on a single test case:

def test_everything():
    input = "What did the cat do?"
    output = "The cat climbed up the tree"
    expected_output = "The cat ran up the tree."
    context = "The cat ran up the tree."
    conceptual_similarity_metric = ConceptualSimilarityMetric(minimum_score=0.5)
    relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
    factual_consistency_metric = FactualConsistencyMetric(minimum_score=0.5)
    test_case = LLMTestCase(query=input, output=output, context=context, expected_output=expected_output)
    assert_test(test_case, [conceptual_similarity_metric, relevancy_metric, factual_consistency_metric])

Not so hard after all huh? Write enough of these (10-20), and you'll have much better control over what you're building ๐Ÿค—

PS. And here's a bonus feature DeepEval offers: free web platform for you to view data on all your test runs.

Try running the following command:

deepeval login

Follow the instructions (login, get your API key, paste it in the CLI), and run the test again by typing in the same command:

deepeval test run

Let me know what happens!


In this article, you've learnt:

  • how ChatGPT work
  • examples of LLM applications
  • why it's hard to evaluate LLM outputs
  • how to evaluate LLM outputs in python

With evals, you can stop making breaking changes to your LLM application โœ… quickly iterate on your implementation to improve on metrics you care about โœ… and most importantly be confident in the LLM application you build ๐Ÿ˜‡

The source code for this tutorial is available here:

Thank you for reading, and till next time ๐Ÿซก

Jeffrey Ip
Cofounder @ Confident AI, creating companies that tell stories. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.