Jeffrey Ip
Cofounder @ Confident AI, creating companies that tell stories. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

How to Evaluate LLM Applications

November 10, 2023
10 min read
How to Evaluate LLM Applications

ChatGPT, the leading code generator, has soared in popularity over the past year thanks to the seemingly omniscient GPT-4. Its ability to generate coherent and poetic responses to previously unseen contexts has accelerated the development of other foundational large language models (LLMs), such as Anthropic’s Claude, Google’s Bard, and Meta’s open-source LLaMA model. Consequently, this has enabled ML engineers to build retrieval-based LLM applications around proprietary data like never before. But these applications continue to suffer from hallucinations, struggle to keep up-to-date with the latest information, and don’t always respond relevantly to prompts.

In this article, as the founder of Confident AI, the world’s first open-source evaluation infrastructure for LLM applications, I will outline how to evaluate LLM and retrieval pipelines, different workflows you can employ for evaluation, and the common pitfalls when building RAG applications that evaluation can solve.

Evaluation is (not) Eyeballing Outputs

Before we begin, does your current approach to evaluation look something like the code snippet below? You loop through a list of prompts, run your LLM application on each one of them, wait a minute or two for it to finish executing, manually inspect everything, and try to evaluate the quality of the output based on each input.

If this sounds familiar, this article is desperately for you. (And hopefully, by the end of this article, you’ll know how to stop eyeballing results.)

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import ServiceContext

service_context = ServiceContext.from_defaults(chunk_size=1000)
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)

def query(user_input):
    return query_engine.query(user_input).response

prompts = [...]

for prompt in prompts:

Evaluation as a Multi-Step, Iterative Process

Evaluation is an involved process but has huge downstream benefits as you look to iterate on your LLM application. Building an LLM system without evaluations is akin to building a distributed backend system without any automated testing — although it might work at first, you’ll end up wasting more time fixing breaking changes than building the actual thing. (Fun fact: Did you know that AI-first applications suffer from a much lower one-month retention because users don’t revisit flaky products?)

By the way, if you're looking to get a better general sense of what LLM evaluation is, here is another great read.

Step One — Creating an Evaluation Dataset

The first step to any successful evaluation workflow for LLM applications is to create an evaluation dataset, or at least have a vague idea of the type of inputs your application is going to get. It might sound fancy and a lot of work, but the truth is you’re probably already doing it as you’re eyeballing outputs.

Let’s consider the eyeballing example above. Correct me if I’m wrong, but what you’re really trying to do is to judge an output based on what you’re expecting. You probably already know something about the knowledge base you’re working with and are likely aware of what retrieval results you expect to see should you also choose to print out the retrieved text chunks in your retrieval pipeline. The initial evals dataset doesn’t have to be comprehensive, but start by writing down a set of QAs with the relevant context:

dataset = [
    "input": "...",
    "expected_output": "...",
    # context is a list of strings that represents ideally the
    # additional context your LLM application will receive at query time
    "context": ["..."]

Here, the “input” is mandatory, but “expected_output” and “context” are optional (you’ll see why later).

If you wish to automate things, you can try to generate an evals dataset by looping through your knowledge base (which could be in a vector database like Qdrant) and ask GPT-3.5 to generate a set of QAs instead of manually doing it yourself. It’s flexible, versatile, and fast, but limited by the data it was trained on. (Ironically, you’re more likely to care about evaluation if you’re building in a domain that requires deep expertise, since it’s more reliant on the retrieval pipeline rather than the foundational model itself.)

Lastly, you might wonder, “Why do I need an evaluation dataset when there are already standard LLM benchmarks out there?”. Well, it’s because public benchmarks like Stanford HELM are redundant when it comes to evaluating an LLM application that’s based on your proprietary data.

Step Two — Identify Relevant Metrics for Evaluation

The next step in evaluating LLM applications is to decide on the set of metrics you want to evaluate your LLM application on. Some examples include:

  • factual consistency (how factually correct your LLM application is based on the respective context in your evals dataset)
  • answer relevancy (how relevant your LLM application’s outputs are based on the respective inputs in your evals dataset)
  • coherence (how logical and consistent your LLM application’s outputs are)
  • toxicity (whether your LLM application is outputting harmful content)
  • RAGAS (for RAG pipelines)
  • bias (pretty self-explanatory)

I’ll write about all the different types of metrics in another article, but as you can see, different metrics require different components in your evals dataset to reference against one another. Factual consistency doesn’t care about the input, and toxicity only cares about the output. (Here, we would call factual consistency a reference-based metric since it requires some sort of grounded context, while toxicity, for example, is a reference-less metric.)

Step Three — Implement a Scorer to Compute Metric Scores

This step involves taking all the relevant metrics you’ve previously identified and implementing a way to compute a score for each data point in your evals dataset. Here’s an example of how you might implement a scorer for factual consistency (code taken from DeepEval):

from sentence_transformers import CrossEncoder  

def predict(self, text_a: str, text_b: str):
      model = CrossEncoder('cross-encoder/nli-deberta-v3-large')
      scores = model.predict([(text_a, text_b), (text_b, text_a)])

      softmax_scores = softmax(scores)
      score = softmax_scores[0][1]
      second_score = softmax_scores[1][1]
      return max(score, second_score)

Here, we used a natural language inference model from Hugging Face to compute an entailment score ranging from 0–1 to measure factual consistency. It doesn’t have to be this particular implementation, but you get the point — you’ll have to decide how you want to compute a score for each metric and find a way to implement it. One thing to note is that LLM outputs are probabilistic in nature, so your implementation of the scorer should take this into account and not penalize outputs that are equally correct but different from what you expect.

At Confident AI, we use a combination of model-based, statistical, but also LLM-based scorers depending on the type of metric we’re trying to evaluate. For example, we use a model-based approach to evaluate metrics such as factual consistency (NLI models) and answer relevancy (cross-encoders), while for more nuanced metrics such as coherence, we implemented a framework called G-Eval (which applies LLMs with Chain-of-Though) for evaluation using GPT-4. (If you’re interested, here’s the paper that introduces GEval — a robust framework to utilize LLMs for evaluation) In fact, the authors of the paper found that G-Eval outperforms all traditional scores such as:

  • BLEU (compares n-grams of the machine-generated text to n-grams of a reference translation and counting the number of matches)
  • BERTScore (a metric for evaluating text generation based on BERT embeddings)
  • ROUGE (a set of metrics for evaluating automatic summarization of texts as well as machine translation)
  • MoverScore (computes the distance between the contextual embeddings of words in the machine-generated text and those in a reference text)

If you’re not familiar with these scores, don’t worry, I’ll be writing about all the different scores and metrics next week, so stay tuned.

Lastly, you’ll need to define a passing criterion for each metric; the passing criterion is the threshold which the metric score will need to meet in order for your LLM application output to be deemed satisfactory for a given input. For example, a passing criterion for the factual consistency metric implemented above could be 0.6, since the metric outputs a score ranging from 0 to 1. (Similarly, the passing criteria might be 1 for a metric that outputs a 0 or 1 binary score.)

Step Four — Apply each Metric to your Evaluation Dataset

With everything in place, you can now loop through your evaluation dataset and evaluate each data point individually. The algorithm looks something like this:

  • Loop through your evaluation dataset.
  • For each data point, run your LLM application based on the given input.
  • Once your LLM application has finished generating an output for a given data point, compute a score for each of the metrics you’ve previously defined.
  • Identify and log failing metrics (metrics where the passing criteria wasn’t met).
  • Iterate on your LLM application based on these failing metrics.
  • Repeat steps 1–5 until no metrics are failing.

Now, you can stop eyeballing outputs and ensure that having confidence in your LLM application is as easy as having passing test cases.

Evaluation Helps You Iterate Towards the Optimal Hyperparameters

There are several benefits of setting up an evaluation framework that would allow you to rapidly iterate and improve on your LLM application/retrieval pipeline:

  • Taking a RAG-based application as an example, you can now run several nested for loops to find the optimal combination of hyperparameters such as chunk size, top k retrieval, embedding model, and prompt template that would yield the highest metric scores for your evaluation dataset.
  • You’ll be able to make marginal improvements without worrying about unnoticed breaking changes.

Evaluation is Not Bullet-Proof Though

Although your evaluation framework is now in place, it is flimsy and fragile, especially in the early days of deploying to production. This is because your users will start prompting your application in ways you’ve never expected, but that’s okay. To build a truly robust LLM application, you should:

  • Identify unsatisfactory outputs, mark them for reproducibility, and add them to your evaluation dataset. This is known as continuous evaluation and without it, you’ll find that your LLM application will slowly become out of touch with what your users care most about. There are several ways you can identify bad outputs, but the most foolproof way would be to use humans as an evaluator.
  • Identify on a component level which part of your LLM pipeline is causing unsatisfactory outputs. This is known as evaluating with tracing and without it, you’ll find yourself making unnecessary changes because you “think” for example, the retrieval component is not retrieving the relevant text chunks when it’s actually the prompt template that’s the problem.

(you can find an example of how tracing can be implemented for an example Chatbot implementation here)

Other Approaches to Evaluation

Another way to evaluate LLM applications could be an auto-evaluation approach where LLMs are used as judges for picking the best output when presented with several different choices. In fact, data from Databricks claims that LLM-as-a-judge agrees with human grading on over 80% of judgments. There are several points to note when using LLM-as-a-judge:

  • GPT-3.5 works, but only if you provide an example.
  • GPT-4 works well even without an example.
  • Use low-precision grading scales like 1–5 or a binary scale to retain precision, instead of going for something like 1–100.

A possible approach to auto-evaluation is to:

  • Generate outputs on all different combinations of hyperparameters.
  • Ask GPT-4 to compare and pick the best set of outputs in a pairwise fashion.
  • Identify the set of hyperparameters for the best set of outputs GPT-4 has chosen.

A problem I have with this approach, and why we haven’t implemented a way to do this at Confident AI, is that it leaves nothing actionable for subsequent iteration and improvement.


Evaluating LLM pipelines is essential to building robust applications, but evaluation is an involved and continuous process that requires a lot of work. If you want to do short-lived, untrusted evaluation, print statements are a great choice. However, if you want to employ a robust evaluation infrastructure in your current development workflow, you can use Confident AI. We’ve done all the hard work for you already, and although we’re still in alpha, you can find us on GitHub

Thank you for reading, and I’ll be back next week to talk about all the different metrics for LLM evaluation.

Jeffrey Ip
Cofounder @ Confident AI, creating companies that tell stories. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.