ChatGPT, the leading code generator, has soared in popularity over the past year thanks to the seemingly omniscient GPT-4. Its ability to generate coherent and poetic responses to previously unseen contexts has accelerated the development of other foundational large language models (LLMs), such as Anthropic’s Claude, Google’s Bard, and Meta’s open-source LLaMA model. Consequently, this has enabled ML engineers to build retrieval-based LLM applications around proprietary data like never before. But these applications continue to suffer from hallucinations, struggle to keep up-to-date with the latest information, and don’t always respond relevantly to prompts.
In this article, as the founder of Confident AI, the world’s first open-source evaluation infrastructure for LLM applications, I will outline how to evaluate LLM and retrieval pipelines, different workflows you can employ for evaluation, and the common pitfalls when building RAG applications that evaluation can solve.
Evaluation is (not) Eyeballing Outputs
Before we begin, does your current approach to evaluation look something like the code snippet below? You loop through a list of prompts, run your LLM application on each one of them, wait a minute or two for it to finish executing, manually inspect everything, and try to evaluate the quality of the output based on each input.
If this sounds familiar, this article is desperately for you. (And hopefully, by the end of this article, you’ll know how to stop eyeballing results.)
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(chunk_size=1000)
documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(similarity_top_k=5)
def query(user_input):
return query_engine.query(user_input).response
prompts = [...]
for prompt in prompts:
print(query(prompt))Evaluation as a Multi-Step, Iterative Process
Evaluation is an involved process but has huge downstream benefits as you look to iterate on your LLM application. Building an LLM system without evaluations is akin to building a distributed backend system without any automated testing — although it might work at first, you’ll end up wasting more time fixing breaking changes than building the actual thing. (Fun fact: Did you know that AI-first applications suffer from a much lower one-month retention because users don’t revisit flaky products?)
To evaluate LLMs, you need several components - an evaluation dataset (that improves over time), choose and implement up to a handful of evaluation metrics on criteria relevant to your use case, and some evaluation infrastructure in place to continuously run real-time evaluations throughout the lifetime of your LLM application.
By the way, if you're looking to get a better general sense of what LLM evaluation is, here is another great read.
Step One—Creating an Evaluation Dataset
The first step to any successful evaluation workflow for LLM applications is to create an evaluation dataset, or at least have a vague idea of the type of inputs your application is going to get. It might sound fancy and a lot of work, but the truth is you’re probably already doing it as you’re eyeballing outputs.
Let’s consider the eyeballing example above. Correct me if I’m wrong, but what you’re really trying to do is to judge an output based on what you’re expecting. You probably already know something about the knowledge base you’re working with and are likely aware of what retrieval results you expect to see should you also choose to print out the retrieved text chunks in your retrieval pipeline. The initial evals dataset doesn’t have to be comprehensive, but start by writing down a set of QAs with the relevant context:
dataset = [
{
"input": "...",
"expected_output": "...",
# context is a list of strings that represents ideally the
# additional context your LLM application will receive at query time
"context": ["..."]
},
...
]Here, the “input” is mandatory, but “expected_output” and “context” are optional (you’ll see why later).
If you wish to automate things, you can try to generate an evals dataset by looping through your knowledge base (which could be in a vector database like Qdrant) and ask GPT-3.5 to generate a set of QAs instead of manually doing it yourself. It’s flexible, versatile, and fast, but limited by the data it was trained on. (Ironically, you’re more likely to care about evaluation if you’re building in a domain that requires deep expertise, since it’s more reliant on the retrieval pipeline rather than the foundational model itself.)
Lastly, you might wonder, “Why do I need an evaluation dataset when there are already standard LLM benchmarks out there?”. Well, it’s because public benchmarks like Stanford HELM are redundant when it comes to evaluating an LLM application that’s based on your proprietary data.
Step Two—Identify Relevant Metrics for Evaluation
The next step in evaluating LLM applications is to decide on the set of metrics you want to evaluate your LLM application on. Some examples include:
factual consistency (how factually correct your LLM application is based on the respective context in your evals dataset)
answer relevancy (how relevant your LLM application’s outputs are based on the respective inputs in your evals dataset)
coherence (how logical and consistent your LLM application’s outputs are)
toxicity (whether your LLM application is outputting harmful content)
RAGAS (for RAG pipelines)
bias (pretty self-explanatory)
I’ll write about all the different types of metrics in another article, but as you can see, different metrics require different components in your evals dataset to reference against one another. Factual consistency doesn’t care about the input, and toxicity only cares about the output. (Here, we would call factual consistency a reference-based metric since it requires some sort of grounded context, while toxicity, for example, is a reference-less metric.)
Step Three—Implement a Scorer to Compute Metric Scores
This step involves taking all the relevant metrics you’ve previously identified and implementing a way to compute a score for each data point in your evals dataset. Here’s an example of how you might implement a scorer for factual consistency (code taken from DeepEval):
from sentence_transformers import CrossEncoder
def predict(self, text_a: str, text_b: str):
# https://huggingface.co/cross-encoder/nli-deberta-base
model = CrossEncoder('cross-encoder/nli-deberta-v3-large')
scores = model.predict([(text_a, text_b), (text_b, text_a)])
softmax_scores = softmax(scores)
score = softmax_scores[0][1]
second_score = softmax_scores[1][1]
return max(score, second_score)Here, we used a natural language inference model from Hugging Face to compute an entailment score ranging from 0–1 to measure factual consistency. It doesn’t have to be this particular implementation, but you get the point — you’ll have to decide how you want to compute a score for each metric and find a way to implement it. One thing to note is that LLM outputs are probabilistic in nature, so your implementation of the scorer should take this into account and not penalize outputs that are equally correct but different from what you expect.
At Confident AI, we use a combination of model-based, statistical, but also LLM-based scorers depending on the type of metric we’re trying to evaluate. For example, we use a model-based approach to evaluate metrics such as factual consistency (NLI models) and answer relevancy (cross-encoders), while for more nuanced metrics such as coherence, we implemented a framework called G-Eval (which applies LLMs with Chain-of-Though) for evaluation using GPT-4. (If you’re interested, here’s the paper that introduces GEval — a robust framework to utilize LLMs for evaluation) In fact, the authors of the paper found that G-Eval outperforms all traditional scores such as:
BLEU (compares n-grams of the machine-generated text to n-grams of a reference translation and counting the number of matches)
BERTScore (a metric for evaluating text generation based on BERT embeddings)
ROUGE (a set of metrics for evaluating automatic summarization of texts as well as machine translation)
MoverScore (computes the distance between the contextual embeddings of words in the machine-generated text and those in a reference text)
If you’re not familiar with these scores, don’t worry, here's an in depth article on all types of LLM evaluation metric scorers.
Lastly, you’ll need to define a passing criterion for each metric; the passing criterion is the threshold which the metric score will need to meet in order for your LLM application output to be deemed satisfactory for a given input. For example, a passing criterion for the factual consistency metric implemented above could be 0.6, since the metric outputs a score ranging from 0 to 1. (Similarly, the passing criteria might be 1 for a metric that outputs a 0 or 1 binary score.)
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.





