Recently, I’m hearing the term “LLM as a Judge” more frequently than ever. Although it might be because I work in the LLM evaluation field for a living, LLM judges are taking over because it is becoming clear that it is a much better alternative for LLM evaluation when compared to human evaluators which are slow, costly and labor-intensive.
But, LLM judges do have their limitations, and using it without caution will cause you nothing but frustration. In this article, I’ll let you in on everything I know (so far) about using LLM judges for LLM (system) evaluation, including:
- What is LLM as a judge, and why is it so popular.
- LLM as a judge alternatives, and why they don’t cut it.
- Limitations of LLM judges and how to address them.
- Using LLM judges in LLM evaluation metrics for LLM evaluation via DeepEval.
Can’t wait? Neither can I.
What is “LLM as a Judge” and Why Do You Keep Hearing About it?
LLM-as-a-Judge refers to using LLMs to evaluate LLM responses based on any specific criteria of your choice, which is basically using LLMs to carry out LLM (system) evaluation. As introduced in the Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena paper as an alternative to human evaluation, which is often expensive and time-consuming, the three types of LLM as judges include:
- Single Output Scoring (without reference): A judge LLM is provided with a scoring rubric as the criteria and prompted to assign a score to LLM responses based on various factors such as input to LLM system, retrieval context in RAG pipelines, etc.
- Single Output Scoring (with reference): Same as above, but sometimes LLM judges can get flaky. Having a reference, ideal, expected output helps the judge LLM to return consistent scores.
- Pairwise Comparison: Given two LLM generated outputs, the judge LLM will pick which one is the better generation with respect to the input. This also requires a custom criteria to determine what is “better”.
The concept is straightforward: provide an LLM with an evaluation criterion, and let it do the grading for you. But how and where exactly would you use LLMs to judge LLM responses?
“LLM as a Judge” can be used to augment LLM evaluation by using it as a scorer for LLM evaluation metrics (if you don’t know what an LLM evaluation metric is, I highly recommend reading this article here). To get started, simply provide the LLM of your choice with a clear and concise evaluation criterion or a rubric, and use it to compute a metric score (ranging from 0 to 1) based on various parameters, such as the input and generated output of your LLM. Here is an example of an evaluation prompt to an LLM judge to evaluate summary coherence:
By collecting these metric scores, you can create a comprehensive suite of LLM evaluation results, which can be used to benchmark, evaluate, and even regression test LLM (systems).
The growing trend of using LLMs as a scorer for LLM evaluation metrics to evaluate other LLMs is catching on because the alternatives just don’t cut it. LLM evaluation is vital to quantifying and identifying areas to improve LLM system performance, but human evaluation is slow, and traditional evaluation methods like BERT and ROUGE miss the mark by overlooking the deeper semantics in LLM generated text. Think about it, how could we expect traditional, much smaller NLP models to effectively judge not just paragraphs of open-ended generated text, but also content in formats like Markdown or JSON?
Alternatives to LLM Judges
Here are two popular alternatives to using LLMs for LLM evaluation and common reasons why they, in my opinion, are mistakenly preferred:
- Human Evaluation: Often seen as the gold standard due to its ability to understand context and nuance. However, it’s time-consuming, expensive, and can be inconsistent due to subjective interpretations. It’s not unusual for a real-world LLM application to generate approximately 100,000 responses a month. I don’t know about you, but it takes me about 45 seconds on average to read through a few paragraphs and make a judgment about it. That adds up to around 4.5 million seconds, or about 52 consecutive days each month — without taking lunch breaks — to evaluate every single generated LLM responses.
- Traditional NLP Evaluation Methods: Traditional scorers such as BERT and ROUGE are great — they are fast, cheap, and reliable. However, as I pointed out in my one of my previous article comparing all types of LLM evaluation metric scorers, these methods have two fatal flaws: they must require a reference text to compare the generated LLM outputs against, and are incredibly inaccurate as they overlook semantics in LLM-generated outputs, which are often open to subjective interpretation and comes in various complicated formats (e.g., JSON). Given that LLM outputs in production are open-ended without reference text, traditional evaluation methods hardly makes the cut.
(Also, both human and traditional NLP evaluation methods also lack explainability, which is the ability to explain the evaluation score it has given.)
And so, LLM as a judge is currently the best available option. They are scalable, can be fine-tuned or prompt-engineered to mitigate bias, relatively fast and cheap (though this depends on which method of evaluation you’re comparing against), and most importantly, can understand even extremely complicated pieces of generated text, regardless of the content itself and the format it is in. With that in mind, let’s go through the effectiveness of LLM judges and their pros and cons in LLM evaluation.
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.
LLMs, More Judgemental Than You Think
So the question is, how accurate are LLM judges? After all, LLMs are probabilistic models, and are still susceptible to hallucination, right?
Research has shown that when used correctly, state-of-the-art LLMs such as GPT-4 (yes, still GPT-4) have the ability to align with human judgement to up to 85%, for both pairwise and single-output scoring. For those who are still skeptical, this number is actually even higher than the agreement among humans (81%).
The fact that GPT-4 matches both pairwise and single output scoring implies GPT-4 has a relatively stable internal rubric, and this stability can further be improved through chain-of-thought (CoT) prompting.
G-Eval
As introduced in one of my previous articles, G-Eval is a framework that uses CoT prompting to stabilize and make LLM judges more reliable and accurate in terms of metric score computation (scroll down to learn more about CoT).
G-Eval first generates a series of evaluation steps using from the original evaluation criteria and uses the generated steps to determine the final score via a form-filling paradigm (this is just a fancy way of saying G-Eval requires several pieces of information to work). For example, evaluating LLM output coherence using G-Eval involves constructing a prompt that contains the criteria and text to be evaluated to generate evaluation steps, before using an LLM to output a score from 1 to 5 based on these steps (for a more detailed explanation, read this article instead).
As you’ll learn later, the technique presented in G-Eval actually aligns with various techniques we can use to improve LLM judgements. You can use G-Eval immediately in a few lines of code through DeepEval⭐, the open-source LLM evaluation framework.
LLMs are Not Perfect Though
As you might expect, LLM judges are not all rainbows and sunshines. They also suffer from several drawbacks, which includes:
- Narcissistic Bias: It has been shown that LLMs may favor the answers generated by themselves. We use the word “may” because research has shown that although GPT-4 and Claude-v1 favors itself with a 10% and 25% higher win rate respectively, they also favor other models and GPT-3.5 does not favor itself.
- More is More: We humans all know the phrase less is more, but LLM judges tend to prefer more verbose text over more concise ones. This is a problem in LLM evaluation because LLM computed evaluation scores might not accurately reflect the quality of the LLM generated text.
- Not-so-Fine-Grained Evaluation Scores: LLMs can be reliable judges when making high-level decisions, such as determining binary factual correctness or rating generated text on a simple 1–5 scale. However, as the scoring scale becomes more detailed with finer intervals, LLMs are more likely to produce arbitrary scores, making their judgments less reliable and more prone to randomness.
- Position Bias: When using LLM judges for pairwise comparisons, it has been shown that LLMs such as GPT-4 generally prefer the first generated LLM output over the second one.
Furthermore, there are other general considerations such as LLM hallucination. However, that’s not to say these can’t be solved. In the next section, we’ll go through some techniques on how to mitigate such limitations.
Improving LLM Judgements
Chain-Of-Thought Prompting
Chain-of-thought (CoT) prompting is an approach where the model is prompted to articulate its reasoning process, and in the context of using CoTs for LLM judges, it involves including detailed evaluation steps in the prompt instead of vague, high-level criteria to help a judge LLM perform more accurate and reliable evaluations. This also helps LLMs align better with human-expectations.
This is in fact the technique employed in G-Eval, which they call “auto-CoT”, and is of course implemented in DeepEval, which you can use like this:
Few-Shot Prompting
Few-shot prompting is a simple concept which involves including examples to better guide LLM judgements. It is definitely more computationally expensive as you’ll be including more input tokens, but few-shot prompting has shown to increase GPT-4’s consistency from 65.0% to 77.5%.
Other than that, there’s not much to elaborate on here, and if you’ve ever tried playing around with different prompt templates you’ll know that adding a few examples in the prompts is probably the single most helpful thing one could do to steer LLM generated outputs.
Using Probabilities of Output Tokens
To make the computed evaluation score more continous, instead of asking the judge LLM to output scores on a finer scale which may introduce arbitrariness in the metric score, we can instead ask the LLM to generate 20 scores and use the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation. This minimizes bias in LLM scoring, and smoothens the final computed metric score to make the final score more continuous without compromising accuracy.
Bonus: This is also implemented in DeepEval’s G-Eval implementation.
Reference-Guided Judging
Instead of single output, reference-free judging, providing an expected output as the ideal answer helps a judge LLM better align with human expectations. In your prompt, this can be as simple as incorporating it as an example in few-shot prompting.
Confining LLM Judgements
Instead of giving LLMs the entire generated output to evaluate, you can consider breaking it down into more fine-grained evaluations. For example, you can use LLM to power question-answer-generation (QAG), which is a powerful technique to compute scores that are non-arbitrary. QAG is a powerful technique to compute evaluation metric scores based on yes/no answers to close-ended questions. For example, if you would like to calculate the answer relevancy of an LLM output based on a given input, you can first extract all sentences in the LLM output, and determine the proportion of sentences that are relevant to the input. The final answer relevancy score will then be the proportion of relevant sentences in the LLM output. For a more complete example of QAG, read this article on how to use QAG to compute scores for various different RAG and text summarization metrics.
QAG is a powerful technique because it means LLM scores are no longer arbitrary and can be attributed to a mathematical formula. Breaking down the initial prompt to only include sentences instead of the entire LLM output can also help combat hallucinations as there is now less text to be analyzed.
Swapping Positions
No rocket science here, we can simply swap the positions to address position bias in pairwise LLM judges and only declare a win when an answer is preferred in both orders.
Fine-Tuning
For more domain specific LLM judges, you might consider fine tuning and custom open-source model like Llama-3.1. This is also if you would like faster interference time and cost associated with LLM evaluation.
Using LLM Judges in LLM Evaluation Metrics
Lastly, LLM judges can be and are currently most widely used to evaluate LLM systems by incorporating it as a scorer in an LLM evaluation metric:
A good implementation of an LLM evaluation metric will use all the mentioned techniques to improve the LLM judge scorer. For example in DeepEval (give it a star here⭐) we already use QAG to confine LLM judgements in RAG metrics such as contextual precision, or auto-CoTs and normalizing probabilities of output tokens for custom metrics such as G-Eval, and most importantly few-show prompting for all metrics to cover a wide variety of edge cases. For a full list of metrics that you can use immediately, click here.
To finish off this article, I’ll show you how you can leverage DeepEval’s metrics in a few lines of code. You can also find all the implementation on DeepEval’s GitHub, which is free and open-source.
Coherence
You’ve probably seen this a few times, a custom metric that you can implement via G-Eval:
Note that we turned on verbose_mode
for G-Eval. When verbose mode is turned on in DeepEval, it prints the internal workings of an LLM judge and allows you to see all the intermediate judgements made.
Contextual Precision
Contextual precision is a RAG metric that determines whether the nodes retrieved in your RAG pipeline is in the correct order. This is important because LLMs tend to consider nodes that are closer to the end of the prompt more (recency bias). Contextual precision is calculated using QAG, where the relevance of each node is determine by an LLM judge by looking at the input. The final score is a weighted cumulative precision, and you can view the full explanation here.
Conclusion
You made it! It was a lot on LLM judges, but now at least we know what the different types of LLM judges are, their role in LLM evaluation, pros and cons, and the ways in which you can improve them.
The main objective of an LLM evaluation metric is to quantify the performance of your LLM (application), and to do this we have different scorers, which the current best are LLM judges. Sure, there are drawbacks such as LLMs exhibiting biasness in its judgements, but these can be prompt engineered through CoT and few-shot prompting.
Don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article useful, and as always, till next time.
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The LLM Evaluation Platform
The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.