
When talking to a user of DeepEval last week, here’s what I heard:
“We [a team of 7 engineers] just sit in a room for 30 minutes in silence to prompt for half an hour while entering the results into a spreadsheet before giving the thumbs up for deployment”
For many LLM engineering teams, pre-deployment checks still involve eyeballing outputs, “vibe checks,” and a big reason for this is because Large Language Model (LLM) applications are unpredictable which makes testing LLM applications a significant challenge.
While it’s essential to run quantitative evaluations through unit tests to catch regressions in CI/CD pipelines before deployment, the subjective and variable nature of LLM outputs makes principles in traditional software testing difficult to transfer.

But what if there were a way to address this unpredictability to enable unit-testing for LLMs?
This is exactly why we need to discuss LLM evaluators, which tackle this challenge by using LLMs to evaluate other LLMs. In this article, we’ll cover:
- What LLM evaluators are, why they are important, and how to choose them
- Common LLM evaluators for different use cases and systems (RAG, agents, etc.)
- How to tailor evaluators for your specific use case
- Practical code implementations for these evaluators in DeepEval (github⭐), including in CI/CD testing environments
After reading this article, you’ll know exactly how to choose, implement, and optimize LLM evaluators for your LLM testing workflows.
Let’s dive right in.
What are LLM Evaluators?
LLM evaluators are LLM-powered scorers that help quantify how well your LLM system is performing on criteria such as relevancy, answer correctness, faithfulness, and more. Unlike traditional statistical scores like recall, precision, or F1, LLM evaluators use LLM-as-a-judge, which involves feeding the inputs and outputs of your LLM system into a prompt template, and having an LLM judge score a single interaction based on your chosen evaluation criteria.
Evaluators are typically used as part of metrics that test your LLM app in the form of unit tests. Many of these unit tests together form a benchmark for your LLM application. This benchmark allows you to run regression tests by comparing each unit test side-by-side across different versions of your system.
There are two main types of LLM evaluators, which was first introduced in the “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” paper:
- Single-output evaluation (both referenceless and reference-based): A judge LLM is given a scoring rubric and asked to evaluate one output at a time. It considers factors like the system input, retrieved context (e.g. in RAG pipelines), and optionally a reference answer, then assigns a score based on the criteria you define. If provided a “labelled” output it is a reference-based evaluation, else referenceless.
- Pairwise comparison: The judge LLM is shown two different outputs generated from the same input and asked to choose which one is better. Like single-output evaluation, this also relies on clear criteria to define what “better” means — whether that’s accuracy, helpfulness, tone, or anything else.
Although pairwise comparison is possible, the trend we’re seeing at DeepEval is most teams today primarily use single-output evaluation first, then compare the scores between test runs to measure improvement or regressions.
Here are the most common metrics powered by LLM evaluators that you could use to capture both subjective and objective evaluation criteria:
- Correctness — Typically a reference-based metric that compares the correctness of an LLM output against the expected output. (and in fact the most used in G-Eval)
- Answer Relevancy — Can be either referenceless or reference-based; it measures how relevant the LLM output is to the input.
- Faithfulness — A referenceless metric used in RAG systems to assess whether the LLM output contains hallucinations when compared to the retrieved text chunks.
- Task completion — A referenceless, agentic metric that evaluates how well the LLM completed the task based on the given input.
- Summarization — Can be either referenceless or reference-based; it evaluates how effectively the LLM summarizes the input text.
Which uses these LLM evaluators under the hood:
- G-Eval — A framework that uses LLMs with CoT to evaluate LLMs on any criteria of your choice.
- DAG (deep acyclic graph)— A framework that uses LLM powered decision trees to evaluate LLMs on any criteria of your choice.
- QAG (question-answer-generation)— A framework that uses LLMs to first generate a series of close-ended questions before using binary yes/no answers to these questions as the final score.
- Prometheus — A purely model based evaluator that relies on a fine-tuned Llama2 model as an evaluator (Prometheus) and an evaluation prompt. Prometheus is strictly reference-based.
These evaluators can either be algorithms in the form of prompt engineering, or just the LLM itself as is the case with Prometheus.
We’ll go through each of these, figure out which ones make the most sense for your use case and system — but first, let’s take a step back and understand why we’re using LLM evaluators to test LLM applications in the first place.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Why LLM evaluators for LLM testing?
LLM evaluators are built to handle the ambiguity and subjectivity of language generation — making them far more suitable than rigid metrics for evaluating LLM systems:
- LLM outputs can vary on each run, even with the same prompt — evaluators handle that variability.
- Many tasks (like summarization, reasoning, or open-ended answers) don’t have a single “correct” output.
- Evaluators can score dimensions like coherence, relevance, tone, and helpfulness — which traditional metrics can’t.
- Scalable, very obvious isn’t it.
However, that’s not to say LLM evaluators have no downsides, as explained in more detail in this article that talks about using LLM-as-a-judge. The main downside of LLM evaluators, is that they are extremely bias. For example, there is literally a paper titled “LLM Evaluators Recognize and Favor Their Own Generations” released back in 2024 where the authors demonstrated how a model’s self-recognition ability is directly proportional to a model’s degree of self-preference.

What about humans?
To state the obvious, human evaluators are accurate but impractical for modern LLM development cycles — especially if you want to move fast and test often.
- Expensive and time-consuming to scale to hundreds or thousands of outputs.
- Inconsistent — different people may rate the same response differently.
- Not CI/CD friendly — you can’t ship code based on 2-day human eval loops.
- LLM evaluators let you automate scoring and get feedback instantly.
Another thing that you may not know is, humans actually have a lower agreement rate between humans and LLMs like gpt-4 (where the agreement rate is 81%).
Why not accuracy or BLEU?
Traditional NLP metrics like accuracy or BLEU were made for structured tasks, not creative or generative ones — and they miss what really matters in LLM outputs.
- BLEU and ROUGE rely on surface-level token overlap, ignoring meaning and fluency.
- Accuracy assumes a ground-truth answer, which doesn’t exist for most LLM tasks.
- They can penalize perfectly fine outputs just because they use different phrasing.
In fact, back in mid-2023 all of DeepEval’s metrics were non-LLM evaluators, and the results were horrible — users were complaining about scores not changing a single decimal point despite deleting entire paragraphs from their LLM output, and the only thing that eventually worked was using LLM-as-a-judge.
Top LLM Evaluators
With the exception of using OpenAI’s o-series models to evaluate coding and math problems, using a naive approach such as any off-the-shelf LLM + a minimalistic evaluation prompt very rarely works well for LLM evaluation. In this section, we will walk through the top evaluators you should be using when thinking about using LLM evaluators in your LLM testing workflows.
With the exception of Prometheus, all of G-Eval, DAG, and QAG are all based on prompt engineering. In the following examples below, we’ll look at a popular use case in sales — email drafting.
G-Eval
G-Eval is one of the most popular LLM evaluators out there and uses LLM with CoT prompting to evaluate LLM outputs. As I’ve introduced numerous times in previous articles, G-Eval first generates a series of evaluation steps when given a criteria before using the generated steps to determine the final score via a form-filling paradigm.

In laymen terms, the prompt template will contain:
- The criteria
- The evaluation steps generated from this criteria
- Any LLM test case details such as input, output, etc.
G-Eval is best for subjective evaluation, and for a sales-email drafting use case here’s how to evaluate persuasiveness:
G-Eval is available on DeepEval, with the top use case being answer correctness that ran over 8M times in March 2025 alone. You can read more about it here.
DAG
Deep Acyclic Graph (DAG) is a deterministic LLM evaluator made possible through decision trees modeled as direct acyclic graphs, where each node is an LLM evaluator and edge is an evaluation decision taken. The leaf nodes are either hardcoded scores to be returned, or G-Eval evaluators that you can use for more fine-grained evaluation.
In this example, we will show how to evaluate persuasiveness as shown above, but using DAG to filter away lengthy emails that has more than 4 sentences:

This is how you would implement it in code:
More details on how the DAG evaluator works and the philosophy behind it is available here.
QAG
Question-answer-generation (QAG) is a framework that involves leveraging binary answers to close-ended questions to determine the final score for an LLM test case. For example, evaluating persuasive using QAG instead of G-Eval might results in the score being the proportion of persuasive sentences found in the generated sales email instead of a loosely defined G-Eval rubric.
Prometheus
Lastly, Prometheus is an LLM evaluator where a LLaMA-2-Chat (7B & 13B) model is fine-tuned to accept a reference-based evaluation prompt template for rubric guided evaluation.
In our sales email example, Prometheus would involve:
- Evaluation Rubric
- Reference Answer
- Response to Evaluate
However as you’ll learn later, fine-tuning is the most complicated out of the four on this list so I recommend using other methods to optimize your LLM evaluators before resorting to this method.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
LLM Evaluators Based on Use Case
Here’s the definition of a use case taken from DeepEval’s official documentation:
A use case refers to the specific application context — such as a medical chatbot, meeting summarizer, or travel planner agent.
Different use case requires different a criteria, which means a different choice of metrics, and ultimately a different choice of LLM evaluator for each metric. Hence, the choice of LLM evaluators depends entirely on the criteria for your specific use case.
For example, in a medical chatbot you might wish to use two metrics to evaluate the correctness and helpfulness of it. In this case, you would probably use G-Eval or Prometheus for both metrics, because both correctness and helpfulness are subjective and not required for deterministic, objective evaluation as is the case with DAG.
In fact, most of the time users would actually prefer G-Eval, since it has a much lower barrier to entry unlike Prometheus where you have to use a specific model from Hugging Face.
Here are the general rule of thumb when selecting your evaluators:
- If the success criteria is purely subjective, use G-Eval
- If the success criteria is purely objective, use DAG
- If the success criteria is a mixture of both, use DAG with G-Eval as one of the leaf nodes.
If you’re wondering where QAG is, keep reading to find out.
LLM Evaluators Based on System Architecture
The QAG evaluator is it is slightly more dated and harder to build compared to G-Eval and DAG, where there are already established interfaces in frameworks like DeepEval.
However, QAG is still great for predefined metrics such as answer relevancy, faithfulness, contextual recall, etc. which you can use directly in DeepEval as well. Here is an example of using QAG for the RAG metrics:
QAG is also used for evaluating conversations, within DeepEval’s conversational metrics. To keep this article swift, we’ll leave more of that explanation to this article here.
LLM Evaluators for Responsible AI
LLM evaluators can also be used for responsible AI testing. Responsible AI refers to safety criteria such as bias, fairness, inclusion, toxicity, etc. and is usually done using G-Eval due to its subjectivity.
However, the biggest difference between G-Eval for normal use and for RAI use is safety metrics built on top of the G-Eval evaluator often times output a binary instead of continuous score. This is because for safety the criteria is a bit stricter and so users usually don’t tolerate a partial score in safety.
Methods to Optimize LLM Evaluators
Even with the right metrics and evaluators, evaluation quality can vary widely based on how prompts are designed. A simple “rate this response from 1 to 5” often leads to vague, inconsistent scoring.
To improve evaluation reliability and alignment, prompt optimization is essential.
Using CoT Prompting
Chain-of-thought prompting encourages the evaluator model to explain its reasoning before giving a score. This can lead to more accurate, interpretable evaluations:
- Helps the LLM “think through” the evaluation criteria step by step.
- Encourages consistency and reduces random scoring.
- Especially useful when evaluating multi-step reasoning, complex answers, or abstract criteria.
Example:
⚠️ Note: CoT is less effective on OpenAI’s newer o-series
models, which tend to perform better with shorter, direct prompts. This is why DeepEval’s G-Eval, for instance, drops CoT in favor of more concise prompting for these models.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
In-context learning
In-context learning involves providing examples of what good and bad outputs look like, along with their evaluations, directly in the prompt. It:
- Aligns the model’s scoring with your expectations.
- Reduces ambiguity in what constitutes a “high” or “low” score.
- Helps normalize judgment across different types of input.
Use a few-shot format with:
- Clear input → output → reasoning → score examples.
- Balanced samples that highlight edge cases and typical answers.
In reality, you’ll want to combine both in-context learning and G-Eval.
Fine-tuning Models
Fine tuning is an involved process as can be seen in the case of Prometheus, which may not be the best for most users but nevertheless good to learn about. To fine-tune Prometheus, the authors created the Feedback Collection Dataset, consisting of:
- 1,000 fine-grained rubrics
- 20k instructions
- 100k GPT-4-generated responses and feedback
Process:
- Started with 50 seed rubrics.
- Used GPT-4 to expand to 1,000 diverse rubrics.
- For each rubric, GPT-4 generated 20 instructions.
- For each instruction, GPT-4 generated 5 responses with feedback.
- Fine-tuned LLaMA-2-Chat (7B & 13B) to first generate feedback, then a score, following a Chain-of-Thought style approach.
The result is a reference-based scorer that matches GPT-4 in performance.

Using LLM Evaluators for LLM Testing
Decide on your metrics
Different LLM applications require tailored metrics based on their use case and architecture. When choosing evaluation metrics:
- Use no more than five metrics.
- Include at least one or two custom metrics (e.g., GEval, DAG).
- Avoid metrics without clear success criteria.
The last point’s important because LLM evaluators like G-Eval require well-defined criteria. Poorly defined metrics lead to unclear testing results.
Your five metrics should strike a balance between:
- 2–3 generic, system-level metrics (e.g., answer relevancy for RAG, tool correctness for agents).
- 1–2 custom, use case-specific metrics that reflect your application’s unique goals, independent of system architecture.
Here’s a flow chart for better visualization, and for more information and rationale on why I recommend this, click here:

Select the appropriate LLM evaluators
Now that you’ve identified your top metrics, the next step is to select evaluators best suited to accurately score them.
Start by mapping each metric to the most suitable evaluation method:
- Generic metrics (e.g. answer relevancy, tool correctness) can typically be scored using standard QAG with predefined rubrics and equations.
- Custom metrics (e.g. GEval, DAG) require evaluators that support flexible criteria and allow you to define your own scoring logic.
Remember, the generic metrics are the easy ones since they are use case agnostic and covers a wide range of systems. It is the custom metrics that you have to carefully choose your evaluator for based on how subjective and objective the evaluation criteria at hand is.
Incorporating LLM evaluators for unit-testing LLMs in CI/CD pipeline
The last step in using your beloved LLM evaluators for unit-testing LLMs in CI/CD pipeline is to implement them and then write it into something like Pytest.
Fortunately as the open-source LLM evaluation framework, DeepEval already handles everything for you, and in this example we’ll show how to use G-Eval to evaluate answer correctness in CI/CD pipelines:
Then the last step would be to simply run deepeval test run
with your test file:
And congratulations 🎉🥳🎊🎁! You’ve successfully learnt how to use LLM evaluators to test your LLM applications so that you can stop relying on vibe checks.
(PS. DeepEval is known for unit-testing for LLMs, so click here to find out more if you’re interested!)
Conclusion
In this article, we went through all the major LLM evaluators that are most commonly used and the way in which you can use them effectively depending on your use case. We learnt that there are G-Eval, DAG, QAG, and Prometheus, as well as the fact that G-Eval is best suited for subjective evaluation, whereas DAG for objective deterministic evaluations.
Ultimately, no matter how you implement your LLM evaluators you’ll want to make sure they are accurate and reliable to your use case, otherwise you wouldn’t be able to use it to unit-test your LLM application and save time on manual eyeballing efforts.
Lastly, we also saw how DeepEval brings everything together by offering the entire LLM evaluator plus unit testing workflow in a few simple lines of code, which is also open-source.
Don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article insightful, and as always, till next time.
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)