Products

Jeffrey Ip

Cofounder @ Confident AI, creator of DeepEval & DeepTeam. Working overtime to enforce responsible AI, with an unhealthy LLM evals addiction. Ex-Googler (YouTube), Microsoft AI (Office365).

Top LLM Evaluators for Testing LLM Systems at Scale

Aug 8, 2025.15 min read

Presenting...

The open-source LLM evaluation framework.

Star on GitHub

When talking to a user of DeepEval last week, here’s what I heard:

“We [a team of 7 engineers] just sit in a room for 30 minutes in silence to prompt for half an hour while entering the results into a spreadsheet before giving the thumbs up for deployment”

For many LLM engineering teams, pre-deployment checks still involve eyeballing outputs, “vibe checks,” and a big reason for this is because Large Language Model (LLM) applications are unpredictable which makes testing LLM applications a significant challenge.

While it’s essential to run quantitative evaluations through unit tests to catch regressions in CI/CD pipelines before deployment, the subjective and variable nature of LLM outputs makes principles in traditional software testing difficult to transfer.

But what if there were a way to address this unpredictability to enable unit-testing for LLMs?

This is exactly why we need to discuss LLM evaluators, which tackle this challenge by using LLMs to evaluate other LLMs. In this article, we’ll cover:

What LLM evaluators are, why they are important, and how to choose them
Common LLM evaluators for different use cases and systems (RAG, agents, etc.)
How to tailor evaluators for your specific use case
Practical code implementations for these evaluators in DeepEval (github⭐), including in CI/CD testing environments

After reading this article, you’ll know exactly how to choose, implement, and optimize LLM evaluators for your LLM testing workflows.

Let’s dive right in.

TL;DR

LLM evaluators (aka. LLM-as-a-judge) is the most effective way to score LLM outputs for unit-testing LLM apps, outperforming BLEU & ROUGE for accuracy and humans for scalability.
They can be used for scoring agents, RAG, chatbots, and typically outputs a score from between 0 - 1.
LLM evaluators can suffer from biases and unreliability, but can be fixed using CoT prompting, in-context learning, and fine-tuning models (more rare).
The most effective/SOTA LLM evaluators include G-Eval, DAG, and QAG, which should be employed based on the criteria at hand to evaluate.
‍
DeepEval (100% OS ⭐ https://github.com/confident-ai/deepeval) allows anyone to implement LLM evaluators that can be customized to any unique use case.

What are LLM Evaluators?

LLM evaluators are LLM-powered scorers that help quantify how well your LLM system is performing on criteria such as relevancy, answer correctness, faithfulness, and more. Unlike traditional statistical scores like recall, precision, or F1, LLM evaluators use LLM-as-a-judge, which involves feeding the inputs and outputs of your LLM system into a prompt template, and having an LLM judge score a single interaction based on your chosen evaluation criteria.

evaluation_prompt = """You are an expert evaluator. Your task is to rate how factually correct the following response is, based on the provided input and optional context. Rate on a scale from 1 to 5, where:

1 = Completely incorrect  
2 = Mostly incorrect  
3 = Somewhat correct but with noticeable issues  
4 = Mostly correct with minor issues  
5 = Fully correct and accurate

Input:
{input}

Context:
{context}

LLM Response:
{output}

Please return only the numeric score (1 to 5) and no explanation.

Score:"""

Evaluators are typically used as part of metrics that test your LLM app in the form of unit tests. Many of these unit tests together form a benchmark for your LLM application. This benchmark allows you to run regression tests by comparing each unit test side-by-side across different versions of your system.

There are two main types of LLM evaluators, which was first introduced in the “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” paper:

Single-output evaluation (both referenceless and reference-based): A judge LLM is given a scoring rubric and asked to evaluate one output at a time. It considers factors like the system input, retrieved context (e.g. in RAG pipelines), and optionally a reference answer, then assigns a score based on the criteria you define. If provided a “labelled” output it is a reference-based evaluation, else referenceless.
Pairwise comparison: The judge LLM is shown two different outputs generated from the same input and asked to choose which one is better. Like single-output evaluation, this also relies on clear criteria to define what “better” means — whether that’s accuracy, helpfulness, tone, or anything else.

Although pairwise comparison is possible, the trend we’re seeing at DeepEval is most teams today primarily use single-output evaluation first, then compare the scores between test runs to measure improvement or regressions.

Here are the most common metrics powered by LLM evaluators that you could use to capture both subjective and objective evaluation criteria:

Correctness — Typically a reference-based metric that compares the correctness of an LLM output against the expected output. (and in fact the most used in G-Eval)
Answer Relevancy — Can be either referenceless or reference-based; it measures how relevant the LLM output is to the input.
Faithfulness — A referenceless metric used in RAG systems to assess whether the LLM output contains hallucinations when compared to the retrieved text chunks.
Task completion — A referenceless, agentic metric that evaluates how well the LLM completed the task based on the given input.
Summarization — Can be either referenceless or reference-based; it evaluates how effectively the LLM summarizes the input text.

Which uses these LLM evaluators under the hood:

G-Eval — A framework that uses LLMs with CoT to evaluate LLMs on any criteria of your choice.
DAG (deep acyclic graph) — A framework that uses LLM powered decision trees to evaluate LLMs on any criteria of your choice.
QAG (question-answer-generation) — A framework that uses LLMs to first generate a series of close-ended questions before using binary yes/no answers to these questions as the final score.
Prometheus — A purely model based evaluator that relies on a fine-tuned Llama2 model as an evaluator (Prometheus) and an evaluation prompt. Prometheus is strictly reference-based.

These evaluators can either be algorithms in the form of prompt engineering, or just the LLM itself as is the case with Prometheus.

We’ll go through each of these, figure out which ones make the most sense for your use case and system — but first, let’s take a step back and understand why we’re using LLM evaluators to test LLM applications in the first place.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free Checkout DeepEval

Why LLM evaluators for LLM testing?

LLM evaluators are built to handle the ambiguity and subjectivity of language generation — making them far more suitable than rigid metrics for evaluating LLM systems:

LLM outputs can vary on each run, even with the same prompt — evaluators handle that variability.
Many tasks (like summarization, reasoning, or open-ended answers) don’t have a single “correct” output.
Evaluators can score dimensions like coherence, relevance, tone, and helpfulness — which traditional metrics can’t.
Scalable, very obvious isn’t it.

However, that’s not to say LLM evaluators have no downsides, as explained in more detail in this article that talks about using LLM-as-a-judge. The main downside of LLM evaluators, is that they are extremely bias. For example, there is literally a paper titled “LLM Evaluators Recognize and Favor Their Own Generations” released back in 2024 where the authors demonstrated how a model’s self-recognition ability is directly proportional to a model’s degree of self-preference.

Diagram taken from paper showing a positive correlation between self preference and recognition

What about humans?

To state the obvious, human evaluators are accurate but impractical for modern LLM development cycles — especially if you want to move fast and test often.

Expensive and time-consuming to scale to hundreds or thousands of outputs.
Inconsistent — different people may rate the same response differently.
Not CI/CD friendly — you can’t ship code based on 2-day human eval loops.
LLM evaluators let you automate scoring and get feedback instantly.

Another thing that you may not know is, humans actually have a lower agreement rate between humans and LLMs like gpt-4 (where the agreement rate is 81%).

Why not accuracy or BLEU?

Traditional NLP metrics like accuracy or BLEU were made for structured tasks, not creative or generative ones — and they miss what really matters in LLM outputs.

BLEU and ROUGE rely on surface-level token overlap, ignoring meaning and fluency.
Accuracy assumes a ground-truth answer, which doesn’t exist for most LLM tasks.
They can penalize perfectly fine outputs just because they use different phrasing.

In fact, back in mid-2023 all of DeepEval’s metrics were non-LLM evaluators, and the results were horrible — users were complaining about scores not changing a single decimal point despite deleting entire paragraphs from their LLM output, and the only thing that eventually worked was using LLM-as-a-judge.

Top LLM Evaluators

With the exception of using OpenAI’s o-series models to evaluate coding and math problems, using a naive approach such as any off-the-shelf LLM + a minimalistic evaluation prompt very rarely works well for LLM evaluation. In this section, we will walk through the top evaluators you should be using when thinking about using LLM evaluators in your LLM testing workflows.

With the exception of Prometheus, all of G-Eval, DAG, and QAG are all based on prompt engineering. In the following examples below, we’ll look at a popular use case in sales — email drafting.

G-Eval

G-Eval is one of the most popular LLM evaluators out there and uses LLM with CoT prompting to evaluate LLM outputs. As I’ve introduced numerous times in previous articles, G-Eval first generates a series of evaluation steps when given a criteria before using the generated steps to determine the final score via a form-filling paradigm.

In laymen terms, the prompt template will contain:

The criteria
The evaluation steps generated from this criteria
Any LLM test case details such as input, output, etc.

G-Eval is best for subjective evaluation, and for a sales-email drafting use case here’s how to evaluate persuasiveness:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

geval_metric = GEval(
    name="Persuasiveness",
    criteria="Determine how persuasive the `actual output` is to getting a user booking in a call.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
test_case = LLMTestCase(input="...", actual_output="...")

geval_metric.measure(test_case)

G-Eval is available on DeepEval, with the top use case being answer correctness that ran over 8M times in March 2025 alone. You can read more about it here.

DAG

Deep Acyclic Graph (DAG) is a deterministic LLM evaluator made possible through decision trees modeled as direct acyclic graphs, where each node is an LLM evaluator and edge is an evaluation decision taken. The leaf nodes are either hardcoded scores to be returned, or G-Eval evaluators that you can use for more fine-grained evaluation.

In this example, we will show how to evaluate persuasiveness as shown above, but using DAG to filter away lengthy emails that has more than 4 sentences:

This is how you would implement it in code:

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics.dag import (
    DeepAcyclicGraph,
    TaskNode,
    BinaryJudgementNode,
    NonBinaryJudgementNode,
    VerdictNode,
)
from deepeval.metrics import DAGMetric, GEval

geval_metric = GEval(
    name="Persuasiveness",
    criteria="Determine how persuasive the `actual output` is to getting a user booking in a call.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)

conciseness_node = BinaryJudgementNode(
    criteria="Does the actual output contain less than or equal to 4 sentences?",
    children=[
        VerdictNode(verdict=False, score=0),
        VerdictNode(verdict=True, child=geval_metric),
    ],
)

# create the DAG
dag = DeepAcyclicGraph(root_nodes=[conciseness_node])
metric = DagMetric(dag=dag)

# create test case
test_case = LLMTestCase(input="...", actual_output="...")

# measure
metric.measure(test_case)

More details on how the DAG evaluator works and the philosophy behind it is available here.

QAG

Question-answer-generation (QAG) is a framework that involves leveraging binary answers to close-ended questions to determine the final score for an LLM test case. For example, evaluating persuasive using QAG instead of G-Eval might results in the score being the proportion of persuasive sentences found in the generated sales email instead of a loosely defined G-Eval rubric.

Prometheus

Lastly, Prometheus is an LLM evaluator where a LLaMA-2-Chat (7B & 13B) model is fine-tuned to accept a reference-based evaluation prompt template for rubric guided evaluation.

In our sales email example, Prometheus would involve:

Evaluation Rubric
Reference Answer
Response to Evaluate

However as you’ll learn later, fine-tuning is the most complicated out of the four on this list so I recommend using other methods to optimize your LLM evaluators before resorting to this method.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free Checkout DeepEval

LLM Evaluators Based on Use Case

Here’s the definition of a use case taken from DeepEval’s official documentation:

A use case refers to the specific application context — such as a medical chatbot, meeting summarizer, or travel planner agent.

Different use case requires different a criteria, which means a different choice of metrics, and ultimately a different choice of LLM evaluator for each metric. Hence, the choice of LLM evaluators depends entirely on the criteria for your specific use case.

For example, in a medical chatbot you might wish to use two metrics to evaluate the correctness and helpfulness of it. In this case, you would probably use G-Eval or Prometheus for both metrics, because both correctness and helpfulness are subjective and not required for deterministic, objective evaluation as is the case with DAG.

In fact, most of the time users would actually prefer G-Eval, since it has a much lower barrier to entry unlike Prometheus where you have to use a specific model from Hugging Face.

Here are the general rule of thumb when selecting your evaluators:

If the success criteria is purely subjective, use G-Eval
If the success criteria is purely objective, use DAG
If the success criteria is a mixture of both, use DAG with G-Eval as one of the leaf nodes.

If you’re wondering where QAG is, keep reading to find out.

LLM Evaluators Based on System Architecture

The QAG evaluator is it is slightly more dated and harder to build compared to G-Eval and DAG, where there are already established interfaces in frameworks like DeepEval.

However, QAG is still great for predefined metrics such as answer relevancy, faithfulness, contextual recall, etc. which you can use directly in DeepEval as well. Here is an example of using QAG for the RAG metrics:

from deepeval.test_case import LLMTestCase
from deepeval.metrics import ContextualRecallMetric

metric = ContextualRecallMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    # Replace this with the actual output from your LLM application
    actual_output="We offer a 30-day full refund at no extra cost.",
    # Replace this with the expected output from your RAG generator
    expected_output="You are eligible for a 30 day full refund at no extra cost.",
    # Replace this with the actual retrieved context from your RAG pipeline
    retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."]
)

metric.measure(test_case)

QAG is also used for evaluating conversations, within DeepEval’s conversational metrics. To keep this article swift, we’ll leave more of that explanation to this article here.

LLM Evaluators for Responsible AI

LLM evaluators can also be used for responsible AI testing. Responsible AI refers to safety criteria such as bias, fairness, inclusion, toxicity, etc. and is usually done using G-Eval due to its subjectivity.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

bias_metric = GEval(
    name="Bias",
    criteria="Determine whether the actual output contains any racial bias.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
)
test_case = LLMTestCase(input="...", actual_output="...")

bias_metric.measure(test_case)

However, the biggest difference between G-Eval for normal use and for RAI use is safety metrics built on top of the G-Eval evaluator often times output a binary instead of continuous score. This is because for safety the criteria is a bit stricter and so users usually don’t tolerate a partial score in safety.

Methods to Optimize LLM Evaluators

Even with the right metrics and evaluators, evaluation quality can vary widely based on how prompts are designed. A simple “rate this response from 1 to 5” often leads to vague, inconsistent scoring.

prompt = """Score the assistant's answer for how relevant it is to the question, using the context provided.
Give a score from 1 (worst) to 5 (best).

Question: {input}
Context: {context}
Answer: {output}

Score:"""

To improve evaluation reliability and alignment, prompt optimization is essential.

Using CoT Prompting

Chain-of-thought prompting encourages the evaluator model to explain its reasoning before giving a score. This can lead to more accurate, interpretable evaluations:

Helps the LLM “think through” the evaluation criteria step by step.
Encourages consistency and reduces random scoring.
Especially useful when evaluating multi-step reasoning, complex answers, or abstract criteria.

Example:

prompt = """You are evaluating how relevant the assistant's answer is to the question using the provided context.
Follow these steps to guide your reasoning, then give a score from 1 to 5.

Steps:
1. Does the answer address the main point(s) of the question?
2. Does it use information from the context?
3. Does it omit any key context details that would improve the answer?
4. Does it add irrelevant or hallucinated information?

Question: {input}
Context: {context}
Answer: {output}

Score:"""

⚠️ Note: CoT is less effective on OpenAI’s newer o-series models, which tend to perform better with shorter, direct prompts. This is why DeepEval’s G-Eval, for instance, drops CoT in favor of more concise prompting for these models.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free Checkout DeepEval

In-context learning

In-context learning involves providing examples of what good and bad outputs look like, along with their evaluations, directly in the prompt. It:

Aligns the model’s scoring with your expectations.
Reduces ambiguity in what constitutes a “high” or “low” score.
Helps normalize judgment across different types of input.

Use a few-shot format with:

Clear input → output → reasoning → score examples.
Balanced samples that highlight edge cases and typical answers.

prompt = """You are evaluating how relevant the assistant's answer is to the question using the provided context.
Below are examples of how to score different outputs. At the end, continue the pattern by scoring a new answer.

Example 1:
Question: What are the benefits of vitamin D?
Context: Vitamin D supports bone health, immune function, and may reduce inflammation.
Answer: Vitamin D is important for bones and the immune system.
Score: 4

Example 2:
Question: What causes the seasons on Earth?
Context: The tilt of the Earth's axis causes seasonal changes as the planet orbits the Sun.
Answer: The moon's gravity changes the seasons throughout the year.
Score: 2

Example 3:
Question: Who wrote Pride and Prejudice?
Context: Jane Austen is the author of Pride and Prejudice, a classic 19th-century novel.
Answer: Jane Austen.
Score: 5

Now evaluate:
Question: {input}
Context: {context}
Answer: {output}

Score:"""

In reality, you’ll want to combine both in-context learning and G-Eval.

Fine-tuning Models

Fine tuning is an involved process as can be seen in the case of Prometheus, which may not be the best for most users but nevertheless good to learn about. To fine-tune Prometheus, the authors created the Feedback Collection Dataset, consisting of:

1,000 fine-grained rubrics
20k instructions
100k GPT-4-generated responses and feedback

Process:

Started with 50 seed rubrics.
Used GPT-4 to expand to 1,000 diverse rubrics.
For each rubric, GPT-4 generated 20 instructions.
For each instruction, GPT-4 generated 5 responses with feedback.
Fine-tuned LLaMA-2-Chat (7B & 13B) to first generate feedback, then a score, following a Chain-of-Thought style approach.

The result is a reference-based scorer that matches GPT-4 in performance.

Taken from the Prometheus [paper](https://arxiv.org/abs/2310.08491) — Taken from the Prometheus paper

Using LLM Evaluators for LLM Testing

Decide on your metrics

Different LLM applications require tailored metrics based on their use case and architecture. When choosing evaluation metrics:

Use no more than five metrics.
Include at least one or two custom metrics (e.g., GEval, DAG).
Avoid metrics without clear success criteria.

The last point’s important because LLM evaluators like G-Eval require well-defined criteria. Poorly defined metrics lead to unclear testing results.

Your five metrics should strike a balance between:

2–3 generic, system-level metrics (e.g., answer relevancy for RAG, tool correctness for agents).
1–2 custom, use case-specific metrics that reflect your application’s unique goals, independent of system architecture.

Here’s a flow chart for better visualization, and for more information and rationale on why I recommend this, click here:

Diagram taken from [DeepEval](https://deepeval.com/) Documentation — Diagram taken from DeepEval Documentation

Select the appropriate LLM evaluators

Now that you’ve identified your top metrics, the next step is to select evaluators best suited to accurately score them.

Start by mapping each metric to the most suitable evaluation method:

Generic metrics (e.g. answer relevancy, tool correctness) can typically be scored using standard QAG with predefined rubrics and equations.
Custom metrics (e.g. GEval, DAG) require evaluators that support flexible criteria and allow you to define your own scoring logic.

Remember, the generic metrics are the easy ones since they are use case agnostic and covers a wide range of systems. It is the custom metrics that you have to carefully choose your evaluator for based on how subjective and objective the evaluation criteria at hand is.

Incorporating LLM evaluators for unit-testing LLMs in CI/CD pipeline

The last step in using your beloved LLM evaluators for unit-testing LLMs in CI/CD pipeline is to implement them and then write it into something like Pytest.

Fortunately as the open-source LLM evaluation framework, DeepEval already handles everything for you, and in this example we’ll show how to use G-Eval to evaluate answer correctness in CI/CD pipelines:

import pytest 

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval import assert_test

test_cases = [
  LLMTestCase(input="...", actual_output="...", expected_output="..."),
  LLMTestCase(input="...", actual_output="...", expected_output="..."),
  LLMTestCase(input="...", actual_output="...", expected_output="...")
]

@pytest.mark.parametrize(
    "test_case",
    test_cases,
)
def test_llm_app(test_case: LLMTestCase):
  correctness = GEval(
      name="Correctness",
      criteria="Determine whether the actual output is factually correct based on the expected output.",
      evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
  )    
  assert_test(test_case, metrics=[correctness])

Then the last step would be to simply run deepeval test run with your test file:

deepeval test run test_llm_app.py

And congratulations 🎉🥳🎊🎁! You’ve successfully learnt how to use LLM evaluators to test your LLM applications so that you can stop relying on vibe checks.

(PS. DeepEval is known for unit-testing for LLMs, so click here to find out more if you’re interested!)

Conclusion

In this article, we went through all the major LLM evaluators that are most commonly used and the way in which you can use them effectively depending on your use case. We learnt that there are G-Eval, DAG, QAG, and Prometheus, as well as the fact that G-Eval is best suited for subjective evaluation, whereas DAG for objective deterministic evaluations.

Ultimately, no matter how you implement your LLM evaluators you’ll want to make sure they are accurate and reliable to your use case, otherwise you wouldn’t be able to use it to unit-test your LLM application and save time on manual eyeballing efforts.

Lastly, we also saw how DeepEval brings everything together by offering the entire LLM evaluator plus unit testing workflow in a few simple lines of code, which is also open-source.

Don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article insightful, and as always, till next time.

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free Checkout DeepEval

Top LLM Evaluators for Testing LLM Systems at Scale

TL;DR

What are LLM Evaluators?

Confident AI: The DeepEval LLM Evaluation Platform

Why LLM evaluators for LLM testing?

What about humans?

Why not accuracy or BLEU?

Top LLM Evaluators

G-Eval

DAG

QAG

Prometheus

Confident AI: The DeepEval LLM Evaluation Platform

LLM Evaluators Based on Use Case

LLM Evaluators Based on System Architecture

LLM Evaluators for Responsible AI

Methods to Optimize LLM Evaluators

Using CoT Prompting

Confident AI: The DeepEval LLM Evaluation Platform

In-context learning

Fine-tuning Models

Using LLM Evaluators for LLM Testing

Decide on your metrics

Select the appropriate LLM evaluators

Incorporating LLM evaluators for unit-testing LLMs in CI/CD pipeline

Conclusion

Confident AI: The DeepEval LLM Evaluation Platform

More stories from us...

Products

Blog

Resources

Company

Legal Stuff