LLM-as-a-Judge Metrics

Understanding everything you need to know about LLM-as-a-Judge

Overview

LLM-as-a-Judge refers to using large language models (LLMs) to evaluate the outputs of other LLM systems. This approach enables scalable, cost-effective, and human-like assessment. It is:

  • More effective than traditional metrics such as BLEU or ROUGE
  • Faster than manual human evaluation
  • More reliable and consistent than human annotators

This technique works by crafting a rubric or evaluation prompt, feeding it alongside the input and output to a secondary LLM (“judge”), and having it return a quality score or decision.

Almost all metrics in deepeval are LLM-as-a-Judge, which means all metrics you’ll use on Confident AI is also LLM-as-a-Judge.

In fact, all custom metrics you create on Confident AI are powered by deepeval’s G-Eval metric, which you’ll learn more about later.

What is LLM-as-a-Judge?

LLM-as-a-Judge uses a dedicated LLM to grade or assess generated LLM outputs. You define a scoring criterion via an evaluation prompt, then the judge examines the input and output to assign a score or label based on that rubric.

Evaluation Prompt:

You are an expert judge. Your task is to rate how relevant the following response is based on the provided input. Rate on a scale from 1 to 5, where:
1 = Completely irrelevant
2 = Mostly irrelevant
3 = Somewhat relevant but with noticeable issues
4 = Mostly relevant with minor issues
5 = Fully correct and accurate
Input:
{input}
LLM Response:
{output}
Please return only the numeric score (1 to 5) and no explanation.
Score:

You’ll notice that the parameters - {input} and {output}, look conincidentally alike the parameters we had from test cases in the previous section.

This technique, when done correctly, has shown to exhibit a higher alignment rate than humans (81%) as shown in the “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” paper, which was also the first paper that introduced LLM-as-a-Judge.

Two Types of Judges

In the section above, we actually saw a system prompt for evaluating single-turn LLM interactions. However, LLM-as-a-judge has two main types:

Single-Output
  • Evaluates LLM output based on a single interaction
  • Outputs numerical scores (e.g., 1-5 scale) for quantitative analysis
  • Can be referenceless (no expected output) or reference-based
  • Perfect for regression testing and production online evaluations

Suitable for: Most evaluation scenarios, especially when you need quantitative scores

Pairwise Comparison
  • Compares two responses to determine which is better
  • Outputs qualitative decisions (A, B, or Tie) rather than scores
  • Requires multiple LLM versions to run simultaneously
  • Less common due to complexity and lack of quantitative output

Suitable for: A/B testing scenarios where direct comparison is needed

The example evaluation prompt we saw earlier is a single-output, referenceless, single-turn LLM-as-a-judge.

Single-output

Single-output LLM-as-a-judge refers to evaluating an LLM output based solely on a single interaction at hand, which are represented in your evaluation prompt template. These can either be referenceless or reference-based.

Often times when you run regression tests on LLM apps, you would run two instances of evaluations using single-output LLM-as-a-judge, before comparing the two scores to work out if there are any regressions or not.

Refernceless

Referenceless single-output judges simply means there are no labelled, expected output/outcome for your LLM judge to anchor as the ideal output. This is perfect for those that:

  • Don’t have access to expected output/outcomes, such as in production environments where you wish to run online evals
  • Have trouble curating expected outputs/outcomes

Reference-based

Reference-based single-output judging gives better reliability, and also helps teams anchor towards what an ideal output/outcome should look like. Often times the only addition to the evaluation prompt is an additional expected output variable:

You are an expert judge. Your task is to rate how relevant the following response is based on the provided input. Rate on a scale from 1 to 5, where:
1 = Completely irrelevant
2 = Mostly irrelevant
3 = Somewhat relevant but with noticeable issues
4 = Mostly relevant with minor issues
5 = Fully correct and accurate
Input:
{input}
Expected Output:
{expected_output}
LLM Response:
{output}
Please return only the numeric score (1 to 5) and no explanation.
Score:

One major drawback of reference-based LLM-as-a-judge is it is impossible to use them in production for online evals.

Pairwise comparison

Unlike single-output, pairwise LLM-as-a-judge is much less common because they:

  • Don’t output a score, meaning are less quantitative for score analysis
  • Require multiple versions of your LLM to run at once, which can be challenging

Essentially, instead of outputting a score pairwise comparison aims to pick the best output/outcome based on a custom rubric at hand. The prompt template looks something more like this:

You are an expert judge. Your task is to compare two responses to the same input and decide which one is better based on relevance and accuracy.
Guidelines:
Choose Response A if it is clearly better.
Choose Response B if it is clearly better.
If both are equally good (or equally poor), choose Tie.
Input:
{input}
Expected Output (reference, if helpful):
{expected_output}
Response A:
{output_a}
Response B:
{output_b}
Please return only one of the following:
- A
- B
- Tie
Decision:

In Confident AI, out of the 40+ LLM evaluation metrics, only the Arena G-Eval metric uses pairwise comparison. However, an internal benchmarking of deepeval’s Arena G-Eval metric shows nearly identical performance to reference-less single-output LLM-as-a-judge:

Arena G-Eval vs Single-Output

Single vs Multi-turn

Scoring single-turn LLM apps are straightforward, as we saw in the previous section. For single-turn evals, simply provide the test case parameters as dynamic variables in your evaluation prompt, and out you get a score.

Single-Turn LLM-as-a-Judge

However for multi-turn evals, you’ll need a prompt that:

  • Takes into account entire conversations
  • Calculates a score based on portions of a conversation
  • Consider any tool calling and retrieval context within turns

In fact, often times a conversation can get length and the best way to evaluate it is to partition it into several list of turns instead:

Multi-Turn LLM-as-a-Judge

Despite how different single and mult-turn LLM-as-a-judge may look, they both actually fall under the single-output LLM-as-a-judge category. For pairwise comparison, we generally don’t do it for multi-turn since that would overload the LLM judge with too much context, hence it doesn’t work as well compared to single-turn pairwise comparisons.

Confident AI Has You Covered

Confident AI already takes care of all LLM-as-a-judge implementation via deepeval, so don’t worry if this all looks too complicated to implement.

Techniques and Algorithms for LLM Judge Scoring

LLM-as-a-judge, at least for the implementations shown in above sections, can suffer from several problems:

  • Reliability – Scores may vary across runs due to randomness or prompt sensitivity.
  • Bias – Judges can show position bias (favoring the first or last response), or favor outputs generated by the same model family as the judge itself.
  • Verbosity preference – Judges often reward longer, more detailed answers even when they are less accurate or less useful.
  • Accuracy – Judges may misinterpret the rubric, overlook factual mistakes, or hallucinate justifications for a score.

These limitations means we need better techniques and algorithms, as is implemented in Confident AI.

G-Eval

G-Eval is a SOTA, research-backed framework that uses single-output LLM-as-a-judge to evaluate LLM outputs on any custom criteria using everyday language. It’s evaluation algorithm is as follows:

  • Generate a series of CoTs (chain of thoughts) based on an initial criteria
  • Use these CoTs as evaluation steps in your evaluation prompt
  • Dynamically include test case arguments in the evaluation prompt as well

G-eval was first introduced in the paper “NLG Evaluation using GPT-4 with Better Human Alignment”:

G-Eval Algorithm

G-Eval makes great LLM evaluation metrics for subjective criteria because it is accurate, easily tunable, and surprisingly consistent across runs. Here’s how you would use it in deepeval for running local evals:

main.py
1from deepeval.metrics import GEval
2from deepeval.test_case import LLMTestCaseParams
3
4correctness_metric = GEval(
5 name="Correctness",
6 criteria="Determine whether the actual output is factually correct based on the expected output.",
7 evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
8)

In fact, all custom metrics you create on the platform is also powered by G-Eval. Confident AI now supports both single and multi-turn G-Eval:

main.py
1from deepeval.test_case import ConversationalTestCase
2from deepeval.metrics import ConversationalGEval
3
4metric = ConversationalGEval(
5 name="Professionalism",
6 criteria="Determine whether the assistant has acted professionally based on the content."
7)

More information on G-Eval can be found here.

G-Eval was designed as a stronger alternative to traditional reference-based metrics such as BLEU and ROUGE, which often fall short on subjective or open-ended tasks that demand creativity, nuance, and semantic understanding.

DAG

Deep Acyclic Graph (DAG) is a decision-tree based, deterministic, single-output LLM-as-a-judge metric. Each node in the DAG is a verdict, while each node contains the logic for which the LLM judge has to work through. In the end, the leaf nodes will return the score and reason.

Decision-Based LLM-as-a-Judge

The DAG metric is currently not yet available on the platform, but you can run it through deepeval for local evals:

main.py
1from deepeval.test_case import LLMTestCase, LLMTestCaseParams
2from deepeval.metrics.dag import (
3 DeepAcyclicGraph,
4 TaskNode,
5 BinaryJudgementNode,
6 NonBinaryJudgementNode,
7 VerdictNode,
8)
9from deepeval.metrics import DAGMetric, GEval
10
11geval_metric = GEval(
12 name="Persuasiveness",
13 criteria="Determine how persuasive the `actual output` is to getting a user booking in a call.",
14 evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
15)
16
17conciseness_node = BinaryJudgementNode(
18 criteria="Does the actual output contain less than or equal to 4 sentences?",
19 children=[
20 VerdictNode(verdict=False, score=0),
21 VerdictNode(verdict=True, child=geval_metric),
22 ],
23)
24
25# create the DAG
26dag = DeepAcyclicGraph(root_nodes=[conciseness_node])
27metric = DAGMetric(dag=dag)

Notice that you can include G-Eval as the leaf node as well. This will allow you apply LLM-as-a-judge for more decision based filtering, while still allowing a subjective scoring at the end.

The DAG metric is better than G-Eval that has hard criteria to work through.

QAG

Question-answer-generation (QAG) is a single-output LLM-as-a-judge technique to compute LLM metric scores according to some sort of mathematical question. Instead of asking an LLM to come up with a score based on some criteria like G-Eval, QAG works by:

  • Breaking test case arguments down into more fine grained “units”
  • Applying LLM-as-a-judge to each fine grained “unit’
  • Aggregate the verdicts of each LLM judge to compute a score and reason

Here’s a tangible example with the answer relevancy metric:

  • Break the actual output down into “statements”, which is defined as coherent groups of text (e.g., sentences, paragraphs, etc.)
  • For each statement, determine whether it is relevant to the input
  • The final score is the proportion of relevant statements found in the actual output

On Confident AI, this is all handled by deepeval:

main.py
1from deepeval.test_case import LLMTestCase
2from deepeval.metrics import AnswerRelevancyMetric
3
4test_case = LLMTestCase(input="...", actual_output="...")
5metric = AnswerRelevancyMetric()
6metric.measure(test_case)
7print(metric.score, metric.reason) # QAG score reason here

The reason why it is called QAG, is because this technique leverages closed-ended questions to confine LLM outputs to something that can be aggregated. In this example, instead of asking the LLM judge to do everything in a one-shot fashion, our algorithm only allowed the LLM judge to output a "yes" or "no" as verdicts to whether each statement is relevant. This makes it possible to confine a score to a mathematical formula.

All of the RAG metrics are QAG-based metrics.

LLM arena

LLM arena is traditionally an elo voting system to select the best performing LLM, but in this case we are applying pairwise LLM-as-a-judge to automate the voting process.

In Confident AI, this is done using the ArenaGEval metric, and only supports single-output:

main.py
1from deepeval.test_case import ArenaTestCase, LLMTestCase, LLMTestCaseParams
2from deepeval.metrics import ArenaGEval
3
4a_test_case = ArenaTestCase(
5 contestants={
6 "GPT-4": LLMTestCase(
7 input="What is the capital of France?",
8 actual_output="Paris",
9 ),
10 "Claude-4": LLMTestCase(
11 input="What is the capital of France?",
12 actual_output="Paris is the capital of France.",
13 ),
14 },
15)
16metric = ArenaGEval(
17 name="Friendly",
18 criteria="Choose the winner of the more friendly contestant based on the input and actual output",
19 evaluation_params=[
20 LLMTestCaseParams.INPUT,
21 LLMTestCaseParams.ACTUAL_OUTPUT,
22 ],
23)

The ArenaGEval metric is currently only available for local evals in development (same as DAG).

Using LLM Judges for Metrics

If LLM judges are the core evaluation engine, metrics are the scaffolding around it. A metric determines the:

  • Evaluation criteria/rubric
  • Evaluation algorithm
  • Passing threshold
  • Which test case parameters should be used
A test case passes only if all metrics have passed.

In Confident AI, we generally won’t refer to LLM-as-a-judge directly going forward, because metrics encapsulates more information about how evals should be ran.

Application-Based Metrics

Every LLM use case that you’re building should have 1-3 application-based metrics. These metrics are based entirely on the way your LLM app is built and is use case agnostic.

RAG

In a RAG context, there are 5 single-turn metrics that evaluates the retriever and generator as separate components:

  • Answer Relevancy: Measures how relevant the LLM’s response is to the user’s query
  • Faithfulness: Evaluates whether the LLM’s response is supported by the retrieved context
  • Contextual Relevancy: Assesses how relevant the retrieved context is to the query
  • Contextual Recall: Measures the precision of the retrieved context
  • Contextual Precision: Evaluates the recall of the retrieved context

Agents

For agents, there are 3 main single-turn metics centered around task completion and tool calling:

  • Task Completion: Evaluates whether the agent successfully completed the assigned task
  • Tool Correctness: Measures whether the agent used the correct tools for a given task
  • Arugment Correctness: Measures whether the agent passed in the correct arguments for a given tool call

The task completion metirc is an extremely unique one that evalutes not on test cases, but on entire traces.

Chatbots

For chatbots, these will be multi-turn metrics:

  • Turn Relevancy: Evaluates how relevant each response is to the ongoing conversation
  • Turn Faithfulness: Evaluates how relevant each response is to the ongoing conversation
  • Conversation Completeness: Measures whether the conversation addresses all aspects of the user’s request
  • Role Adherence: Evaluates how well the LLM adheres to its assigned role
  • Knowledge Retention: Measures how well the LLM retains information across conversation turns

You’ll notice Confident AI’s multi-turn metrics also takes RAG into account.

Use Case-Specific Metrics

Use case specific metrics, in contrary to the previous section, are application agnostic. and we recommend having 1-2 custom metrics in your evaluation suite.

You’ll need to create custom metrics for use case specific metrics:

  • G-Eval: A general-purpose evaluation metric for LLM outputs
  • DAG: A decision-tree based LLM-evaluated metric
  • Conversational G-Eval: A multi-turn general-purpose evaluation metric for LLM conversations

Currently, only G-Eval is supported on Confident AI platform. However, you can still leverage DAG in development by creating it locally.

Custom metrics are typically different based on your use case, while application-specific metrics (like RAG or agent metrics) remain consistent across similar LLM applications. For example, two conversational agents - one for medical advice and another for legal consultation - would use the same agent metrics like tool correctness but have different GEval metrics tailored to their respective industry-specific success criteria.

Create Custom Metrics

You can create metrics either locally or remotely on the platform.

Local Evals
  • Run evaluations locally using deepeval with full control over metrics
  • Support for custom metrics, DAG, and advanced evaluation algorithms

Suitable for: Python users, development, and pre-deployment workflows

Remote Evals
  • Run evaluations on Confident AI platform with pre-built metrics
  • Integrated with monitoring, datasets, and team collaboration features

Suitable for: Non-python users, online + offline evals for tracing in prod

You can learn everything about creating custom metrics here.

Next Steps

You now have everything you need to know to start running evaluations. Choose which best suits you to begin: