In this story
Jeffrey Ip
Cofounder @ Confident AI, creator of DeepEval & DeepTeam. Working overtime to enforce responsible AI, with an unhealthy LLM evals addiction. Ex-Googler (YouTube), Microsoft AI (Office365).

LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing

July 7, 2025
·
10 min read
Presenting...
The open-source LLM evaluation framework.
Star on GitHub
Presenting...
The open-source LLM red teaming framework.
Star on GitHub
LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing

Let’s imagine you’re building an LLM evaluation framework called DeepEval, and you talk to 20+ users a week. Out of those 20 users, over 15 of them ask this very question:

Which metrics should I use if I need to compare the [fill in the blank here] of different prompts/models?

Clearly, most people is still confused about how each metric works and what use cases they are for, and that’s not good. To make testing your prompts and models more accessible, we ought to use something more intuitive and simple to understand.

So in this article, I’m introducing LLM Arena-as-a-Judge — a novel way to run automated, scaleable, comparison-based LLM-as-a-judge that just tells you which iteration of your LLM app worked best.

With LLM Arena-as-a-Judge, you don’t pick a metric. You pick the better output. That’s it.

TL;DR

In this article, you’ll learn that:

  • LLM Arena is an Elo rating system based on human feedback, and how to replace humans with LLM judges.
  • LLM “Arena”-as-a-judge can be extended not just for foundational models but also for A|B testing LLM apps.
  • DeepEval (100% open-source) makes LLM Arena-as-a-Judge a lot easier to use, simple to setup, in just 10 lines of code.
  • LLM Arena-as-a-judge is vulnerable to common biases, which swapping positions randomly, and borrowing the existing G-Eval algorithm can help mitigate them.
  • LLM Arena-as-a-Judge is not a replacement for existing LLM-as-a-judge. Arena is easier to setup, and although gives similar agreement rate to humans, it is less flexible than regular LLM-as-a-judge for different use cases.

And there’s so much more below. Ready? Let’s begin.

What is LLM Arena?

First things first, let’s talk about what LLM Arena is. LLM Arena started as a community-driven benchmark designed to compare the outputs of LLMs in a pairwise format. Inspired by the need for human-like judgment at scale, Arena lets users vote on which model output is “better,” creating a leaderboard of LLMs based on the “Elo” rating system. When one model consistently beats another, its Elo score rises.

Over time, this creates a dynamic, crowd-sourced leaderboard that reflects community preferences across models like GPT-4, Claude, Mistral, and more.

Elo Leaderboard

At its core, LLM Arena is both a research tool and a public evaluation platform. Today, LLM Arena is primarily used for:

  • Researchers to benchmark foundation models (e.g. GPT-4 vs Claude vs Gemini)
  • Open-source fans use it to track how their models stack up against the big guys
  • Building leaderboards based on community preferences
  • Running studies on model alignment and helpfulness

It’s even cited in academic papers and model release blogs (see: Chatbot Arena leaderboard).

But here’s the catch: LLM Arena is a public benchmark. It’s not built for your internal development workflow.

This means:

  • You can’t plug in your own app, prompt, or model.
  • You can’t use it to test iterations of your LLM app.
  • You can’t integrate it into your LLM evaluation pipeline or CI process.
  • And there’s no real way to do large-scale comparison testing across dozens or hundreds of your own outputs.

In other words, LLM Arena is great for watching the race. But it’s not the right tool if you’re trying to run your own. That’s exactly where LLM Arena-as-a-Judge fits in, and we’re proud to be the first to open-source it at DeepEval.

But first, let’s go through why using LLM arena for LLM evals.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Why Replicate LLM Arena As An LLM Eval?

As discussed in one of my previous article, there already exist  many LLM evaluation metrics available for agents, multi-turn chatbots, and RAG applications. Common ones include task completion, answer relevancy, and contextual recall. These metrics typically:

A standard LLM evaluation workflow involves looping through a dataset of inputs, running your LLM app to generate outputs, and forming test cases. You then apply a set of 3–5 evaluation metrics to each test case for testing your LLMs, typically right before deployment. (For example, evaluating 10 test cases with 3 metrics yields 30 individual scores.)

However, all of these metrics follow a single-output evaluation paradigm. This is a problem because if you’re trying to compare two different prompts (e.g., for regression testing), you must run them independently and then compare their absolute scores, even when those scores were generated in isolation and have no awareness of each other.

Side-by-side test case comparison for regression testing on Confident AI

Single-output LLM-as-a-judge has a key limitation: you need to define multiple separate metrics — correctness, relevance, coherence, etc. — to cover different goals. This adds complexity, and since outputs are judged in isolation, direct comparisons between them are difficult.

But this changes with pairwse LLM judge comparisons, or as you now may know it as, LLM Arena-as-a-judge. In fact, the first paper that introduced LLM-as-a-judge titled “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” also talked about pairwise comparisons as a type of LLM judge.

So now, let’s see how you can implement LLM Arena-as-a-Judge in your evaluation pipeline.

Using LLM Arena-as-a-Judge in DeepEval

Normally, you would have to hack together an evaluation pipeline that takes care of running evals asynchronously, handle any errors gracefully, writing your own evaluation prompts, implementing CoTs to stabilize evaluation results, etc.

Although you can definitely try building everything from scratch yourself, fortunately we’ve already built and open-sourced everything for you in DeepEval ⭐, the open-source LLM evaluation framework.

First install DeepEval:


pip install deepeval

Then create an ArenaTestCase , with a list of “contestants” and their respective LLM interactions:


from deepeval.test_case import ArenaTestCase, LLMTestCase
from deepeval.metrics import ArenaGEval

a_test_case = ArenaTestCase(
    contestants={
        "GPT-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris",
        ),
        "Claude-4": LLMTestCase(
            input="What is the capital of France?",
            actual_output="Paris is the capital of France.",
        ),
    },
)

There are 4 types of test cases in DeepEval, each built for a different purpose. You can learn more in DeepEval’s docs here.

Finally, define your criteria for comparison using the Arena G-Eval metric, which incorporates the G-Eval algorithm for a comparison use case:


from deepeval.test_case import LLMTestCaseParams
from deepeval.metrics import ArenaGEval
...

arena_geval = ArenaGEval(
    name="Friendly",
    criteria="Choose the winner of the more friendly contestant based on the input and actual output",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
    ],
)

metric.measure(a_test_case)
print(metric.winner, metric.reason)

Note that LLM Arena as a judge, can either be referenceless or reference-based. It all depends on whether you have an expected output for each given input.

When you’re ready to scale up your arena, simply pass in a list of ArenaTestCase s to compare everything:


from deepeval import compare
...

compare(arena=[a_test_case_1, a_test_case_2, ...], metric=arena_geval)

And that’s it! You’ll be able to see which contestant has won by popularity vote and choose the winner to ship to production.

The beauty of this is it leverages Arena G-Eval, a pairwise LLM-as-a-judge that uses the existing G-Eval algorithm for choosing the winning contestant for a given set of LLM interactions.

It allows anyone to define what it means to be “better” in everyday language, without having to ever understand how the metrics work. But for those that are curious minded, let’s dive deeper into what we’ve done to implement pairwise LLM-as-a-judge to mitigate bias.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.

Mitigating Bias in Arena

Mitigating bias involves designing a fair experiment, and for a comparison based LLM-as-a-judge, this couldn’t be more important.

Blinded Trial

One of the most important parts of a fair experiment is not letting the scientist design it around their desired outcome. Similarly, in DeepEval, we hide each contestant’s name to prevent bias. For example, you wouldn’t want gpt-4.1 picking the contestant named “gpt-4.1” every time.

Randomize Positioning

Even when names are hidden, people tend to favor whatever’s on the left — or the right — depending on the day. To avoid this, DeepEval randomizes the position of each model’s output for every test. That way, you’re judging the content, not the layout.

G-Eval Algorithm

G-Eval is a metric I’ve written about numerous times in the past and it is a two step algorithm that first generates a list of evaluation steps based on an initial criterion, before using the generated steps as the “instructions” to your LLM judge to output a score between 0–1.

In the case of Arena G-Eval, we’ve taken the chain-of-thought (CoT) and form-filling paradigm approach (which you can learn more about here) to pick the winner instead of outputting a score.

Assessing The Effectiveness of LLM Arena-as-a-Judge

So should we ditch regular LLM-as-a-judge? No, not really. And here’s why. Regular LLM-as-a-judge  has its benefits, it:

  1. Can be ran in production, even without reference for comparison.
  2. Quantitative, you’ll won’t be able to get nice graphs for LLM Arena-as-a-judge.
  3. Can be customized to more fine-grained criteria.
  4. Can be extended to multi-turn conversations. Although you can technically also do that for LLM Arena-as-a-judge, we’ve found that the LLM judge can easily get overloaded by a number of different ongoing conversations, which led to hallucinations and such.

Our testing at Confident AI also indicates that LLM Arena-as-a-judge has the same alignment rate as our existing regular LLM-as-a-judge (specifically G-Eval) metrics:

Alignment rate of regular vs arena LLM-as-a-judge to human annotation, over 250k test cases each.

Our team aggregated customer feedback (in the form of thumbs up, thumbs down) over half a million single-turn data points over the past month, split the data in half, and calculated the differences in agreement between humans, and regular/arena based LLM-as-a-judge separately.

Note that although the evaluation criteria is not separated, they both show an astonishing 95% alignment rate with what an end consumer of such evaluation results would agree on.

So when should you use which? Well, it depends. The way I would see LLM Arena-as-a-Judge is, it is easy to use, fast to setup, requires no understanding of LLM evals, but if you have the time to understand how to use LLM-as-a-judge correctly, it is equally as powerful. It will unlock more features for yourself such as quantitative scores, multi-turn evaluations, online evaluations in production, and much more you can learn about here.

Conclusion

In this article, we introduced a novel way to run LLM-as-a-judge evals, inspired by the original LLM chatbot arena. It addresses the barrier to entry to existing LLM evaluation metrics, where you have to curate and align a handful of single-output LLM metrics, before using it to compute and compare different versions of your LLM app.

Arena G-Eval makes this whole process easier by allowing you to regression test LLM apps by defining a criteria in everyday language. It is available in DeepEval in 10 lines of code, which is also 100% open-source for anyone to use, with any LLM model, and performs just as good as regular G-Eval.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrails accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.
* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

Got Red? Safeguard LLM Systems Today with Confident AI

The leading platform to red-team LLM applications for your organization, powered by DeepTeam.

Tailored frameworks (e.g. OWASP Top 10)
10+ LLM guardrails to guard malicious I/O
40+ plug-and-play vulnerabilities and 10+ attacks
Guardrail accuracy and latency reporting
Publicly sharable risk assessments.
On-demand custom guards available.
Jeffrey Ip
Cofounder @ Confident AI, creator of DeepEval & DeepTeam. Working overtime to enforce responsible AI, with an unhealthy LLM evals addiction. Ex-Googler (YouTube), Microsoft AI (Office365).

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.