For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Trust CenterStatusSupportGet a demoPlatform
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
  • Get Started
    • Introduction
    • Setup and Installation
  • LLM Evaluation
    • Introduction
      • Quickstart
      • Single-Turn Evals
      • Multi-Turn Evals
      • Arena
    • Experiments
  • Metrics
    • Introduction
    • Metric Collections
    • Custom Metrics
  • LLM Tracing
    • Introduction
    • Signals
    • Troubleshooting
  • Human-in-the-Loop
    • Introduction
    • Collect Feedback
  • Reporting & Analytics
    • Dashboards
    • Executive Insights
  • Red Teaming
    • Introduction
    • Quickstart
    • Frameworks & Policies
    • Risk Profiles
    • Red Team Using DeepTeam
  • Resources
    • Why Confident AI
    • Support
    • Data Handling
    • LLM Use Cases
LogoLogo
Trust CenterStatusSupportGet a demoPlatform
On this page
  • Overview
  • How It Works
  • Using the Arena
  • Advanced Prompt Usage
  • Include images
  • Configure models
  • Tips for Comparisons
  • Next Steps
LLM EvaluationNo-Code Evals

Arena

Quickly compare prompts, models, and AI connections side-by-side without running a full evaluation.

Was this page helpful?
Previous

Experiment with AI Apps

Move beyond vibes. Compare multiple versions of your AI app with statistical rigor.
Next
Built with

Overview

The Arena is a lightweight comparison tool for rapid comparison. It’s ideal when you want to quickly see how different prompts, models, or AI connections perform — without setting up a full evaluation with datasets and metrics.

Use the Arena when you:

  • Want to compare prompt variations to see which produces better outputs
  • Need to test a new model against your current one before committing
  • Are iterating on a prompt and want instant feedback
  • Want to demo differences between configurations to stakeholders
  • Don’t yet have a dataset and want to rely on vibes

The Arena is for quick, qualitative comparisons. For evaluation with metrics and test datasets, perform experiments instead.

How It Works

The Arena lets you set up two or more “contestants” and run the same input through all of them simultaneously. Each contestant can be either a Prompt or AI Connection (learn more).

  1. Set up contestants — configure two or more contestants to compare
  2. Enter your message(s) — or existing prompts or AI connections
  3. Interpolate variables - if any, enter the dynamic variables in your input
  4. Press Quick Run — execute all contestants and view results side-by-side

You can mix and match — for example, compare a new prompt against your production AI Connection to see if it’s ready for deployment.

Arena Overview

Using the Arena

1

Navigate to Arena

Go to Project > Arena from the sidebar.

2

Configure the Base Run

The base run is your starting point. Choose how it should generate outputs:

Prompt
AI Connection
Configure prompt in Arena
  1. Select a model provider (e.g., OpenAI) and model (e.g., gpt-4.1)
  2. Enter your prompt in the message composer — you can use variables like {variable_name}
  3. Optionally click Settings to configure model parameters

Use the Select prompt… dropdown to load a saved prompt from your Prompt Studio, or click Save Prompt to save your current prompt for later.

3

Add Comparison Runs

Click Add Contestant to add as many comparison runs as you need. Each contestant is independently configured — you can choose either a prompt or an AI Connection for each one.

Common comparison setups:

  • Prompt vs Prompt — compare different prompt variations on the same model
  • Model vs Model — compare the same prompt across different models
  • Prompt vs AI Connection — test a new prompt against your production system
4

Set Variable Values

If your prompts contain variables, expand the Variables panel at the bottom to provide values. These values will be interpolated into all prompts before execution.

Set Dynamic Variables
5

Run the Comparison

You have two options:

  • Quick Run — immediately execute all contestants and view results inline
  • Run as Experiment — run the comparison as a tracked experiment for more detailed analysis

Results appear below each contestant, allowing you to compare outputs side-by-side.

Arena Quick Run

Advanced Prompt Usage

Include images

You can include images in your Arena comparisons by simply dragging and dropping them into the message composer. This is useful for testing vision-capable models with image inputs.

Include images in Arena prompt

Only models that support image inputs (such as GPT-4o, Claude 3, and Gemini Pro Vision) will be able to generate responses based on the image. If a contestant uses a model without vision capabilities, it will only process the text portion of your input.

Configure models

You can also compare different model configs on the same prompt. For example, running a quick run on a model from Anthropic vs OpenAI, or even changing the model parameters:

Model Configs in Prompts

Tips for Comparisons

  • Test one variable at a time — change only the prompt OR the model between contestants to isolate what’s causing differences
  • Use realistic inputs — test with inputs that represent your actual use cases
  • Try edge cases — compare how different configurations handle unusual or challenging inputs
  • Save winning prompts — when you find a prompt that works well, save it to Prompt Studio for use in evaluations

Next Steps

Once you’ve found a configuration that works (🎉), take it to the next level — run a full experiment with a dataset to get statistically meaningful results.

Experiments

Run systematic evaluations to compare 2 or more versions of your AI app with datasets and metrics.