Arena | Confident AI Docs

Overview

The Arena is a lightweight comparison tool for rapid comparison. It’s ideal when you want to quickly see how different prompts, models, or AI connections perform — without setting up a full evaluation with datasets and metrics.

Use the Arena when you:

Want to compare prompt variations to see which produces better outputs
Need to test a new model against your current one before committing
Are iterating on a prompt and want instant feedback
Want to demo differences between configurations to stakeholders
Don’t yet have a dataset and want to rely on vibes

The Arena is for quick, qualitative comparisons. For evaluation with metrics and test datasets, perform experiments instead.

How It Works

The Arena lets you set up two or more “contestants” and run the same input through all of them simultaneously. Each contestant can be either a Prompt or AI Connection (learn more).

Set up contestants — configure two or more contestants to compare
Enter your message(s) — or existing prompts or AI connections
Interpolate variables - if any, enter the dynamic variables in your input
Press Quick Run — execute all contestants and view results side-by-side

You can mix and match — for example, compare a new prompt against your production AI Connection to see if it’s ready for deployment.

Arena Overview

Using the Arena

Navigate to Arena

Go to Project > Arena from the sidebar.

Configure the Base Run

The base run is your starting point. Choose how it should generate outputs:

Prompt

AI Connection

Select a model provider (e.g., OpenAI) and model (e.g., gpt-4.1)
Enter your prompt in the message composer — you can use variables like {variable_name}
Optionally click Settings to configure model parameters

Use the Select prompt… dropdown to load a saved prompt from your Prompt Studio, or click Save Prompt to save your current prompt for later.

Add Comparison Runs

Click Add Contestant to add as many comparison runs as you need. Each contestant is independently configured — you can choose either a prompt or an AI Connection for each one.

Common comparison setups:

Prompt vs Prompt — compare different prompt variations on the same model
Model vs Model — compare the same prompt across different models
Prompt vs AI Connection — test a new prompt against your production system

Set Variable Values

If your prompts contain variables, expand the Variables panel at the bottom to provide values. These values will be interpolated into all prompts before execution.

Run the Comparison

You have two options:

Quick Run — immediately execute all contestants and view results inline
Run as Experiment — run the comparison as a tracked experiment for more detailed analysis

Results appear below each contestant, allowing you to compare outputs side-by-side.

Advanced Prompt Usage

Include images

You can include images in your Arena comparisons by simply dragging and dropping them into the message composer. This is useful for testing vision-capable models with image inputs.

Only models that support image inputs (such as GPT-4o, Claude 3, and Gemini Pro Vision) will be able to generate responses based on the image. If a contestant uses a model without vision capabilities, it will only process the text portion of your input.

Configure models

You can also compare different model configs on the same prompt. For example, running a quick run on a model from Anthropic vs OpenAI, or even changing the model parameters:

Tips for Comparisons

Test one variable at a time — change only the prompt OR the model between contestants to isolate what’s causing differences
Use realistic inputs — test with inputs that represent your actual use cases
Try edge cases — compare how different configurations handle unusual or challenging inputs
Save winning prompts — when you find a prompt that works well, save it to Prompt Studio for use in evaluations

Next Steps

Once you’ve found a configuration that works (🎉), take it to the next level — run a full experiment with a dataset to get statistically meaningful results.

Experiments

Run systematic evaluations to compare 2 or more versions of your AI app with datasets and metrics.