Arena

Quickly compare prompts, models, and AI connections side-by-side without running a full evaluation.

Overview

The Arena is a lightweight comparison tool for rapid comparison. It’s ideal when you want to quickly see how different prompts, models, or AI connections perform — without setting up a full evaluation with datasets and metrics.

Use the Arena when you:

  • Want to compare prompt variations to see which produces better outputs
  • Need to test a new model against your current one before committing
  • Are iterating on a prompt and want instant feedback
  • Want to demo differences between configurations to stakeholders
  • Don’t yet have a dataset and want to rely on vibes

The Arena is for quick, qualitative comparisons. For evaluation with metrics and test datasets, perform experiments instead.

How It Works

The Arena lets you set up two or more “contestants” and run the same input through all of them simultaneously. Each contestant can be either a Prompt or AI Connection (learn more).

  1. Set up contestants — configure two or more contestants to compare
  2. Enter your message(s) — or existing prompts or AI connections
  3. Interpolate variables - if any, enter the dynamic variables in your input
  4. Press Quick Run — execute all contestants and view results side-by-side

You can mix and match — for example, compare a new prompt against your production AI Connection to see if it’s ready for deployment.

Arena Overview

Using the Arena

2

Configure the Base Run

The base run is your starting point. Choose how it should generate outputs:

Configure prompt in Arena
  1. Select a model provider (e.g., OpenAI) and model (e.g., gpt-4.1)
  2. Enter your prompt in the message composer — you can use variables like {variable_name}
  3. Optionally click Settings to configure model parameters

Use the Select prompt… dropdown to load a saved prompt from your Prompt Studio, or click Save Prompt to save your current prompt for later.

3

Add Comparison Runs

Click Add Contestant to add as many comparison runs as you need. Each contestant is independently configured — you can choose either a prompt or an AI Connection for each one.

Common comparison setups:

  • Prompt vs Prompt — compare different prompt variations on the same model
  • Model vs Model — compare the same prompt across different models
  • Prompt vs AI Connection — test a new prompt against your production system
4

Set Variable Values

If your prompts contain variables, expand the Variables panel at the bottom to provide values. These values will be interpolated into all prompts before execution.

Set Dynamic Variables
5

Run the Comparison

You have two options:

  • Quick Run — immediately execute all contestants and view results inline
  • Run as Experiment — run the comparison as a tracked experiment for more detailed analysis

Results appear below each contestant, allowing you to compare outputs side-by-side.

Arena Quick Run

Advanced Prompt Usage

Include images

You can include images in your Arena comparisons by simply dragging and dropping them into the message composer. This is useful for testing vision-capable models with image inputs.

Include images in Arena prompt

Only models that support image inputs (such as GPT-4o, Claude 3, and Gemini Pro Vision) will be able to generate responses based on the image. If a contestant uses a model without vision capabilities, it will only process the text portion of your input.

Configure models

You can also compare different model configs on the same prompt. For example, running a quick run on a model from Anthropic vs OpenAI, or even changing the model parameters:

Model Configs in Prompts

Tips for Comparisons

  • Test one variable at a time — change only the prompt OR the model between contestants to isolate what’s causing differences
  • Use realistic inputs — test with inputs that represent your actual use cases
  • Try edge cases — compare how different configurations handle unusual or challenging inputs
  • Save winning prompts — when you find a prompt that works well, save it to Prompt Studio for use in evaluations

Next Steps

Once you’ve found a configuration that works (🎉), take it to the next level — run a full experiment with a dataset to get statistically meaningful results.