Arena
Quickly compare prompts, models, and AI connections side-by-side without running a full evaluation.
Overview
The Arena is a lightweight comparison tool for rapid comparison. It’s ideal when you want to quickly see how different prompts, models, or AI connections perform — without setting up a full evaluation with datasets and metrics.
Use the Arena when you:
- Want to compare prompt variations to see which produces better outputs
- Need to test a new model against your current one before committing
- Are iterating on a prompt and want instant feedback
- Want to demo differences between configurations to stakeholders
- Don’t yet have a dataset and want to rely on vibes
The Arena is for quick, qualitative comparisons. For evaluation with metrics and test datasets, perform experiments instead.
How It Works
The Arena lets you set up two or more “contestants” and run the same input through all of them simultaneously. Each contestant can be either a Prompt or AI Connection (learn more).
- Set up contestants — configure two or more contestants to compare
- Enter your message(s) — or existing prompts or AI connections
- Interpolate variables - if any, enter the dynamic variables in your input
- Press Quick Run — execute all contestants and view results side-by-side
You can mix and match — for example, compare a new prompt against your production AI Connection to see if it’s ready for deployment.
Using the Arena
Configure the Base Run
The base run is your starting point. Choose how it should generate outputs:
Prompt
AI Connection
- Select a model provider (e.g., OpenAI) and model (e.g., gpt-4.1)
- Enter your prompt in the message composer — you can use variables like
{variable_name} - Optionally click Settings to configure model parameters
Use the Select prompt… dropdown to load a saved prompt from your Prompt Studio, or click Save Prompt to save your current prompt for later.
Add Comparison Runs
Click Add Contestant to add as many comparison runs as you need. Each contestant is independently configured — you can choose either a prompt or an AI Connection for each one.
Common comparison setups:
- Prompt vs Prompt — compare different prompt variations on the same model
- Model vs Model — compare the same prompt across different models
- Prompt vs AI Connection — test a new prompt against your production system
Set Variable Values
If your prompts contain variables, expand the Variables panel at the bottom to provide values. These values will be interpolated into all prompts before execution.
Run the Comparison
You have two options:
- Quick Run — immediately execute all contestants and view results inline
- Run as Experiment — run the comparison as a tracked experiment for more detailed analysis
Results appear below each contestant, allowing you to compare outputs side-by-side.
Advanced Prompt Usage
Include images
You can include images in your Arena comparisons by simply dragging and dropping them into the message composer. This is useful for testing vision-capable models with image inputs.
Only models that support image inputs (such as GPT-4o, Claude 3, and Gemini Pro Vision) will be able to generate responses based on the image. If a contestant uses a model without vision capabilities, it will only process the text portion of your input.
Configure models
You can also compare different model configs on the same prompt. For example, running a quick run on a model from Anthropic vs OpenAI, or even changing the model parameters:
Tips for Comparisons
- Test one variable at a time — change only the prompt OR the model between contestants to isolate what’s causing differences
- Use realistic inputs — test with inputs that represent your actual use cases
- Try edge cases — compare how different configurations handle unusual or challenging inputs
- Save winning prompts — when you find a prompt that works well, save it to Prompt Studio for use in evaluations
Next Steps
Once you’ve found a configuration that works (🎉), take it to the next level — run a full experiment with a dataset to get statistically meaningful results.