Arena
Quickly compare prompts, models, and AI connections side-by-side without running a full evaluation.
Quickly compare prompts, models, and AI connections side-by-side without running a full evaluation.
The Arena is a lightweight comparison tool for rapid comparison. It’s ideal when you want to quickly see how different prompts, models, or AI connections perform — without setting up a full evaluation with datasets and metrics.
Use the Arena when you:
The Arena is for quick, qualitative comparisons. For evaluation with metrics and test datasets, perform experiments instead.
The Arena lets you set up two or more “contestants” and run the same input through all of them simultaneously. Each contestant can be either a Prompt or AI Connection (learn more).
You can mix and match — for example, compare a new prompt against your production AI Connection to see if it’s ready for deployment.
The base run is your starting point. Choose how it should generate outputs:
{variable_name}Use the Select prompt… dropdown to load a saved prompt from your Prompt Studio, or click Save Prompt to save your current prompt for later.
Click Add Contestant to add as many comparison runs as you need. Each contestant is independently configured — you can choose either a prompt or an AI Connection for each one.
Common comparison setups:
If your prompts contain variables, expand the Variables panel at the bottom to provide values. These values will be interpolated into all prompts before execution.
You have two options:
Results appear below each contestant, allowing you to compare outputs side-by-side.
You can include images in your Arena comparisons by simply dragging and dropping them into the message composer. This is useful for testing vision-capable models with image inputs.
Only models that support image inputs (such as GPT-4o, Claude 3, and Gemini Pro Vision) will be able to generate responses based on the image. If a contestant uses a model without vision capabilities, it will only process the text portion of your input.
You can also compare different model configs on the same prompt. For example, running a quick run on a model from Anthropic vs OpenAI, or even changing the model parameters:
Once you’ve found a configuration that works (🎉), take it to the next level — run a full experiment with a dataset to get statistically meaningful results.