Now that you’ve ran your AI app in the Arena (or created a test run), it’s time to systematically find out which version of your AI app is actually better.
Experiments let you compare two or more versions of your AI app side-by-side, using the same dataset and metrics.
This is fundamentally different from running single-turn or multi-turn evaluations:
Think of it this way: evals answer “how good is this version?” while experiments answer “which version is better?”
Experiments require the same dataset and same metric collection. This ensures you’re comparing apples to apples.
You can kick off an experiment in two ways:
Either way you’ll get the same result, however running experiments from Arena will be the more natural workflow.
It is highly recommended that you only change one independent variable (such as the prompt version) at a time to ensure fair experimentation.
As covered in the Arena section, the Arena is great for quick, qualitative comparisons. But when you’re ready to graduate from vibes to data, it’s time to run a proper experiment.
Here’s the key difference:
Here’s what happens when you run an experiment:
For example, for an experiment with 4 contestants, 5 goldens, and 3 metrics, there will be a grand total of 20 (4 x 5) outputs generated and 60 (20 x 3) evaluations total.
In the Arena, configure at least 2 contestants. Each can be a Prompt or AI Connection (learn more).
Click Run as Experiment in the top right corner. You’ll see a dialog to configure your experiment:
Variables Mapping is optional but highly recommended. If your prompts use variables like {input}, map them to golden fields here. Without this, every golden gets the same static prompt — not very useful for comparison.
Confident AI generates outputs for each contestant across your entire dataset, then runs your metrics on every test case.
When complete, you’ll see:
Metrics Overview — a side-by-side comparison of all contestants:
All differences are compared against the base run. You can switch which contestant is the base run to change your control — useful when you want to see how everything compares to a different baseline.
Experiment Test Cases — drill into individual goldens:
Running experiments from Arena is perfect when you’re actively iterating. Set up your contestants, tweak prompts, and run experiments until you find a winner.
Already have test runs from previous evaluations? You can create an experiment directly from them — no need to re-run anything.
This is different from regression testing:
Click Create Experiment and select the other test runs you want to compare against.
Only test runs using the same dataset and metric collection can be compared in an experiment.
The experiment view aggregates all your test runs and shows you:
A good experiment tells you more than just “A is better than B.” Here’s what to look for:
You should also avoid looking at the details of each contestants from the get go to avoid bias - Confident AI intentionally hides contestant details unless you explicitly click on the top of each column.
Now that you’ve run an experiment, the next step is making your results even more reliable. Your experiment is only as valid as the quality of your dataset. That means scaling your dataset and keeping it well-maintained.