A|B Regression Testing
You must have already ran at least TWO test runs for this feature to work.
Overview
An A|B regression test allows you to see:
- Differences in metric scores for the same dataset across two LLM app versions
- Which prompts, models, etc. led to which better scores
- Regressions, improvements, and why that is the case for matching test cases
Regression testing on Confident AI is about comparing two different versions of your LLM app side-by-side. Regression tests happen retrospectively - meaning you can only run a regression test once you’ve ran an evaluation and created a test run you wish to benchmark against.
How It Works
Go to the Compare Test Results page in a testing report, and click on the Select Test Run to Compare Against button. There will be a dropdown of test runs for you to choose from - select the one you wish to compare against.
It is difficult to know which test run is the one you wish to compare against, but there is where the identifier comes in handy when using the evaluate() function, as the identifier will be displayed alongside the generated test run ID that you’re seeing in the dropdown:
Or in deepeval test run:
Overview Comparison
Selecting a test run would automatically open up the Comparison Overview, which shows:
- The difference in passing test cases
- The difference in metric score distributions per metric
The comparison displays data from your current test run on the left side and data from the selected comparison test run on the right side. This side-by-side view makes it easy to spot differences between the two runs.
Compare Test Cases Side-by-Side
The left hand side displays test cases from the current test run, while the right hand side displays test cases from the test run you are comparing against.
The test case comparer matches test cases from both test runs using test case
names by default.
A test case highlighted in red indicates a regression - where at least one metric that previously passed in the comparison test run (right side) is now failing in the current test run (left side).
Match test cases by input
To match by Input instead, toggle the “Match By” pill from Name to Input
When you have duplicate inputs or names, the filters might not work as expected. It is important that you don’t duplicate inputs and names in a test run.
Filters
When dealing with a large number of test cases, you can easily identify regressions and improvements by clicking on the filters tab.
Within each “All”, “Regressions”, or “Improvements” tab, you can also filter test cases based on changes in their metric scores. For example, to identify test case pairs where a metric’s score has significantly decreased (by more than 0.3):
- Click on the Metric Score filter
- Select the specific metric you want to analyze
- Choose has decreased by more than in the dropdown
- Set the change to 0.3