A|B Regression Testing

Learn how to regression test for different versions of your LLM app

You must have already ran at least TWO test runs for this feature to work.

Overview

An A|B regression test allows you to see:

  • Differences in metric scores for the same dataset across two LLM app versions
  • Which prompts, models, etc. led to which better scores
  • Regressions, improvements, and why that is the case for matching test cases

Regression testing on Confident AI is about comparing two different versions of your LLM app side-by-side. Regression tests happen retrospectively - meaning you can only run a regression test once you’ve ran an evaluation and created a test run you wish to benchmark against.

How It Works

Go to the Compare Test Results page in a testing report, and click on the Select Test Run to Compare Against button. There will be a dropdown of test runs for you to choose from - select the one you wish to compare against.

Testing Reports on Confident AI

It is difficult to know which test run is the one you wish to compare against, but there is where the identifier comes in handy when using the evaluate() function, as the identifier will be displayed alongside the generated test run ID that you’re seeing in the dropdown:

1evaluate(identifier="My first test run", ...)

Or in deepeval test run:

$deepeval test run test_llm_app.py -id "My first test run"

Overview Comparison

Selecting a test run would automatically open up the Comparison Overview, which shows:

  • The difference in passing test cases
  • The difference in metric score distributions per metric

The comparison displays data from your current test run on the left side and data from the selected comparison test run on the right side. This side-by-side view makes it easy to spot differences between the two runs.

Compare Test Cases Side-by-Side

The left hand side displays test cases from the current test run, while the right hand side displays test cases from the test run you are comparing against.

The test case comparer matches test cases from both test runs using test case names by default.

A test case highlighted in red indicates a regression - where at least one metric that previously passed in the comparison test run (right side) is now failing in the current test run (left side).

Match test cases by input

To match by Input instead, toggle the “Match By” pill from Name to Input

When you have duplicate inputs or names, the filters might not work as expected. It is important that you don’t duplicate inputs and names in a test run.

Filters

When dealing with a large number of test cases, you can easily identify regressions and improvements by clicking on the filters tab.

Within each “All”, “Regressions”, or “Improvements” tab, you can also filter test cases based on changes in their metric scores. For example, to identify test case pairs where a metric’s score has significantly decreased (by more than 0.3):

  1. Click on the Metric Score filter
  2. Select the specific metric you want to analyze
  3. Choose has decreased by more than in the dropdown
  4. Set the change to 0.3