Multi-Turn, E2E Testing
Learn how to run end-to-end testing for multi-turn use cases
Overview
Multi-turn, end-to-end testing requires:
- A multi-turn dataset of goldens
- A list of multi-turn metrics you wish to evaluate with
- A way to generate turns for multi-turn test cases at runtime
The best way to generate turns if you don’t have human testers, is to use Confident AI’s conversation simulator to simulate user interactions.
How It Works
- Pull your dataset from Confident AI
- Loop through goldens in your dataset, for each golden:
- Simulate turns from each golden
- Map golden fields to test case parameters
- Add test case back to your dataset
- Run evaluation on test cases
In this example, we’ll be using this mock LLM app for our callback for turns simulation:
Run E2E Tests Locally
Running evals locally is only possible if you are using the Python deepeval library. If you’re working with Typescript or any other language, skip to the remote end-to-end evals section instead.
Loop through goldens and simulate turns
Loop through multi-turn goldens and simulate turns for test cases, before adding them back to your dataset:
Although a bit unconventional, you can also use any other means necessary to
generate turns in a ConversationalTestCase and map golden properties
manually for this step.
Run evaluation using evaluate()
The evaluate() function allows you to create test runs and uploads the data to Confident AI once evaluations have completed locally.
Done ✅. You should see a link to your newly created sharable testing report.
- The
evaluate()function runs your test suite across all test cases and metrics - Each metric is applied to every test case (e.g., 10 test cases × 2 metrics = 20 evaluations)
- A test case passes only if all metrics for it pass
- The test run’s pass rate is the proportion of test cases that pass
deepeval opens your browser automatically by default. To disable this
behavior, set CONFIDENT_BROWSER_OPEN=NO.
Run E2E Tests Remotely
Create metric collection
Go to Project > Metric > Collections:
Pull dataset and simulate conversations
Using your language of choice, you would call your LLM app to construct a list of valid ConversationalTestCase data models.
Python
Typescript
curL
Advanced Usage
This section is a repetition of the one for single-turn end-to-end testing, so please click here to learn more.