Multi-Turn, E2E Testing

Learn how to run end-to-end testing for multi-turn use cases

Overview

Multi-turn, end-to-end testing requires:

  • A multi-turn dataset of goldens
  • A list of multi-turn metrics you wish to evaluate with
  • A way to generate turns for multi-turn test cases at runtime

The best way to generate turns if you don’t have human testers, is to use Confident AI’s conversation simulator to simulate user interactions.

How It Works

  1. Pull your dataset from Confident AI
  2. Loop through goldens in your dataset, for each golden:
    • Simulate turns from each golden
    • Map golden fields to test case parameters
    • Add test case back to your dataset
  3. Run evaluation on test cases

In this example, we’ll be using this mock LLM app for our callback for turns simulation:

callback.py
1from deepeval.test_case import Turn
2from typing import List
3
4def chatbot_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
5 messages = [{"role": turn.role, "content": turn.content} for turn in turns]
6 messages.append({"role": "user", "content": input})
7 response = your_chatbot(messages) # Replace with your chatbot
8 return Turn(role="assistant", content=response)

Run E2E Tests Locally

Running evals locally is only possible if you are using the Python deepeval library. If you’re working with Typescript or any other language, skip to the remote end-to-end evals section instead.

1

Pull dataset

Pull your dataset (and create one if you haven’t already):

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
2

Loop through goldens and simulate turns

Loop through multi-turn goldens and simulate turns for test cases, before adding them back to your dataset:

main.py
1from deepeval.simulator import ConversationSimulator
2from deepeval.dataset import EvaluationDataset
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7simulator = ConversationSimulator(model_callback=chatbot_callback)
8for golden in dataset.goldens:
9 test_case = simulator.simulator(golden)
10 dataset.add_test_case(test_case)

Although a bit unconventional, you can also use any other means necessary to generate turns in a ConversationalTestCase and map golden properties manually for this step.

3

Run evaluation using evaluate()

The evaluate() function allows you to create test runs and uploads the data to Confident AI once evaluations have completed locally.

main.py
1from deepeval.metrics import TurnRelevancyMetric
2from deepeval import evaluate
3# Replace with your metrics
4evaluate(test_cases=dataset.test_cases, metrics=[TurnRelevancyMetric()])

Done ✅. You should see a link to your newly created sharable testing report.

  • The evaluate() function runs your test suite across all test cases and metrics
  • Each metric is applied to every test case (e.g., 10 test cases × 2 metrics = 20 evaluations)
  • A test case passes only if all metrics for it pass
  • The test run’s pass rate is the proportion of test cases that pass

deepeval opens your browser automatically by default. To disable this behavior, set CONFIDENT_BROWSER_OPEN=NO.

Multi-Turn Testing Reports

Run E2E Tests Remotely

1

Create metric collection

Go to Project > Metric > Collections:

Metric Collection for Remote Evals
Don’t forget to create a multi-turn collection.
2

Pull dataset and simulate conversations

Using your language of choice, you would call your LLM app to construct a list of valid ConversationalTestCase data models.

main.py
1from deepeval.simulator import ConversationSimulator
2from deepeval.dataset import EvaluationDataset
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7simulator = ConversationSimulator(model_callback=chatbot_callback)
8for golden in dataset.goldens:
9 test_case = simulator.simulator(golden)
10 dataset.add_test_case(test_case)
3

Call /v1/evaluate endpoint

main.py
1from deepeval import evaluate
2
3evaluate(test_case=dataset.test_cases, metric_collection="YOUR-COLLECTION-NAME")

Advanced Usage

This section is a repetition of the one for single-turn end-to-end testing, so please click here to learn more.