Multi-Turn Evals

Simulate conversations and run end-to-end testing for multi-turn use cases

Overview

Multi-turn evaluation requires:

  • A multi-turn dataset of conversational goldens
  • A callback function that wraps around your chatbot to generate conversation turns
  • A list of multi-turn metrics you wish to evaluate with

Each conversational golden must have a scenario before you can simulate user turns. It is also highly recommended to provide a user description for higher quality simulations.

How It Works

  1. Pull your multi-turn dataset from Confident AI
  2. Define a callback that invokes your chatbot to generate conversation turns
  3. Simulate conversations for each golden in your dataset
  4. Run evaluation on the resulting test cases

Define Your Callback

Define a callback that wraps around your chatbot and generates the next conversation turn:

callback.py
1from deepeval.test_case import Turn
2from typing import List
3
4def chatbot_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
5 messages = [{"role": turn.role, "content": turn.content} for turn in turns]
6 messages.append({"role": "user", "content": input})
7 response = your_chatbot(messages) # Replace with your chatbot
8 return Turn(role="assistant", content=response)

The callback should accept an input, and optionally a list of Turns and the thread id. It should return the next Turn in the conversation.

Run Evals Locally

Running evals locally is only possible with the Python deepeval library. For Typescript or other languages, skip to remote evals.

1

Pull dataset

Pull your multi-turn dataset (and create one if you haven’t already):

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
2

Simulate conversations

Create a simulator with your callback and generate test cases from your goldens:

main.py
1from deepeval.simulator import ConversationSimulator
2
3simulator = ConversationSimulator(model_callback=chatbot_callback)
4for golden in dataset.goldens:
5 test_case = simulator.simulator(golden)
6 dataset.add_test_case(test_case)

You can also use any other means to generate turns in a ConversationalTestCase and map golden properties manually.

3

Run evaluation

The evaluate() function runs your test suite and uploads results to Confident AI:

main.py
1from deepeval.metrics import TurnRelevancyMetric
2from deepeval import evaluate
3
4# Replace with your metrics
5evaluate(test_cases=dataset.test_cases, metrics=[TurnRelevancyMetric()])

Done! You should see a link to your newly created sharable testing report.

  • Each metric is applied to every test case (e.g., 10 test cases × 2 metrics = 20 evaluations)
  • A test case passes only if all metrics for it pass
  • The test run’s pass rate is the proportion of test cases that pass

deepeval opens your browser automatically by default. To disable this behavior, set CONFIDENT_BROWSER_OPEN=NO.

Multi-Turn Testing Reports

Run Evals Remotely

1

Create metric collection

Go to Project > Metric > Collections:

Metric Collection for Remote Evals
Don’t forget to create a multi-turn collection.
2

Pull dataset and simulate conversations

Set run_remote to true to run simulations remotely:

main.py
1from deepeval.simulator import ConversationSimulator
2from deepeval.dataset import EvaluationDataset
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7simulator = ConversationSimulator(model_callback=chatbot_callback, run_remote=True)
8for golden in dataset.goldens:
9 test_case = simulator.simulator(golden)
10 dataset.add_test_case(test_case)
3

Run evaluation

main.py
1from deepeval import evaluate
2
3evaluate(test_case=dataset.test_cases, metric_collection="YOUR-COLLECTION-NAME")

Advanced Usage

Early Stopping

To stop a simulation naturally before it reaches the maximum number of turns, provide an expected_outcome for each golden. The conversation will end automatically after the expected outcome has been reached.

main.py
1from deepeval.dataset import ConversationalGolden
2
3conversation_golden = ConversationalGolden(
4 scenario="Andy Byron wants to purchase a VIP ticket to a cold play concert.",
5 expected_outcome="Successful purchase of a ticket.",
6 user_description="Andy Byron is the CEO of Astronomer.",
7)

If expected_outcome is unavailable, the max_user_simulations parameter in the simulate method will stop simulation after a certain number of user turns, which defaults to 10.

Extend Existing Turns

You can extend existing conversations by providing existing Turns to each golden. The simulator will automatically detect and continue simulating from the existing turns.

main.py
1from deepeval.dataset import ConversationalGolden
2from deepeval.test_case import Turn
3
4conversation_golden = ConversationalGolden(
5 scenario="Andy Byron wants to purchase a VIP ticket to a cold play concert.",
6 user_description="Andy Byron is the CEO of Astronomer.",
7 turns=[
8 Turn(role="user", content="Hi"),
9 Turn(role="assistant", content="Hello! How can I help you today?"),
10 Turn(role="user", content="I want to purchase a VIP ticket to a cold play concert."),
11 ]
12)