Multi-Turn Evals | Confident AI Docs

Overview

Multi-turn evaluation requires:

A multi-turn dataset of conversational goldens
A callback function that wraps around your chatbot to generate conversation turns
A list of multi-turn metrics you wish to evaluate with

Each conversational golden must have a scenario before you can simulate user turns. It is also highly recommended to provide a user description for higher quality simulations.

How It Works

Pull your multi-turn dataset from Confident AI
Define a callback that invokes your chatbot to generate conversation turns
Simulate conversations for each golden in your dataset
Run evaluation on the resulting test cases

Define Your Callback

Define a callback that wraps around your chatbot and generates the next conversation turn:

callback.py

1 from deepeval.test_case import Turn
2 from typing import List
3 
4 def chatbot_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
5     messages = [{"role": turn.role, "content": turn.content} for turn in turns]
6     messages.append({"role": "user", "content": input})
7     response = your_chatbot(messages) # Replace with your chatbot
8     return Turn(role="assistant", content=response)

The callback should accept an input, and optionally a list of Turns and the thread id. It should return the next Turn in the conversation.

Run Evals Locally

Running evals locally is only possible with the Python deepeval library. For Typescript or other languages, skip to remote evals.

Pull dataset

Pull your multi-turn dataset (and create one if you haven’t already):

main.py

1 from deepeval.dataset import EvaluationDataset
2 
3 dataset = EvaluationDataset()
4 dataset.pull(alias="YOUR-DATASET-ALIAS")

Simulate conversations

Create a simulator with your callback and generate test cases from your goldens:

main.py

1 from deepeval.simulator import ConversationSimulator
2 
3 simulator = ConversationSimulator(model_callback=chatbot_callback)
4 for golden in dataset.goldens:
5     test_case = simulator.simulator(golden)
6     dataset.add_test_case(test_case)

You can also use any other means to generate turns in a ConversationalTestCase and map golden properties manually.

Run evaluation

The evaluate() function runs your test suite and uploads results to Confident AI:

main.py

1 from deepeval.metrics import TurnRelevancyMetric
2 from deepeval import evaluate
3 
4 # Replace with your metrics
5 evaluate(test_cases=dataset.test_cases, metrics=[TurnRelevancyMetric()])

Done! You should see a link to your newly created sharable testing report.

Each metric is applied to every test case (e.g., 10 test cases × 2 metrics = 20 evaluations)
A test case passes only if all metrics for it pass
The test run’s pass rate is the proportion of test cases that pass

deepeval opens your browser automatically by default. To disable this behavior, set CONFIDENT_BROWSER_OPEN=NO.

Multi-Turn Testing Reports

Run Evals Remotely

Create metric collection

Go to Project > Metric > Collections:

Metric Collection for Remote Evals

Don’t forget to create a multi-turn collection.

Pull dataset and simulate conversations

Python

Typescript

curL

Set run_remote to true to run simulations remotely:

main.py

1 from deepeval.simulator import ConversationSimulator
2 from deepeval.dataset import EvaluationDataset
3 
4 dataset = EvaluationDataset()
5 dataset.pull(alias="YOUR-DATASET-ALIAS")
6 
7 simulator = ConversationSimulator(model_callback=chatbot_callback, run_remote=True)
8 for golden in dataset.goldens:
9     test_case = simulator.simulator(golden)
10     dataset.add_test_case(test_case)

Run evaluation

Python

Typescript

curL

main.py

1 from deepeval import evaluate
2 
3 evaluate(test_case=dataset.test_cases, metric_collection="YOUR-COLLECTION-NAME")

Advanced Usage

Early Stopping

To stop a simulation naturally before it reaches the maximum number of turns, provide an expected_outcome for each golden. The conversation will end automatically after the expected outcome has been reached.

Python

Typescript

curL

main.py

1 from deepeval.dataset import ConversationalGolden
2 
3 conversation_golden = ConversationalGolden(
4     scenario="Andy Byron wants to purchase a VIP ticket to a cold play concert.",
5     expected_outcome="Successful purchase of a ticket.",
6     user_description="Andy Byron is the CEO of Astronomer.",
7 )

If expected_outcome is unavailable, the max_user_simulations parameter in the simulate method will stop simulation after a certain number of user turns, which defaults to 10.

Extend Existing Turns

You can extend existing conversations by providing existing Turns to each golden. The simulator will automatically detect and continue simulating from the existing turns.

Python

Typescript

curL

main.py

1 from deepeval.dataset import ConversationalGolden
2 from deepeval.test_case import Turn
3 
4 conversation_golden = ConversationalGolden(
5     scenario="Andy Byron wants to purchase a VIP ticket to a cold play concert.",
6     user_description="Andy Byron is the CEO of Astronomer.",
7     turns=[
8         Turn(role="user", content="Hi"),
9         Turn(role="assistant", content="Hello! How can I help you today?"),
10         Turn(role="user", content="I want to purchase a VIP ticket to a cold play concert."),
11     ]
12 )