Multi-Turn Evals
Simulate conversations and run end-to-end testing for multi-turn use cases
Overview
Multi-turn evaluation requires:
- A multi-turn dataset of conversational goldens
- A callback function that wraps around your chatbot to generate conversation turns
- A list of multi-turn metrics you wish to evaluate with
Each conversational golden must have a scenario before you can simulate user
turns. It is also highly recommended to provide a user description for
higher quality simulations.
How It Works
- Pull your multi-turn dataset from Confident AI
- Define a callback that invokes your chatbot to generate conversation turns
- Simulate conversations for each golden in your dataset
- Run evaluation on the resulting test cases
Define Your Callback
Define a callback that wraps around your chatbot and generates the next conversation turn:
The callback should accept an input, and optionally a list of Turns and the
thread id. It should return the next Turn in the conversation.
Run Evals Locally
Running evals locally is only possible with the Python deepeval library. For Typescript or other languages, skip to remote evals.
Simulate conversations
Create a simulator with your callback and generate test cases from your goldens:
You can also use any other means to generate turns in a
ConversationalTestCase and map golden properties manually.
Run evaluation
The evaluate() function runs your test suite and uploads results to Confident AI:
Done! You should see a link to your newly created sharable testing report.
- Each metric is applied to every test case (e.g., 10 test cases × 2 metrics = 20 evaluations)
- A test case passes only if all metrics for it pass
- The test run’s pass rate is the proportion of test cases that pass
deepeval opens your browser automatically by default. To disable this
behavior, set CONFIDENT_BROWSER_OPEN=NO.
Run Evals Remotely
Create metric collection
Go to Project > Metric > Collections:
Advanced Usage
Early Stopping
To stop a simulation naturally before it reaches the maximum number of turns, provide an expected_outcome for each golden. The conversation will end automatically after the expected outcome has been reached.
Python
Typescript
curL
If expected_outcome is unavailable, the max_user_simulations parameter in
the simulate method will stop simulation after a certain number of user
turns, which defaults to 10.
Extend Existing Turns
You can extend existing conversations by providing existing Turns to each golden. The simulator will automatically detect and continue simulating from the existing turns.