Multi-Turn Evals (No-Code)
Overview
Multi-turn evaluations test conversational interactions where context accumulates across multiple exchanges. These are use cases where the AI must maintain coherence throughout a conversation:
- Chatbots — customer support, sales assistants, or general-purpose chat
- Conversational agents — multi-step task completion with back-and-forth
- Agentic systems — complex workflows with tool calls and reasoning across turns
Unlike single-turn evals, multi-turn evals require generating the entire conversation before metrics can be applied, is the most time-consuming part of the process.
Fortunately, Confident AI handles the simulation aspect as well so you don’t have to manually prompt your AI for hours on end.
Requirements
To run a multi-turn evaluation, you need:
- A multi-turn dataset — goldens with conversation starters or full conversation histories
- A multi-turn metric collection — metrics designed for conversational evaluation
Multi-turn metrics evaluate the conversation as a whole, not individual messages. Examples include turn faithfulness and turn contextual relevancy.
How it works
Multi-turn evals follow a 5-step process — the key difference from single-turn is the simulation step:
- Define metrics — choose conversational metrics (e.g., turn relevancy, conversation completeness)
- Create dataset — build goldens with conversation starters
- Configure output generation — set up your AI connection or prompt
- Simulate conversations — generate full conversations by simulating user turns
- Evaluate — run metrics against completed conversations
Here’s a visual representation of the data flow:
Because conversations must be fully simulated before evaluation, multi-turn evals can take slightly longer than single-turn. Plan accordingly for large datasets.
Controlling Simulations
To control simulations within your dataset, you will have to edit the scenario, expected outcome, and user description fields of your goldens. Each field will control your simulations in a different way:
- Scenario — sets the context and topic of the conversation, guiding what the simulated user will discuss and what situation they are in (e.g., “User is trying to book a flight to Paris for next weekend”)
- Expected outcome — defines the goal that must be achieved for the simulation to end successfully (e.g., “User successfully books a flight” or “User receives a refund confirmation”)
- User description — shapes the simulated user’s persona, tone, and behavior throughout the conversation (e.g., “A frustrated customer who is impatient and asks short, direct questions”)
It is important to note that simulations will automatically end if the expected outcome is not met after the max number of user turns simulated. This can be configured in the dropdown settings of a multi-turn dataset.
Run an Evaluation
You can evaluate on a dataset by clicking on the Evaluate button on the top right of a dataset page.
Select your dataset and metrics
- Navigate to Project > Datasets, and select your multi-turn dataset to evaluate
- Click Evaluate
- Select your multi-turn Metric Collection
Turn on simulations
This must be enabled if you want to call your AI app during evaluation time
Well-crafted simulation instructions are key to realistic conversations. Be specific about the user’s goals, tone, and knowledge level through the use of scenarios, expected outcome, and user description fields on your goldens.
Configure output generation
If simulations is turned on, and select how your AI app will respond to each turn:
Prompt
AI Connection
For prompt-based chatbots, select a prompt template that includes conversation history.
- In the evaluation setup, select this prompt as your output generation method
- Confident AI calls your LLM for each turn, passing the conversation history
You’ll need an existing prompt for this to work. If you haven’t already, you can create a prompt on the Prompt Studio.
Run and view results
Click Run Evaluation and wait for simulations to complete. This may take longer than single-turn evals due to the conversation generation step.
Your test run dashboard shows:
- Score distributions — average, median, and percentiles for each metric
- Pass/fail results — a conversation passes only if all metrics meet their thresholds
- Full conversation logs — review the complete simulated conversations
- Turn-by-turn analysis — see how the AI performed at each step
Regression Testing
Once you have two or more test runs, you can compare them side-by-side to identify regressions.