Multi-Turn Evals (No-Code) | Confident AI Docs

Overview

Multi-turn evaluations test conversational interactions where context accumulates across multiple exchanges. These are use cases where the AI must maintain coherence throughout a conversation:

Chatbots — customer support, sales assistants, or general-purpose chat
Conversational agents — multi-step task completion with back-and-forth
Agentic systems — complex workflows with tool calls and reasoning across turns

Unlike single-turn evals, multi-turn evals require generating the entire conversation before metrics can be applied, is the most time-consuming part of the process.

Fortunately, Confident AI handles the simulation aspect as well so you don’t have to manually prompt your AI for hours on end.

Requirements

To run a multi-turn evaluation, you need:

A multi-turn dataset — goldens with conversation starters or full conversation histories
A multi-turn metric collection — metrics designed for conversational evaluation

Multi-turn metrics evaluate the conversation as a whole, not individual messages. Examples include turn faithfulness and turn contextual relevancy.

How it works

Multi-turn evals follow a 5-step process — the key difference from single-turn is the simulation step:

Define metrics — choose conversational metrics (e.g., turn relevancy, conversation completeness)
Create dataset — build goldens with conversation starters
Configure output generation — set up your AI connection or prompt
Simulate conversations — generate full conversations by simulating user turns
Evaluate — run metrics against completed conversations

Here’s a visual representation of the data flow:

Because conversations must be fully simulated before evaluation, multi-turn evals can take slightly longer than single-turn. Plan accordingly for large datasets.

Controlling Simulations

To control simulations within your dataset, you will have to edit the scenario, expected outcome, and user description fields of your goldens. Each field will control your simulations in a different way:

Scenario — sets the context and topic of the conversation, guiding what the simulated user will discuss and what situation they are in (e.g., “User is trying to book a flight to Paris for next weekend”)
Expected outcome — defines the goal that must be achieved for the simulation to end successfully (e.g., “User successfully books a flight” or “User receives a refund confirmation”)
User description — shapes the simulated user’s persona, tone, and behavior throughout the conversation (e.g., “A frustrated customer who is impatient and asks short, direct questions”)

It is important to note that simulations will automatically end if the expected outcome is not met after the max number of user turns simulated. This can be configured in the dropdown settings of a multi-turn dataset.

Run an Evaluation

You can evaluate on a dataset by clicking on the Evaluate button on the top right of a dataset page.

Select your dataset and metrics

Navigate to Project > Datasets, and select your multi-turn dataset to evaluate
Click Evaluate
Select your multi-turn Metric Collection

Turn on simulations

This must be enabled if you want to call your AI app during evaluation time

Well-crafted simulation instructions are key to realistic conversations. Be specific about the user’s goals, tone, and knowledge level through the use of scenarios, expected outcome, and user description fields on your goldens.

Configure output generation

If simulations is turned on, and select how your AI app will respond to each turn:

Prompt

AI Connection

For prompt-based chatbots, select a prompt template that includes conversation history.

In the evaluation setup, select this prompt as your output generation method
Confident AI calls your LLM for each turn, passing the conversation history

You’ll need an existing prompt for this to work. If you haven’t already, you can create a prompt on the Prompt Studio.

Run and view results

Click Run Evaluation and wait for simulations to complete. This may take longer than single-turn evals due to the conversation generation step.

Your test run dashboard shows:

Score distributions — average, median, and percentiles for each metric
Pass/fail results — a conversation passes only if all metrics meet their thresholds
Full conversation logs — review the complete simulated conversations
Turn-by-turn analysis — see how the AI performed at each step

Multi-turn test run results

Regression Testing

Once you have two or more test runs, you can compare them side-by-side to identify regressions.

Open regression testing

Go to your test run’s A|B Regression Test tab
Click New Regression Test
Select the test runs you want to compare

Analyze regressions

The comparison view highlights:

Regressions (red) — conversations that got worse
Improvements (green) — conversations that got better
Side-by-side scores — metric comparisons across runs

A|B regression testing

Next Steps

Single-Turn Evals

Evaluate one-shot Q&A, summarization, and classification tasks.

Arena

Compare prompts and models side-by-side in real-time.