Multi-Turn Evals (No-Code)
Multi-Turn Evals (No-Code)
Multi-Turn Evals (No-Code)
Multi-turn evaluations test conversational interactions where context accumulates across multiple exchanges. These are use cases where the AI must maintain coherence throughout a conversation:
Unlike single-turn evals, multi-turn evals require generating the entire conversation before metrics can be applied, is the most time-consuming part of the process.
Fortunately, Confident AI handles the simulation aspect as well so you don’t have to manually prompt your AI for hours on end.
To run a multi-turn evaluation, you need:
Multi-turn metrics evaluate the conversation as a whole, not individual messages. Examples include turn faithfulness and turn contextual relevancy.
Multi-turn evals follow a 5-step process — the key difference from single-turn is the simulation step:
Here’s a visual representation of the data flow:
Because conversations must be fully simulated before evaluation, multi-turn evals can take slightly longer than single-turn. Plan accordingly for large datasets.
To control simulations within your dataset, you will have to edit the scenario, expected outcome, and user description fields of your goldens. Each field will control your simulations in a different way:
It is important to note that simulations will automatically end if the expected outcome is not met after the max number of user turns simulated. This can be configured in the dropdown settings of a multi-turn dataset.
You can evaluate on a dataset by clicking on the Evaluate button on the top right of a dataset page.
This must be enabled if you want to call your AI app during evaluation time
Well-crafted simulation instructions are key to realistic conversations. Be specific about the user’s goals, tone, and knowledge level through the use of scenarios, expected outcome, and user description fields on your goldens.
If simulations is turned on, and select how your AI app will respond to each turn:
For prompt-based chatbots, select a prompt template that includes conversation history.
You’ll need an existing prompt for this to work. If you haven’t already, you can create a prompt on the Prompt Studio.
Click Run Evaluation and wait for simulations to complete. This may take longer than single-turn evals due to the conversation generation step.
Your test run dashboard shows:
Once you have two or more test runs, you can compare them side-by-side to identify regressions.