Evaluate Threads
Run evaluations on multi-turn conversations by evaluating entire threads
Overview
Thread evaluations let you evaluate an entire multi-turn conversation as a single unit, rather than evaluating individual traces or spans in isolation. This is essential for conversational AI apps where quality depends on the full context of a conversation.
For evaluating individual traces and spans, see Evaluate Traces & Spans.
How It Works
Thread evaluations follow these steps:
- You create a multi-turn metric collection on Confident AI with the conversational metrics you want to run.
- Your app creates traces with a shared thread ID, setting
inputandoutputon each trace to represent conversation turns. - Once the conversation is complete, you call the evaluate thread function with the thread ID and metric collection name.
- Confident AI builds a conversational test case from the trace I/O values — each trace’s
inputbecomes a user turn, and eachoutputbecomes an assistant turn. - Your multi-turn metrics run against the full conversation and results appear on the thread in the dashboard.
Only multi-turn metric collections work for thread evaluations. Using a single-turn collection will not produce results.
How Thread Evals Differ
The key difference is that you don’t set test case parameters for thread evals — instead, Confident AI automatically constructs the conversation from trace I/O:
- Trace
input→ user message - Trace
output→ assistant message
This is why setting trace I/O correctly is critical for thread evaluations.
If you don’t set input and/or output on any traces in the thread,
Confident AI will have no turns to evaluate and the evaluation will produce no
results.
Evaluate a Thread
Thread evaluations must be triggered manually after a conversation has completed, since Confident AI cannot automatically know when a multi-turn conversation is finished.
Call the evaluate thread function once the conversation is done:
Python
TypeScript
The asynchronous version a_evaluate_thread is also available in Python.
Add Turn Context
You can optionally enrich each turn with tools called and retrieval context. This gives multi-turn metrics additional context about how each response was generated.
For more information on how trace parameters map to test case parameters, click here.
Python
TypeScript
Examples
Quick quiz: Given the code below, will Confident AI successfully evaluate the thread?
Python
TypeScript
Answer: No — the thread evaluation will produce no results because neither input nor output has been set on the trace. Without these, Confident AI has no conversation turns to evaluate.
The metric collection on the observe decorator/wrapper is for
trace-level single-turn evaluations. For thread evaluations, you need to
set input and output on the trace — these are what become conversation
turns — and call the evaluate thread function separately.