Evaluate Threads
Run evaluations on multi-turn conversations by evaluating entire threads
Overview
Thread evaluations let you evaluate an entire multi-turn conversation as a single unit, rather than evaluating individual traces or spans in isolation. This is essential for conversational AI apps where quality depends on the full context of a conversation.
For evaluating individual traces and spans, see Evaluate Traces & Spans.
How It Works
Thread evaluations follow these steps:
- You create a multi-turn metric collection on Confident AI with the conversational metrics you want to run.
- Your app creates traces with a shared thread ID, setting
inputandoutputon each trace to represent conversation turns. - Once the conversation is complete, you call the evaluate thread function with the thread ID and metric collection name.
- Confident AI builds a conversational test case from the trace I/O values — each trace’s
inputbecomes a user turn, and eachoutputbecomes an assistant turn. - Your multi-turn metrics run against the full conversation and results appear on the thread in the dashboard.
Only multi-turn metric collections work for thread evaluations. Using a single-turn collection will not produce results.
How Thread Evals Differ
The key difference is that you don’t set test case parameters for thread evals — instead, Confident AI automatically constructs the conversation from trace I/O:
- Trace
input→ user message - Trace
output→ assistant message
This is why setting trace I/O correctly is critical for thread evaluations.
If you don’t set input and/or output on any traces in the thread,
Confident AI will have no turns to evaluate and the evaluation will produce no
results.
Evaluate a Thread
Thread evaluations must be triggered manually after a conversation has completed, since Confident AI cannot automatically know when a multi-turn conversation is finished.
Call the evaluate thread function once the conversation is done:
Python
TypeScript
The asynchronous version a_evaluate_thread is also available in Python.
Add Turn Context
You can optionally enrich each turn with tools called and retrieval context. This gives multi-turn metrics additional context about how each response was generated.
For more information on how trace parameters map to test case parameters, click here.
Python
TypeScript
Examples
Quick quiz: Given the code below, will Confident AI successfully evaluate the thread?
Python
TypeScript
Answer: No — the thread evaluation will produce no results because neither input nor output has been set on the trace. Without these, Confident AI has no conversation turns to evaluate.
The metric_collection set via update_current_trace is for trace-level
single-turn evaluations. For thread evaluations, you need to set input and
output on the trace — these are what become conversation turns — and call
the evaluate thread function separately.