Evaluate Threads
Run evaluations on multi-turn conversations by evaluating entire threads
Run evaluations on multi-turn conversations by evaluating entire threads
Thread evaluations let you evaluate an entire multi-turn conversation as a single unit, rather than evaluating individual traces or spans in isolation. This is essential for conversational AI apps where quality depends on the full context of a conversation.
For evaluating individual traces and spans, see Evaluate Traces & Spans.
Thread evaluations follow these steps:
input and output on each trace to represent conversation turns.input becomes a user turn, and each output becomes an assistant turn.Only multi-turn metric collections work for thread evaluations. Using a single-turn collection will not produce results.
To evaluate threads automatically without calling the evaluate thread function from your code, configure Evaluation Rules in Project Settings—a thread rule waits for the conversation to be idle for a configurable time limit, then runs your multi-turn metric collection.
The key difference is that you don’t set test case parameters for thread evals — instead, Confident AI automatically constructs the conversation from trace I/O:
input → user messageoutput → assistant messageThis is why setting trace I/O correctly is critical for thread evaluations.
If you don’t set input and/or output on any traces in the thread,
Confident AI will have no turns to evaluate and the evaluation will produce no
results.
Thread evaluations must be triggered manually after a conversation has completed, since Confident AI cannot automatically know when a multi-turn conversation is finished.
Call the evaluate thread function once the conversation is done:
The asynchronous version a_evaluate_thread is also available in Python.
You can optionally enrich each turn with tools called and retrieval context. This gives multi-turn metrics additional context about how each response was generated.
For more information on how trace parameters map to test case parameters, click here.
Quick quiz: Given the code below, will Confident AI successfully evaluate the thread?
Answer: No — the thread evaluation will produce no results because neither input nor output has been set on the trace. Without these, Confident AI has no conversation turns to evaluate.
The metric_collection set via update_current_trace is for trace-level
single-turn evaluations. For thread evaluations, you need to set input and
output on the trace — these are what become conversation turns — and call
the evaluate thread function separately.