Evaluate Threads

Run evaluations on multi-turn conversations by evaluating entire threads

Overview

Thread evaluations let you evaluate an entire multi-turn conversation as a single unit, rather than evaluating individual traces or spans in isolation. This is essential for conversational AI apps where quality depends on the full context of a conversation.

For evaluating individual traces and spans, see Evaluate Traces & Spans.

How It Works

Thread evaluations follow these steps:

  1. You create a multi-turn metric collection on Confident AI with the conversational metrics you want to run.
  2. Your app creates traces with a shared thread ID, setting input and output on each trace to represent conversation turns.
  3. Once the conversation is complete, you call the evaluate thread function with the thread ID and metric collection name.
  4. Confident AI builds a conversational test case from the trace I/O values — each trace’s input becomes a user turn, and each output becomes an assistant turn.
  5. Your multi-turn metrics run against the full conversation and results appear on the thread in the dashboard.

Only multi-turn metric collections work for thread evaluations. Using a single-turn collection will not produce results.

How Thread Evals Differ

Trace & Span EvalsThread Evals
ScopeSingle request/responseEntire multi-turn conversation
Metric collectionSingle-turn metricsMulti-turn metrics
When to runReal-time or retrospectivelyRetrospectively only (after conversation ends)
Data sourceTest case parameters you set on spans/tracesTrace input/output values become conversation turns

The key difference is that you don’t set test case parameters for thread evals — instead, Confident AI automatically constructs the conversation from trace I/O:

  • Trace input → user message
  • Trace output → assistant message

This is why setting trace I/O correctly is critical for thread evaluations.

If you don’t set input and/or output on any traces in the thread, Confident AI will have no turns to evaluate and the evaluation will produce no results.

Evaluate a Thread

Thread evaluations must be triggered manually after a conversation has completed, since Confident AI cannot automatically know when a multi-turn conversation is finished.

Call the evaluate thread function once the conversation is done:

main.py
1from openai import OpenAI
2from deepeval.tracing import observe, update_current_trace, evaluate_thread
3
4client = OpenAI()
5your_thread_id = "your-thread-id"
6
7@observe()
8def llm_app(query: str):
9 res = client.chat.completions.create(
10 model="gpt-4o",
11 messages=[{"role": "user", "content": query}]
12 ).choices[0].message.content
13 update_current_trace(thread_id=your_thread_id, input=query, output=res)
14 return res
15
16llm_app("What's the weather in SF?")
17llm_app("What about tomorrow?")
18
19evaluate_thread(thread_id=your_thread_id, metric_collection="My Multi-Turn Collection")

The asynchronous version a_evaluate_thread is also available in Python.

Add Turn Context

You can optionally enrich each turn with tools called and retrieval context. This gives multi-turn metrics additional context about how each response was generated.

For more information on how trace parameters map to test case parameters, click here.

main.py
1from deepeval.tracing import observe, update_current_trace
2from deepeval.test_case import ToolCall
3
4@observe()
5def llm_app(query: str):
6 chunks = retrieve(query)
7 res = generate(query, chunks)
8 update_current_trace(
9 thread_id="your-thread-id",
10 input=query,
11 output=res,
12 retrieval_context=[chunk.text for chunk in chunks],
13 tools_called=[ToolCall(name="WebSearch")]
14 )
15 return res

Examples

Quick quiz: Given the code below, will Confident AI successfully evaluate the thread?

main.py
1from deepeval.tracing import observe, update_current_trace, evaluate_thread
2
3your_thread_id = "your-thread-id"
4
5@observe(metric_collection="Collection 1")
6def llm_app(query: str):
7 update_current_trace(thread_id=your_thread_id)
8
9llm_app("Hello")
10evaluate_thread(thread_id=your_thread_id, metric_collection="Collection 2")

Answer: No — the thread evaluation will produce no results because neither input nor output has been set on the trace. Without these, Confident AI has no conversation turns to evaluate.

The metric collection on the observe decorator/wrapper is for trace-level single-turn evaluations. For thread evaluations, you need to set input and output on the trace — these are what become conversation turns — and call the evaluate thread function separately.

Next Steps