Evaluate Threads | Confident AI Docs

Overview

Thread evaluations let you evaluate an entire multi-turn conversation as a single unit, rather than evaluating individual traces or spans in isolation. This is essential for conversational AI apps where quality depends on the full context of a conversation.

For evaluating individual traces and spans, see Evaluate Traces & Spans.

How It Works

Thread evaluations follow these steps:

You create a multi-turn metric collection on Confident AI with the conversational metrics you want to run.
Your app creates traces with a shared thread ID, setting input and output on each trace to represent conversation turns.
Once the conversation is complete, you call the evaluate thread function with the thread ID and metric collection name.
Confident AI builds a conversational test case from the trace I/O values — each trace’s input becomes a user turn, and each output becomes an assistant turn.
Your multi-turn metrics run against the full conversation and results appear on the thread in the dashboard.

Only multi-turn metric collections work for thread evaluations. Using a single-turn collection will not produce results.

How Thread Evals Differ

	Trace & Span Evals	Thread Evals
Scope	Single request/response	Entire multi-turn conversation
Metric collection	Single-turn metrics	Multi-turn metrics
When to run	Real-time or retrospectively	Retrospectively only (after conversation ends)
Data source	Test case parameters you set on spans/traces	Trace `input`/`output` values become conversation turns

The key difference is that you don’t set test case parameters for thread evals — instead, Confident AI automatically constructs the conversation from trace I/O:

Trace input → user message
Trace output → assistant message

This is why setting trace I/O correctly is critical for thread evaluations.

If you don’t set input and/or output on any traces in the thread, Confident AI will have no turns to evaluate and the evaluation will produce no results.

Evaluate a Thread

Thread evaluations must be triggered manually after a conversation has completed, since Confident AI cannot automatically know when a multi-turn conversation is finished.

Call the evaluate thread function once the conversation is done:

Python

TypeScript

main.py

1 from openai import OpenAI
2 from deepeval.tracing import observe, update_current_trace, evaluate_thread
3 
4 client = OpenAI()
5 your_thread_id = "your-thread-id"
6 
7 @observe()
8 def llm_app(query: str):
9     res = client.chat.completions.create(
10         model="gpt-4o",
11         messages=[{"role": "user", "content": query}]
12     ).choices[0].message.content
13     update_current_trace(thread_id=your_thread_id, input=query, output=res)
14     return res
15 
16 llm_app("What's the weather in SF?")
17 llm_app("What about tomorrow?")
18 
19 evaluate_thread(thread_id=your_thread_id, metric_collection="My Multi-Turn Collection")

The asynchronous version a_evaluate_thread is also available in Python.

Add Turn Context

You can optionally enrich each turn with tools called and retrieval context. This gives multi-turn metrics additional context about how each response was generated.

For more information on how trace parameters map to test case parameters, click here.

Python

TypeScript

main.py

1 from deepeval.tracing import observe, update_current_trace
2 from deepeval.test_case import ToolCall
3 
4 @observe()
5 def llm_app(query: str):
6     chunks = retrieve(query)
7     res = generate(query, chunks)
8     update_current_trace(
9         thread_id="your-thread-id",
10         input=query,
11         output=res,
12         retrieval_context=[chunk.text for chunk in chunks],
13         tools_called=[ToolCall(name="WebSearch")]
14     )
15     return res

Examples

Quick quiz: Given the code below, will Confident AI successfully evaluate the thread?

Python

TypeScript

main.py

1 from deepeval.tracing import observe, update_current_trace, evaluate_thread
2 
3 your_thread_id = "your-thread-id"
4 
5 @observe(metric_collection="Collection 1")
6 def llm_app(query: str):
7     update_current_trace(thread_id=your_thread_id)
8 
9 llm_app("Hello")
10 evaluate_thread(thread_id=your_thread_id, metric_collection="Collection 2")

Answer: No — the thread evaluation will produce no results because neither input nor output has been set on the trace. Without these, Confident AI has no conversation turns to evaluate.

The metric collection on the observe decorator/wrapper is for trace-level single-turn evaluations. For thread evaluations, you need to set input and output on the trace — these are what become conversation turns — and call the evaluate thread function separately.

Next Steps

Thread Traces

Learn how to create threads, set I/O, and log tools called and retrieval context per turn.

Customize Traces

Add tags, metadata, and user info to your traces for filtering and analysis.