For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Trust CenterStatusSupportGet a demoPlatform
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
  • Get Started
    • Introduction
    • Setup and Installation
  • LLM Evaluation
    • Introduction
    • Experiments
  • Metrics
    • Introduction
    • Metric Collections
    • Custom Metrics
  • LLM Tracing
    • Introduction
      • Evaluate Traces & Spans
      • Evaluate Threads
    • Signals
    • Troubleshooting
  • Human-in-the-Loop
    • Introduction
    • Collect Feedback
  • Reporting & Analytics
    • Dashboards
    • Executive Insights
  • Red Teaming
    • Introduction
    • Quickstart
    • Frameworks & Policies
    • Risk Profiles
    • Red Team Using DeepTeam
  • Resources
    • Why Confident AI
    • Support
    • Data Handling
    • LLM Use Cases
LogoLogo
Trust CenterStatusSupportGet a demoPlatform
On this page
  • Overview
  • How It Works
  • How Thread Evals Differ
  • Evaluate a Thread
  • Add Turn Context
  • Examples
  • Next Steps
LLM TracingOnline Evaluations

Evaluate Threads

Run evaluations on multi-turn conversations by evaluating entire threads

Was this page helpful?
Previous

Name

Giving names to your traces for better visbility on Confident AI
Next
Built with

Overview

Thread evaluations let you evaluate an entire multi-turn conversation as a single unit, rather than evaluating individual traces or spans in isolation. This is essential for conversational AI apps where quality depends on the full context of a conversation.

For evaluating individual traces and spans, see Evaluate Traces & Spans.

How It Works

Thread evaluations follow these steps:

  1. You create a multi-turn metric collection on Confident AI with the conversational metrics you want to run.
  2. Your app creates traces with a shared thread ID, setting input and output on each trace to represent conversation turns.
  3. Once the conversation is complete, you call the evaluate thread function with the thread ID and metric collection name.
  4. Confident AI builds a conversational test case from the trace I/O values — each trace’s input becomes a user turn, and each output becomes an assistant turn.
  5. Your multi-turn metrics run against the full conversation and results appear on the thread in the dashboard.

Only multi-turn metric collections work for thread evaluations. Using a single-turn collection will not produce results.

To evaluate threads automatically without calling the evaluate thread function from your code, configure Evaluation Rules in Project Settings—a thread rule waits for the conversation to be idle for a configurable time limit, then runs your multi-turn metric collection.

How Thread Evals Differ

Trace & Span EvalsThread Evals
ScopeSingle request/responseEntire multi-turn conversation
Metric collectionSingle-turn metricsMulti-turn metrics
When to runReal-time or retrospectivelyRetrospectively only (after conversation ends)
Data sourceTest case parameters you set on spans/tracesTrace input/output values become conversation turns

The key difference is that you don’t set test case parameters for thread evals — instead, Confident AI automatically constructs the conversation from trace I/O:

  • Trace input → user message
  • Trace output → assistant message

This is why setting trace I/O correctly is critical for thread evaluations.

If you don’t set input and/or output on any traces in the thread, Confident AI will have no turns to evaluate and the evaluation will produce no results.

Evaluate a Thread

Thread evaluations must be triggered manually after a conversation has completed, since Confident AI cannot automatically know when a multi-turn conversation is finished.

Call the evaluate thread function once the conversation is done:

Python
TypeScript
main.py
1from openai import OpenAI
2from deepeval.tracing import observe, update_current_trace, evaluate_thread
3
4client = OpenAI()
5your_thread_id = "your-thread-id"
6
7@observe()
8def llm_app(query: str):
9 res = client.chat.completions.create(
10 model="gpt-4o",
11 messages=[{"role": "user", "content": query}]
12 ).choices[0].message.content
13 update_current_trace(thread_id=your_thread_id, input=query, output=res)
14 return res
15
16llm_app("What's the weather in SF?")
17llm_app("What about tomorrow?")
18
19evaluate_thread(thread_id=your_thread_id, metric_collection="My Multi-Turn Collection")

The asynchronous version a_evaluate_thread is also available in Python.

Add Turn Context

You can optionally enrich each turn with tools called and retrieval context. This gives multi-turn metrics additional context about how each response was generated.

For more information on how trace parameters map to test case parameters, click here.

Python
TypeScript
main.py
1from deepeval.tracing import observe, update_current_trace
2from deepeval.test_case import ToolCall
3
4@observe()
5def llm_app(query: str):
6 chunks = retrieve(query)
7 res = generate(query, chunks)
8 update_current_trace(
9 thread_id="your-thread-id",
10 input=query,
11 output=res,
12 retrieval_context=[chunk.text for chunk in chunks],
13 tools_called=[ToolCall(name="WebSearch")]
14 )
15 return res

Examples

Quick quiz: Given the code below, will Confident AI successfully evaluate the thread?

Python
TypeScript
main.py
1from deepeval.tracing import observe, update_current_trace, evaluate_thread
2
3your_thread_id = "your-thread-id"
4
5@observe()
6def llm_app(query: str):
7 update_current_trace(
8 thread_id=your_thread_id,
9 metric_collection="Collection 1"
10 )
11
12llm_app("Hello")
13evaluate_thread(thread_id=your_thread_id, metric_collection="Collection 2")

Answer: No — the thread evaluation will produce no results because neither input nor output has been set on the trace. Without these, Confident AI has no conversation turns to evaluate.

The metric_collection set via update_current_trace is for trace-level single-turn evaluations. For thread evaluations, you need to set input and output on the trace — these are what become conversation turns — and call the evaluate thread function separately.

Next Steps

Thread Traces

Learn how to create threads, set I/O, and log tools called and retrieval context per turn.

Customize Traces

Add tags, metadata, and user info to your traces for filtering and analysis.