Evaluate Traces & Spans

Run online and offline evaluations on individual traces and spans on the fly

Overview

Online evaluations let you run metrics on traces and spans on-the-fly as they’re ingested into Confident AI, giving you real-time production monitoring of your AI’s quality.

Online Evaluations on Confident AI

You can also trigger evaluations retrospectively on historical traces and spans.

For evaluating multi-turn conversations (threads), see Evaluate Threads.

How It Works

Online evaluations for traces and spans follow these steps:

  1. You create a metric collection on Confident AI with the single-turn metrics you want to run.
  2. You reference that metric collection by name in the observe decorator/wrapper via the metric collection parameter.
  3. Inside your observed function, you set test case parameters on the span or trace using the update current span or update current trace function.
  4. When the trace is sent to Confident AI, it runs the metrics in your collection against the test case data you provided.
  5. Results appear on the trace/span in the Confident AI dashboard.

Only referenceless metrics in your metric collection will run during tracing. Referenceless metrics can evaluate your LLM’s performance without requiring reference data (like expected_output or expected_tools). Non-referenceless metrics are silently skipped.

Map Test Case Parameters

To run evaluations, you first need to understand how trace span parameters map to test case parameters, which is what metrics use for evaluation. These parameters provide the data that metrics evaluate against.

The parameters you pass to the update current span or update current trace function map directly to test case parameters that metrics evaluate against:

Trace/Span ParameterTest Case ParameterDescription
inputinputThe input to your AI app
outputactual_outputThe output of your AI app
expected_outputexpected_outputThe expected output of your AI app
retrieval_contextretrieval_contextList of retrieved text chunks from a retrieval system
contextcontextList of ideal retrieved text chunks
tools_calledtools_calledList of ToolCall objects representing tools called
expected_toolsexpected_toolsList of ToolCall objects representing expected tools

All parameters are optional — you only need to provide the ones required by the metrics in your collection.

Each metric requires different test case parameters. For details on what each metric needs, refer to the official DeepEval documentation.

Evaluate Spans Online

Provide a metric collection on the span’s observe decorator/wrapper and set test case parameters via the update current span function:

main.py
1from deepeval.tracing import observe, update_current_span
2from openai import OpenAI
3
4client = OpenAI()
5
6@observe(metric_collection="My Collection")
7def llm_app(query: str) -> str:
8 res = client.chat.completions.create(
9 model="gpt-4o",
10 messages=[{"role": "user", "content": query}]
11 ).choices[0].message.content
12
13 update_current_span(input=query, output=res)
14 return res
15
16llm_app("Write me a poem.")

If a metric collection name doesn’t match any collection on Confident AI, it will fail silently. Make sure the names align exactly (watch out for trailing spaces).

Evaluate Traces Online

Similar to spans, but use the update current trace function to set test case parameters. The metric collection must be set on the root-level (outermost) span.

main.py
1from deepeval.tracing import observe, update_current_trace
2from openai import OpenAI
3
4client = OpenAI()
5
6@observe(metric_collection="My Collection")
7def llm_app(query: str) -> str:
8 res = client.chat.completions.create(
9 model="gpt-4o",
10 messages=[{"role": "user", "content": query}]
11 ).choices[0].message.content
12
13 update_current_trace(input=query, output=res)
14 return res
15
16llm_app("Write me a poem.")
You can run online evals on both traces and spans at the same time.

If you specify a metric collection but don’t provide sufficient test case parameters for a metric, it will show up as an error on Confident AI but won’t block or cause issues in your code.

Run Evals Offline

You can also trigger evaluations on traces and spans that have already been ingested. This is useful for re-evaluating with new metrics or running evals on historical data.

Evaluate a trace

main.py
1from deepeval.tracing import evaluate_trace
2
3evaluate_trace(trace_uuid="your-trace-uuid", metric_collection="Collection Name")
The asynchronous version a_evaluate_trace is also available.

Your trace must already contain the necessary test case parameters — you cannot update them when evaluating retrospectively.

Evaluate a span

main.py
1from deepeval.tracing import evaluate_span
2
3evaluate_span(span_uuid="your-span-uuid", metric_collection="Collection Name")
The asynchronous version a_evaluate_span is also available.

The metric collection you provide must be a single-turn collection.

Examples

Quick quiz: Given the code below, will Confident AI run online evaluations on the trace using metrics in "Collection 2" or "Collection 1"?

main.py
1from deepeval.tracing import observe, update_current_span, update_current_trace
2
3@observe(metric_collection="Collection 1")
4def outer_function():
5 @observe(metric_collection="Collection 2")
6 def inner_function():
7 update_current_span(input=..., output=...)
8 update_current_trace(input=..., output=...)

Answer: "Collection 1" runs for the trace, and "Collection 2" runs for the span.

This is because:

  1. The outer function creates the root span — its metric collection is used for trace-level evals
  2. The inner function creates a nested span — its metric collection is used for span-level evals
  3. The update current span call updates the innermost active span (the “inner” span)
  4. The update current trace call always updates the trace regardless of where it’s called

Next Steps

Now that you can evaluate individual traces and spans, learn how to evaluate entire conversations.