Evaluate Traces & Spans | Confident AI Docs

Overview

Online evaluations let you run metrics on traces and spans on-the-fly as they’re ingested into Confident AI, giving you real-time production monitoring of your AI’s quality.

You can also trigger evaluations retrospectively on historical traces and spans.

For evaluating multi-turn conversations (threads), see Evaluate Threads.

How It Works

Online evaluations for traces and spans follow these steps:

You create a metric collection on Confident AI with the single-turn metrics you want to run.
You reference that metric collection by name in the observe decorator/wrapper via the metric collection parameter.
Inside your observed function, you set test case parameters on the span or trace using the update current span or update current trace function.
When the trace is sent to Confident AI, it runs the metrics in your collection against the test case data you provided.
Results appear on the trace/span in the Confident AI dashboard.

Only referenceless metrics in your metric collection will run during tracing. Referenceless metrics can evaluate your LLM’s performance without requiring reference data (like expected_output or expected_tools). Non-referenceless metrics are silently skipped.

Map Test Case Parameters

To run evaluations, you first need to understand how trace span parameters map to test case parameters, which is what metrics use for evaluation. These parameters provide the data that metrics evaluate against.

The parameters you pass to the update current span or update current trace function map directly to test case parameters that metrics evaluate against:

Python

TypeScript

Trace/Span Parameter	Test Case Parameter	Description
`input`	`input`	The input to your AI app
`output`	`actual_output`	The output of your AI app
`expected_output`	`expected_output`	The expected output of your AI app
`retrieval_context`	`retrieval_context`	List of retrieved text chunks from a retrieval system
`context`	`context`	List of ideal retrieved text chunks
`tools_called`	`tools_called`	List of `ToolCall` objects representing tools called
`expected_tools`	`expected_tools`	List of `ToolCall` objects representing expected tools

All parameters are optional — you only need to provide the ones required by the metrics in your collection.

Each metric requires different test case parameters. For details on what each metric needs, refer to the official DeepEval documentation.

Evaluate Spans Online

Provide a metric collection on the span’s observe decorator/wrapper and set test case parameters via the update current span function:

Python

TypeScript

main.py

1 from deepeval.tracing import observe, update_current_span
2 from openai import OpenAI
3 
4 client = OpenAI()
5 
6 @observe(metric_collection="My Collection")
7 def llm_app(query: str) -> str:
8     res = client.chat.completions.create(
9         model="gpt-4o",
10         messages=[{"role": "user", "content": query}]
11     ).choices[0].message.content
12 
13     update_current_span(input=query, output=res)
14     return res
15 
16 llm_app("Write me a poem.")

If a metric collection name doesn’t match any collection on Confident AI, it will fail silently. Make sure the names align exactly (watch out for trailing spaces).

Evaluate Traces Online

Similar to spans, but use the update current trace function to set test case parameters. The metric collection must be set on the root-level (outermost) span.

Python

TypeScript

main.py

1 from deepeval.tracing import observe, update_current_trace
2 from openai import OpenAI
3 
4 client = OpenAI()
5 
6 @observe(metric_collection="My Collection")
7 def llm_app(query: str) -> str:
8     res = client.chat.completions.create(
9         model="gpt-4o",
10         messages=[{"role": "user", "content": query}]
11     ).choices[0].message.content
12 
13     update_current_trace(input=query, output=res)
14     return res
15 
16 llm_app("Write me a poem.")

You can run online evals on both traces and spans at the same time.

If you specify a metric collection but don’t provide sufficient test case parameters for a metric, it will show up as an error on Confident AI but won’t block or cause issues in your code.

Run Evals Offline

You can also trigger evaluations on traces and spans that have already been ingested. This is useful for re-evaluating with new metrics or running evals on historical data.

Evaluate a trace

Python

TypeScript

main.py

1 from deepeval.tracing import evaluate_trace
2 
3 evaluate_trace(trace_uuid="your-trace-uuid", metric_collection="Collection Name")

The asynchronous version a_evaluate_trace is also available.

Your trace must already contain the necessary test case parameters — you cannot update them when evaluating retrospectively.

Evaluate a span

Python

TypeScript

main.py

1 from deepeval.tracing import evaluate_span
2 
3 evaluate_span(span_uuid="your-span-uuid", metric_collection="Collection Name")

The asynchronous version a_evaluate_span is also available.

The metric collection you provide must be a single-turn collection.

Examples

Quick quiz: Given the code below, will Confident AI run online evaluations on the trace using metrics in "Collection 2" or "Collection 1"?

Python

TypeScript

main.py

1 from deepeval.tracing import observe, update_current_span, update_current_trace
2 
3 @observe(metric_collection="Collection 1")
4 def outer_function():
5     @observe(metric_collection="Collection 2")
6     def inner_function():
7         update_current_span(input=..., output=...)
8         update_current_trace(input=..., output=...)

Answer: "Collection 1" runs for the trace, and "Collection 2" runs for the span.

This is because:

The outer function creates the root span — its metric collection is used for trace-level evals
The inner function creates a nested span — its metric collection is used for span-level evals
The update current span call updates the innermost active span (the “inner” span)
The update current trace call always updates the trace regardless of where it’s called

Next Steps

Now that you can evaluate individual traces and spans, learn how to evaluate entire conversations.

Evaluate Threads

Run evaluations on multi-turn conversations and understand how thread evals differ from trace evals.

Customize Traces

Add tags, metadata, and user info to your traces for filtering and analysis.