Evaluate Traces & Spans
Overview
Online evaluations let you run metrics on traces and spans on-the-fly as they’re ingested into Confident AI, giving you real-time production monitoring of your AI’s quality.
You can also trigger evaluations retrospectively on historical traces and spans.
For evaluating multi-turn conversations (threads), see Evaluate Threads.
How It Works
Online evaluations for traces and spans follow these steps:
- You create a metric collection on Confident AI with the single-turn metrics you want to run.
- You reference that metric collection by name in the
observedecorator/wrapper via the metric collection parameter. - Inside your observed function, you set test case parameters on the span or trace using the update current span or update current trace function.
- When the trace is sent to Confident AI, it runs the metrics in your collection against the test case data you provided.
- Results appear on the trace/span in the Confident AI dashboard.
Only referenceless metrics in your metric collection will run during
tracing. Referenceless metrics
can evaluate your LLM’s performance without requiring reference data (like
expected_output or expected_tools). Non-referenceless metrics are silently
skipped.
Map Test Case Parameters
To run evaluations, you first need to understand how trace span parameters map to test case parameters, which is what metrics use for evaluation. These parameters provide the data that metrics evaluate against.
The parameters you pass to the update current span or update current trace function map directly to test case parameters that metrics evaluate against:
Python
TypeScript
All parameters are optional — you only need to provide the ones required by the metrics in your collection.
Each metric requires different test case parameters. For details on what each metric needs, refer to the official DeepEval documentation.
Evaluate Spans Online
Provide a metric collection on the span’s observe decorator/wrapper and set test case parameters via the update current span function:
Python
TypeScript
If a metric collection name doesn’t match any collection on Confident AI, it will fail silently. Make sure the names align exactly (watch out for trailing spaces).
Evaluate Traces Online
Similar to spans, but use the update current trace function to set test case parameters. The metric collection must be set on the root-level (outermost) span.
Python
TypeScript
If you specify a metric collection but don’t provide sufficient test case parameters for a metric, it will show up as an error on Confident AI but won’t block or cause issues in your code.
Run Evals Offline
You can also trigger evaluations on traces and spans that have already been ingested. This is useful for re-evaluating with new metrics or running evals on historical data.
Evaluate a trace
Python
TypeScript
a_evaluate_trace is also available.Your trace must already contain the necessary test case parameters — you cannot update them when evaluating retrospectively.
Evaluate a span
Python
TypeScript
a_evaluate_span is also available.The metric collection you provide must be a single-turn collection.
Examples
Quick quiz: Given the code below, will Confident AI run online evaluations on the trace using metrics in "Collection 2" or "Collection 1"?
Python
TypeScript
Answer: "Collection 1" runs for the trace, and "Collection 2" runs for the span.
This is because:
- The outer function creates the root span — its metric collection is used for trace-level evals
- The inner function creates a nested span — its metric collection is used for span-level evals
- The update current span call updates the innermost active span (the “inner” span)
- The update current trace call always updates the trace regardless of where it’s called
Next Steps
Now that you can evaluate individual traces and spans, learn how to evaluate entire conversations.