Evaluate Traces & Spans
Overview
Online evaluations let you run metrics on traces and spans on-the-fly as they’re ingested into Confident AI, giving you real-time production monitoring of your AI’s quality.
You can also trigger evaluations retrospectively on historical traces and spans.
For evaluating multi-turn conversations (threads), see Evaluate Threads.
How It Works
Online evaluations for traces and spans follow these steps:
- You create a metric collection on Confident AI with the single-turn metrics you want to run.
- You reference that metric collection by name via the
metric_collectionparameter — on theobservedecorator/wrapper for span-level evals, or viaupdate_current_tracefor trace-level evals. - Inside your observed function, you set test case parameters on the span or trace using the update current span or update current trace function.
- When the trace is sent to Confident AI, it runs the metrics in your collection against the test case data you provided.
- Results appear on the trace/span in the Confident AI dashboard.
Only referenceless metrics in your metric collection will run during
tracing. Referenceless metrics
can evaluate your LLM’s performance without requiring reference data (like
expected_output or expected_tools). Non-referenceless metrics are silently
skipped.
Map Test Case Parameters
To run evaluations, you first need to understand how trace span parameters map to test case parameters, which is what metrics use for evaluation. These parameters provide the data that metrics evaluate against.
The parameters you pass to the update current span or update current trace function map directly to test case parameters that metrics evaluate against:
Python
TypeScript
All parameters are optional — you only need to provide the ones required by the metrics in your collection.
Each metric requires different test case parameters. For details on what each metric needs, refer to the official DeepEval documentation.
Evaluate Spans Online
Provide a metric collection on the span’s observe decorator/wrapper and set test case parameters via the update current span function:
Python
TypeScript
If a metric collection name doesn’t match any collection on Confident AI, it will fail silently. Make sure the names align exactly (watch out for trailing spaces).
Evaluate Traces Online
Similar to spans, but use the update_current_trace/updateCurrentTrace function to set both the metric collection and test case parameters on the trace.
Python
TypeScript
If you specify a metric collection but don’t provide sufficient test case parameters for a metric, it will show up as an error on Confident AI but won’t block or cause issues in your code.
Run Evals Offline
You can also trigger evaluations on traces and spans that have already been ingested. This is useful for re-evaluating with new metrics or running evals on historical data.
Evaluate a trace
Python
TypeScript
a_evaluate_trace is also available.Your trace must already contain the necessary test case parameters — you cannot update them when evaluating retrospectively.
Evaluate a span
Python
TypeScript
a_evaluate_span is also available.The metric collection you provide must be a single-turn collection.
Examples
Quick quiz: Given the code below, which metric collection will Confident AI use for the trace, and which for the span?
Python
TypeScript
Answer: "Collection 1" runs for the trace, and "Collection 2" runs for the span.
This is because:
- The trace-level metric collection is set via
update_current_trace(metric_collection=...), which can be called from any span - The inner function’s
observedecorator sets"Collection 2"as the metric collection for that span - The
update_current_spancall updates the innermost active span (the “inner” span) - The
update_current_tracecall always updates the trace regardless of where it’s called
Next Steps
Now that you can evaluate individual traces and spans, learn how to evaluate entire conversations.