Evaluate Traces & Spans
Evaluate Traces & Spans
Evaluate Traces & Spans
Online evaluations let you run metrics on traces and spans on-the-fly as they’re ingested into Confident AI, giving you real-time production monitoring of your AI’s quality.
You can also trigger evaluations retrospectively on historical traces and spans.
For evaluating multi-turn conversations (threads), see Evaluate Threads.
Prefer not to set metric_collection from your SDK? Configure Evaluation
Rules in Project Settings to
automatically run a metric collection on incoming traces and spans that match
your filters.
Online evaluations for traces and spans follow these steps:
metric_collection parameter — on the observe decorator/wrapper for span-level evals, or via update_current_trace for trace-level evals.Only referenceless metrics in your metric collection will run during
tracing. Referenceless metrics
can evaluate your LLM’s performance without requiring reference data (like
expected_output or expected_tools). Non-referenceless metrics are silently
skipped.
To run evaluations, you first need to understand how trace span parameters map to test case parameters, which is what metrics use for evaluation. These parameters provide the data that metrics evaluate against.
The parameters you pass to the update current span or update current trace function map directly to test case parameters that metrics evaluate against:
All parameters are optional — you only need to provide the ones required by the metrics in your collection.
Each metric requires different test case parameters. For details on what each metric needs, refer to the official DeepEval documentation.
Provide a metric collection on the span’s observe decorator/wrapper and set test case parameters via the update current span function:
If a metric collection name doesn’t match any collection on Confident AI, it will fail silently. Make sure the names align exactly (watch out for trailing spaces).
Similar to spans, but use the update_current_trace/updateCurrentTrace function to set both the metric collection and test case parameters on the trace.
If you specify a metric collection but don’t provide sufficient test case parameters for a metric, it will show up as an error on Confident AI but won’t block or cause issues in your code.
You can also trigger evaluations on traces and spans that have already been ingested. This is useful for re-evaluating with new metrics or running evals on historical data.
a_evaluate_trace is also available.Your trace must already contain the necessary test case parameters — you cannot update them when evaluating retrospectively.
a_evaluate_span is also available.The metric collection you provide must be a single-turn collection.
Quick quiz: Given the code below, which metric collection will Confident AI use for the trace, and which for the span?
Answer: "Collection 1" runs for the trace, and "Collection 2" runs for the span.
This is because:
update_current_trace(metric_collection=...), which can be called from any spanobserve decorator sets "Collection 2" as the metric collection for that spanupdate_current_span call updates the innermost active span (the “inner” span)update_current_trace call always updates the trace regardless of where it’s calledNow that you can evaluate individual traces and spans, learn how to evaluate entire conversations.