Online & Offline Evaluations
Overview
You can run evaluations during LLM tracing, by running metrics on individual:
- Spans,
- Traces, and
- Threads
For online evaluations, you can do this by providing the name of the metric collection you’ve created on Confident AI in the @obesrve decorators during tracing. For offline evaluations, you can call the evaluate_thread function to manually trigger an evaluation.
Prerequisites
Create Metric Collections
To enable referenceless metrics to run, you need to create a metric collection on Confident AI. Your metric collection will determine the set of metrics for you to run evaluations on.
If you’re planning to run offline evals on threads, create a multi-turn metric collection, and vice versa. This is really important as you’ll needs the metric collection names to match as you’ll see in the next section.
Only referenceless metric you’ve enabled inside the metric collection are runnable upon tracing. For non-referenceless metrics, Confident AI will simply ignore them.
To recap, referenceless metrics section, referenceless metrics are a special type of metric that can evaluate your LLM’s performance without requiring reference data (like expected_output or expected_tools).
Online Evaluations
Online evaluations refer to evals that are ran in real-time, as traces are ingested and sent to Confident AI.
Spans
Simply provide the name of the metric collection to tell Confident AI the specify set of referenceless metrics you wish to run in the @observe decorator.
You’ll also need to use update_current_span with an LLMTestCase at runtime to actually trigger an online evaluation on the server-side:
Python
TypeScript
The metricCollection argument is an optional strings that determines which metrics in your online metric collection will be ran for this current span.
Supplying a metric name in metrics that doesn’t exist or isn’t activated on
Confident AI will result in it failing silently. If metrics aren’t showing up
on the platform, make sure the names align perfectly. (PS. Watch out for
trailing spaces!)
If you specify a metricCollection list but don’t update your current span with the sufficient test case parameters for metric execution, it will simply show up as an error on Confident AI, and won’t block or cause issues in your code.
Traces
Running evals on traces are akin to running end-to-end evals, where you disregard the performance of individual spans within the trace and treat your application as a black-box.
Similar to evals for spans, you would also provide a metricCollection name, but this time call the update_current_trace() function instead:
Python
TypeScript
Also note that unlike evals on spans, the metricCollection MUST BE DEFINED at the top-level/root span level. You can call update_current_trace anywhere in your observed application though.
Offline Evaluations
Offline evaluations refer to evals that are ran retrospectively, after traces are ingested into Confident AI.
Threads
All threads are offline evaluations because you should only run evals on threads once a multi-turn interaction has completed.
Since it is impossible for Confident AI to automatically know whether a multi-turn conversation has completed or not, you’ll have to trigger an offline evaluation using the evaluate_thread() method ONLY AFTER once you’re certain a conversation has completed:
You can also use a_evaluate_thread, the async version of evaluate_thread():
You MUST set the input/output of individual traces in a thread for multi-turn evaluations to work. To recap, DeepEval uses the input of a trace as the "user" role content and the output of a trace as the "assistant" role content as turns in your thread.
If you don’t set the input and/or output, Confident AI will have nothing
in your thread to evaluate.
You can also set the retrieval context and tools called (if any) for each assistant/AI turn as part of the output setting, which would allow you to see the context of which an LLM generation was invoked in:
To learn more, visit the threads section.
Trace
Although less common since you can run online evals on traces, you can also run single-turn, offline evals on trace retrospectively:
The asynchronous version, a_evaluate_trace is also available for traces.
Note that your trace must already contain the necessary test case parameters as you cannot update the test case parameters using the evaluate_trace function.
Span
The same offline evals can also be ran in the same say for spans:
The asynchronous version, a_evaluate_span is also available for spans.
Remember, the metric collection you provide must be a single-turn one.
Examples
Trace and span evals
Quick quiz: Given the code below, will Confidnet AI run online evaluations on the trace using metrics in "Collection 2" or "Collection 1"?
Answer: This will run "Collection 1" for traces, and "Collection 2" for spans.
This is because in this example:
- When
outer_functionstarts, it creates the “outer” span - When
inner_functionis called, it creates the “inner” span on top - Any calls to
update_current_span()duringinner_function’s execution will update the “inner” span, not the “outer” one - Any calls to
update_current_trace()during any point insideouter_functionwill update the entire trace and online evals for traces MUST BE SET on the root level span.
Thread and trace evals
Quick quiz: Given the code below, will Confident AI run online evaluations on the thread using metrics in "Collection 2" or "Collection 1"?
Answer: This will NOT run "Collection 1" or "Collection 2", because neither the input or output has been specified in update_current_trace. This means Confident AI will have no turns to evaluate using metrics in your metric collection.
Note that setting the test_case for a trace has no bearing on the input
and output.