Evaluation Rules | Confident AI Docs

Evaluation rules let you attach a metric collection to incoming traces, spans, or threads without changing your application code. When data is ingested into your project, each enabled rule checks whether the trace, span, or thread matches its conditions and runs the configured metric collection automatically.

Evaluation rules are a no-code alternative to passing metric_collection through the SDK. If your API call already supplies a metric collection, the API value wins—rules only attach evaluations when the API does not.

Create an Evaluation Rule

To create an evaluation rule:

Navigate to Project Settings → Evaluation Rules
Click New Rule
Enter a unique Name (and an optional Description)
Pick a Data Model—Trace, Span, or Thread
(For span rules) Optionally pick a Span Type to restrict the rule to one type of span (e.g., LLM, Tool, Retriever)
Pick a Metric Collection to run when the rule matches
(Optional) Configure Filters to limit which entities the rule applies to
(Optional) Set a Sample Rate between 0 and 1 to evaluate only a fraction of matches
(For thread rules) Set a Time Limit in seconds—the inactivity window after which the thread is evaluated
Toggle Enabled on, then click Create Rule

You can quickly enable or disable a rule from the list view without opening the editor.

Data Models

Each rule applies to one of three data models. The data model determines what the rule can match against and when evaluations run.

Data Model	When It Runs	Compatible Metric Collections
`Trace`	At ingest, on each incoming trace	Single-turn
`Span`	At ingest, on each incoming span	Single-turn
`Thread`	After the thread has been idle for the time limit	Multi-turn

For span rules, you can additionally restrict the rule to a specific Span Type. Leave the span type blank to match every span regardless of type.

Only one enabled Thread rule can target a given metric collection at a time—Confident AI prevents duplicate thread rules so each conversation isn’t evaluated multiple times for the same metrics.

Filters

Filters narrow down which traces, spans, or threads a rule applies to. Use filters to scope a rule to specific environments, tags, metadata fields, or other dimensions of your data.

For example, you can configure a Trace rule that only fires when metadata.env equals cloud, or a Span rule that only matches LLM spans whose latency exceeds a threshold.

Leave Filters empty to match every entity for the chosen data model.

Sample Rate

Sample rate controls how often a matching rule actually fires. A rule with a sample rate of 0.25 evaluates roughly one in four matches. Sampling is deterministic, so the same trace, span, or thread will always make the same sample decision for a given rule.

This is useful when you want signal on metric trends without paying to evaluate every single ingested item.

Time Limit (Thread Rules)

Thread rules use a Time Limit (in seconds) to decide when a multi-turn conversation is “done” and ready to evaluate. After the rule’s filters match a trace in a thread, Confident AI waits the configured number of seconds—if no new traces arrive in that window, the thread is evaluated using the rule’s metric collection.

If new traces continue to arrive, the wait window resets. This lets you evaluate threads automatically without having to call the evaluate thread function from your code.

Interaction With API Metric Collections

When a trace or span is ingested with metric_collection set via the SDK, that value wins—the API path runs the metrics you supplied and rules do not add extra collections to that item. Rules only attach evaluations when the API call did not supply a metric collection.

For threads, evaluation rules are the way to evaluate threads automatically—there is no equivalent inline parameter on traces that triggers a thread evaluation. Threads can still be evaluated explicitly via the evaluate thread function.

Online Evaluations

Run metric collections on traces and spans inline from your SDK, in addition to (or instead of) using evaluation rules.

Evaluate Threads

Run multi-turn metrics on entire conversations once they’re complete.