Today we're launching AI Observability Workflows on Confident AI — a graph based interface for managing everything that happens after your traces, spans, and threads hit the platform.
Dataset ingestion, queue ingestion, evaluation rules, and classifiers aren't new; you've been able to auto-ingest data and run online evals for a while. What's new is that they now live in one place — a graph-based editor where you stitch these tasks together into a pipeline and define the order they run in.
What is Observability Workflows?
Workflows gives you a single view of your entire post-ingestion pipeline. Incoming traces, spans, and threads sit at the top of a graph, and everything you've configured to run on them — ingestion tasks, evaluation rules, and classifiers — branches off below.
Use the Traces, Spans, and Threads buttons at the top to scope the page to a specific entity type. The graph and every tab update to show only the workflows relevant to that type, so you're always looking at one coherent pipeline rather than four disconnected settings screens.
Below the graph are four tabs — Dataset Ingestion, Queue Ingestion, Evaluation Rules, and Classifiers — each managing one kind of downstream task.
The graph editor: stitch tasks into an order of operations
The graph isn't just a picture of what you've configured — it's how you wire tasks together. Each task is a node, and the order you connect them in is the order they execute when data arrives.
That ordering matters, because tasks build on each other. A few examples:
- Evaluate, then ingest — run an evaluation rule first, then ingest only the traces that scored below your threshold into a dataset, so your golden set fills with real failures instead of noise.
- Classify, then route — classify incoming traces first, then route everything labelled a security risk or negative sentiment straight into an annotation queue for human review.
- Classify, then evaluate — label by use case first, then run the metric collection that actually fits that use case.
You decide the sequence. Drag tasks into the order you want, and Workflows runs them as a pipeline — each step operating on the results of the one before it — rather than as four independent jobs that happen to fire on the same data.
Dataset ingestion
Dataset ingestion tasks continuously ingest matching traces, spans, or threads into a dataset as goldens. Each task runs automatically against incoming data and adds qualifying items without manual curation.
To create one:
- Select Traces, Spans, or Threads at the top of the page
- Open the Dataset Ingestion tab and click New ingestion task
- In the side drawer, pick the target dataset, set filters, and name the task
- Save
Each row shows the task's name, target dataset, data model, and golden count. Use the toggle to pause a task without deleting it, or the edit and delete icons to manage it.
Queue ingestion
Queue ingestion tasks route matching traces, spans, or threads into an annotation queue for human review — so the examples worth a closer look reach reviewers automatically.
To create one:
- Select Traces, Spans, or Threads
- Open the Queue Ingestion tab and click New ingestion task
- Choose the target annotation queue and configure the task in the side drawer
- Save
Rows show the task name, target queue, data model, and how many items have been ingested so far. Toggle, edit, and delete work exactly as they do for dataset ingestion.
Evaluation rules
Evaluation rules run a metric collection on incoming data at ingest time — no code changes required. They're a no-code complement to inline evaluation: a rule only fires when the SDK call that produced the data didn't already pass a metric_collection. If the SDK supplies one, that value wins and the rule is skipped.
To create a rule, open the Evaluation Rules tab, click New rule, and configure it in the side drawer. The key fields:
- Data Model —
Trace,Span, orThread. Trace and span rules run at ingest on each item and require a single-turn collection; thread rules require a multi-turn collection and run once a thread has been idle for a configurable Time Limit (default300s). - Metric Collection — the collection to run on matching items.
- Filters — scope the rule by environment, tags, metadata, latency, and more. Leave empty to match everything.
- Sample Rate — the fraction of matching items the rule fires on (
0.0–1.0). Sampling is deterministic, so the same item always makes the same decision.
For threads, evaluation rules are the primary way to evaluate automatically — there's no inline SDK parameter that triggers a thread-level eval. Note that only one enabled thread rule can target a given metric collection at a time.
Classifiers
Classifiers assign labels to traces and threads as they're ingested, based on a description and a set of labels you define. Those labels surface as Signals and as filterable dimensions across the Observatory and Dashboards. (Classifiers aren't available for spans — switch to Traces or Threads to manage them.)
There's no rule engine or regex under the hood. When a classifier runs, the model sees the classifier's description, each label's description, and the trace or thread payload, then picks one label or returns "no match." Specificity is everything — concrete label descriptions ("label as Negative if the user expresses frustration or restates the same question after a wrong answer") drive accuracy far more than anything else.
A few things worth knowing:
- Generate Labels proposes a starter set from your recent data via a summarize → cluster → label pipeline, so you don't have to define a taxonomy from scratch. Recommended labels show Accept / Decline actions.
- Auto Classify lets a classifier propose new labels when none of your existing ones fit — leave it on to keep discovering edge cases, off for a fixed taxonomy.
- Sample Rate is a project-wide setting controlling what fraction of incoming items get classified, and each classification logs a usage event you can track under Project Settings → Data Usage.
Get started
Workflows is live on Confident AI now.
- Read the Workflows documentation for the full reference on every tab, field, and option
- Open the Workflows page in your project to see the graph for your own traces, spans, and threads
And keep an eye on the blog this week — we've got more shipping.
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an "aha!" moment, who knows?
AI Quality for the entire organization, not just individual teams
Give all AI use cases the same quality bar with all-in-one evals, observability, and red teaming, and enforce them at scale.

