Workflows gives you a single view of the entire post-ingestion pipeline for your traces, spans, and threads — dataset ingestion tasks, queue ingestion tasks, evaluation rules, and classifiers — visualised as a graph and managed through a set of tabs below it.
Use the Traces, Spans, and Threads buttons at the top to scope the graph and all tabs to a specific entity type. Everything on the page updates to show only the workflows relevant to that type.
Dataset ingestion tasks continuously ingest matching traces, spans, or threads into a dataset as goldens. Each task runs automatically against incoming data and adds qualifying items to the target dataset without manual intervention.
To create a dataset ingestion task:
Each task row shows its name, target dataset, data model, and golden count. Use the toggle to enable or disable a task without deleting it. Click the edit icon to update its configuration, or the delete icon to remove it permanently.
Queue ingestion tasks continuously route matching traces, spans, or threads into an annotation queue for human review. Use these to automatically populate queues with data that meets specific criteria.
To create a queue ingestion task:
Each task row shows its name, target queue, data model, and how many items have been ingested so far. Toggle, edit, and delete work the same way as for dataset ingestion tasks.
Evaluation rules automatically run a metric collection on incoming traces, spans, or threads at ingest time — without any code changes. They fire only when the SDK call that produced the data did not already supply a metric collection, making them a no-code complement to inline evaluation.
If your SDK call already passes metric_collection, that value wins — the rule is skipped for that item. Rules only attach evaluations when the SDK does not supply a metric collection.
To create an evaluation rule:
Filters narrow which traces, spans, or threads a rule applies to. Filters can target environment, tags, metadata fields, latency, and other dimensions. Filter tabs for eval metrics, annotations, and signals are not available in rules — those dimensions don’t exist at ingest time.
Leave Filters empty to match every entity for the chosen data model.
For threads, evaluation rules are the primary way to run evaluations automatically — there is no equivalent inline SDK parameter that triggers a thread-level evaluation. Threads can still be evaluated explicitly via the Evaluate Threads function if needed.
Only one enabled thread rule can target a given metric collection at a time. Enabling a rule that would conflict with another active thread rule targeting the same collection is blocked until the conflicting rule is disabled.
Classifiers assign labels to traces and threads as they are ingested, based on a description and a set of labels you define. The labels they produce surface as Signals and as filterable dimensions across the Observatory and Dashboards.
Classifiers are not available for Spans. Switch to the Traces or Threads tab to see and manage classifiers.
When a classifier runs, the underlying LLM receives:
The model picks one label or returns “no match.” There is no rule engine, no metadata-based pre-filtering, and no regex — everything depends on how the descriptions read against the data.
Specificity matters. Vague label descriptions yield vague labels. Concrete examples in each description (e.g. “label as Negative if the user expresses frustration, gives up, or restates the same question because of a wrong answer”) drive accuracy more than any other lever.
To create a classifier:
After creating, click the edit icon on the row to open the classifier editor in a side drawer. This is where you manage labels and configure generation settings.
Each classifier has one or more labels. Add labels manually with New Label (Name + Description) or auto-suggest them in bulk via Generate Labels. Each label has its own enable toggle — disabled labels are not assigned to new items but remain in the classifier’s history.
If you don’t yet know what labels you need, Generate Labels proposes a set from your recent traces or threads. Click Configure Generation first to set the prompt and clustering parameters, then Generate Labels to run the three-stage pipeline:
Recommended labels show Accept (✓) and Decline (✕) actions instead of the regular edit menu. Accepted labels become regular labels and start running on the next ingestion tick. Declined labels are deleted. Re-running generation while recommendations are still pending discards the old ones first.
Generation configuration (summary prompt and number of clusters) must be saved before the Generate Labels button becomes active.
The Auto Classify toggle in the classifier editor is separate from the top-level Enabled toggle:
Leave Auto Classify on if you want to keep discovering edge cases, and off if you want a fixed taxonomy.
The Sample Rate below the classifier list controls what fraction of incoming traces (or threads) are sent for classification — 1.0 classifies everything, 0.1 classifies roughly one in ten. This is a project-wide setting shared across all enabled classifiers for that data model.
For thread classifiers, Time Limit defines how many seconds of inactivity must pass before a thread is eligible for classification. The classification runs once no new trace has arrived for that period. Set this long enough that follow-up turns have stopped arriving, but not so long that you miss the conversation window.
Each classification logs a usage event for billing. See Project Settings → Data Usage under the Signals line for live usage and projected cost.