Setting Up Trigger Moments (Online Evals)

Last edited on Apr 7, 2026Written by Brian Neville-O'Neill, GTM for Developer Tools @ Confident AI

Most teams start evaluating their LLM app in one of two ways: they either run evals only offline against a test set, or they try to evaluate everything in production at once.

Both approaches miss the point. Offline evals don't catch the failures that only show up with real user inputs. And evaluating everything in production generates noise that makes it hard to act on anything.

The better approach is to pick specific moments in your application where evaluations actually matter, set those up first, and expand from there.

We call these trigger moments — the points in your app where an online eval should run.

Start With One Trigger Moment, Not Five

You don't need to instrument your entire pipeline on day one. Start with the single point where failures would hurt the most.

For most teams, that's after generating a response, before returning it to the user. This is the moment where you can catch hallucinations, policy violations, and off-topic answers before they reach anyone.

Set up one eval here. Get it running. Watch the results for a few days. Then expand.

👉 Set up your first online eval

Here's the sequence that works for most LLM applications, from highest to lowest priority:

1. After response generation

This is your first line of defense. Run evals on every response (or a sample, depending on volume) to score things like correctness, relevance, and hallucination risk.

If you're only going to set up one trigger moment, make it this one.

2. After tool execution (if your app uses tools)

If your app calls external tools — APIs, databases, search — evaluate the tool call itself. Was the right tool selected? Were the arguments correct? Did the result actually support the final answer?

This is where agentic apps break in ways that are invisible from the final output alone. A response can look fine even when the underlying tool call was wrong.

3. On specific traces and spans

Once you have response-level and tool-level evals running, go deeper. Attach evals to individual spans in your traces so you can pinpoint exactly where quality degrades in a multi-step workflow.

This is especially useful for RAG pipelines — you can evaluate retrieval quality separately from generation quality instead of guessing which one caused a bad output.

4. On threads and conversations

If you're building a multi-turn system (chatbot, copilot, agent with memory), add thread-level evals last. These measure things like turn relevancy, role adherence, and whether the conversation actually resolved the user's request.

Thread-level evals are the hardest to get right, so don't start here.

What to Skip (For Now)

A few things that sound useful but tend to slow teams down early on:

Evaluating every single span from day one. You'll drown in data. Start with the output, then work backward to the spans that contribute to failures.
Blocking responses on eval results. Unless you have a hard compliance requirement, run evals asynchronously at first. Blocking adds latency and you'll want to tune your eval thresholds before you trust them enough to gate responses.
Building custom metrics before using the defaults. Start with out-of-the-box metrics like answer relevancy, faithfulness, and hallucination. Custom metrics are worth it later, but they're a distraction when you're still setting up the pipeline.

Setting This Up in Confident AI

In Confident AI, trigger moments map directly to how you configure online evaluations against traces, spans, and threads.

The setup is:

Instrument your app to send traces to Confident AI.
Pick the trigger moment you want to start with (we recommend post-response).
Select which metrics to run at that trigger point.
Deploy, and monitor results in the dashboard.

From there, add trigger moments incrementally as you learn where your app actually fails.

👉 Get started with online evals