End-to-End vs. Component-Level Evaluation

When tracing goes live, every span becomes visible for the first time. The natural move is to score all of them — retrieval, tool selection, planner decisions. It feels like progress, and it is not unusual. Most teams do this before they have defined what a passing end-to-end run even means for the customer.

A few weeks later you have six component-level charts and no answer to the one question leadership is asking: "Did we make the product better for the user?" The only number that answers that question is the final output — what the user actually saw.

Component-level evaluation is the right tool when you already know the shipped answer is wrong and you need to find where it broke:

Retrieval quality
Tool selection accuracy
Individual span behavior
Planner vs executor decisions

It is the wrong place to start. If you have not defined what a passing end-to-end run means for the customer, component scores give you numbers for every step and no rule for which number should block a deploy.

So which one should you start with?

The objection is always "we need component metrics to know what to fix." You do — after you can score the outcome. Otherwise you are debugging in a vacuum: you might push retrieval scores up while the user-facing answer still misses the point, and you will not know until someone reads the output by hand.

Start end-to-end. Validate those top-level metrics against human-labeled outcomes. When a regression or a production spike clearly implicates one layer — retrieval went empty, wrong tool selected, hallucination after a specific step — add component-level evals for that subsystem and treat them as scalpels, not as the definition of quality.

The progression that holds up:

End-to-end first
Tracing for visibility
Component-level evals where they explain a failure you already see at the top level

TL;DR — Score the output the user sees. Everything else is debugging.