When tracing goes live, every span becomes visible for the first time. The natural move is to score all of them — retrieval, tool selection, planner decisions. It feels like progress, and it is not unusual. Most teams do this before they have defined what a passing end-to-end run even means for the customer.
A few weeks later you have six component-level charts and no answer to the one question leadership is asking: "Did we make the product better for the user?" The only number that answers that question is the final output — what the user actually saw.
Component-level evaluation is the right tool when you already know the shipped answer is wrong and you need to find where it broke:
- Retrieval quality
- Tool selection accuracy
- Individual span behavior
- Planner vs executor decisions
It is the wrong place to start. If you have not defined what a passing end-to-end run means for the customer, component scores give you numbers for every step and no rule for which number should block a deploy.
So which one should you start with?
The objection is always "we need component metrics to know what to fix." You do — after you can score the outcome. Otherwise you are debugging in a vacuum: you might push retrieval scores up while the user-facing answer still misses the point, and you will not know until someone reads the output by hand.
Start end-to-end. Validate those top-level metrics against human-labeled outcomes. When a regression or a production spike clearly implicates one layer — retrieval went empty, wrong tool selected, hallucination after a specific step — add component-level evals for that subsystem and treat them as scalpels, not as the definition of quality.
The progression that holds up:
- End-to-end first
- Tracing for visibility
- Component-level evals where they explain a failure you already see at the top level
TL;DR — Score the output the user sees. Everything else is debugging.