Single-Turn, Component-Level Evals
Learn how to run component-level testing for single-turn use cases
Overview
Component-level testing lets you evaluate individual parts of your LLM application — retrievers, generators, tools, planners — rather than just the final output. This is essential for debugging complex pipelines where you need to pinpoint exactly which component is failing.
Requirements:
- A dataset of goldens — determines how many times your app runs
- LLM tracing setup with
@observedecorators - Metrics defined per component via the
metricsparameter in@observe()
Component-level testing is currently only supported for users using deepeval
Python, and must run locally. However, ad-hoc online-evals for components
in production are still available here.
How It Works
- Setup LLM tracing with
@observedecorators and definemetricsfor each component - Pull your dataset from Confident AI
- Loop through goldens using the
evals_iterator()and invoke your LLM app
Unlike end-to-end testing, you don’t need to use golden.input to call your
LLM app. The evals_iterator() simply controls how many times your app runs
(once per golden). Test case fields are set via update_current_span() inside
each component — the golden just determines the iteration count.
Run Component-Level Tests Locally
This section is nearly identical to this part of the previous section, where we use LLM tracing to run end-to-end evals. This is because LLM tracing is just so convenient to evaluate everything and anything.
In this example we’re basically just swapping update_current_trace with
update_current_span instead.
We’re also using the same mock LLM app in the previous section to demonstrate LLM tracing:
See Mock LLM App
Setup LLM tracing, and define metrics
Decorate your application with the @observe decorator, and provide metrics for components that you wish to evaluate:
The example above shows how we are tracing our LLM app by simply adding a few @observe decorators:
- Each
@observedecorator creates a span, which represents components - A trace on the other hand is created by the top-level
@observedecorator, and is made up of many spans/components - You include a list of
metricsin@observe()for components you wish to evaluate, and call theupdate_current_spanfunction inside said components to create test cases for evaluation
When you call update_current_span() to set inputs, outputs, retrieval_contexts, etc. deepeval automatically maps these to create LLMTestCases.
Pull dataset, and loop through goldens
Pull your dataset and use the .evals_iterator() to iterate. The iterator controls how many times your LLM app runs — once per golden in your dataset.
Since test case fields are populated via update_current_span() inside your components, you can pass any input to your LLM app — or use golden.input if your test scenarios require specific inputs.
Done ✅. You should see a link to your newly created sharable testing report.
You can also run your for-loop asynchronously:
When you call your LLM app inside a dataset’s evals_iterator(), deepeval automatically captures invocations of your LLM app and creates test cases dynamically based on the @observeed component’s hierarchy. Here are some more info about component-level evals:
- For components that are
@observeed but with nometricsattached, Confident AI will simply not test those components and display them as regular spans instead - You would generally not use reference-based metrics for component-level testing. This is because goldens are designed to map 1-to-1 to test cases, which makes arguments such as
expected_outputredundant