Single-Turn, Component-Level Testing
Learn how to run component-level testing for single-turn use cases
Overview
Single-turn, component-level testing requires:
- A dataset of goldens
- Setting up LLM tracing
- A list of metrics for each component you wish to test
- Construction of test cases at the component-level at runtime
This is currently only supported for users using deepeval Python, and must
run locally. However, ad-hoc online-evals for components in production are
still available here.
How It Works
- Setup LLM tracing
- Pull your dataset from Confident AI
- Loop through goldens and call your LLM app using the
evals_iterator()
You can automatically run end-to-end testing if you have component-level setup, as shown in the previous section.
See Mock LLM App
For this section, we’ll be using the same mock LLM app in the previous section to demonstrate LLM tracing:
Run Component-Level Tests Locally
This section is nearly identical to this part of the previous section, where we use LLM tracing to run end-to-end evals. This is because LLM tracing is just so convenient to evaluate everything and anything.
In this example we’re basically just swapping update_current_trace with
update_current_span instead.
Setup LLM tracing, and define metrics
Decorate your application with the @observe decorator, and provide metrics for components that you wish to evaluate:
The example above shows how we are tracing our LLM app by simply adding a few @observe decorators:
- Each
@observedecorator creates a span, which represents components - A trace on the other hand is created by the top-level
@observedecorator, and is made up of many spans/components - You include a list of
metricsin@observe()for components you wish to evaluate, and call theupdate_current_spanfunction inside said components to create test cases for evaluation
When you call update_current_span() to set inputs, outputs, retrieval_contexts, etc. deepeval automatically maps these to create LLMTestCases.
Pull dataset, and loop through goldens
Pull your dataset in the same way as before, and use the .evals_iterator() to loop through your goldens. You will use data in your goldens (most likely the input) to call your LLM app:
Done ✅. You should see a link to your newly created sharable testing report.
[video]
You can also run your for-loop asynchronously:
When you call your LLM app inside a dataset’s evals_iterator(), deepeval automatically captures invocations of your LLM app and creates test cases dynamically based on the @observeed component’s hierarchy. Here are some more info about component-level evals:
- For components that are
@observeed but with nometricsattached, Confident AI will simply not test those components and display them as regular spans instead - You would generally not use reference-based metrics for component-level testing. This is because goldens are designed to map 1-to-1 to test cases, which makes arguments such as
expected_outputredundant