Single-Turn, E2E Testing
Learn how to run end-to-end testing for single-turn use cases
Overview
Single-turn, end-to-end testing requires:
- A dataset of goldens
- A list of metrics you wish to evaluate with
- Construction of test case at runtime
The test case that you construct will be an LLMTestCase that encapsulates the end system input and outputs.
How It Works
- Pull your dataset from Confident AI
- Loop through goldens in your dataset, for each golden:
- Invoke your LLM app using golden inputs to generate test case parameters such as actual output, tools called, and
- Map golden fields to test case parameters
- Add test case back to your dataset
- Run evaluation on test cases
There many ways to run evaluations in Confident AI
There are many ways to do step 3., For running evals locally:
- Using the
evaluate()function - With the
.evals_iterator()via LLM tracing - Using
deepeval test runin CI/CD
For running evals remotely:
- Using the Evals API
- Also using the
evaluate()function
For this section, we’ll be using this mock LLM app, that is a simple RAG pipeline:
Run E2E Tests Locally
Running evals locally is only possible if you are using the Python deepeval library. If you’re working with Typescript or any other language, skip to the remote end-to-end evals section instead.
Loop through goldens to create test cases
A native for-loop calling your LLM app would do for this step:
You’ll notice if you also want to also return other test case parameters such
as the retrieval_context you’ll have to rewrite your LLM app. We’ll address
this problem in the next section.
Run evaluation using evaluate()
The evaluate() function allows you to create test runs and uploads the data to Confident AI once evaluations have completed locally.
Done ✅. You should see a link to your newly created sharable testing report.
- The
evaluate()function runs your test suite across all test cases and metrics - Each metric is applied to every test case (e.g., 10 test cases × 2 metrics = 20 evaluations)
- A test case passes only if all metrics for it pass
- The test run’s pass rate is the proportion of test cases that pass
deepeval opens your browser automatically by default. To disable this
behavior, set CONFIDENT_BROWSER_OPEN=NO.
The evaluate() function is extremely unopinionated and non-instrusive, which means it is great for teams looking for a lightweight approach for running LLM evaluations. However, it also means that:
- You have to handle a lot of the ETL yourself to map test case fields, even rewriting your LLM app at times to return the correct data
- No visibility - you will still want to be able to debug your LLM app even if it is an end-to-end evaluation
In the next section, we’ll show how you can avoid this ETL hellhole and bring LLM traces to end-to-end testing.
LLM Tracing for Local E2E Testing
LLM tracing solves all problems associated with constructing test cases.
Setup LLM tracing
All you need is to add a few lines of code to your existing LLM app (we’ll be using the example from above):
The example above shows how we are tracing our LLM app by simply adding a few @observe decorators:
- Each
@observedecorator creates a span, which represents components - A trace on the other hand is created by the top-level
@observedecorator, and is made up of many spans/components - When you run end-to-end testing, you can call the
update_current_tracefunction inside anywhere in your traced application to set test case parameters
Don’t worry too much about learning everything you can about LLM tracing for now. We’ll go through it in in a dedicated LLM tracing section.
In the next section on component-level testing, we will simply swap the
update_current_trace function with update_current_span to construct test
cases on a component-level.
Pull dataset, and loop through goldens
Pull your dataset in the same way as before, and use the .evals_iterator() to loop through your goldens. You will use data in your goldens (most likely the input) to call your LLM app:
Done ✅. You should see a link to your newly created sharable testing report. This is literally all it takes to run end-to-end evaluations, with the added benefit of a full testing report with tracing included on Confident AI.
You can also run your for-loop asynchronously:
Run E2E Tests Remotely
Remote end-to-end evals offer no tracibility for debugging but is great because:
- Team members can build metrics without going through code
- Supported through Evals API, for any language
This is possible via Confident AI’s Evals API.
Pull dataset and construct test cases
Using your language of choice, you would call your LLM app to construct a list of valid LLMTestCase data models.
Python
Typescript
curL
Advanced Usage
Now you’ve learnt how to run a single-turn, end-to-end evaluation, here are a few things you should also do.
Log prompts and models
Tell Confident AI the configurations used in your LLM app during the evaluation.
This will help Confident AI tell you which of your hyperparameters performed better retrospectively.
Python
Typescript
curL
Simply add a free-form key-value pair to the hyperparameters argument in the evaluate() function:
If you’re keeping prompts on Confidnet AI for prompt optimization, you can also provide the prompt object directly:
Add identifer to test runs
The identifer argument allows you to name test runs, which will come in extremely handy when you’re trying to run regression tests on them on the platform.
Python
Typescript
curL
Name test cases
Similar to the identifer, naming test cases allows you to search and match test cases across different test runs during regression testing.
By default, Confident AI will match test cases based on matching inputs, so naming test cases is not strictly required for regression testing.