Single-Turn, E2E Testing | Confident AI Docs

Overview

Single-turn, end-to-end testing requires:

A dataset of goldens
A list of metrics you wish to evaluate with
Construction of test case at runtime

The test case that you construct will be an LLMTestCase that encapsulates the end system input and outputs.

How It Works

Pull your dataset from Confident AI
Loop through goldens in your dataset, for each golden:
- Invoke your LLM app using golden inputs to generate test case parameters such as actual output, tools called, and
- Map golden fields to test case parameters
- Add test case back to your dataset
Run evaluation on test cases

There many ways to run evaluations in Confident AI

There are many ways to do step 3., For running evals locally:

Using the evaluate() function
With the .evals_iterator() via LLM tracing
Using deepeval test run in CI/CD

For running evals remotely:

Using the Evals API
Also using the evaluate() function

For this section, we’ll be using this mock LLM app, that is a simple RAG pipeline:

main.py

1 from openai import OpenAI
2 
3 def llm_app(query: str) -> str:
4     # Retriever for your vector db
5     def retriever(query: str) -> list[str]:
6         return ["List", "of", "text", "chunks"]
7     # Generator that combines retrieved context with user query
8     def generator(query: str, text_chunks: list[str]) -> str:
9         return OpenAI().chat.completions.create(
10             model="gpt-4o",
11             messages=[
12                 {"role": "user", "content": query}
13             ]
14         ).choices[0].message.content
15     # Calls retriever then generator
16     return generator(query, retriever(query))

Run E2E Tests Locally

Running evals locally is only possible if you are using the Python deepeval library. If you’re working with Typescript or any other language, skip to the remote end-to-end evals section instead.

Pull dataset

Pull your dataset (and create one if you haven’t already):

main.py

1 from deepeval.dataset import EvaluationDataset
2 
3 dataset = EvaluationDataset()
4 dataset.pull(alias="YOUR-DATASET-ALIAS")

Loop through goldens to create test cases

A native for-loop calling your LLM app would do for this step:

main.py

1 from deepeval.test_case import LLMTestCase
2 from deepeval.dataset import EvaluationDataset
3 
4 dataset = EvaluationDataset()
5 dataset.pull(alias="YOUR-DATASET-ALIAS")
6 
7 for golden in dataset.goldens:
8     test_case = LLMTestCase(
9         input=golden.input,
10         actual_output=llm_app(input)
11     )
12     dataset.add_test_case(test_case)

You’ll notice if you also want to also return other test case parameters such as the retrieval_context you’ll have to rewrite your LLM app. We’ll address this problem in the next section.

Run evaluation using `evaluate()`

The evaluate() function allows you to create test runs and uploads the data to Confident AI once evaluations have completed locally.

main.py

1 from deepeval.metrics import AnswerRelevancyMetric
2 from deepeval import evaluate
3 
4 # Replace with your metrics
5 evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Done ✅. You should see a link to your newly created sharable testing report.

The evaluate() function runs your test suite across all test cases and metrics
Each metric is applied to every test case (e.g., 10 test cases × 2 metrics = 20 evaluations)
A test case passes only if all metrics for it pass
The test run’s pass rate is the proportion of test cases that pass

deepeval opens your browser automatically by default. To disable this behavior, set CONFIDENT_BROWSER_OPEN=NO.

Single-Turn Testing Reports

The evaluate() function is extremely unopinionated and non-instrusive, which means it is great for teams looking for a lightweight approach for running LLM evaluations. However, it also means that:

You have to handle a lot of the ETL yourself to map test case fields, even rewriting your LLM app at times to return the correct data
No visibility - you will still want to be able to debug your LLM app even if it is an end-to-end evaluation

In the next section, we’ll show how you can avoid this ETL hellhole and bring LLM traces to end-to-end testing.

LLM Tracing for Local E2E Testing

LLM tracing solves all problems associated with constructing test cases.

Setup LLM tracing

All you need is to add a few lines of code to your existing LLM app (we’ll be using the example from above):

main.py

1 from openai import OpenAI
2 from deepeval.tracing import observe, update_current_trace
3 
4 @observe()
5 def llm_app(query: str) -> str:
6 
7     @observe()
8     def retriever(query: str) -> list[str]:
9         chunks = ["List", "of", "text", "chunks"]
10         update_current_trace(retrieval_context=chunks)
11         return chunks
12 
13     @observe()
14     def generator(query: str, text_chunks: list[str]) -> str:
15         res = OpenAI().chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": query}]
16         ).choices[0].message.content
17         update_current_trace(input=query, output=res)
18         return res
19 
20     return generator(query, retriever(query))

The example above shows how we are tracing our LLM app by simply adding a few @observe decorators:

Each @observe decorator creates a span, which represents components
A trace on the other hand is created by the top-level @observe decorator, and is made up of many spans/components
When you run end-to-end testing, you can call the update_current_trace function inside anywhere in your traced application to set test case parameters

Don’t worry too much about learning everything you can about LLM tracing for now. We’ll go through it in in a dedicated LLM tracing section.

In the next section on component-level testing, we will simply swap the update_current_trace function with update_current_span to construct test cases on a component-level.

Pull dataset, and loop through goldens

Pull your dataset in the same way as before, and use the .evals_iterator() to loop through your goldens. You will use data in your goldens (most likely the input) to call your LLM app:

main.py

1 from deepeval.metrics import AnswerRelevancyMetric
2 from deepeval.dataset import EvaluationDataset
3 
4 dataset = EvaluationDataset()
5 dataset.pull(alias="YOUR-DATASET-ALIAS")
6 
7 for golden in dataset.evals_iterator(metrics=[AnswerRelevancyMetric()]):
8     llm_app(golden.input) # Replace with your LLM app

Done ✅. You should see a link to your newly created sharable testing report. This is literally all it takes to run end-to-end evaluations, with the added benefit of a full testing report with tracing included on Confident AI.

Single-Turn Testing Reports (with Tracing)

You can also run your for-loop asynchronously:

1 import asyncio
2 from deepeval.metrics import AnswerRelevancyMetric
3 from deepeval.dataset import EvaluationDataset
4 
5 dataset = EvaluationDataset()
6 dataset.pull(alias="YOUR-DATASET-ALIAS")
7 
8 for golden in dataset.evals_iterator(metrics=[AnswerRelevancyMetric()]):
9     task = asyncio.create_task(a_llm_app(golden.input))
10     dataset.evaluate(task)

Run E2E Tests Remotely

Remote end-to-end evals offer no tracibility for debugging but is great because:

Team members can build metrics without going through code
Supported through Evals API, for any language

This is possible via Confident AI’s Evals API.

Create metric collection

Go to Project > Metric > Collections:

Metric Collection for Remote Evals

Pull dataset and construct test cases

Using your language of choice, you would call your LLM app to construct a list of valid LLMTestCase data models.

Python

Typescript

curL

main.py

1 from deepeval.dataset import EvaluationDataset
2 from deepeval.test_case import LLMTestCase
3 
4 dataset = EvaluationDataset()
5 dataset.pull(alias="YOUR-DATASET-ALIAS")
6 
7 for golden in dataset.goldens:
8     test_case = LLMTestCase(input=golden.input, actual_output=llm_app(golden.input))
9     dataset.add_test_case(test_case)

Call `/v1/evaluate` endpoint

Python

Typescript

curL

main.py

1 from deepeval import evaluate
2 
3 evaluate(test_case=dataset.test_cases, metric_collection="YOUR-COLLECTION-NAME")

Advanced Usage

Now you’ve learnt how to run a single-turn, end-to-end evaluation, here are a few things you should also do.

Log prompts and models

Tell Confident AI the configurations used in your LLM app during the evaluation.

This will help Confident AI tell you which of your hyperparameters performed better retrospectively.

Python

Typescript

curL

Simply add a free-form key-value pair to the hyperparameters argument in the evaluate() function:

1 from deepeval.prompt import Prompt
2 
3 prompt = Prompt(alias="YOUR-PROMPT-ALIAS")
4 prompt.pull()
5 
6 evaluate(
7     hyperparameters={
8         "Model": "YOUR-MODEL",
9         "Prompt Version": prompt # An instance of your Prompt
10     },
11     test_cases=[...],
12     metrics=[...]
13 )

Providing a Prompt instance only works if you pulled a prompt version from Confident AI.

Add identifer to test runs

The identifer argument allows you to name test runs, which will come in extremely handy when you’re trying to run regression tests on them on the platform.

Python

Typescript

curL

1 evaluate(
2     identifer="Any custom string",
3     test_cases=[...],
4     metrics=[...]
5 )

Name test cases

Similar to the identifer, naming test cases allows you to search and match test cases across different test runs during regression testing.

Python

Typescript

curL

1 evaluate(
2     test_cases=[LLMTestCase(name="Any custom string", ...)],
3     metric_collection="..."
4 )

By default, Confident AI will match test cases based on matching inputs, so naming test cases is not strictly required for regression testing.

Overview

How It Works

There many ways to run evaluations in Confident AI

Run E2E Tests Locally

Pull dataset

Loop through goldens to create test cases

Run evaluation using evaluate()

LLM Tracing for Local E2E Testing

Setup LLM tracing

Pull dataset, and loop through goldens

Run E2E Tests Remotely

Create metric collection

Pull dataset and construct test cases

Python

Typescript

curL

Call /v1/evaluate endpoint

Python

Typescript

curL

Advanced Usage

Log prompts and models

Python

Typescript

curL

Add identifer to test runs

Python

Typescript

curL

Name test cases

Python

Typescript

curL

Run evaluation using `evaluate()`

Call `/v1/evaluate` endpoint