Single-Turn, E2E Testing

Learn how to run end-to-end testing for single-turn use cases

Overview

Single-turn, end-to-end testing requires:

  • A dataset of goldens
  • A list of metrics you wish to evaluate with
  • Construction of test case at runtime

The test case that you construct will be an LLMTestCase that encapsulates the end system input and outputs.

How It Works

  1. Pull your dataset from Confident AI
  2. Loop through goldens in your dataset, for each golden:
    • Invoke your LLM app using golden inputs to generate test case parameters such as actual output, tools called, and
    • Map golden fields to test case parameters
    • Add test case back to your dataset
  3. Run evaluation on test cases
There many ways to run evaluations in Confident AI

There are many ways to do step 3., For running evals locally:

  • Using the evaluate() function
  • With the .evals_iterator() via LLM tracing
  • Using deepeval test run in CI/CD

For running evals remotely:

  • Using the Evals API
  • Also using the evaluate() function

For this section, we’ll be using this mock LLM app, that is a simple RAG pipeline:

main.py
1from openai import OpenAI
2
3def llm_app(query: str) -> str:
4 # Retriever for your vector db
5 def retriever(query: str) -> list[str]:
6 return ["List", "of", "text", "chunks"]
7 # Generator that combines retrieved context with user query
8 def generator(query: str, text_chunks: list[str]) -> str:
9 return OpenAI().chat.completions.create(
10 model="gpt-4o",
11 messages=[
12 {"role": "user", "content": query}
13 ]
14 ).choices[0].message.content
15 # Calls retriever then generator
16 return generator(query, retriever(query))

Run E2E Tests Locally

Running evals locally is only possible if you are using the Python deepeval library. If you’re working with Typescript or any other language, skip to the remote end-to-end evals section instead.

1

Pull dataset

Pull your dataset (and create one if you haven’t already):

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
2

Loop through goldens to create test cases

A native for-loop calling your LLM app would do for this step:

main.py
1from deepeval.test_case import LLMTestCase
2from deepeval.dataset import EvaluationDataset
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7for golden in dataset.goldens:
8 test_case = LLMTestCase(
9 input=golden.input,
10 actual_output=llm_app(input)
11 )
12 dataset.add_test_case(test_case)

You’ll notice if you also want to also return other test case parameters such as the retrieval_context you’ll have to rewrite your LLM app. We’ll address this problem in the next section.

3

Run evaluation using evaluate()

The evaluate() function allows you to create test runs and uploads the data to Confident AI once evaluations have completed locally.

main.py
1from deepeval.metrics import AnswerRelevancyMetric
2from deepeval import evaluate
3
4# Replace with your metrics
5evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Done ✅. You should see a link to your newly created sharable testing report.

  • The evaluate() function runs your test suite across all test cases and metrics
  • Each metric is applied to every test case (e.g., 10 test cases × 2 metrics = 20 evaluations)
  • A test case passes only if all metrics for it pass
  • The test run’s pass rate is the proportion of test cases that pass

deepeval opens your browser automatically by default. To disable this behavior, set CONFIDENT_BROWSER_OPEN=NO.

Single-Turn Testing Reports

The evaluate() function is extremely unopinionated and non-instrusive, which means it is great for teams looking for a lightweight approach for running LLM evaluations. However, it also means that:

  • You have to handle a lot of the ETL yourself to map test case fields, even rewriting your LLM app at times to return the correct data
  • No visibility - you will still want to be able to debug your LLM app even if it is an end-to-end evaluation

In the next section, we’ll show how you can avoid this ETL hellhole and bring LLM traces to end-to-end testing.

LLM Tracing for Local E2E Testing

LLM tracing solves all problems associated with constructing test cases.

1

Setup LLM tracing

All you need is to add a few lines of code to your existing LLM app (we’ll be using the example from above):

main.py
1from openai import OpenAI
2from deepeval.tracing import observe, update_current_trace
3
4@observe()
5def llm_app(query: str) -> str:
6
7 @observe()
8 def retriever(query: str) -> list[str]:
9 chunks = ["List", "of", "text", "chunks"]
10 update_current_trace(retrieval_context=chunks)
11 return chunks
12
13 @observe()
14 def generator(query: str, text_chunks: list[str]) -> str:
15 res = OpenAI().chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": query}]
16 ).choices[0].message.content
17 update_current_trace(input=query, output=res)
18 return res
19
20 return generator(query, retriever(query))

The example above shows how we are tracing our LLM app by simply adding a few @observe decorators:

  • Each @observe decorator creates a span, which represents components
  • A trace on the other hand is created by the top-level @observe decorator, and is made up of many spans/components
  • When you run end-to-end testing, you can call the update_current_trace function inside anywhere in your traced application to set test case parameters

Don’t worry too much about learning everything you can about LLM tracing for now. We’ll go through it in in a dedicated LLM tracing section.

In the next section on component-level testing, we will simply swap the update_current_trace function with update_current_span to construct test cases on a component-level.

2

Pull dataset, and loop through goldens

Pull your dataset in the same way as before, and use the .evals_iterator() to loop through your goldens. You will use data in your goldens (most likely the input) to call your LLM app:

main.py
1from deepeval.metrics import AnswerRelevancyMetric
2from deepeval.dataset import EvaluationDataset
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7for golden in dataset.evals_iterator(metrics=[AnswerRelevancyMetric()]):
8 llm_app(golden.input) # Replace with your LLM app

Done ✅. You should see a link to your newly created sharable testing report. This is literally all it takes to run end-to-end evaluations, with the added benefit of a full testing report with tracing included on Confident AI.

Single-Turn Testing Reports (with Tracing)

You can also run your for-loop asynchronously:

1import asyncio
2from deepeval.metrics import AnswerRelevancyMetric
3from deepeval.dataset import EvaluationDataset
4
5dataset = EvaluationDataset()
6dataset.pull(alias="YOUR-DATASET-ALIAS")
7
8for golden in dataset.evals_iterator(metrics=[AnswerRelevancyMetric()]):
9 task = asyncio.create_task(a_llm_app(golden.input))
10 dataset.evaluate(task)

Run E2E Tests Remotely

Remote end-to-end evals offer no tracibility for debugging but is great because:

  • Team members can build metrics without going through code
  • Supported through Evals API, for any language

This is possible via Confident AI’s Evals API.

1

Create metric collection

Go to Project > Metric > Collections:

Metric Collection for Remote Evals
2

Pull dataset and construct test cases

Using your language of choice, you would call your LLM app to construct a list of valid LLMTestCase data models.

main.py
1from deepeval.dataset import EvaluationDataset
2from deepeval.test_case import LLMTestCase
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7for golden in dataset.goldens:
8 test_case = LLMTestCase(input=golden.input, actual_output=llm_app(golden.input))
9 dataset.add_test_case(test_case)
3

Call /v1/evaluate endpoint

main.py
1from deepeval import evaluate
2
3evaluate(test_case=dataset.test_cases, metric_collection="YOUR-COLLECTION-NAME")

Advanced Usage

Now you’ve learnt how to run a single-turn, end-to-end evaluation, here are a few things you should also do.

Log prompts and models

Tell Confident AI the configurations used in your LLM app during the evaluation.

This will help Confident AI tell you which of your hyperparameters performed better retrospectively.

Simply add a free-form key-value pair to the hyperparameters argument in the evaluate() function:

1evaluate(
2 hyperparameters={
3 "Model": "YOUR-MODEL",
4 "Prompt Version": "YOUR-PROMPT-VERSION"
5 },
6 test_cases=[...],
7 metrics=[...]
8)

If you’re keeping prompts on Confidnet AI for prompt optimization, you can also provide the prompt object directly:

1evaluate(
2 hyperparameters={
3 "Model": "YOUR-MODEL",
4 "Prompt Version": "YOUR-PROMPT-VERSION"
5 },
6 test_cases=[...],
7 metrics=[...]
8)

Add identifer to test runs

The identifer argument allows you to name test runs, which will come in extremely handy when you’re trying to run regression tests on them on the platform.

1evaluate(
2 identifer="Any custom string",
3 test_cases=[...],
4 metrics=[...]
5)

Name test cases

Similar to the identifer, naming test cases allows you to search and match test cases across different test runs during regression testing.

By default, Confident AI will match test cases based on matching inputs, so naming test cases is not strictly required for regression testing.

1evaluate(
2 test_cases=[LLMTestCase(name="Any custom string", ...)],
3 metric_collection="..."
4)