Single-Turn, Component-Level Testing

Learn how to run component-level testing for single-turn use cases

Overview

Single-turn, component-level testing requires:

  • A dataset of goldens
  • Setting up LLM tracing
  • A list of metrics for each component you wish to test
  • Construction of test cases at the component-level at runtime

This is currently only supported for users using deepeval Python, and must run locally. However, ad-hoc online-evals for components in production are still available here.

How It Works

  1. Setup LLM tracing
  2. Pull your dataset from Confident AI
  3. Loop through goldens and call your LLM app using the evals_iterator()

You can automatically run end-to-end testing if you have component-level setup, as shown in the previous section.

For this section, we’ll be using the same mock LLM app in the previous section to demonstrate LLM tracing:

main.py
1from openai import OpenAI
2
3def llm_app(query: str) -> str:
4 # Retriever for your vector db
5 def retriever(query: str) -> list[str]:
6 return ["List", "of", "text", "chunks"]
7 # Generator that combines retrieved context with user query
8 def generator(query: str, text_chunks: list[str]) -> str:
9 return OpenAI().chat.completions.create(
10 model="gpt-4o",
11 messages=[
12 {"role": "user", "content": query}
13 ]
14 ).choices[0].message.content
15 # Calls retriever then generator
16 return generator(query, retriever(query))

Run Component-Level Tests Locally

This section is nearly identical to this part of the previous section, where we use LLM tracing to run end-to-end evals. This is because LLM tracing is just so convenient to evaluate everything and anything.

In this example we’re basically just swapping update_current_trace with update_current_span instead.

1

Setup LLM tracing, and define metrics

Decorate your application with the @observe decorator, and provide metrics for components that you wish to evaluate:

main.py
1from openai import OpenAI
2from deepeval.metrics import AnswerRelevancyMetric, ContextualRelevancyMetric
3from deepeval.tracing import observe, update_current_span
4
5@observe()
6def llm_app(query: str) -> str:
7
8 @observe(metrics=[ContextualRelevancyMetric()], embedder="your-embedding-model-name")
9 def retriever(query: str) -> list[str]:
10 chunks = ["List", "of", "text", "chunks"]
11 update_current_span(input=query, retrieval_context=chunks)
12 return chunks
13
14 @observe(metrics=[AnswerRelevancyMetric()])
15 def generator(query: str, text_chunks: list[str]) -> str:
16 res = OpenAI().chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": query}]
17 ).choices[0].message.content
18 update_current_span(input=query, output=res)
19 return res
20
21 return generator(query, retriever(query))

The example above shows how we are tracing our LLM app by simply adding a few @observe decorators:

  • Each @observe decorator creates a span, which represents components
  • A trace on the other hand is created by the top-level @observe decorator, and is made up of many spans/components
  • You include a list of metrics in @observe() for components you wish to evaluate, and call the update_current_span function inside said components to create test cases for evaluation

When you call update_current_span() to set inputs, outputs, retrieval_contexts, etc. deepeval automatically maps these to create LLMTestCases.

2

Pull dataset, and loop through goldens

Pull your dataset in the same way as before, and use the .evals_iterator() to loop through your goldens. You will use data in your goldens (most likely the input) to call your LLM app:

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
5
6for golden in dataset.evals_iterator():
7 llm_app(golden.input) # Replace with your LLM app

Done ✅. You should see a link to your newly created sharable testing report.

[video]

You can also run your for-loop asynchronously:

1import asyncio
2from deepeval.dataset import EvaluationDataset
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7for golden in dataset.evals_iterator():
8 task = asyncio.create_task(a_llm_app(golden.input))
9 dataset.evaluate(task)

When you call your LLM app inside a dataset’s evals_iterator(), deepeval automatically captures invocations of your LLM app and creates test cases dynamically based on the @observeed component’s hierarchy. Here are some more info about component-level evals:

  • For components that are @observeed but with no metrics attached, Confident AI will simply not test those components and display them as regular spans instead
  • You would generally not use reference-based metrics for component-level testing. This is because goldens are designed to map 1-to-1 to test cases, which makes arguments such as expected_output redundant