Single-Turn, Component-Level Evals

Learn how to run component-level testing for single-turn use cases

Overview

Component-level testing lets you evaluate individual parts of your LLM application — retrievers, generators, tools, planners — rather than just the final output. This is essential for debugging complex pipelines where you need to pinpoint exactly which component is failing.

Requirements:

Component-level testing is currently only supported for users using deepeval Python, and must run locally. However, ad-hoc online-evals for components in production are still available here.

How It Works

  1. Setup LLM tracing with @observe decorators and define metrics for each component
  2. Pull your dataset from Confident AI
  3. Loop through goldens using the evals_iterator() and invoke your LLM app

Unlike end-to-end testing, you don’t need to use golden.input to call your LLM app. The evals_iterator() simply controls how many times your app runs (once per golden). Test case fields are set via update_current_span() inside each component — the golden just determines the iteration count.

Run Component-Level Tests Locally

This section is nearly identical to this part of the previous section, where we use LLM tracing to run end-to-end evals. This is because LLM tracing is just so convenient to evaluate everything and anything.

In this example we’re basically just swapping update_current_trace with update_current_span instead.

We’re also using the same mock LLM app in the previous section to demonstrate LLM tracing:

main.py
1from openai import OpenAI
2
3def llm_app(query: str) -> str:
4 # Retriever for your vector db
5 def retriever(query: str) -> list[str]:
6 return ["List", "of", "text", "chunks"]
7 # Generator that combines retrieved context with user query
8 def generator(query: str, text_chunks: list[str]) -> str:
9 return OpenAI().chat.completions.create(
10 model="gpt-4o",
11 messages=[
12 {"role": "user", "content": query}
13 ]
14 ).choices[0].message.content
15 # Calls retriever then generator
16 return generator(query, retriever(query))
1

Setup LLM tracing, and define metrics

Decorate your application with the @observe decorator, and provide metrics for components that you wish to evaluate:

main.py
1from openai import OpenAI
2from deepeval.metrics import AnswerRelevancyMetric, ContextualRelevancyMetric
3from deepeval.tracing import observe, update_current_span
4
5@observe()
6def llm_app(query: str) -> str:
7
8 @observe(metrics=[ContextualRelevancyMetric()], embedder="your-embedding-model-name")
9 def retriever(query: str) -> list[str]:
10 chunks = ["List", "of", "text", "chunks"]
11 update_current_span(input=query, retrieval_context=chunks)
12 return chunks
13
14 @observe(metrics=[AnswerRelevancyMetric()])
15 def generator(query: str, text_chunks: list[str]) -> str:
16 res = OpenAI().chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": query}]
17 ).choices[0].message.content
18 update_current_span(input=query, output=res)
19 return res
20
21 return generator(query, retriever(query))

The example above shows how we are tracing our LLM app by simply adding a few @observe decorators:

  • Each @observe decorator creates a span, which represents components
  • A trace on the other hand is created by the top-level @observe decorator, and is made up of many spans/components
  • You include a list of metrics in @observe() for components you wish to evaluate, and call the update_current_span function inside said components to create test cases for evaluation

When you call update_current_span() to set inputs, outputs, retrieval_contexts, etc. deepeval automatically maps these to create LLMTestCases.

2

Pull dataset, and loop through goldens

Pull your dataset and use the .evals_iterator() to iterate. The iterator controls how many times your LLM app runs — once per golden in your dataset.

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
5
6for _ in dataset.evals_iterator():
7 llm_app("any input") # golden.input is optional for component-level testing

Since test case fields are populated via update_current_span() inside your components, you can pass any input to your LLM app — or use golden.input if your test scenarios require specific inputs.

Done ✅. You should see a link to your newly created sharable testing report.

Component-Level Testing Report

You can also run your for-loop asynchronously:

1import asyncio
2from deepeval.dataset import EvaluationDataset
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7for golden in dataset.evals_iterator():
8 task = asyncio.create_task(a_llm_app(golden.input))
9 dataset.evaluate(task)

When you call your LLM app inside a dataset’s evals_iterator(), deepeval automatically captures invocations of your LLM app and creates test cases dynamically based on the @observeed component’s hierarchy. Here are some more info about component-level evals:

  • For components that are @observeed but with no metrics attached, Confident AI will simply not test those components and display them as regular spans instead
  • You would generally not use reference-based metrics for component-level testing. This is because goldens are designed to map 1-to-1 to test cases, which makes arguments such as expected_output redundant