Single-Turn, Component-Level Evals

Overview

Component-level testing lets you evaluate individual parts of your LLM application — retrievers, generators, tools, planners — rather than just the final output. This is essential for debugging complex pipelines where you need to pinpoint exactly which component is failing.

Requirements:

A dataset of goldens — determines how many times your app runs
LLM tracing setup with @observe decorators
Metrics defined per component via the metrics parameter in @observe()

Component-level testing is currently only supported for users using deepeval Python, and must run locally. However, ad-hoc online-evals for components in production are still available here.

How It Works

Setup LLM tracing with @observe decorators and define metrics for each component
Pull your dataset from Confident AI
Loop through goldens using the evals_iterator() and invoke your LLM app

Unlike end-to-end testing, you don’t need to use golden.input to call your LLM app. The evals_iterator() simply controls how many times your app runs (once per golden). Test case fields are set via update_current_span() inside each component — the golden just determines the iteration count.

Run Component-Level Tests Locally

This section is nearly identical to this part of the previous section, where we use LLM tracing to run end-to-end evals. This is because LLM tracing is just so convenient to evaluate everything and anything.

In this example we’re basically just swapping update_current_trace with update_current_span instead.

We’re also using the same mock LLM app in the previous section to demonstrate LLM tracing:

See Mock LLM App

main.py

1 from openai import OpenAI
2 
3 def llm_app(query: str) -> str:
4     # Retriever for your vector db
5     def retriever(query: str) -> list[str]:
6         return ["List", "of", "text", "chunks"]
7     # Generator that combines retrieved context with user query
8     def generator(query: str, text_chunks: list[str]) -> str:
9         return OpenAI().chat.completions.create(
10             model="gpt-4o",
11             messages=[
12                 {"role": "user", "content": query}
13             ]
14         ).choices[0].message.content
15     # Calls retriever then generator
16     return generator(query, retriever(query))

Setup LLM tracing, and define metrics

Decorate your application with the @observe decorator, and provide metrics for components that you wish to evaluate:

main.py

1 from openai import OpenAI
2 from deepeval.metrics import AnswerRelevancyMetric, ContextualRelevancyMetric
3 from deepeval.tracing import observe, update_current_span
4 
5 @observe()
6 def llm_app(query: str) -> str:
7 
8     @observe(metrics=[ContextualRelevancyMetric()], embedder="your-embedding-model-name")
9     def retriever(query: str) -> list[str]:
10         chunks = ["List", "of", "text", "chunks"]
11         update_current_span(input=query, retrieval_context=chunks)
12         return chunks
13 
14     @observe(metrics=[AnswerRelevancyMetric()])
15     def generator(query: str, text_chunks: list[str]) -> str:
16         res = OpenAI().chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": query}]
17         ).choices[0].message.content
18         update_current_span(input=query, output=res)
19         return res
20 
21     return generator(query, retriever(query))

The example above shows how we are tracing our LLM app by simply adding a few @observe decorators:

Each @observe decorator creates a span, which represents components
A trace on the other hand is created by the top-level @observe decorator, and is made up of many spans/components
You include a list of metrics in @observe() for components you wish to evaluate, and call the update_current_span function inside said components to create test cases for evaluation

When you call update_current_span() to set inputs, outputs, retrieval_contexts, etc. deepeval automatically maps these to create LLMTestCases.

Pull dataset, and loop through goldens

Pull your dataset and use the .evals_iterator() to iterate. The iterator controls how many times your LLM app runs — once per golden in your dataset.

main.py

1 from deepeval.dataset import EvaluationDataset
2 
3 dataset = EvaluationDataset()
4 dataset.pull(alias="YOUR-DATASET-ALIAS")
5 
6 for _ in dataset.evals_iterator():
7     llm_app("any input") # golden.input is optional for component-level testing

Since test case fields are populated via update_current_span() inside your components, you can pass any input to your LLM app — or use golden.input if your test scenarios require specific inputs.

Done ✅. You should see a link to your newly created sharable testing report.

Component-Level Testing Report

You can also run your for-loop asynchronously:

1 import asyncio
2 from deepeval.dataset import EvaluationDataset
3 
4 dataset = EvaluationDataset()
5 dataset.pull(alias="YOUR-DATASET-ALIAS")
6 
7 for golden in dataset.evals_iterator():
8     task = asyncio.create_task(a_llm_app(golden.input))
9     dataset.evaluate(task)

When you call your LLM app inside a dataset’s evals_iterator(), deepeval automatically captures invocations of your LLM app and creates test cases dynamically based on the @observeed component’s hierarchy. Here are some more info about component-level evals:

For components that are @observeed but with no metrics attached, Confident AI will simply not test those components and display them as regular spans instead
You would generally not use reference-based metrics for component-level testing. This is because goldens are designed to map 1-to-1 to test cases, which makes arguments such as expected_output redundant