Pull Datasets

Pull datasets locally to use them for evaluation.

Overview

In the previous section, we learnt how to push and queue goldens via Confident’s Evals API. In this section, we will learn how to:

  • Pull single and multi-turn datasets for evaluation
  • Access custom column values from goldens
  • Parse multi-modal goldens (images) into an evaluatable format
  • Use the evals_iterator to run evals on single-turn datasets (Python only)

How it works

Code-driven evals follow a similar process to no-code evals, but you control the evaluation loop:

  1. Pull dataset — fetch goldens from Confident AI using the Evals API
  2. Invoke AI app — call your AI app with each golden’s input
  3. Create test cases — map golden fields and AI outputs into test cases
  4. Run evaluation — execute metrics on your test cases and push results

Here’s a visual representation of the data flow:

The key difference from no-code evals is that you control the evaluation loop — pulling goldens, invoking your AI app, and constructing test cases all happen in your code.

You can manage your datasets in any project by configuring a CONFIDENT_API_KEY.

  • For default usage, set CONFIDENT_API_KEY as an environment variable.
  • To target a specific project, pass a confident_api_key directly when creating the EvaluationDataset.
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset(confident_api_key="confident_us...")
4dataset.delete(alias="YOUR-DATASET-ALIAS")

When both are provided, the confident_api_key passed to EvaluationDataset always takes precedence over the environment variable.

Pull Goldens via Evals API

Datasets are either single or multi-turn, and you should know that pulling a single-turn dataset will give you single-turn goldens, and vice versa.

You will be responsible for mapping single-turn goldens to single-turn test cases, and vice versa.

Pulling goldens via the Evals API will only pull finalized goldens by default. Below is a single-turn dataset example (click here for multi-turn usage of datasets):

1

Pull goldens

First use the .pull() method:

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
5
6print(dataset.goldens) # Check it's pulled correctly
2

Construct test cases

Then loop through your dataset of goldens to create a list of test cases:

main.py
1from deepeval.dataset import EvaluationDataset
2from deepeval.test_case import LLMTestCase
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7for golden in dataset.goldens:
8 test_case = LLMTestCase(
9 input=golden.input,
10 actual_output=llm_app(golden.input),
11 # map any additional fields here
12 )
13 dataset.add_test_case(test_case)

For multi-turn datasets, you will create ConversationalTestCases instead:

main.py
1from deepeval.test_case import ConversationalTestCase
2
3for golden in dataset.goldens:
4 test_case = simulate(golden) # simulate conversation
5 dataset.add_test_case(test_case)
3

Run an evaluation

By calling .add_test_case() in the previous step, each time you run evaluate Confident AI will automatically associate any created test run with your dataset:

1from deepeval import evaluate
2
3evaluate(test_cases=dataset.test_cases, metrics=[...])

Using Custom Columns

If your dataset has custom columns, you can access them via the custom_column_key_values field on each golden:

main.py
1from deepeval.dataset import EvaluationDataset
2from deepeval.test_case import LLMTestCase
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7for golden in dataset.goldens:
8 # Access custom column values
9 difficulty = golden.custom_column_key_values.get("difficulty")
10 category = golden.custom_column_key_values.get("category")
11
12 # Use them in your test case or LLM app invocation
13 test_case = LLMTestCase(
14 input=golden.input,
15 actual_output=llm_app(golden.input, difficulty=difficulty),
16 )
17 dataset.add_test_case(test_case)

Using Images

Any (list of) golden text fields (such as input, scenario, etc.) that contains an image will be in the format of [DEEPEVAL:IMAGE:url]. The url inside the [DEEPEVAL:IMAGE:url] format is a public url that can be accessed by anyone.

For goldens containing images, here you can parse and use it accordingly as follows:

The deepeval python SDK offers a utility method called convert_to_multi_modal_array. This method is useful for converting a string containing images in the [DEEPEVAL:IMAGE:url] format into a list of strings and MLLMImage items.

1from deepeval.dataset import EvaluationDataset
2from deepeval.utils import convert_to_multi_modal_array
3
4dataset = EvaluationDataset()
5dataset.pull(alias="My Evals Dataset")
6
7for golden in dataset.goldens:
8 multimodal_array = convert_to_multi_modal_array(golden.input)

The multimodal_array here is a list containing strings and MLLMImages, you can loop over this array to construct a messages array with images to pass to your MLLM. Here’s an example showing how to construct messages array for openai:

1messages = []
2for element in multimodal_array:
3 if isinstance(element, str):
4 messages.append({"type": "text", "text": element})
5 elif isinstance(element, MLLMImage):
6 if element.url:
7 messages.append(
8 {
9 "type": "image_url",
10 "image_url": {"url": element.url},
11 }
12 )

This is only required when using datasets in code - Confident AI automatically handles image parsing and conversion on the platform.

deepeval’s native models like GPTModel, GeminiModel automatically parse the images inside the [DEEPEVAL:IMAGE:url] formats for you, you can simply pass any golden field with images inside the .generate() or .a_generate() methods and deepeval automatically handles images for you internally!

1from deepeval.models import GPTModel
2from deepeval.dataset import EvaluationDataset
3
4dataset = EvaluationDataset()
5dataset.pull(alias="My Evals Dataset")
6
7model = GPTModel(model="gpt-5.2")
8
9for golden in dataset.goldens:
10 print(model.generate(golden.input)) # Images are automatically handled by deepeval

Using Evals Iterator

Typically, you would just provide your dataset as a list of test cases for evaluation. However, if you’re running single-turn, end-to-end OR component-level evaluations and using deepeval in Python, you can use the evals_iterator() instead:

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
5
6for golden in dataset.evals_iterator():
7 llm_app(golden.input) # Replace with your LLM app
8
9# Async version
10# import asyncio
11#
12# for golden in dataset.evals_iterator():
13# task = asyncio.create_task(a_llm_app(golden.input))
14# dataset.evaluate(task)

You’ll need to trace your LLM app to make this work. Read this section on running single-turn end-to-end evals with tracing to learn more.

Datasets in CI/CD

Using datasets in CI/CD follows the same pattern as local evaluation — pull your dataset, create test cases, and run evaluation. The only difference is that you use assert_test() instead of evaluate() to integrate with pytest:

test_llm_app.py
1import pytest
2from deepeval.test_case import LLMTestCase
3from deepeval.dataset import EvaluationDataset
4from deepeval.metrics import AnswerRelevancyMetric
5from deepeval import assert_test
6
7dataset = EvaluationDataset()
8dataset.pull(alias="YOUR-DATASET-ALIAS")
9
10for golden in dataset.goldens:
11 test_case = LLMTestCase(input=golden.input, actual_output=llm_app(golden.input))
12 dataset.add_test_case(test_case)
13
14@pytest.mark.parametrize("test_case", dataset.test_cases)
15def test_llm_app(test_case: LLMTestCase):
16 assert_test(test_case, metrics=[AnswerRelevancyMetric()])

Then run with deepeval test run test_llm_app.py to execute your tests. Learn more about setting up automated testing in the Unit-Testing in CI/CD section.

Next Steps

Now that you’re familiar with the full dataset lifecycle, time to dive into running evaluations end to end.