Pull Datasets for Testing

Pull datasets to use them for evaluation

Overview

In the previous section, we learnt how to curate datasets by manging goldens through the platform directly or via Confident’s Evals API. In this section, we will learn how to:

  • Pull dataests for LLM testing
  • Associate dataset with test runs
  • Use the evals_iterator to run evals on datasets (for python users)

Pull Goldens via Evals API

Datasets are either single or multi-turn, and you should know that pulling a single-turn dataset will give you single-turn goldens, and vice versa.

You will be responsible for mapping single-turn goldens to single-turn test cases, and vice versa.

Pulling goldens via the Evals API will only pull finalized goldens by default.

1

Pull goldens

First use the .pull() method:

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
5
6print(dataset.goldens) # Check it's pulled correctly
2

Construct test cases

Then loop through your dataset of goldens to create a list of test cases:

main.py
1from deepeval.dataset import EvaluationDataset
2from deepeval.test_case import LLMTestCase
3
4dataset = EvaluationDataset()
5dataset.pull(alias="YOUR-DATASET-ALIAS")
6
7for golden in dataset.goldens:
8 test_case = LLMTestCase(
9 input=golden.input,
10 actual_output=llm_app(golden.input),
11 # map any additional fields here
12 )
13 dataset.add_test_case(test_case)

For multi-turn datasets, you will create ConversationalTestCases instead:

main.py
1from deepeval.test_case import ConversationalTestCase
2
3for golden in dataset.goldens:
4 test_case = simulate(golden) # simulate conversation
5 dataset.add_test_case(test_case)
3

Run an evaluation

By calling .add_test_case() in the previous step, each time you run evaluate Confident AI will automatically associate any created test run with your dataset:

1from deepeval import evaluate
2
3evaluate(test_cases=dataset.test_cases, metrics=[...])

Using Evals Iterator

Typeically, you would just provide your dataset as a list of test cases for evaluatioin. However, if you’re running single-turn, end-to-end OR component-level evaluations, and is using deepeval in Python, you can use the evals_iterator() instead:

main.py
1from deepeval.dataset import EvaluationDataset
2
3dataset = EvaluationDataset()
4dataset.pull(alias="YOUR-DATASET-ALIAS")
5
6for golden in dataset.evals_iterator():
7 llm_app(golden.input) # Replace with your LLM app
8
9# Async version
10# import asyncio
11#
12# for golden in dataset.evals_iterator():
13# task = asyncio.create_task(a_llm_app(golden.input))
14# dataset.evaluate(task)

You’ll need to trace your LLM app to make this work. Read this section on running single-turn end-to-end evals with tracing to learn more.