Pull Datasets for Testing | Confident AI Docs

Overview

In the previous section, we learnt how to curate datasets by manging goldens through the platform directly or via Confident’s Evals API. In this section, we will learn how to:

Pull dataests for LLM testing
Associate dataset with test runs
Use the evals_iterator to run evals on datasets (for python users)

Pull Goldens via Evals API

Datasets are either single or multi-turn, and you should know that pulling a single-turn dataset will give you single-turn goldens, and vice versa.

You will be responsible for mapping single-turn goldens to single-turn test cases, and vice versa.

Pulling goldens via the Evals API will only pull finalized goldens by default.

Python

Typescript

curL

Pull goldens

First use the .pull() method:

main.py

1 from deepeval.dataset import EvaluationDataset
2 
3 dataset = EvaluationDataset()
4 dataset.pull(alias="YOUR-DATASET-ALIAS")
5 
6 print(dataset.goldens) # Check it's pulled correctly

Construct test cases

Then loop through your dataset of goldens to create a list of test cases:

main.py

1 from deepeval.dataset import EvaluationDataset
2 from deepeval.test_case import LLMTestCase
3 
4 dataset = EvaluationDataset()
5 dataset.pull(alias="YOUR-DATASET-ALIAS")
6 
7 for golden in dataset.goldens:
8     test_case = LLMTestCase(
9         input=golden.input,
10         actual_output=llm_app(golden.input),
11         # map any additional fields here
12     )
13     dataset.add_test_case(test_case)

For multi-turn datasets, you will create ConversationalTestCases instead:

main.py

1 from deepeval.test_case import ConversationalTestCase
2 
3 for golden in dataset.goldens:
4   test_case = simulate(golden) # simulate conversation
5   dataset.add_test_case(test_case)

Run an evaluation

By calling .add_test_case() in the previous step, each time you run evaluate Confident AI will automatically associate any created test run with your dataset:

1 from deepeval import evaluate
2 
3 evaluate(test_cases=dataset.test_cases, metrics=[...])

Using Evals Iterator

Typeically, you would just provide your dataset as a list of test cases for evaluatioin. However, if you’re running single-turn, end-to-end OR component-level evaluations, and is using deepeval in Python, you can use the evals_iterator() instead:

main.py

1 from deepeval.dataset import EvaluationDataset
2 
3 dataset = EvaluationDataset()
4 dataset.pull(alias="YOUR-DATASET-ALIAS")
5 
6 for golden in dataset.evals_iterator():
7     llm_app(golden.input) # Replace with your LLM app
8 
9 # Async version
10 # import asyncio
11 #
12 # for golden in dataset.evals_iterator():
13 #    task = asyncio.create_task(a_llm_app(golden.input))
14 #    dataset.evaluate(task)

You’ll need to trace your LLM app to make this work. Read this section on running single-turn end-to-end evals with tracing to learn more.