Pull Datasets
Overview
In the previous section, we learnt how to push and queue goldens via Confident’s Evals API. In this section, we will learn how to:
- Pull single and multi-turn datasets for evaluation
- Access custom column values from goldens
- Parse multi-modal goldens (images) into an evaluatable format
- Use the
evals_iteratorto run evals on single-turn datasets (Python only)
How it works
Code-driven evals follow a similar process to no-code evals, but you control the evaluation loop:
- Pull dataset — fetch goldens from Confident AI using the Evals API
- Invoke AI app — call your AI app with each golden’s input
- Create test cases — map golden fields and AI outputs into test cases
- Run evaluation — execute metrics on your test cases and push results
Here’s a visual representation of the data flow:
The key difference from no-code evals is that you control the evaluation loop — pulling goldens, invoking your AI app, and constructing test cases all happen in your code.
You can manage your datasets in any project by configuring a CONFIDENT_API_KEY.
- For default usage, set
CONFIDENT_API_KEYas an environment variable. - To target a specific project, pass a
confident_api_keydirectly when creating theEvaluationDataset.
When both are provided, the confident_api_key passed to EvaluationDataset always takes precedence over the environment variable.
Pull Goldens via Evals API
Datasets are either single or multi-turn, and you should know that pulling a single-turn dataset will give you single-turn goldens, and vice versa.
You will be responsible for mapping single-turn goldens to single-turn test cases, and vice versa.
Pulling goldens via the Evals API will only pull finalized goldens by default. Below is a single-turn dataset example (click here for multi-turn usage of datasets):
Python
Typescript
curL
Using Custom Columns
If your dataset has custom columns, you can access them via the custom_column_key_values field on each golden:
Python
Typescript
Using Images
Any (list of) golden text fields (such as input, scenario, etc.) that contains an image will be in the format of [DEEPEVAL:IMAGE:url]. The url inside the [DEEPEVAL:IMAGE:url] format is a public url that can be accessed by anyone.
For goldens containing images, here you can parse and use it accordingly as follows:
Python
Typescript
curL
Custom
The deepeval python SDK offers a utility method called convert_to_multi_modal_array. This method is useful for converting a string containing images in the [DEEPEVAL:IMAGE:url] format into a list of strings and MLLMImage items.
The multimodal_array here is a list containing strings and MLLMImages, you can loop over this array to construct a messages array with images to pass to your MLLM. Here’s an example showing how to construct messages array for openai:
This is only required when using datasets in code - Confident AI automatically handles image parsing and conversion on the platform.
deepeval’s native models like GPTModel, GeminiModel automatically parse the images inside the [DEEPEVAL:IMAGE:url] formats for you, you can simply pass any golden field with images inside the .generate() or .a_generate() methods and deepeval automatically handles images for you internally!
Using Evals Iterator
Typically, you would just provide your dataset as a list of test cases for evaluation. However, if you’re running single-turn, end-to-end OR component-level evaluations and using deepeval in Python, you can use the evals_iterator() instead:
You’ll need to trace your LLM app to make this work. Read this section on running single-turn end-to-end evals with tracing to learn more.
Datasets in CI/CD
Using datasets in CI/CD follows the same pattern as local evaluation — pull your dataset, create test cases, and run evaluation. The only difference is that you use assert_test() instead of evaluate() to integrate with pytest:
Then run with deepeval test run test_llm_app.py to execute your tests. Learn more about setting up automated testing in the Unit-Testing in CI/CD section.
Next Steps
Now that you’re familiar with the full dataset lifecycle, time to dive into running evaluations end to end.