Pull Datasets
Overview
In the previous section, we learnt how to push and queue goldens via Confident’s Evals API. In this section, we will learn how to:
- Pull single and multi-turn datasets for evaluation
- Access custom column values from goldens
- Parse multi-modal goldens (images) into an evaluatable format
- Use the
evals_iteratorto run evals on single-turn datasets (Python only)
How it works
Code-driven evals follow a similar process to no-code evals, but you control the evaluation loop:
- Pull dataset — fetch goldens from Confident AI using the Evals API
- Invoke AI app — call your AI app with each golden’s input
- Create test cases — map golden fields and AI outputs into test cases
- Run evaluation — execute metrics on your test cases and push results
Here’s a visual representation of the data flow:
The key difference from no-code evals is that you control the evaluation loop — pulling goldens, invoking your AI app, and constructing test cases all happen in your code.
You can manage your datasets in any project by configuring a CONFIDENT_API_KEY.
- For default usage, set
CONFIDENT_API_KEYas an environment variable. - To target a specific project, pass a
confident_api_keydirectly when creating theEvaluationDataset.
When both are provided, the confident_api_key passed to EvaluationDataset always takes precedence over the environment variable.
Pull Goldens via Evals API
Datasets are either single or multi-turn, and you should know that pulling a single-turn dataset will give you single-turn goldens, and vice versa.
You will be responsible for mapping single-turn goldens to single-turn test cases, and vice versa.
Pulling goldens via the Evals API will only pull finalized goldens by default. Below is a single-turn dataset example (click here for multi-turn usage of datasets):
For reproducible evaluation runs, pin to a specific dataset version by passing
version="00.00.01" (Python) or { version: "00.00.01" } (TypeScript) to
pull(...). Omitting version pulls the latest version, or unversioned goldens
if the dataset has no versions yet. See Versioning
Datasets
for details.
Python
Typescript
curL
Using Custom Columns
If your dataset has custom columns, you can access them via the custom_column_key_values field on each golden:
Python
Typescript
Using Images
Any (list of) golden text fields (such as input, scenario, etc.) that contains an image will be in the format of [DEEPEVAL:IMAGE:url]. The url inside the [DEEPEVAL:IMAGE:url] format is a public url that can be accessed by anyone.
For goldens containing images, here you can parse and use it accordingly as follows:
Python
Typescript
curL
Custom
The deepeval python SDK offers a utility method called convert_to_multi_modal_array. This method is useful for converting a string containing images in the [DEEPEVAL:IMAGE:url] format into a list of strings and MLLMImage items.
The multimodal_array here is a list containing strings and MLLMImages, you can loop over this array to construct a messages array with images to pass to your MLLM. Here’s an example showing how to construct messages array for openai:
This is only required when using datasets in code - Confident AI automatically handles image parsing and conversion on the platform.
deepeval’s native models like GPTModel, GeminiModel automatically parse the images inside the [DEEPEVAL:IMAGE:url] formats for you, you can simply pass any golden field with images inside the .generate() or .a_generate() methods and deepeval automatically handles images for you internally!
Using Evals Iterator
Typically, you would just provide your dataset as a list of test cases for evaluation. However, if you’re running single-turn, end-to-end OR component-level evaluations and using deepeval in Python, you can use the evals_iterator() instead:
You’ll need to trace your LLM app to make this work. Read this section on running single-turn end-to-end evals with tracing to learn more.
Datasets in CI/CD
Using datasets in CI/CD follows the same pattern as local evaluation — pull your dataset, create test cases, and run evaluation. The only difference is that you use assert_test() instead of evaluate() to integrate with pytest:
Then run with deepeval test run test_llm_app.py to execute your tests. Learn more about setting up automated testing in the Unit-Testing in CI/CD section.
Next Steps
Now that you’re familiar with the full dataset lifecycle, time to dive into running evaluations end to end.