Unit-Testing in CI/CD | Confident AI Docs

Overview

For Python users specifically, you can leverage deepeval’s native integration with pytest to run unit-tests on your LLM app in CI/CD pipelines

Currently, only end-to-end testing is supported in CI/CD. Evals must be ran locally.

Setup CI Environment

Create test file

Create a test_[name].py file and paste in the following code:

Single-turn E2E

Multi-turn E2E

test_llm_app.py

1 import pytest
2 from deepeval.test_case import LLMTestCase
3 from deepeval.dataset import EvaluationDataset
4 from deepeval.metrics import AnswerRelevancyMetric
5 from deepeval import assert_test
6 
7 dataset = EvaluationDataset()
8 dataset.pull(alias="YOUR-DATASET-ALIAS")
9 
10 for golden in dataset.goldens:
11     test_case = LLMTestCase(input=golden.input, actual_output=llm_app(input))
12     dataset.add_test_case(test_case)
13 
14 # Loop through test cases using pytest
15 @pytest.mark.parametrize("test_case", dataset.test_cases)
16 def test_llm_app(test_case: LLMTestCase):
17     assert_test(test_case, metrics=[AnswerRelevancyMetric()]) # Replace with your metrics

If you haven’t already, you can learn how to run single-turn end-to-end evals locally here.

In the test file we’ve created, we need at least one test function (function that starts with test_ that calls assert_test()). Do NOT call evalaute() like how you’ve learnt in previous sections, as this is not part of the pytest integration suite.

To make sure everything works, run deepeval test run in your terminal to trigger the test file:

$ deepeval test run test_llm_app.py

Done ✅. The deepeval test run command integrates natively with pytest and creates one test run only.

Setup `.yml` file

Create a YAML file to execute your test file automatically in CI/CD pipelines. Here’s an example that uses poetry for installation, OPENAI_API_KEY as your LLM judge to run evals locally, and CONFIDENT_API_KEY to send results to Confident AI:

unit-testing.yml

1 name: Unit-Testing LLM App
2 
3 on:
4   push:
5   pull_request:
6 
7 jobs:
8   test:
9     runs-on: ubuntu-latest
10     steps:
11       - name: Check out repository
12         uses: actions/checkout@v3
13       - name: Set up python
14         id: setup-python
15         uses: actions/setup-python@v4
16         with:
17           python-version: "3.11"
18 
19       - name: Install Poetry
20         uses: snok/install-poetry@v1
21         with:
22           virtualenvs-create: true
23           virtualenvs-in-project: true
24           installer-parallel: true
25 
26       - name: Load cached venv
27         id: cached-poetry-dependencies
28         uses: actions/cache@v3
29         with:
30           path: .venv
31           key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}
32 
33       - name: Install dependencies
34         if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
35         run: poetry install --no-interaction --no-root --only main
36 
37       - name: Install project
38         run: poetry install --no-interaction --only main
39 
40       - name: Run tests
41         env:
42           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
43           CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
44         run: |
45           poetry run pytest tests/test_core/ --ignore=tests/test_core/test_synthesizer/

Remember to provide your CONFIDENT_API_KEY, otherwise you won’t have access to your datasets and create test runs on Confident AI upon completing evaluation.

Include in GitHub Workflows

Last step is to automate everything:

Create a .github/workflows directory in your repository if you don’t already have one
Place your unit-testing.yml file in this directory
Make sure to set up your Confident AI API Key as a secret in your GitHub repository

Now, whenever you make a commit and push changes, GitHub Actions will automatically execute your tests based on the specified triggers

Log Prompts and Models

Similar to how you can log prompts, models, and other parameters using evalaute(), you can also do so with a test file:

test_llm_app.py

1 import pytest
2 from deepeval.test_case import LLMTestCase
3 from deepeval.dataset import EvaluationDataset
4 from deepeval.metrics import AnswerRelevancyMetric
5 from deepeval import assert_test
6 from typing import Union
7 import deepeval
8 
9 dataset = EvaluationDataset()
10 dataset.pull(alias="YOUR-DATASET-ALIAS")
11 
12 for golden in dataset.goldens:
13     test_case = LLMTestCase(input=golden.input, actual_output=llm_app(input))
14     dataset.add_test_case(test_case)
15 
16 # Loop through test cases using pytest
17 @pytest.mark.parametrize("test_case", dataset.test_cases)
18 def test_llm_app(test_case: LLMTestCase):
19     assert_test(test_case, metrics=[AnswerRelevancyMetric()]) # Replace with your metrics
20 
21 # Log configs used in LLM app at this point in time
22 @deepeval.log_hyperparameters()
23 def hyperparameters() -> dict[str, Union[str, int, float]]:
24     # Return an empty Dict if there's nothing to log
25     return {
26         "Model": "gpt-4o",
27         "Temperature": 1,
28         "Chunk Size": 500
29     }

When you run deepeval test run, Confident AI will automatically associate your hyperparameters with the test run you’ve created.

Flag Configs

The deepeval test run is a powerful command that allows you to run unit tests as if you’re using pytest. There are a dozens of flags for you to customize deepeval test run, including improving number of parallel processes, error handling, etc.

Parallelization

Provide a number to the -n flag to specify how many processes to use.

deepeval test run test_example.py -n 4

In this case, -n 4 means deepeval will spin up 4 processes and evaluate 4 test cases at once.

Cache

Provide the -c flag (with no arguments) to read from the local deepeval cache instead of re-evaluating test cases on the same metrics.

deepeval test run test_example.py -c

This is extremely useful if you’re running large amounts of test cases. For example, lets say you’re running 1000 test cases using deepeval test run, but you encounter an error on the 999th test case. The cache functionality would allow you to skip all the previously evaluated 999 test cases, and just evaluate the remaining one.

Ignore Errors

The -i flag (with no arguments) allows you to ignore errors for metrics executions during a test run.

deepeval test run test_example.py -i

You can combine different flags, such as the -i, -c, and -n flag to execute any uncached test cases in parallel while ignoring any errors along the way:

1 deepeval test run test_example.py -i -c -n 2

Verbose Mode

The -v flag (with no arguments) allows you to turn on verbose_mode for all metrics ran using deepeval test run. Not supplying the -v flag will default each metric’s verbose_mode to its value at instantiation.

1 deepeval test run test_example.py -v

When a metric’s verbose_mode is True, it prints the intermediate steps used to calculate said metric to the console during evaluation.

Skip Test Cases

The -s flag (with no arguments) allows you to skip metric executions where the test case has missing/insufficient parameters (such as retrieval_context) that is required for evaluation. An example of where this is helpful is if you’re using a metric such as the ContextualPrecisionMetric but don’t want to apply it when the retrieval_context is None.

deepeval test run test_example.py -s

Identifier

The -id flag followed by a string allows you to name test runs and better identify them in testing reports and when regression testing.

deepeval test run test_example.py -id "My Latest Test Run"

Repeats

Repeat each test case by providing a number to the -r flag to specify how many times to rerun each test case.

deepeval test run test_example.py -r 2