Unit-Testing in CI/CD

Setup an automated pre-deployment workflow in CI/CD

Overview

For Python users specifically, you can leverage deepeval’s native integration with pytest to run unit-tests on your LLM app in CI/CD pipelines

Currently, only end-to-end testing is supported in CI/CD. Evals must be ran locally.

Setup CI Environment

1

Create test file

Create a test_[name].py file and paste in the following code:

test_llm_app.py
1import pytest
2from deepeval.test_case import LLMTestCase
3from deepeval.dataset import EvaluationDataset
4from deepeval.metrics import AnswerRelevancyMetric
5from deepeval import assert_test
6
7dataset = EvaluationDataset()
8dataset.pull(alias="YOUR-DATASET-ALIAS")
9
10for golden in dataset.goldens:
11 test_case = LLMTestCase(input=golden.input, actual_output=llm_app(input))
12 dataset.add_test_case(test_case)
13
14# Loop through test cases using pytest
15@pytest.mark.parametrize("test_case", dataset.test_cases)
16def test_llm_app(test_case: LLMTestCase):
17 assert_test(test_case, metrics=[AnswerRelevancyMetric()]) # Replace with your metrics

If you haven’t already, you can learn how to run single-turn end-to-end evals locally here.

In the test file we’ve created, we need at least one test function (function that starts with test_ that calls assert_test()). Do NOT call evalaute() like how you’ve learnt in previous sections, as this is not part of the pytest integration suite.

To make sure everything works, run deepeval test run in your terminal to trigger the test file:

$deepeval test run test_llm_app.py

Done ✅. The deepeval test run command integrates natively with pytest and creates one test run only.

2

Setup .yml file

Create a YAML file to execute your test file automatically in CI/CD pipelines. Here’s an example that uses poetry for installation, OPENAI_API_KEY as your LLM judge to run evals locally, and CONFIDENT_API_KEY to send results to Confident AI:

unit-testing.yml
1name: Unit-Testing LLM App
2
3on:
4 push:
5 pull_request:
6
7jobs:
8 test:
9 runs-on: ubuntu-latest
10 steps:
11 - name: Check out repository
12 uses: actions/checkout@v3
13 - name: Set up python
14 id: setup-python
15 uses: actions/setup-python@v4
16 with:
17 python-version: "3.11"
18
19 - name: Install Poetry
20 uses: snok/install-poetry@v1
21 with:
22 virtualenvs-create: true
23 virtualenvs-in-project: true
24 installer-parallel: true
25
26 - name: Load cached venv
27 id: cached-poetry-dependencies
28 uses: actions/cache@v3
29 with:
30 path: .venv
31 key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}
32
33 - name: Install dependencies
34 if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
35 run: poetry install --no-interaction --no-root --only main
36
37 - name: Install project
38 run: poetry install --no-interaction --only main
39
40 - name: Run tests
41 env:
42 OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
43 CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
44 run: |
45 poetry run pytest tests/test_core/ --ignore=tests/test_core/test_synthesizer/

Remember to provide your CONFIDENT_API_KEY, otherwise you won’t have access to your datasets and create test runs on Confident AI upon completing evaluation.

3

Include in GitHub Workflows

Last step is to automate everything:

  1. Create a .github/workflows directory in your repository if you don’t already have one
  2. Place your unit-testing.yml file in this directory
  3. Make sure to set up your Confident AI API Key as a secret in your GitHub repository

Now, whenever you make a commit and push changes, GitHub Actions will automatically execute your tests based on the specified triggers

Log Prompts and Models

Similar to how you can log prompts, models, and other parameters using evalaute(), you can also do so with a test file:

test_llm_app.py
1import pytest
2from deepeval.test_case import LLMTestCase
3from deepeval.dataset import EvaluationDataset
4from deepeval.metrics import AnswerRelevancyMetric
5from deepeval import assert_test
6from typing import Union
7import deepeval
8
9dataset = EvaluationDataset()
10dataset.pull(alias="YOUR-DATASET-ALIAS")
11
12for golden in dataset.goldens:
13 test_case = LLMTestCase(input=golden.input, actual_output=llm_app(input))
14 dataset.add_test_case(test_case)
15
16# Loop through test cases using pytest
17@pytest.mark.parametrize("test_case", dataset.test_cases)
18def test_llm_app(test_case: LLMTestCase):
19 assert_test(test_case, metrics=[AnswerRelevancyMetric()]) # Replace with your metrics
20
21# Log configs used in LLM app at this point in time
22@deepeval.log_hyperparameters()
23def hyperparameters() -> dict[str, Union[str, int, float]]:
24 # Return an empty Dict if there's nothing to log
25 return {
26 "Model": "gpt-4o",
27 "Temperature": 1,
28 "Chunk Size": 500
29 }

When you run deepeval test run, Confident AI will automatically associate your hyperparameters with the test run you’ve created.

Flag Configs

The deepeval test run is a powerful command that allows you to run unit tests as if you’re using pytest. There are a dozens of flags for you to customize deepeval test run, including improving number of parallel processes, error handling, etc.

Parallelization

Provide a number to the -n flag to specify how many processes to use.

deepeval test run test_example.py -n 4

In this case, -n 4 means deepeval will spin up 4 processes and evaluate 4 test cases at once.

Cache

Provide the -c flag (with no arguments) to read from the local deepeval cache instead of re-evaluating test cases on the same metrics.

deepeval test run test_example.py -c

This is extremely useful if you’re running large amounts of test cases. For example, lets say you’re running 1000 test cases using deepeval test run, but you encounter an error on the 999th test case. The cache functionality would allow you to skip all the previously evaluated 999 test cases, and just evaluate the remaining one.

Ignore Errors

The -i flag (with no arguments) allows you to ignore errors for metrics executions during a test run.

deepeval test run test_example.py -i

You can combine different flags, such as the -i, -c, and -n flag to execute any uncached test cases in parallel while ignoring any errors along the way:

1deepeval test run test_example.py -i -c -n 2

Verbose Mode

The -v flag (with no arguments) allows you to turn on verbose_mode for all metrics ran using deepeval test run. Not supplying the -v flag will default each metric’s verbose_mode to its value at instantiation.

1deepeval test run test_example.py -v

When a metric’s verbose_mode is True, it prints the intermediate steps used to calculate said metric to the console during evaluation.

Skip Test Cases

The -s flag (with no arguments) allows you to skip metric executions where the test case has missing/insufficient parameters (such as retrieval_context) that is required for evaluation. An example of where this is helpful is if you’re using a metric such as the ContextualPrecisionMetric but don’t want to apply it when the retrieval_context is None.

deepeval test run test_example.py -s

Identifier

The -id flag followed by a string allows you to name test runs and better identify them in testing reports and when regression testing.

deepeval test run test_example.py -id "My Latest Test Run"

Repeats

Repeat each test case by providing a number to the -r flag to specify how many times to rerun each test case.

deepeval test run test_example.py -r 2