Unit-Testing in CI/CD
Setup an automated pre-deployment workflow in CI/CD
Overview
For Python users specifically, you can leverage deepeval’s native integration with pytest to run unit-tests on your LLM app in CI/CD pipelines
Currently, only end-to-end testing is supported in CI/CD. Evals must be ran locally.
Setup CI Environment
Create test file
Create a test_[name].py file and paste in the following code:
Single-turn E2E
Multi-turn E2E
If you haven’t already, you can learn how to run single-turn end-to-end evals locally here.
In the test file we’ve created, we need at least one test function (function that starts with test_ that calls assert_test()). Do NOT call evalaute() like how you’ve learnt in previous sections, as this is not part of the pytest integration suite.
To make sure everything works, run deepeval test run in your terminal to trigger the test file:
Done ✅. The deepeval test run command integrates natively with pytest and creates one test run only.
Setup .yml file
Create a YAML file to execute your test file automatically in CI/CD pipelines. Here’s an example that uses poetry for installation, OPENAI_API_KEY as your LLM judge to run evals locally, and CONFIDENT_API_KEY to send results to Confident AI:
Remember to provide your CONFIDENT_API_KEY, otherwise you won’t have access
to your datasets and create test runs on Confident AI upon completing
evaluation.
Include in GitHub Workflows
Last step is to automate everything:
- Create a
.github/workflowsdirectory in your repository if you don’t already have one - Place your
unit-testing.ymlfile in this directory - Make sure to set up your Confident AI API Key as a secret in your GitHub repository
Now, whenever you make a commit and push changes, GitHub Actions will automatically execute your tests based on the specified triggers
Log Prompts and Models
Similar to how you can log prompts, models, and other parameters using evalaute(), you can also do so with a test file:
When you run deepeval test run, Confident AI will automatically associate your hyperparameters with the test run you’ve created.
Flag Configs
The deepeval test run is a powerful command that allows you to run unit tests as if you’re using pytest. There are a dozens of flags for you to customize deepeval test run, including improving number of parallel processes, error handling, etc.
Parallelization
Provide a number to the -n flag to specify how many processes to use.
In this case, -n 4 means deepeval will spin up 4 processes and evaluate 4 test cases at once.
Cache
Provide the -c flag (with no arguments) to read from the local deepeval cache instead of re-evaluating test cases on the same metrics.
This is extremely useful if you’re running large amounts of test cases. For
example, lets say you’re running 1000 test cases using deepeval test run,
but you encounter an error on the 999th test case. The cache functionality
would allow you to skip all the previously evaluated 999 test cases, and just
evaluate the remaining one.
Ignore Errors
The -i flag (with no arguments) allows you to ignore errors for metrics executions during a test run.
You can combine different flags, such as the -i, -c, and -n flag to execute any uncached test cases in parallel while ignoring any errors along the way:
Verbose Mode
The -v flag (with no arguments) allows you to turn on verbose_mode for all metrics ran using deepeval test run. Not supplying the -v flag will default each metric’s verbose_mode to its value at instantiation.
When a metric’s verbose_mode is True, it prints the intermediate steps used to calculate said metric to the console during evaluation.
Skip Test Cases
The -s flag (with no arguments) allows you to skip metric executions where the test case has missing/insufficient parameters (such as retrieval_context) that is required for evaluation. An example of where this is helpful is if you’re using a metric such as the ContextualPrecisionMetric but don’t want to apply it when the retrieval_context is None.
Identifier
The -id flag followed by a string allows you to name test runs and better identify them in testing reports and when regression testing.
Repeats
Repeat each test case by providing a number to the -r flag to specify how many times to rerun each test case.