Jacky Wong
ML Scientist/Engineer.

DeepEval — Synthetic Data, Bulk Review, Custom Metric Logging and more!

September 14, 2023
3 min read
DeepEval — Synthetic Data, Bulk Review, Custom Metric Logging and more!

To those new to DeepEval, DeepEval provides a Pythonic way to run offline evaluations on your LLM pipelines so you can launch comfortably into production. It provides a testing suite for LLMs.

In this product update, we include a number of improvements such as:

  • Synthetic Data Creation Using LLMs
  • Bulk Review For Synthetic Data Creation
  • Custom Metric Logging
  • Improved Developer Experience + CLI Improvements

🧨 Synthetic Data Creation

For Retrieval Augmented Generation applications for tools like LlamaIndex, developers want an easy way to quickly measure the performance of their RAG pipeline.

This is now achievable in just 1 line of code.

dataset = create_evaluation_query_answer_pairs(
  context="FastAPI is a Python language.",

Under the hood, it uses ChatGPT to automatically create n number of query-answer pairs. It uses a simple ChatGPT prompt, takes in the original contextand feeds it into a LLMTestCase . The LLMTestCase abstraction is one of the building blocks of DeepEval that allows for measuring performance of these RAG pipelines.

Interested in finding out more? Read about how to run this here.

👮 Bulk Review Datasets

Once you have created synthetic data, you can easily add / remove synthetic data pieces. You can see a sample screenshot of the dashboard for reviewing synthetic data.

Example dashboard when bulk reviewing

The best part? You can view the dashboard completely in Python and can be self-hosted. This is done simply by running:


When reviewing the dataset, you will be able to easily delete a row and add a row depending on what data you think is important for your evaluation.

📏 Custom Metric Logging

Custom metric logging has been one of the most ❤️ features. Now, you can do it with DeepEval and save it right onto the Confident AI dashboard.

You can define a custom metric in just a few lines of code:

from deepeval.metrics.metric import Metric
from deepeval.test_case import LLMTestCase
from deepeval import run_test

class LengthMetric(Metric):
    """This metric checks if the output is more than 3 letters"""
    def __init__(self, minimum_length: int=3):
        self.minimum_length = minimum_length

    def measure(self, test_case: LLMTestCase):
        # sends to server
        score = len(text)
        self.success = score > self.minimum_length
        return score

    def is_successful(self):
        return self.success

    def __name__(self):
        return "Length"

# All you need to actually run this test:
metric = LengthMetric()
# Defining a custom test case
test_case = LLMTestCase()
run_test(test_case, metrics=[metric])

🧠 Developer Experience Improvements

  • We’ve added in 2 new abstractions aimed at making it easier to write tests following our framework.
The RAG Evaluation Framework (https://docs.confident-ai.com/docs/framework)

As we are building this, we added a new LLMTestCase abstraction designed to provide flexibility when running these tests. We recommend reading more about it here for those looking to dive into the framework.

In addition, we also added 2 new ways to run tests:

  • run_test allows users to run a test when provided a LLMTestCase
  • assert_test allows users to ensure a test passes otherwise it raises an error

These Pythonic abstractions are intended to make it easier to be able to run tests and log them to the server (if the API key is set) and to treat as independent Pytests.

And that’s all!

DeepEval is iterating fast and adding a number of metrics with an ambitious roadmap including adding guardrails, improving synthetic data creation and significant improvements to our dashboard.

Jacky Wong
ML Scientist/Engineer.

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.