LLM Evaluation Quickstart

5 min quickstart guide for a code-driven LLM evaluation workflow

Overview

Confident AI offers a variety of features for you to test AI apps using code for a pre-deployment workflow, offering a wide range of features for:

  • Single-turn evaluation: Input-output as distinct AI interactions.
    • End-to-end: Treats your AI app as a black box.
    • Component-level: Built for agentic use cases—debug each agent step and component (planner, tools, memory, retriever, prompts) with granular assertions.
  • Multi-turn evaluation: Validate full conversations for consistency, state/memory retention, etc.

You can either run evals via code locally or remotely on Confident AI, both of which gives you the same functionality:

Local Evals
  • Run evaluations locally using deepeval with full control over metrics
  • Support for custom metrics, DAG, and advanced evaluation algorithms

Suitable for: Python users, development, and pre-deployment workflows

Remote Evals
  • Run evaluations on Confident AI platform with pre-built metrics
  • Integrated with monitoring, datasets, and team collaboration features

Suitable for: Non-python users, online + offline evals for tracing in prod

Run Your First Eval

This examples goes through a single-turn, end-to-end evaluation example in code.

You’ll need to get your API key as shown in the setup and installation section before continuing.

1

Login with API key

$export CONFIDENT_API_KEY="confident_us..."
2

Create a dataset

It is mandatory to create a dataset for a proper evaluation workflow.

If a dataset is not possible for your team at this point, setup LLM tracing to run ad-hoc evaluations without a dataset instead. Confident AI will generate datasets for you automatically this way.

main.py
1from deepeval.dataset import EvaluationDataset, Golden
2# goldens are what makes up your dataset
3goldens = [Golden(input="What's the weather like in SF?")]
4# create dataset
5dataset = EvaluationDataset(goldens=goldens)
6# save to Confident AI
7dataset.push(alias="YOUR-DATASET-ALIAS")

Done ✅. You should now see your dataset on the platform.

3

Create a metric

Create a metric locally in deepeval. Here, we’re using the AnswerRelevancyMetric() for demo purposes.

main.py
1from deepeval.metrics import AnswerRelevancyMetric
2
3relevancy = AnswerRelevancyMetric() # Using this for the sake of simplicity
4

Configure evaluation model

Since all metrics in deepeval uses LLM-as-a-Judge, you will also need to configure your LLM judge provider. To use OpenAI for evals:

$export OPENAI_API_KEY="sk-..."

You can also use any model provider since deepeval integrates with all of them.

5

Create a test run

A test run is a benchmark/snapshot of your AI app’s performance at any point in time. You’ll need to:

  • Convert all goldens in your dataset into test cases, then
  • Use the metric you’ve created to evaluate each test case
main.py
1from deepeval.dataset import EvaluationDataset
2from deepeval.test_case import LLMTestCase
3from deepeval.metrics import AnswerRelevancyMetric
4from deepeval import evaluate
5
6# Pull from Confident AI
7dataset = EvaluationDataset()
8dataset.pull(alias="YOUR-DATASET-ALIAS")
9
10# Create test cases
11for golden in dataset.goldens:
12 test_case = LLMTestCase(
13 input=golden.input,
14 actual_output=llm_app(golden.input) # Replace with your AI app
15 )
16 dataset.add_test_case(test_case)
17
18# Run an evaluation
19evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Lastly, run main.py to run your first single-turn, end-to-end evaluation:

$python main.py

✅ Done. You just created a first test run with a sharable testing report auto-generated on Confident AI.

Testing Report on Confident AI

There are two main pages in a testing report:

  • Overview - Shows metadata of your test run such as the dataset that was used for testing, average, median, and distribution of each of the metric(s)
  • Test Cases - Shows all the test cases in your test run, including AI generated summaries of your test bench, and metric data for in-depth debugging and analysis.

When you have two or more test runs, you can also start running A|B regression tests.

Next Steps

Now that you’ve run your first evaluation, dive deeper into single-turn testing: