Quickstart

5 min quickstart guide for LLM evaluation

Overview

Confident AI offers a variety of features for you to test LLM apps in development for a pre-deployment workflow, offering a wide range of features for:

  • End-to-end testing: Treats your LLM app as a black box. Best for simple architectures such as calling raw model endpoints or lightweight RAG pipelines.
  • Component-level testing: Built for agentic use cases—debug each agent step and component (planner, tools, memory, retriever, prompts) with granular assertions.
  • Multi-turn evaluation: Validate full conversations for consistency, state/memory retention, etc.

LLM-as-a-Judge metrics from deepeval will be used throughout the platform to auto-score outputs (with or without references) for all of these use cases.

You don’t always need code to run evals.

You can either run evals locally or remotely on Confident AI, both of which gives you the same functionality:

Local Evals
  • Run evaluations locally using deepeval with full control over metrics
  • Support for custom metrics, DAG, and advanced evaluation algorithms

Suitable for: Python users, development, and pre-deployment workflows

Remote Evals
  • Run evaluations on Confident AI platform with pre-built metrics
  • Integrated with monitoring, datasets, and team collaboration features

Suitable for: Non-python users, online + offline evals for tracing in prod

Key Capabilities

A|B Regression Testing

Catch regressions on different versions of your LLM app with side-by-side test case comparisons

Sharable Testing Reports

Comprehensive sharable AI testing reports that are sharable through organizations

Prompt and Model Insights

Find and optimize on the optimal set of prompts, models, and parameters

Unit-Testing in CI/CD

Integrate native pytest evaluations into your deployment CI/CD pipelines

Run Your First Eval

This examples goes through a single-turn, end-to-end evaluation example in code.

You’ll need to get your API key as shown in the setup and installation section before continuing.

1

Login with API key

$export CONFIDENT_API_KEY="confident_us..."
2

Create a dataset

It is mandatory to create a dataset for a proper evaluation workflow.

If a dataset is not possible for your team at this point, setup LLM tracing to run ad-hoc evaluations without a dataset instead. Confident AI will generate datasets for you automatically this way.

You can create one in the UI under Project > Datasets, and upload goldens to your dataset via CSV:

Create Dataset on Confident AI
3

Create a metric

Create a metric locally in deepeval. Here, we’re using the AnswerRelevancyMetric() for demo purposes.

main.py
1from deepeval.metrics import AnswerRelevancyMetric
2
3relevancy = AnswerRelevancyMetric() # Using this for the sake of simplicity
4

Configure evaluation model

Since all metrics in deepeval uses LLM-as-a-Judge, you will also need to configure your LLM judge provider. To use OpenAI for evals:

$export OPENAI_API_KEY="sk-..."

You can also ANY model providers since deepeval integrates with all of them.

5

Create a test run

A test run is a benchmark/snapshot of your LLM app’s performance at any point in time. You’ll need to:

  • Convert all goldens in your dataset into test cases, then
  • Use the metric you’ve created to evaluate each test case
main.py
1from deepeval.dataset import EvaluationDataset
2from deepeval.test_case import LLMTestCase
3from deepeval.metrics import AnswerRelevancyMetric
4from deepeval import evaluate
5
6# Pull from Confident AI
7dataset = EvaluationDataset()
8dataset.pull(alias="YOUR-DATASET-ALIAS")
9
10# Create test cases
11for golden in dataset.goldens:
12 test_case = LLMTestCase(
13 input=golden.input,
14 actual_output=llm_app(golden.input) # Replace with your LLM app
15 )
16 dataset.add_test_case(test_case)
17
18# Run an evaluation
19evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Lastly, run main.py to run your first single-turn, end-to-end evaluation:

$python main.py

✅ Done. You just created a first test run with a sharable testing report auto-generated on Confident AI.

Testing Report on Confident AI

There are two main pages in a testing report:

  • Overview - Shows metadata of your test run such as the dataset that was used for testing, average, median, and distribution of each of the metric(s)
  • Test Cases - Shows all the test cases in your test run, including AI generated summaries of your test bench, and metric data for in-depth debugging and analysis.

When you have two or more test runs, you can also start running A|B regression tests.

Next Steps

Now that you’ve run your first single-turn, end-to-end evaluation, we’ll dive into these core evaluation concepts and techniques:

  • Core Concepts - Learn what are goldens, test cases and how they form the foundation of end-to-end, component-level, and multi-turn evals.
  • Use Cases - Understand when to use each evaluation type for agents, AI workflows, RAG, and chatbots.
  • LLM-as-a-Judge Metrics - Discover 40+ available metrics for different use cases, and how to select them.