LLM Evaluation Quickstart

5 min quickstart guide for a code-driven LLM evaluation workflow

LLM Evaluation Quickstart

5 min quickstart guide for a code-driven LLM evaluation workflow

Overview

Confident AI offers a variety of features for you to test AI apps using code for a pre-deployment workflow, offering a wide range of features for:

Single-turn evaluation: Input-output as distinct AI interactions.
- End-to-end: Treats your AI app as a black box.
- Component-level: Built for agentic use cases—debug each agent step and component (planner, tools, memory, retriever, prompts) with granular assertions.
Multi-turn evaluation: Validate full conversations for consistency, state/memory retention, etc.

You can either run evals via code locally or remotely on Confident AI, both of which gives you the same functionality:

Local Evals

Run evaluations locally using deepeval with full control over metrics
Support for custom metrics, DAG, and advanced evaluation algorithms

Suitable for: Python users, development, and pre-deployment workflows

Remote Evals

Run evaluations on Confident AI platform with pre-built metrics
Integrated with monitoring, datasets, and team collaboration features

Suitable for: Non-python users, online + offline evals for tracing in prod

Vibe Code Your Evals

Let your coding agent build the eval suite for you — datasets, metrics, pytest files, and shareable Confident AI reports. Better yet, use DeepEval as your build-loop ground truth: your agent runs the evals, reads the failures and reason strings, makes the smallest app change, and re-runs to confirm. Choose the install method for your agent below.

Claude Code (plugin)

Cursor, Codex, Windsurf & others (Skills CLI)

Run these four commands in Claude Code:

$ /plugin marketplace add confident-ai/deepeval
$ /plugin install deepeval@confident-ai-plugins
$ /reload-plugins
$ /plugins

The /plugins command should list DeepEval Plugin under your installed plugins.

Once installed, open the project you want to evaluate and tell your agent what you need. Example prompts:

“Create a DeepEval pytest eval suite for this app, generate ~30 goldens, and push results to Confident AI.”
“My app is a RAG pipeline — set up DeepEval evals with retrieval-focused metrics.”
“Generate a dataset from the docs in ./knowledge and run them through DeepEval.”

Your agent will run the intake questions, pick metrics, generate goldens with deepeval generate, and produce a committed pytest suite you can rerun in CI.

Point your agent at our LLM-friendly docs so it picks the right metrics and APIs: llms.txt indexes every page (append .md to any docs URL for that page’s raw Markdown). You can also connect your agent directly to our docs MCP server.

The Claude Code plugin is Python-first today. TypeScript support via Claude Code is coming soon — for now, follow the TypeScript steps below directly.

Run Your First Eval

This examples goes through a single-turn, end-to-end evaluation example in code.

You’ll need to get your API key as shown in the setup and installation section before continuing.

Python

TypeScript

Create a dataset

It is mandatory to create a dataset for a proper evaluation workflow.

If a dataset is not possible for your team at this point, setup LLM tracing to run ad-hoc evaluations without a dataset instead. Confident AI will generate datasets for you automatically this way.

Code

On Platform

main.py

1 from deepeval.dataset import EvaluationDataset, Golden
2 # goldens are what makes up your dataset
3 goldens = [Golden(input="What's the weather like in SF?")]
4 # create dataset
5 dataset = EvaluationDataset(goldens=goldens)
6 # save to Confident AI
7 dataset.push(alias="YOUR-DATASET-ALIAS")

Done ✅. You should now see your dataset on the platform.

Create a metric

Create a metric locally in deepeval. Here, we’re using the AnswerRelevancyMetric() for demo purposes.

main.py

1 from deepeval.metrics import AnswerRelevancyMetric
2 
3 relevancy = AnswerRelevancyMetric() # Using this for the sake of simplicity

Configure evaluation model

Since all metrics in deepeval uses LLM-as-a-Judge, you will also need to configure your LLM judge provider. To use OpenAI for evals:

$ export OPENAI_API_KEY="sk-..."

You can also use any model provider since deepeval integrates with all of them.

Create a test run

A test run is a benchmark/snapshot of your AI app’s performance at any point in time. You’ll need to:

Convert all goldens in your dataset into test cases, then
Use the metric you’ve created to evaluate each test case

main.py

1 from deepeval.dataset import EvaluationDataset
2 from deepeval.test_case import LLMTestCase
3 from deepeval.metrics import AnswerRelevancyMetric
4 from deepeval import evaluate
5 
6 # Pull from Confident AI
7 dataset = EvaluationDataset()
8 dataset.pull(alias="YOUR-DATASET-ALIAS")
9 
10 # Create test cases
11 for golden in dataset.goldens:
12     test_case = LLMTestCase(
13         input=golden.input,
14         actual_output=llm_app(golden.input) # Replace with your AI app
15     )
16     dataset.add_test_case(test_case)
17 
18 # Run an evaluation
19 evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Lastly, run main.py to run your first single-turn, end-to-end evaluation:

$ python main.py

✅ Done. You just created a first test run with a sharable testing report auto-generated on Confident AI.

Testing Report on Confident AI

There are two main pages in a testing report:

Overview - Shows metadata of your test run such as the dataset that was used for testing, average, median, and distribution of each of the metric(s)
Test Cases - Shows all the test cases in your test run, including AI generated summaries of your test bench, and metric data for in-depth debugging and analysis.

When you have two or more test runs, you can also start running A|B regression tests.

Next Steps

Now that you’ve run your first evaluation, dive deeper into single-turn testing:

End-to-End Evals

Treat your AI app as a black box. Learn how to use LLM tracing for better debugging, run remote evals, and log hyperparameters for A|B testing.

Component-Level Evals

Test individual components like retrievers, generators, and tools. Built for agentic use cases where you need granular assertions.

Overview

Confident AI offers a variety of features for you to test AI apps using code for a pre-deployment workflow, offering a wide range of features for:

Single-turn evaluation: Input-output as distinct AI interactions.
- End-to-end: Treats your AI app as a black box.
- Component-level: Built for agentic use cases—debug each agent step and component (planner, tools, memory, retriever, prompts) with granular assertions.
Multi-turn evaluation: Validate full conversations for consistency, state/memory retention, etc.

You can either run evals via code locally or remotely on Confident AI, both of which gives you the same functionality:

Local Evals

Run evaluations locally using deepeval with full control over metrics
Support for custom metrics, DAG, and advanced evaluation algorithms

Suitable for: Python users, development, and pre-deployment workflows

Remote Evals

Run evaluations on Confident AI platform with pre-built metrics
Integrated with monitoring, datasets, and team collaboration features

Suitable for: Non-python users, online + offline evals for tracing in prod

Vibe Code Your Evals

Claude Code (plugin)

Cursor, Codex, Windsurf & others (Skills CLI)

Run these four commands in Claude Code:

$ /plugin marketplace add confident-ai/deepeval
$ /plugin install deepeval@confident-ai-plugins
$ /reload-plugins
$ /plugins

The /plugins command should list DeepEval Plugin under your installed plugins.

Once installed, open the project you want to evaluate and tell your agent what you need. Example prompts:

“Create a DeepEval pytest eval suite for this app, generate ~30 goldens, and push results to Confident AI.”
“My app is a RAG pipeline — set up DeepEval evals with retrieval-focused metrics.”
“Generate a dataset from the docs in ./knowledge and run them through DeepEval.”

Your agent will run the intake questions, pick metrics, generate goldens with deepeval generate, and produce a committed pytest suite you can rerun in CI.

The Claude Code plugin is Python-first today. TypeScript support via Claude Code is coming soon — for now, follow the TypeScript steps below directly.

Run Your First Eval

This examples goes through a single-turn, end-to-end evaluation example in code.

You’ll need to get your API key as shown in the setup and installation section before continuing.

Python

TypeScript

Create a dataset

It is mandatory to create a dataset for a proper evaluation workflow.

If a dataset is not possible for your team at this point, setup LLM tracing to run ad-hoc evaluations without a dataset instead. Confident AI will generate datasets for you automatically this way.

Code

On Platform

main.py

1 from deepeval.dataset import EvaluationDataset, Golden
2 # goldens are what makes up your dataset
3 goldens = [Golden(input="What's the weather like in SF?")]
4 # create dataset
5 dataset = EvaluationDataset(goldens=goldens)
6 # save to Confident AI
7 dataset.push(alias="YOUR-DATASET-ALIAS")

Done ✅. You should now see your dataset on the platform.

Create a metric

Create a metric locally in deepeval. Here, we’re using the AnswerRelevancyMetric() for demo purposes.

main.py

1 from deepeval.metrics import AnswerRelevancyMetric
2 
3 relevancy = AnswerRelevancyMetric() # Using this for the sake of simplicity

Configure evaluation model

Since all metrics in deepeval uses LLM-as-a-Judge, you will also need to configure your LLM judge provider. To use OpenAI for evals:

$ export OPENAI_API_KEY="sk-..."

You can also use any model provider since deepeval integrates with all of them.

Create a test run

A test run is a benchmark/snapshot of your AI app’s performance at any point in time. You’ll need to:

Convert all goldens in your dataset into test cases, then
Use the metric you’ve created to evaluate each test case

main.py

1 from deepeval.dataset import EvaluationDataset
2 from deepeval.test_case import LLMTestCase
3 from deepeval.metrics import AnswerRelevancyMetric
4 from deepeval import evaluate
5 
6 # Pull from Confident AI
7 dataset = EvaluationDataset()
8 dataset.pull(alias="YOUR-DATASET-ALIAS")
9 
10 # Create test cases
11 for golden in dataset.goldens:
12     test_case = LLMTestCase(
13         input=golden.input,
14         actual_output=llm_app(golden.input) # Replace with your AI app
15     )
16     dataset.add_test_case(test_case)
17 
18 # Run an evaluation
19 evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Lastly, run main.py to run your first single-turn, end-to-end evaluation:

$ python main.py

✅ Done. You just created a first test run with a sharable testing report auto-generated on Confident AI.

Testing Report on Confident AI

There are two main pages in a testing report:

Overview - Shows metadata of your test run such as the dataset that was used for testing, average, median, and distribution of each of the metric(s)
Test Cases - Shows all the test cases in your test run, including AI generated summaries of your test bench, and metric data for in-depth debugging and analysis.

When you have two or more test runs, you can also start running A|B regression tests.

Next Steps

Now that you’ve run your first evaluation, dive deeper into single-turn testing:

End-to-End Evals

Treat your AI app as a black box. Learn how to use LLM tracing for better debugging, run remote evals, and log hyperparameters for A|B testing.

Component-Level Evals

Test individual components like retrievers, generators, and tools. Built for agentic use cases where you need granular assertions.

$	/plugin marketplace add confident-ai/deepeval
$	/plugin install deepeval@confident-ai-plugins
$	/reload-plugins
$	/plugins

1	from deepeval.dataset import EvaluationDataset, Golden
2	# goldens are what makes up your dataset
3	goldens = [Golden(input="What's the weather like in SF?")]
4	# create dataset
5	dataset = EvaluationDataset(goldens=goldens)
6	# save to Confident AI
7	dataset.push(alias="YOUR-DATASET-ALIAS")

1	from deepeval.metrics import AnswerRelevancyMetric
2
3	relevancy = AnswerRelevancyMetric() # Using this for the sake of simplicity

1	from deepeval.dataset import EvaluationDataset
2	from deepeval.test_case import LLMTestCase
3	from deepeval.metrics import AnswerRelevancyMetric
4	from deepeval import evaluate
5
6	# Pull from Confident AI
7	dataset = EvaluationDataset()
8	dataset.pull(alias="YOUR-DATASET-ALIAS")
9
10	# Create test cases
11	for golden in dataset.goldens:
12	test_case = LLMTestCase(
13	input=golden.input,
14	actual_output=llm_app(golden.input) # Replace with your AI app
15	)
16	dataset.add_test_case(test_case)
17
18	# Run an evaluation
19	evaluate(test_cases=dataset.test_cases, metrics=[AnswerRelevancyMetric()])

Overview

Vibe Code Your Evals

Claude Code (plugin)

Cursor, Codex, Windsurf & others (Skills CLI)

Run Your First Eval

Python

TypeScript

Login with API key

Create a dataset

Code

On Platform

Create a metric

Configure evaluation model

Create a test run

Next Steps

Overview

Vibe Code Your Evals

Claude Code (plugin)

Cursor, Codex, Windsurf & others (Skills CLI)

Run Your First Eval

Python

TypeScript

Login with API key

Create a dataset

Code

On Platform

Create a metric

Configure evaluation model

Create a test run

Next Steps