LLM Evaluation Quickstart
5 min quickstart guide for a code-driven LLM evaluation workflow
Overview
Confident AI offers a variety of features for you to test AI apps using code for a pre-deployment workflow, offering a wide range of features for:
- Single-turn evaluation: Input-output as distinct AI interactions.
- End-to-end: Treats your AI app as a black box.
- Component-level: Built for agentic use cases—debug each agent step and component (planner, tools, memory, retriever, prompts) with granular assertions.
- Multi-turn evaluation: Validate full conversations for consistency, state/memory retention, etc.
You can either run evals via code locally or remotely on Confident AI, both of which gives you the same functionality:
- Run evaluations locally using
deepevalwith full control over metrics - Support for custom metrics, DAG, and advanced evaluation algorithms
Suitable for: Python users, development, and pre-deployment workflows
- Run evaluations on Confident AI platform with pre-built metrics
- Integrated with monitoring, datasets, and team collaboration features
Suitable for: Non-python users, online + offline evals for tracing in prod
Run Your First Eval
This examples goes through a single-turn, end-to-end evaluation example in code.
You’ll need to get your API key as shown in the setup and installation section before continuing.
Python
TypeScript
Create a dataset
It is mandatory to create a dataset for a proper evaluation workflow.
If a dataset is not possible for your team at this point, setup LLM tracing to run ad-hoc evaluations without a dataset instead. Confident AI will generate datasets for you automatically this way.
Code
On Platform
Done ✅. You should now see your dataset on the platform.
Create a metric
Create a metric locally in deepeval. Here, we’re using the AnswerRelevancyMetric() for demo purposes.
Configure evaluation model
Since all metrics in deepeval uses LLM-as-a-Judge, you will also need to configure your LLM judge provider. To use OpenAI for evals:
You can also use any model provider since deepeval integrates with all
of them.
Create a test run
A test run is a benchmark/snapshot of your AI app’s performance at any point in time. You’ll need to:
- Convert all goldens in your dataset into test cases, then
- Use the metric you’ve created to evaluate each test case
Lastly, run main.py to run your first single-turn, end-to-end evaluation:
✅ Done. You just created a first test run with a sharable testing report auto-generated on Confident AI.
There are two main pages in a testing report:
- Overview - Shows metadata of your test run such as the dataset that was used for testing, average, median, and distribution of each of the metric(s)
- Test Cases - Shows all the test cases in your test run, including AI generated summaries of your test bench, and metric data for in-depth debugging and analysis.
When you have two or more test runs, you can also start running A|B regression tests.
Next Steps
Now that you’ve run your first evaluation, dive deeper into single-turn testing:
Treat your AI app as a black box. Learn how to use LLM tracing for better debugging, run remote evals, and log hyperparameters for A|B testing.
Test individual components like retrievers, generators, and tools. Built for agentic use cases where you need granular assertions.