LLM Evaluation Quickstart
5 min quickstart guide for a code-driven LLM evaluation workflow
5 min quickstart guide for a code-driven LLM evaluation workflow
Confident AI offers a variety of features for you to test AI apps using code for a pre-deployment workflow, offering a wide range of features for:
You can either run evals via code locally or remotely on Confident AI, both of which gives you the same functionality:
deepeval with full control over metricsSuitable for: Python users, development, and pre-deployment workflows
Suitable for: Non-python users, online + offline evals for tracing in prod
Let your coding agent build the eval suite for you — datasets, metrics, pytest files, and shareable Confident AI reports. Better yet, use DeepEval as your build-loop ground truth: your agent runs the evals, reads the failures and reason strings, makes the smallest app change, and re-runs to confirm. Choose the install method for your agent below.
Run these four commands in Claude Code:
The /plugins command should list DeepEval Plugin under your installed plugins.
Once installed, open the project you want to evaluate and tell your agent what you need. Example prompts:
./knowledge and run them through DeepEval.”Your agent will run the intake questions, pick metrics, generate goldens with deepeval generate, and produce a committed pytest suite you can rerun in CI.
Point your agent at our LLM-friendly docs so it picks the right metrics and APIs: llms.txt indexes every page (append .md to any docs URL for that page’s raw Markdown). You can also connect your agent directly to our docs MCP server.
The Claude Code plugin is Python-first today. TypeScript support via Claude Code is coming soon — for now, follow the TypeScript steps below directly.
This examples goes through a single-turn, end-to-end evaluation example in code.
You’ll need to get your API key as shown in the setup and installation section before continuing.
It is mandatory to create a dataset for a proper evaluation workflow.
If a dataset is not possible for your team at this point, setup LLM tracing to run ad-hoc evaluations without a dataset instead. Confident AI will generate datasets for you automatically this way.
Done ✅. You should now see your dataset on the platform.
Create a metric locally in deepeval. Here, we’re using the AnswerRelevancyMetric() for demo purposes.
Since all metrics in deepeval uses LLM-as-a-Judge, you will also need to configure your LLM judge provider. To use OpenAI for evals:
You can also use any model provider since deepeval integrates with all
of them.
A test run is a benchmark/snapshot of your AI app’s performance at any point in time. You’ll need to:
Lastly, run main.py to run your first single-turn, end-to-end evaluation:
✅ Done. You just created a first test run with a sharable testing report auto-generated on Confident AI.
There are two main pages in a testing report:
When you have two or more test runs, you can also start running A|B regression tests.
Now that you’ve run your first evaluation, dive deeper into single-turn testing:
Treat your AI app as a black box. Learn how to use LLM tracing for better debugging, run remote evals, and log hyperparameters for A|B testing.
Test individual components like retrievers, generators, and tools. Built for agentic use cases where you need granular assertions.