Code-Evals
Overview
Code-Eval lets you create and execute custom metrics by writing Python code directly on the Confident AI platform. Unlike G-Eval which uses natural language criteria, Code-Eval gives you full programmatic control over your evaluation logic using the deepeval framework.
Why Code-Eval?
Code-Eval is ideal when you need evaluation logic that can’t be expressed in natural language:
- Exact format validation — Verify JSON structure, regex patterns, or specific output formats
- Deterministic scoring — Apply consistent, rule-based logic without LLM variability
- Complex calculations — Perform multi-step computations, statistical analysis, or aggregations
- Custom business rules — Implement domain-specific validation logic unique to your use case
For subjective evaluations like tone, helpfulness, or nuanced quality checks, use G-Eval instead — it handles LLM-as-a-judge reasoning and is easier to create without code.
How It Works
Code-Eval works exactly like creating a custom metric in deepeval. You write a Python class that inherits from BaseMetric and implement the evaluation logic.
However, on Confident AI you can only edit the following methods:
a_measure()— the async method where your evaluation logic runsis_successful()— determines whether the test case passed
All other parts of the metric (initialization, properties, etc.) are handled by the platform.
Available Packages
Your code runs in a secure environment with access to:
deepevallibrary — Always the latest version from GitHub, including all utilities likeBaseMetric, and test case types- Standard Python libraries —
json,re,math,collections,datetime, etc. - No external network calls — For security reasons, external API calls are not supported (for now)
Create Code-Eval via the UI
Code-Eval metrics can be created under Project > Metrics > Library.
Fill in metric details
Provide the metric name, and optionally a description. You can also toggle whether you’re creating a single-turn or multi-turn metric.
Your metric name must be unique in your project and not clash with any of the default metric names.
Write your evaluation code
Instead of defining criteria, evaluation steps, and rubrics like in G-Eval, you write Python code that computes the evaluation score directly.
Your code must inherit from the appropriate base class and implement:
a_measure(test_case)— Your async evaluation logic that setsself.score,self.reason, andself.successis_successful()— Returns whether the metric passed based on the threshold (pre-filled for you, not recommended to change)
You a_measure() method does not have to return self.score - although it is recommended that you do so.
Single-turn
Multi-turn
For single-turn metrics, inherit from BaseMetric and accept an LLMTestCase:
The LLMTestCase object gives you access to parameters such as input, actual_output, expected_output, and more.
For more details on test case parameters, see Test Cases, Goldens, and Datasets.
Advanced Usage
Set verbose logs
Use self.verbose_logs to log intermediate steps and decision paths in your evaluation logic. This is useful for debugging complex metrics and understanding how scores are computed.
Verbose logs are displayed in the Confident AI dashboard alongside your metric results, making it easy to trace through evaluation decisions.
Log reasoning
Use self.reason to provide a human-readable explanation of the score. This helps users understand why a particular score was given and is displayed in the evaluation results.
A clear self.reason makes it much easier to understand evaluation results,
especially when reviewing failed test cases or debugging unexpected scores.
Raise exceptions
You can also raise an error like how you would normally do so in Python and log it to self.error: