Code-Evals | Confident AI Docs

Overview

Code-Eval lets you create and execute custom metrics by writing Python code directly on the Confident AI platform. Unlike G-Eval which uses natural language criteria, Code-Eval gives you full programmatic control over your evaluation logic using the deepeval framework.

Code-Eval also executes on Confident AI.

Why Code-Eval?

Code-Eval is ideal when you need evaluation logic that can’t be expressed in natural language:

Exact format validation — Verify JSON structure, regex patterns, or specific output formats
Deterministic scoring — Apply consistent, rule-based logic without LLM variability
Complex calculations — Perform multi-step computations, statistical analysis, or aggregations
Custom business rules — Implement domain-specific validation logic unique to your use case

For subjective evaluations like tone, helpfulness, or nuanced quality checks, use G-Eval instead — it handles LLM-as-a-judge reasoning and is easier to create without code.

How It Works

Code-Eval works exactly like creating a custom metric in deepeval. You write a Python class that inherits from BaseMetric and implement the evaluation logic.

However, on Confident AI you can only edit the following methods:

a_measure() — the async method where your evaluation logic runs
is_successful() — determines whether the test case passed

All other parts of the metric (initialization, properties, etc.) are handled by the platform.

Available Packages

Your code runs in a secure environment with access to:

deepeval library — Always the latest version from GitHub, including all utilities like BaseMetric, and test case types
Standard Python libraries — json, re, math, collections, datetime, etc.
No external network calls — For security reasons, external API calls are not supported (for now)

Create Code-Eval via the UI

Code-Eval metrics can be created under Project > Metrics > Library.

Fill in metric details

Provide the metric name, and optionally a description. You can also toggle whether you’re creating a single-turn or multi-turn metric.

General Metric Info

Your metric name must be unique in your project and not clash with any of the default metric names.

Write your evaluation code

Instead of defining criteria, evaluation steps, and rubrics like in G-Eval, you write Python code that computes the evaluation score directly.

Your code must inherit from the appropriate base class and implement:

a_measure(test_case) — Your async evaluation logic that sets self.score, self.reason, and self.success
is_successful() — Returns whether the metric passed based on the threshold (pre-filled for you, not recommended to change)

You a_measure() method does not have to return self.score - although it is recommended that you do so.

Single-turn

Multi-turn

For single-turn metrics, inherit from BaseMetric and accept an LLMTestCase:

1 from deepeval.metrics import BaseMetric
2 from deepeval.test_case import LLMTestCase
3 
4 class CodeMetric(BaseMetric):
5     async def a_measure(self, test_case: LLMTestCase) -> float:
6         # Your evaluation logic here
7         if len(test_case.actual_output) > 5:
8             self.score = 1
9         else:
10             self.score = 0
11 
12         self.success = self.score >= self.threshold
13         return self.score
14 
15     def is_successful(self) -> bool:
16         if self.error is not None:
17             self.success = False
18         else:
19             try:
20                 self.success = self.score >= self.threshold
21             except TypeError:
22                 self.success = False
23         return self.success

The LLMTestCase object gives you access to parameters such as input, actual_output, expected_output, and more.

For more details on test case parameters, see Test Cases, Goldens, and Datasets.

Review and save

Once you’ve written your code, make sure everything looks right in the final review page, and click Save.

You can now add your Code-Eval metric to a metric collection to start running remote evals.

Advanced Usage

Set verbose logs

Use self.verbose_logs to log intermediate steps and decision paths in your evaluation logic. This is useful for debugging complex metrics and understanding how scores are computed.

1 from deepeval.metrics import BaseMetric
2 from deepeval.test_case import LLMTestCase
3 
4 class CodeMetric(BaseMetric):
5     async def a_measure(self, test_case: LLMTestCase) -> float:
6         # Log anything for debugging purposes
7         self.verbose_logs = "Wow I can't believe I can do this on Confident AI"
8 
9         return self.score
10 
11     def is_successful(self) -> bool:
12         if self.error is not None:
13             self.success = False
14         return self.success

Verbose logs are displayed in the Confident AI dashboard alongside your metric results, making it easy to trace through evaluation decisions.

Log reasoning

Use self.reason to provide a human-readable explanation of the score. This helps users understand why a particular score was given and is displayed in the evaluation results.

1 from deepeval.metrics import BaseMetric
2 from deepeval.test_case import LLMTestCase
3 
4 class CodeMetric(BaseMetric):
5     async def a_measure(self, test_case: LLMTestCase) -> float:
6         # Set any reason you wish
7         self.reason = "Wow I can't believe I can do this on Confident AI"
8 
9         return self.score
10 
11     def is_successful(self) -> bool:
12         if self.error is not None:
13             self.success = False
14         return self.success

A clear self.reason makes it much easier to understand evaluation results, especially when reviewing failed test cases or debugging unexpected scores.

Raise exceptions

You can also raise an error like how you would normally do so in Python and log it to self.error:

1 from deepeval.metrics import BaseMetric
2 from deepeval.test_case import LLMTestCase
3 
4 class CodeMetric(BaseMetric):
5     async def a_measure(self, test_case: LLMTestCase) -> float:
6         try:
7             raise ValueError("Raising an error because I feel like it")
8         except Exception as e:
9             # Surface the error before re-raising
10             self.error = str(e)
11             raise
12 
13         return self.score
14 
15     def is_successful(self) -> bool:
16         if self.error is not None:
17             self.success = False
18         return self.success