Code-Evals

Create custom metrics using Python code on Confident AI

Overview

Code-Eval lets you create and execute custom metrics by writing Python code directly on the Confident AI platform. Unlike G-Eval which uses natural language criteria, Code-Eval gives you full programmatic control over your evaluation logic using the deepeval framework.

Code-Eval also executes on Confident AI.

Why Code-Eval?

Code-Eval is ideal when you need evaluation logic that can’t be expressed in natural language:

  • Exact format validation — Verify JSON structure, regex patterns, or specific output formats
  • Deterministic scoring — Apply consistent, rule-based logic without LLM variability
  • Complex calculations — Perform multi-step computations, statistical analysis, or aggregations
  • Custom business rules — Implement domain-specific validation logic unique to your use case

For subjective evaluations like tone, helpfulness, or nuanced quality checks, use G-Eval instead — it handles LLM-as-a-judge reasoning and is easier to create without code.

How It Works

Code-Eval works exactly like creating a custom metric in deepeval. You write a Python class that inherits from BaseMetric and implement the evaluation logic.

However, on Confident AI you can only edit the following methods:

  • a_measure() — the async method where your evaluation logic runs
  • is_successful() — determines whether the test case passed

All other parts of the metric (initialization, properties, etc.) are handled by the platform.

Available Packages

Your code runs in a secure environment with access to:

  • deepeval library — Always the latest version from GitHub, including all utilities like BaseMetric, and test case types
  • Standard Python librariesjson, re, math, collections, datetime, etc.
  • No external network calls — For security reasons, external API calls are not supported (for now)

Create Code-Eval via the UI

Code-Eval metrics can be created under Project > Metrics > Library.

1

Fill in metric details

Provide the metric name, and optionally a description. You can also toggle whether you’re creating a single-turn or multi-turn metric.

General Metric Info

Your metric name must be unique in your project and not clash with any of the default metric names.

2

Write your evaluation code

Instead of defining criteria, evaluation steps, and rubrics like in G-Eval, you write Python code that computes the evaluation score directly.

Your code must inherit from the appropriate base class and implement:

  • a_measure(test_case) — Your async evaluation logic that sets self.score, self.reason, and self.success
  • is_successful() — Returns whether the metric passed based on the threshold (pre-filled for you, not recommended to change)

You a_measure() method does not have to return self.score - although it is recommended that you do so.

For single-turn metrics, inherit from BaseMetric and accept an LLMTestCase:

1from deepeval.metrics import BaseMetric
2from deepeval.test_case import LLMTestCase
3
4class CodeMetric(BaseMetric):
5 async def a_measure(self, test_case: LLMTestCase) -> float:
6 # Your evaluation logic here
7 if len(test_case.actual_output) > 5:
8 self.score = 1
9 else:
10 self.score = 0
11
12 self.success = self.score >= self.threshold
13 return self.score
14
15 def is_successful(self) -> bool:
16 if self.error is not None:
17 self.success = False
18 else:
19 try:
20 self.success = self.score >= self.threshold
21 except TypeError:
22 self.success = False
23 return self.success

The LLMTestCase object gives you access to parameters such as input, actual_output, expected_output, and more.

For more details on test case parameters, see Test Cases, Goldens, and Datasets.

3

Review and save

Once you’ve written your code, make sure everything looks right in the final review page, and click Save.

You can now add your Code-Eval metric to a metric collection to start running remote evals.

Advanced Usage

Set verbose logs

Use self.verbose_logs to log intermediate steps and decision paths in your evaluation logic. This is useful for debugging complex metrics and understanding how scores are computed.

1from deepeval.metrics import BaseMetric
2from deepeval.test_case import LLMTestCase
3
4class CodeMetric(BaseMetric):
5 async def a_measure(self, test_case: LLMTestCase) -> float:
6 # Log anything for debugging purposes
7 self.verbose_logs = "Wow I can't believe I can do this on Confident AI"
8
9 return self.score
10
11 def is_successful(self) -> bool:
12 if self.error is not None:
13 self.success = False
14 return self.success

Verbose logs are displayed in the Confident AI dashboard alongside your metric results, making it easy to trace through evaluation decisions.

Log reasoning

Use self.reason to provide a human-readable explanation of the score. This helps users understand why a particular score was given and is displayed in the evaluation results.

1from deepeval.metrics import BaseMetric
2from deepeval.test_case import LLMTestCase
3
4class CodeMetric(BaseMetric):
5 async def a_measure(self, test_case: LLMTestCase) -> float:
6 # Set any reason you wish
7 self.reason = "Wow I can't believe I can do this on Confident AI"
8
9 return self.score
10
11 def is_successful(self) -> bool:
12 if self.error is not None:
13 self.success = False
14 return self.success

A clear self.reason makes it much easier to understand evaluation results, especially when reviewing failed test cases or debugging unexpected scores.

Raise exceptions

You can also raise an error like how you would normally do so in Python and log it to self.error:

1from deepeval.metrics import BaseMetric
2from deepeval.test_case import LLMTestCase
3
4class CodeMetric(BaseMetric):
5 async def a_measure(self, test_case: LLMTestCase) -> float:
6 try:
7 raise ValueError("Raising an error because I feel like it")
8 except Exception as e:
9 # Surface the error before re-raising
10 self.error = str(e)
11 raise
12
13 return self.score
14
15 def is_successful(self) -> bool:
16 if self.error is not None:
17 self.success = False
18 return self.success