Code-Evals
Code-Evals
Code-Evals
Code-Eval lets you create and execute custom metrics by writing Python code directly on the Confident AI platform. Unlike G-Eval which uses natural language criteria, Code-Eval gives you full programmatic control over your evaluation logic using the deepeval framework.
Code-Eval is ideal when you need evaluation logic that can’t be expressed in natural language:
For subjective evaluations like tone, helpfulness, or nuanced quality checks, use G-Eval instead — it handles LLM-as-a-judge reasoning and is easier to create without code.
Code-Eval works exactly like creating a custom metric in deepeval. You write a Python class that inherits from BaseMetric and implement the evaluation logic.
However, on Confident AI you can only edit the following methods:
a_measure() — the async method where your evaluation logic runsis_successful() — determines whether the test case passedAll other parts of the metric (initialization, properties, etc.) are handled by the platform.
Your code runs in a secure environment with access to:
deepeval library — Always the latest version from GitHub, including all utilities like BaseMetric, and test case typesjson, re, math, collections, datetime, etc.Code-Eval metrics can be created under Project > Metrics > Library.
Provide the metric name, and optionally a description. You can also toggle whether you’re creating a single-turn or multi-turn metric.
Your metric name must be unique in your project and not clash with any of the default metric names.
Instead of defining criteria, evaluation steps, and rubrics like in G-Eval, you write Python code that computes the evaluation score directly.
Your code must inherit from the appropriate base class and implement:
a_measure(test_case) — Your async evaluation logic that sets self.score, self.reason, and self.successis_successful() — Returns whether the metric passed based on the threshold (pre-filled for you, not recommended to change)You a_measure() method does not have to return self.score - although it is recommended that you do so.
For single-turn metrics, inherit from BaseMetric and accept an LLMTestCase:
The LLMTestCase object gives you access to parameters such as input, actual_output, expected_output, and more.
For more details on test case parameters, see Test Cases, Goldens, and Datasets.
Use self.verbose_logs to log intermediate steps and decision paths in your evaluation logic. This is useful for debugging complex metrics and understanding how scores are computed.
Verbose logs are displayed in the Confident AI dashboard alongside your metric results, making it easy to trace through evaluation decisions.
Use self.reason to provide a human-readable explanation of the score. This helps users understand why a particular score was given and is displayed in the evaluation results.
A clear self.reason makes it much easier to understand evaluation results,
especially when reviewing failed test cases or debugging unexpected scores.
You can also raise an error like how you would normally do so in Python and log it to self.error: