Custom Metrics

Learn how to create and use custom metrics locally and remotely

Overview

Custom metrics are one of the most important metrics for testing LLM apps as they allow you to evaluate on criteria specific to your use case. You can create and use custom metrics either:

  • Locally, to run evals on your machine before sending test results to Confident AI’s
  • Remotely, to run evals on Confident AI directly

Running custom metrics locally gives your code-level control over your metrics, but they are only limited to python users using deepeval and is not available for on/offline evals in production.

Local Evals
  • Run evaluations locally using deepeval with full control over metrics
  • Support for custom metrics, DAG, and advanced evaluation algorithms

Suitable for: Python users, development, and pre-deployment workflows

Remote Evals
  • Run evaluations on Confident AI platform with pre-built metrics
  • Integrated with monitoring, datasets, and team collaboration features

Suitable for: Non-python users, online + offline evals for tracing in prod

How It Works

Currently, all custom metrics on Confident AI uses the G-Eval algorithm. It first:

  1. Generates a list of evaluation steps, using an initial custom criteria
  2. Uses this list of evaluation steps to compute a score from 0 - 10
  3. Normalizes and takes a weighted summation of the final score to make the score more reliable
  4. The final score is then divided by 10 to normalize it to the range 0 - 1

Custom metrics also provide an optional rubric, which you can use to confine custom metric scores in-between a certain range.

The evaluation steps does not determine what score G-Eval gives, but simply guide the LLM judge as a form of CoT to output something more reliable.

If you’re creating custom metrics locally, you can implement anything you wish.

Common Mistakes to Avoid

There are a few common mistakes when creating a custom metric for the first time:

  • Criteria/evaluation steps are not detailed enough
    Many issues arise when criteria are too vague. If you expect the metric to handle X, Y, or Z, make sure those requirements are actually written into the criteria so the metric can use them.

  • Required test case parameters are not referenced
    When defining a custom metric, explicitly state what each parameter means and how they connect. For example, specify that the 'actual output' should semantically match the 'expected output'.

  • Vague language in crtieria
    Avoid terms like “accurate” without explanation. Instead, define what “accurate” means — e.g., “does not contradict the 'retrieval context'”, instead of a general “factually correct”.

  • Trying to set the rubric in the evaluation steps
    The evaluation steps are meant to help the LLM-as-a-Judge think. When you try to define the score that should be outputted instead of telling the metric how it should consider different parameters at play, it can lead to worst results.

  • Non-quantiative language in Rubric
    If a rubric range of 0–1 represents “low accuracy,” spell out what that means in practice — for example, does it allow 2–3 contradictions between the "actual output" and "retrieval context"? Always use concrete definitions like this rather than vague labels such as “low,” “medium,” or “high” without explanation.

Create Custom Metrics on the Platform

Single or multi-turn custom metrics can be created under Project > Metrics > Library.

1

Fill in metric details

Provide the metric name, and optionally a description. You can also toggle whether you’re creating a single-turn or multi-turn metric.

General Metric Info

Your metric name must be unique in your project and clash with any of the default metric names.

2

Select required parameters

Custom metrics needs to know which parameters in test cases it should consider during evaluation for the results to be accurate and reliable - this step gives you the opportunity to do exactly this. The example below shows selecting single-turn test case parameters for a single-turn metric, but you can also do it for multi-turn parameters.

Evaluation Parameters
3

Define custom criteria

A custom criteria helps Confident AI generate evaluation steps and is what separates an out-of-the-box metric to a custom one.

You must mention the names of the required parameters you’ve selected from the previous step. For example, if you’ve selected the “Input” and “Actual Output” for a single-turn use case, you criteria could be something like:

Given the 'input' and 'actual output' which are the query and answer to an AI medical chatbot,
determine whether the 'actual output' is relevant and helpful to the 'input'.
Penalize heavily if not helpful. Relevancy is not so important.
Metric Criteria

Criteria are used for generating evaluation steps, and not used directly for evaluation.

4

Outline evaluation steps (optional)

This step is optional because Confident AI will auto generate evaluation steps based on your criteria if not provided one.

However, providing evaluation steps gives custom metrics more reliable scores, as Confident AI will skip the steps generation process if one is provided.

You should not outline what scores to return at this stage (that goes in the rubric which we will talk about later).

Optional Evaluation Steps
5

Setup rubric (optional)

Lastly, you can optionally provide a set of rubrics to confine evaluation scores. Your list of rubrics must:

  • Not overlap in score range
  • Contain a clear expected outcome for each score range
  • Be inclusive of 0 - 10

The rubric score is defined on a 0–10 scale, but the final score reported by Confident AI is normalized to a 0–1 range. We use integers for the rubric since LLM-as-a-Judge performs more reliably with whole numbers, and then divide by 10 afterward to convert it into the normalized scale.

Optional Rubric
6

Review and save

Once you have your criteria set, make sure everything looks right in the final review page, and click Save.

You can now add your custom metric to a metric collection to start running remote evals.

Create Custom Metrics Locally

You can create custom metrics using various methods through deepeval:

The best place to learn how to create custom metrics locally would be to use deepeval’s documentation, which you can navigate to by clicking on the respective links above.