Custom Metrics | Confident AI Docs

Overview

Custom metrics are one of the most important metrics for testing LLM apps as they allow you to evaluate on criteria specific to your use case. You can create and use custom metrics either:

Locally, to run evals on your machine before sending test results to Confident AI’s, best for code-driven evals
Remotely, to run evals on Confident AI directly, perfect for no-code evaluation workflows

Local Evals

Run evaluations locally using deepeval with full control over metrics
Support for custom metrics, DAG, and advanced evaluation algorithms

Suitable for: Python users, development, and pre-deployment workflows

Remote Evals

Run evaluations on Confident AI platform with pre-built metrics
Integrated with monitoring, datasets, and team collaboration features

Suitable for: Non-python users, online + offline evals for tracing in prod

Available Custom Metrics

There are two types of custom metrics you can create:

G-Eval: LLM-as-a-judge metrics defined using natural language criteria. G-Eval is the most common approach for custom metrics since it requires no coding and can evaluate nuanced, subjective criteria like tone, helpfulness, or domain-specific correctness.
Code-Evals: Programmatic metrics written in Python directly on Confident AI using the deepeval framework. Use code-based metrics when you need deterministic logic, external API calls, or complex computations that can’t be expressed in natural language.

Running custom metrics locally gives your code-level control over your metrics, but they are only limited to python users using deepeval and is not available for on/offline evals in production.

How It Works

Custom metrics follow a simple workflow:

Create a metric — Define your custom metric either locally using deepeval or remotely via the Confident AI UI. For G-Eval, you’ll provide natural language criteria; for Code-Evals, you’ll write Python code directly on the platform.
Add to a metric collection — Group your metric into a metric collection where you can configure settings like threshold (minimum passing score) and strictness (how harshly to penalize failures).
Run evaluations — Execute your metrics either locally via deepeval or remotely through the Confident AI platform.
View results — Analyze scores, reasoning, and pass/fail status in Confident AI’s dashboard.

Next Steps

Now that you know your options, it’s time to select your preferences for creating custom metrics:

G-Eval

Create LLM-as-a-judge metrics using natural language criteria. Best for evaluating nuanced, subjective qualities.

Code-Evals

Write Python code directly on Confident AI for deterministic logic or complex computations.