Custom Metrics
Create custom metrics for your specific use case
Overview
Custom metrics are one of the most important metrics for testing LLM apps as they allow you to evaluate on criteria specific to your use case. You can create and use custom metrics either:
- Locally, to run evals on your machine before sending test results to Confident AI’s, best for code-driven evals
- Remotely, to run evals on Confident AI directly, perfect for no-code evaluation workflows
Local Evals
- Run evaluations locally using
deepevalwith full control over metrics - Support for custom metrics, DAG, and advanced evaluation algorithms
Suitable for: Python users, development, and pre-deployment workflows
Remote Evals
- Run evaluations on Confident AI platform with pre-built metrics
- Integrated with monitoring, datasets, and team collaboration features
Suitable for: Non-python users, online + offline evals for tracing in prod
Available Custom Metrics
There are two types of custom metrics you can create:
- G-Eval: LLM-as-a-judge metrics defined using natural language criteria. G-Eval is the most common approach for custom metrics since it requires no coding and can evaluate nuanced, subjective criteria like tone, helpfulness, or domain-specific correctness.
- Code-Evals: Programmatic metrics written in Python directly on Confident AI using the
deepevalframework. Use code-based metrics when you need deterministic logic, external API calls, or complex computations that can’t be expressed in natural language.
Running custom metrics locally gives your code-level control over your metrics, but they are only limited to python users using deepeval and is not available for on/offline evals in production.
How It Works
Custom metrics follow a simple workflow:
- Create a metric — Define your custom metric either locally using
deepevalor remotely via the Confident AI UI. For G-Eval, you’ll provide natural language criteria; for Code-Evals, you’ll write Python code directly on the platform. - Add to a metric collection — Group your metric into a metric collection where you can configure settings like threshold (minimum passing score) and strictness (how harshly to penalize failures).
- Run evaluations — Execute your metrics either locally via
deepevalor remotely through the Confident AI platform. - View results — Analyze scores, reasoning, and pass/fail status in Confident AI’s dashboard.
Next Steps
Now that you know your options, it’s time to select your preferences for creating custom metrics: