For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Trust CenterStatusSupportGet a demoPlatform
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
  • Get Started
    • Introduction
    • Setup and Installation
  • LLM Evaluation
    • Introduction
    • Experiments
  • Metrics
    • Introduction
    • Metric Collections
    • Custom Metrics
  • LLM Tracing
    • Introduction
    • Signals
    • Troubleshooting
  • Human-in-the-Loop
    • Introduction
    • Collect Feedback
  • Reporting & Analytics
    • Dashboards
    • Executive Insights
  • Red Teaming
    • Introduction
    • Quickstart
    • Frameworks & Policies
    • Risk Profiles
    • Red Team Using DeepTeam
  • Resources
    • Why Confident AI
    • Support
    • Data Handling
    • LLM Use Cases
LogoLogo
Trust CenterStatusSupportGet a demoPlatform
On this page
  • Overview
  • How Metrics Work
  • Pre-built Metrics
  • Custom Metrics
  • Next Steps
Metrics

LLM Metrics on Confident AI

Overview of metrics in Confident AI
Was this page helpful?
Previous

Metric Collections

Metric collections allow you to group together metric runs on Confident AI
Next
Built with

Overview

Metrics are the foundation of LLM evaluation on Confident AI. They define the criteria used to score and assess your LLM outputs — whether you’re testing in development, running experiments, or monitoring production systems.

On Confident AI, “metrics” refers to evaluation metrics that assess the quality of LLM outputs — not operational metrics like latency, cost, or token usage. For tracking operational data, see Latency, Cost, and Error Tracking.

Confident AI provides two categories of metrics:

  • Pre-built metrics — Battle-tested metrics for common evaluation scenarios like answer relevancy, faithfulness, hallucination detection, and more
  • Custom metrics — Create your own metrics tailored to your specific use case using G-Eval (natural language criteria) or Code-Evals (Python code)

Both categories support single-turn (individual LLM interactions) and multi-turn (conversational) evaluations.

How Metrics Work

Metrics on Confident AI follow a simple pattern:

  1. Define your metrics — Choose from pre-built metrics or create custom ones
  2. Group into collections — Add metrics to a metric collection with specific settings (threshold, strictness, etc.)
  3. Run evaluations — Use the collection for test runs, experiments, or production monitoring
  4. Analyze results — View scores, reasoning, and pass/fail status in the dashboard

Metric collections are required for remote evaluations on Confident AI. For local evaluations using deepeval, you can use metrics directly without collections.

Pre-built Metrics

Confident AI offers a comprehensive library of pre-built metrics powered by LLM-as-a-judge:

Single-Turn
Multi-Turn
MetricDescription
Answer RelevancyMeasures how relevant the response is to the input query
FaithfulnessChecks if the response is grounded in the provided context
HallucinationDetects fabricated or unsupported information
Contextual PrecisionEvaluates retrieval ranking quality
Contextual RecallMeasures retrieval completeness
Contextual RelevancyAssesses relevance of retrieved context
BiasDetects biased content in responses
ToxicityIdentifies toxic or harmful content
SummarizationEvaluates summary quality and accuracy
Task CompletionChecks if the task was successfully completed
Tool CorrectnessValidates correct tool/function usage

Custom Metrics

When pre-built metrics don’t fit your use case, create custom metrics:

G-Eval

Define evaluation criteria in natural language. Best for subjective qualities like tone, helpfulness, or domain-specific correctness.

Code-Evals

Write Python code directly on Confident AI. Best for deterministic checks, format validation, or complex calculations.

Next Steps

Ready to start evaluating? Here’s where to go next:

Metric Collections

Learn how to group metrics and configure settings for remote evaluations.

Custom Metrics

Create metrics tailored to your specific evaluation needs.