LLM Metrics on Confident AI | Confident AI Docs

Overview

Metrics are the foundation of LLM evaluation on Confident AI. They define the criteria used to score and assess your LLM outputs — whether you’re testing in development, running experiments, or monitoring production systems.

On Confident AI, “metrics” refers to evaluation metrics that assess the quality of LLM outputs — not operational metrics like latency, cost, or token usage. For tracking operational data, see Latency, Cost, and Error Tracking.

Confident AI provides two categories of metrics:

Pre-built metrics — Battle-tested metrics for common evaluation scenarios like answer relevancy, faithfulness, hallucination detection, and more
Custom metrics — Create your own metrics tailored to your specific use case using G-Eval (natural language criteria) or Code-Evals (Python code)

Both categories support single-turn (individual LLM interactions) and multi-turn (conversational) evaluations.

How Metrics Work

Metrics on Confident AI follow a simple pattern:

Define your metrics — Choose from pre-built metrics or create custom ones
Group into collections — Add metrics to a metric collection with specific settings (threshold, strictness, etc.)
Run evaluations — Use the collection for test runs, experiments, or production monitoring
Analyze results — View scores, reasoning, and pass/fail status in the dashboard

Metric collections are required for remote evaluations on Confident AI. For local evaluations using deepeval, you can use metrics directly without collections.

Pre-built Metrics

Confident AI offers a comprehensive library of pre-built metrics powered by LLM-as-a-judge:

Single-Turn

Multi-Turn

Metric	Description
Answer Relevancy	Measures how relevant the response is to the input query
Faithfulness	Checks if the response is grounded in the provided context
Hallucination	Detects fabricated or unsupported information
Contextual Precision	Evaluates retrieval ranking quality
Contextual Recall	Measures retrieval completeness
Contextual Relevancy	Assesses relevance of retrieved context
Bias	Detects biased content in responses
Toxicity	Identifies toxic or harmful content
Summarization	Evaluates summary quality and accuracy
Task Completion	Checks if the task was successfully completed
Tool Correctness	Validates correct tool/function usage

Custom Metrics

When pre-built metrics don’t fit your use case, create custom metrics:

G-Eval

Define evaluation criteria in natural language. Best for subjective qualities like tone, helpfulness, or domain-specific correctness.

Code-Evals

Write Python code directly on Confident AI. Best for deterministic checks, format validation, or complex calculations.

Next Steps

Ready to start evaluating? Here’s where to go next:

Metric Collections

Learn how to group metrics and configure settings for remote evaluations.

Custom Metrics

Create metrics tailored to your specific evaluation needs.