Metrics are the foundation of LLM evaluation on Confident AI. They define the criteria used to score and assess your LLM outputs — whether you’re testing in development, running experiments, or monitoring production systems.
On Confident AI, “metrics” refers to evaluation metrics that assess the quality of LLM outputs — not operational metrics like latency, cost, or token usage. For tracking operational data, see Latency, Cost, and Error Tracking.
Confident AI provides two categories of metrics:
Both categories support single-turn (individual LLM interactions) and multi-turn (conversational) evaluations.
Metrics on Confident AI follow a simple pattern:
Metric collections are required for remote evaluations on Confident AI. For
local evaluations using deepeval, you can use metrics directly without
collections.
Confident AI offers a comprehensive library of pre-built metrics powered by LLM-as-a-judge:
When pre-built metrics don’t fit your use case, create custom metrics:
Define evaluation criteria in natural language. Best for subjective qualities like tone, helpfulness, or domain-specific correctness.
Write Python code directly on Confident AI. Best for deterministic checks, format validation, or complex calculations.
Ready to start evaluating? Here’s where to go next: