LLM Metrics on Confident AI
Overview
Metrics are the foundation of LLM evaluation on Confident AI. They define the criteria used to score and assess your LLM outputs — whether you’re testing in development, running experiments, or monitoring production systems.
On Confident AI, “metrics” refers to evaluation metrics that assess the quality of LLM outputs — not operational metrics like latency, cost, or token usage. For tracking operational data, see Latency, Cost, and Error Tracking.
Confident AI provides two categories of metrics:
- Pre-built metrics — Battle-tested metrics for common evaluation scenarios like answer relevancy, faithfulness, hallucination detection, and more
- Custom metrics — Create your own metrics tailored to your specific use case using G-Eval (natural language criteria) or Code-Evals (Python code)
Both categories support single-turn (individual LLM interactions) and multi-turn (conversational) evaluations.
How Metrics Work
Metrics on Confident AI follow a simple pattern:
- Define your metrics — Choose from pre-built metrics or create custom ones
- Group into collections — Add metrics to a metric collection with specific settings (threshold, strictness, etc.)
- Run evaluations — Use the collection for test runs, experiments, or production monitoring
- Analyze results — View scores, reasoning, and pass/fail status in the dashboard
Metric collections are required for remote evaluations on Confident AI. For
local evaluations using deepeval, you can use metrics directly without
collections.
Pre-built Metrics
Confident AI offers a comprehensive library of pre-built metrics powered by LLM-as-a-judge:
Single-Turn
Multi-Turn
Custom Metrics
When pre-built metrics don’t fit your use case, create custom metrics:
Define evaluation criteria in natural language. Best for subjective qualities like tone, helpfulness, or domain-specific correctness.
Write Python code directly on Confident AI. Best for deterministic checks, format validation, or complex calculations.
Next Steps
Ready to start evaluating? Here’s where to go next: