LLM Metrics on Confident AI

Overview of metrics in Confident AI

Overview

Metrics are the foundation of LLM evaluation on Confident AI. They define the criteria used to score and assess your LLM outputs — whether you’re testing in development, running experiments, or monitoring production systems.

On Confident AI, “metrics” refers to evaluation metrics that assess the quality of LLM outputs — not operational metrics like latency, cost, or token usage. For tracking operational data, see Latency, Cost, and Error Tracking.

Confident AI provides two categories of metrics:

  • Pre-built metrics — Battle-tested metrics for common evaluation scenarios like answer relevancy, faithfulness, hallucination detection, and more
  • Custom metrics — Create your own metrics tailored to your specific use case using G-Eval (natural language criteria) or Code-Evals (Python code)

Both categories support single-turn (individual LLM interactions) and multi-turn (conversational) evaluations.

How Metrics Work

Metrics on Confident AI follow a simple pattern:

  1. Define your metrics — Choose from pre-built metrics or create custom ones
  2. Group into collections — Add metrics to a metric collection with specific settings (threshold, strictness, etc.)
  3. Run evaluations — Use the collection for test runs, experiments, or production monitoring
  4. Analyze results — View scores, reasoning, and pass/fail status in the dashboard

Metric collections are required for remote evaluations on Confident AI. For local evaluations using deepeval, you can use metrics directly without collections.

Pre-built Metrics

Confident AI offers a comprehensive library of pre-built metrics powered by LLM-as-a-judge:

MetricDescription
Answer RelevancyMeasures how relevant the response is to the input query
FaithfulnessChecks if the response is grounded in the provided context
HallucinationDetects fabricated or unsupported information
Contextual PrecisionEvaluates retrieval ranking quality
Contextual RecallMeasures retrieval completeness
Contextual RelevancyAssesses relevance of retrieved context
BiasDetects biased content in responses
ToxicityIdentifies toxic or harmful content
SummarizationEvaluates summary quality and accuracy
Task CompletionChecks if the task was successfully completed
Tool CorrectnessValidates correct tool/function usage

Custom Metrics

When pre-built metrics don’t fit your use case, create custom metrics:

Next Steps

Ready to start evaluating? Here’s where to go next: