Metric Collection

Metric collections allow you to group together metric runs on Confident AI

Overview

A metric collection on Confident AI is a collection of metric and their respective settings. It is what allows you to run evaluations remotely. This can be for either:

  • Evals in development, through the Evals APIs
  • Evals for LLM tracing, through the means of online or offline evals

Metric collections are strictly used for remote evals, and are identified by an unique name, and does not require any code to manage.

Both single and multi-turn metric collections are supported.
Local Evals
  • Run evaluations locally using deepeval with full control over metrics
  • Support for custom metrics, DAG, and advanced evaluation algorithms

Suitable for: Python users, development, and pre-deployment workflows

Remote Evals
  • Run evaluations on Confident AI platform with pre-built metrics
  • Integrated with monitoring, datasets, and team collaboration features

Suitable for: Non-python users, online + offline evals for tracing in prod

Create a Metric Collection

You can create a single or multi-turn metric collection under Project > Metrics > Collections. All you need to do is provide it with a unique name, select the appropriate metrics, and edit their settings (if required).

Metric Collection for Remote Evals

You can use metric collections for any remote evals in Confident AI:

Understanding Metric Collections

Metric collections and metrics are connected in-directly via metric settings, which specifies the specific threshold, strictness, etc. of each metric in different collections.

Metric Collection: A group of metrics that you wish to evaluate together (either for a test run or online evaluation).

Metric Settings: Configuration options for how a metric within a metric collection should be evaluated, including the threshold, strictness, and whether to include reasoning.

When you run remote evals by providing a metric collection name, Confident AI will fetch the metric and their settings related to said collection, before using all these configs to run evals.