For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Trust CenterStatusSupportGet a demoPlatform
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingGuidesChangelog
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingGuidesChangelog
  • Get Started
    • Introduction
    • Quickstart
    • Authentication
    • Data Models
    • API Conventions
  • Data Models
  • Evals
  • Legacy
LogoLogo
Trust CenterStatusSupportGet a demoPlatform
On this page
  • What is Evals API?
  • Key Capabilities
  • Get Started
  • Main Endpoints
  • FAQs
Get Started

Introduction

Welcome to Confident AI's Evals API reference.
Was this page helpful?

Quickstart

5 min quickstart guide for Confident AI's Evals API
Next
Built with

What is Evals API?

The RESTFUL Evals API enables organizations to offload evaluations, ingest LLM traces, manage datasets, prompt versions, and more on Confident AI. It allows you to:

  • Run metrics remotely on Confident AI, without having to manage the infrastructure overhead
  • Keep a centralized admin dashbaord for all evals, traces, datasets, prompts etc. ingested
  • Manage user annotations, and manipulate LLM traces
  • And most important, build your custom LLMOps pipeline

All evaluations ran using the Evals API is powered by DeepEval, the open-source LLM Evaluation framework.

Evaluation metrics via the Evals API are 100% powered by ⭐ DeepEval 💯

DeepEval is one of the most widely adopted LLM evaluation framework in the world, with over 10k stars and 20 million daily evaluations.

Star History
Chart

⭐ DeepEval Star Growth ⭐

Key Capabilities

The Evals API offers the same functionality but more low-level control over clicking around in the UI:

  • Comprehensive single-turn, multi-turn LLM testing
  • Experiment with different versions of prompts and models
  • Detect unexpected breaking changes through evals
  • LLM tracing to debug and monitor in production
  • Track product analytics and user stats
  • Include human-in-the-loop to notice what needs to be worked on

Get Started

Start building your own LLMops pipeline with Evals API.

5 Min Quickstart

Run your first remote LLM evaluation.

Authentication

Learn how authentication works in Evals API.

Data Models

Understand core data models and how they connect.

API Conventions

Understand conventions such as response formats and status codes.

Main Endpoints

Access a full suite of endpoints to manage evaluations, datasets, prompts, traces, and more.

Metrics
  • Define custom metrics tailored to your use cases
  • Update and create batches of metrics as per your specific needs
Metric Collection
  • Create and manage collection of metrics to run evals on test cases, traces, spans, and threads
  • Update metric collections to match your use case
Datasets
  • Store and manage golden datasets for consistent testing
  • Pull datasets to be used for evaluation, for both single and multi-turn use cases
Evaluation
  • Run create test runs on list of test cases
  • Get detailed scoring and feedback on model performance
Tracing
  • Track and analyze your AI’s execution workflow
  • Get full visibility into LLM calls and component interactions
Prompt
  • Manage and version prompt templates programmatically
  • Track prompt performance and iterate on improvements
Annotations
  • Add human feedback and annotations to evaluation results
  • Create feedback loops for continuous model improvement

FAQs

How is the Evals API different from DeepEval?

The Evals API provides more low-level control over the DeepEval client and provide benefits that DeepEval alone doesn’t offer:

Managed Infrastructure: Serverless evaluations on our managed servers, error handling for metric failures and retries, cost management and billing optimization, automatic scaling based on evaluation volume.

Platform Dashboard: Visual results for each customer dataset, historical tracking and trends, team collaboration features, custom analytics dashboards.

How is the Evals API different from using the platform?

The Evals API and platform serve different use cases in your LLM application development workflow:

Platform (Dashboard): Use when your engineering teams need to improve an LLM application. It provides visual test case creation, interactive evaluation results, team collaboration features, and built-in dashboards.

Evals API: Use when building an LLM application that needs to automate evaluations for different customers, run evaluations programmatically, build custom dashboards, integrate into existing workflows, or scale across multiple customer environments.

Both approaches use the same underlying evaluation engine, so you can start with the platform for development and use the API for production automation.

Who is this for?
  1. Organizations that need to scale evaluations across multiple customers or environments while maintaining visibility into results.
  2. Users that aren’t working with Python or Typescript. If users are working with either Python or Typescript, using DeepEval as your client library is highly recommended.