Introduction | Confident AI Docs

What is Evals API?

The RESTFUL Evals API enables organizations to offload evaluations, ingest LLM traces, manage datasets, prompt versions, and more on Confident AI. It allows you to:

Run metrics remotely on Confident AI, without having to manage the infrastructure overhead
Keep a centralized admin dashbaord for all evals, traces, datasets, prompts etc. ingested
Manage user annotations, and manipulate LLM traces
And most important, build your custom LLMOps pipeline

All evaluations ran using the Evals API is powered by DeepEval, the open-source LLM Evaluation framework.

Evaluation metrics via the Evals API are 100% powered by ⭐ DeepEval 💯

DeepEval is one of the most widely adopted LLM evaluation framework in the world, with over 10k stars and 20 million daily evaluations.

Star History
Chart — ⭐ DeepEval Star Growth ⭐

Key Capabilities

The Evals API offers the same functionality but more low-level control over clicking around in the UI:

Comprehensive single-turn, multi-turn LLM testing
Experiment with different versions of prompts and models
Detect unexpected breaking changes through evals
LLM tracing to debug and monitor in production
Track product analytics and user stats
Include human-in-the-loop to notice what needs to be worked on

Get Started

Start building your own LLMops pipeline with Evals API.

5 Min Quickstart

Run your first remote LLM evaluation.

Authentication

Learn how authentication works in Evals API.

Data Models

Understand core data models and how they connect.

API Conventions

Understand conventions such as response formats and status codes.

Main Endpoints

Access a full suite of endpoints to manage evaluations, datasets, prompts, traces, and more.

Metrics

Define custom metrics tailored to your use cases
Update and create batches of metrics as per your specific needs

Metric Collection

Create and manage collection of metrics to run evals on test cases, traces, spans, and threads
Update metric collections to match your use case

Datasets

Store and manage golden datasets for consistent testing
Pull datasets to be used for evaluation, for both single and multi-turn use cases

Evaluation

Run create test runs on list of test cases
Get detailed scoring and feedback on model performance

Tracing

Track and analyze your AI’s execution workflow
Get full visibility into LLM calls and component interactions

Prompt

Manage and version prompt templates programmatically
Track prompt performance and iterate on improvements

Annotations

Add human feedback and annotations to evaluation results
Create feedback loops for continuous model improvement

FAQs

How is the Evals API different from DeepEval?

The Evals API provides more low-level control over the DeepEval client and provide benefits that DeepEval alone doesn’t offer:

Managed Infrastructure: Serverless evaluations on our managed servers, error handling for metric failures and retries, cost management and billing optimization, automatic scaling based on evaluation volume.

Platform Dashboard: Visual results for each customer dataset, historical tracking and trends, team collaboration features, custom analytics dashboards.

How is the Evals API different from using the platform?

The Evals API and platform serve different use cases in your LLM application development workflow:

Platform (Dashboard): Use when your engineering teams need to improve an LLM application. It provides visual test case creation, interactive evaluation results, team collaboration features, and built-in dashboards.

Evals API: Use when building an LLM application that needs to automate evaluations for different customers, run evaluations programmatically, build custom dashboards, integrate into existing workflows, or scale across multiple customer environments.

Both approaches use the same underlying evaluation engine, so you can start with the platform for development and use the API for production automation.

Who is this for?

Organizations that need to scale evaluations across multiple customers or environments while maintaining visibility into results.
Users that aren’t working with Python or Typescript. If users are working with either Python or Typescript, using DeepEval as your client library is highly recommended.