Introduction
What is Evals API?
The RESTFUL Evals API enables organizations to offload evaluations, ingest LLM traces, manage datasets, prompt versions, and more on Confident AI. It allows you to:
- Run metrics remotely on Confident AI, without having to manage the infrastructure overhead
- Keep a centralized admin dashbaord for all evals, traces, datasets, prompts etc. ingested
- Manage user annotations, and manipulate LLM traces
- And most important, build your custom LLMOps pipeline
All evaluations ran using the Evals API is powered by DeepEval, the open-source LLM Evaluation framework.
Key Capabilities
The Evals API offers the same functionality but more low-level control over clicking around in the UI:
- Comprehensive single-turn, multi-turn LLM testing
- Experiment with different versions of prompts and models
- Detect unexpected breaking changes through evals
- LLM tracing to debug and monitor in production
- Track product analytics and user stats
- Include human-in-the-loop to notice what needs to be worked on
Get Started
Start building your own LLMops pipeline with Evals API.
Run your first remote LLM evaluation.
Learn how authentication works in Evals API.
Understand core data models and how they connect.
Main Endpoints
Access a full suite of endpoints to manage evaluations, datasets, prompts, traces, and more.
- Define custom metrics tailored to your use cases
- Update and create batches of metrics as per your specific needs
- Create and manage collection of metrics to run evals on test cases, traces, spans, and threads
- Update metric collections to match your use case
- Store and manage golden datasets for consistent testing
- Pull datasets to be used for evaluation, for both single and multi-turn use cases
- Run create test runs on list of test cases
- Get detailed scoring and feedback on model performance
- Track and analyze your AI’s execution workflow
- Get full visibility into LLM calls and component interactions
- Manage and version prompt templates programmatically
- Track prompt performance and iterate on improvements
- Add human feedback and annotations to evaluation results
- Create feedback loops for continuous model improvement
FAQs
How is the Evals API different from DeepEval?
The Evals API provides more low-level control over the DeepEval client and provide benefits that DeepEval alone doesn’t offer:
Managed Infrastructure: Serverless evaluations on our managed servers, error handling for metric failures and retries, cost management and billing optimization, automatic scaling based on evaluation volume.
Platform Dashboard: Visual results for each customer dataset, historical tracking and trends, team collaboration features, custom analytics dashboards.
How is the Evals API different from using the platform?
The Evals API and platform serve different use cases in your LLM application development workflow:
Platform (Dashboard): Use when your engineering teams need to improve an LLM application. It provides visual test case creation, interactive evaluation results, team collaboration features, and built-in dashboards.
Evals API: Use when building an LLM application that needs to automate evaluations for different customers, run evaluations programmatically, build custom dashboards, integrate into existing workflows, or scale across multiple customer environments.
Both approaches use the same underlying evaluation engine, so you can start with the platform for development and use the API for production automation.
Who is this for?
- Organizations that need to scale evaluations across multiple customers or environments while maintaining visibility into results.
- Users that aren’t working with Python or Typescript. If users are working with either Python or Typescript, using DeepEval as your client library is highly recommended.