For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Trust CenterStatusSupportGet a demoPlatform
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
  • Get Started
    • Introduction
    • Setup and Installation
  • LLM Evaluation
    • Introduction
    • Experiments
  • Metrics
    • Introduction
    • Metric Collections
    • Custom Metrics
  • LLM Tracing
    • Introduction
    • Signals
    • Troubleshooting
  • Human-in-the-Loop
    • Introduction
    • Collect Feedback
  • Reporting & Analytics
    • Dashboards
    • Executive Insights
  • Red Teaming
    • Introduction
    • Quickstart
    • Frameworks & Policies
    • Risk Profiles
    • Red Team Using DeepTeam
  • Resources
    • Why Confident AI
    • Support
    • Data Handling
    • LLM Use Cases
LogoLogo
Trust CenterStatusSupportGet a demoPlatform
On this page
  • Overview
  • What you can evaluate
  • Choose your workflow
  • Key Capabilities
  • Learn the fundamentals
LLM Evaluation

Introduction to LLM Evaluation

Run LLM evals with or without code — choose the workflow that fits your team.

Was this page helpful?
Previous

LLM Evaluation Quickstart

5 min quickstart guide for a code-driven LLM evaluation workflow

Next
Built with

Overview

LLM evaluation on Confident AI refers to benchmarking via datasets in a pre-deployment setting, can be done in two ways:

  • No-code directly in the platform UI, best for QAs, PMs, SMEs, or,
  • Code-driven using the deepeval (or deepeval.ts) framework, best for engineers and QAs.

Both approaches give you access to the same comprehensive evaluation metrics and insights — the difference is in how you run them.

For those looking to use online evals for production monitoring on observability data, click here.

What you can evaluate

Both code-dirven and no-code workflows allow you to evaluate all 3 use cases:

Single-Turn

One input → one output interactions like Q&A, summarization, or classification tasks.

Multi-Turn

Conversational interactions where context builds across multiple exchanges.

Agentic Workflows

Complex systems with tool calls, reasoning chains, and multi-step execution.

Choose your workflow

Run evals entirely in the platform UI without writing any code or use deepeval programmatically:

No-Code Evals
  • Run experiments on single and multi-prompt AI apps
  • Compare prompts and models in Arena

Suitable for: PMs, QA teams, rapid prototyping

Code-Driven Evals
  • Automated regression testing in CI/CD
  • Full control over output generation
  • Version-controlled eval logic

Suitable for: Engineers, automated testing

Not sure which to pick?

Most teams use both approaches. Start with no-code to explore and experiment, then move to code-driven for automated regression testing in CI/CD. The results from both workflows appear in the same dashboards.

Key Capabilities

Dataset Management

Create, organize, and version datasets of test cases to systematically benchmark your LLM applications

Experimentation

Run experiments to compare prompts, models, and parameters with detailed analysis and insights

A|B Regression Testing

Catch regressions on different versions of your AI app with side-by-side test case comparisons

Unit-Testing in CI/CD

Integrate native pytest evaluations into your deployment CI/CD pipelines

Learn the fundamentals

New to LLM evaluation? These concepts will help you get the most out of your evals:

  • Single vs Multi-Turn Evals — understand when to use each approach
  • Test Cases, Goldens, and Datasets — the building blocks of evaluation
  • LLM-as-a-Judge Metrics — how automated scoring works