Introduction to LLM Evaluation

Run LLM evals with or without code — choose the workflow that fits your team.

Overview

LLM evaluation on Confident AI refers to benchmarking via datasets in a pre-deployment setting, can be done in two ways:

  • No-code directly in the platform UI, best for QAs, PMs, SMEs, or,
  • Code-driven using the deepeval (or deepeval.ts) framework, best for engineers and QAs.

Both approaches give you access to the same comprehensive evaluation metrics and insights — the difference is in how you run them.

For those looking to use online evals for production monitoring on observability data, click here.

What you can evaluate

Both code-dirven and no-code workflows allow you to evaluate all 3 use cases:

Single-Turn

One input → one output interactions like Q&A, summarization, or classification tasks.

Multi-Turn

Conversational interactions where context builds across multiple exchanges.

Agentic Workflows

Complex systems with tool calls, reasoning chains, and multi-step execution.

Choose your workflow

Run evals entirely in the platform UI without writing any code or use deepeval programmatically:

Not sure which to pick?

Most teams use both approaches. Start with no-code to explore and experiment, then move to code-driven for automated regression testing in CI/CD. The results from both workflows appear in the same dashboards.

Key Capabilities

Dataset Management

Create, organize, and version datasets of test cases to systematically benchmark your LLM applications

Experimentation

Run experiments to compare prompts, models, and parameters with detailed analysis and insights

A|B Regression Testing

Catch regressions on different versions of your AI app with side-by-side test case comparisons

Unit-Testing in CI/CD

Integrate native pytest evaluations into your deployment CI/CD pipelines

Learn the fundamentals

New to LLM evaluation? These concepts will help you get the most out of your evals: