For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Trust CenterStatusSupportGet a demoPlatform
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
DocumentationEvals API ReferenceIntegrations & OTELPlatform SettingsSelf-HostingChangelog
  • Get Started
    • Introduction
    • Setup and Installation
  • LLM Evaluation
    • Introduction
      • Quickstart
      • Single-Turn Evals
      • Multi-Turn Evals
      • Arena
    • Experiments
  • Metrics
    • Introduction
    • Metric Collections
    • Custom Metrics
  • LLM Tracing
    • Introduction
    • Signals
    • Troubleshooting
  • Human-in-the-Loop
    • Introduction
    • Collect Feedback
  • Reporting & Analytics
    • Dashboards
    • Executive Insights
  • Red Teaming
    • Introduction
    • Quickstart
    • Frameworks & Policies
    • Risk Profiles
    • Red Team Using DeepTeam
  • Resources
    • Why Confident AI
    • Support
    • Data Handling
    • LLM Use Cases
LogoLogo
Trust CenterStatusSupportGet a demoPlatform
On this page
  • Overview
  • Run your first evaluation
  • Generating AI Outputs
  • Next Steps
LLM EvaluationNo-Code Evals

No-Code Evals Quickstart

Run your first evaluation in the platform UI — no code required.

Was this page helpful?
Previous

Single-Turn Evals (No-Code)

Evaluate one-shot interactions like Q&A, summarization, and classification.

Next
Built with

Overview

This quickstart walks you through running your first no-code evaluation on Confident AI. By the end, you’ll have:

  • Created a metric collection to define what you’re evaluating
  • Built a dataset with goldens
  • Run an evaluation and viewed results on the dashboard

A no-code evaluation workflow allows non-technical team members to run an end-to-end iteration of your AI app without leaving Confident AI.

You’ll need a Confident AI account to follow along. Sign up here if you haven’t already.

Run your first evaluation

Run your first evaluation by following this example for a single-turn, QA use case:

1

Create a Metric Collection

A metric collection groups the metrics you want to evaluate together.

Creating a metric collection
  1. Navigate to Metric Collections in the sidebar
  2. Click Create Metric Collection
  3. Give it a name (e.g., “RAG Quality Metrics”)
  4. Select the metrics you want to include:
    • Answer Relevancy — measures if the output addresses the input
    • Faithfulness — measures if the output is grounded in the context
    • Add any other metrics relevant to your use case
  5. Click Save

Start with 2-3 metrics for your first evaluation. You can always add more later.

2

Create a Dataset

Datasets contain the goldens you’ll use to generate AI outputs.

Creating a dataset with goldens
  1. Navigate to Datasets in the sidebar
  2. Click Create Dataset
  3. Give it a name (e.g., “QA Test Cases”)
  4. Add your golden:
    • Input: The user query (e.g., “What is the refund policy?”)
    • Expected Output (optional): The ideal response
    • Actual Output: The AI app’s output to evaluate
  5. Click Save

We’ll cover all the ways you can generate AI outputs in later sections.

For this quickstart, provide a hardcoded actual output (don’t worry, we won’t be doing this later):

FieldExample Value
Input”What is the refund policy?”
Actual Output”You can request a refund within 30 days of purchase by contacting support.”
3

Run the Evaluation

Now let’s evaluate your goldens against your metrics.

  1. Click the Evaluate button on an individual dataset’s page
  2. Select your Metric Collection (e.g., “Agentic Quality Metrics”)
  3. Click Run Evaluation

The evaluation will process each test case and score it against your selected metrics.

4

View Results on Dashboard

Once your run an evaluation, you will be redirected to a test run. Wait for a moment for evaluation to complete, and ✅ done!. You’ve run your first no-code evaluation.

Viewing test run results

In the testing report, you can analyze:

  • Individual test cases — drill down into specific failures to understand what went wrong
  • Score distributions — view average, median, and percentile breakdowns for each metric
  • Pass/fail results — a test case passes only if all its metrics meet their thresholds
  • AI-generated summary — get an automated analysis of patterns and issues across your test run

In later sections, you can find out more on what a test run offers.

Generating AI Outputs

In the quickstart above, we hardcoded the actual output directly in the dataset. This is useful for quick tests, but highly not recommedned. This is because you should aim to test changes made to your AI app, not static outputs that are pre-computed.

Confident AI offers more powerful ways to generate outputs dynamically:

  1. Single prompt generation — define a prompt template in the platform and Confident AI calls your configured LLM provider to generate outputs automatically. Ideal for testing prompt variations or comparing models.

  2. AI Connections — connect directly to your deployed AI system. If it’s reachable via HTTP(s), it’s testable. Customize request payloads, parse custom response structures, and pass headers or auth tokens.

AI connections are powerful because it allows Confident AI to test your AI apps as they are. However, it does require an initial small setup time from engineering.

AI Connections let you test your actual AI system end-to-end, catching integration issues that prompt-only testing misses.

Next Steps

Now that you’ve completed a basic evaluation, learn how to handle different use cases:

Single-Turn Evals

Evaluate one-shot Q&A, summarization, and classification tasks with generated outputs.

Multi-Turn Evals

Evaluate conversational AI where context builds across multiple exchanges.