No-Code Evals Quickstart | Confident AI Docs

Overview

This quickstart walks you through running your first no-code evaluation on Confident AI. By the end, you’ll have:

Created a metric collection to define what you’re evaluating
Built a dataset with goldens
Run an evaluation and viewed results on the dashboard

A no-code evaluation workflow allows non-technical team members to run an end-to-end iteration of your AI app without leaving Confident AI.

You’ll need a Confident AI account to follow along. Sign up here if you haven’t already.

Run your first evaluation

Run your first evaluation by following this example for a single-turn, QA use case:

Create a Metric Collection

A metric collection groups the metrics you want to evaluate together.

Creating a metric collection

Navigate to Metric Collections in the sidebar
Click Create Metric Collection
Give it a name (e.g., “RAG Quality Metrics”)
Select the metrics you want to include:
- Answer Relevancy — measures if the output addresses the input
- Faithfulness — measures if the output is grounded in the context
- Add any other metrics relevant to your use case
Click Save

Start with 2-3 metrics for your first evaluation. You can always add more later.

Create a Dataset

Datasets contain the goldens you’ll use to generate AI outputs.

Creating a dataset with goldens

Navigate to Datasets in the sidebar
Click Create Dataset
Give it a name (e.g., “QA Test Cases”)
Add your golden:
- Input: The user query (e.g., “What is the refund policy?”)
- Expected Output (optional): The ideal response
- Actual Output: The AI app’s output to evaluate
Click Save

We’ll cover all the ways you can generate AI outputs in later sections.

For this quickstart, provide a hardcoded actual output (don’t worry, we won’t be doing this later):

Field	Example Value
Input	”What is the refund policy?”
Actual Output	”You can request a refund within 30 days of purchase by contacting support.”

Run the Evaluation

Now let’s evaluate your goldens against your metrics.

Navigate to Evaluations in the sidebar
Click New Evaluation
Select your Dataset (e.g., “QA Test Cases”)
Select your Metric Collection (e.g., “RAG Quality Metrics”)
Click Run Evaluation

The evaluation will process each test case and score it against your selected metrics.

View Results on Dashboard

Once your run an evaluation, you will be redirected to a test run. Wait for a moment for evaluation to complete, and ✅ done!. You’ve run your first no-code evaluation.

Viewing test run results

In the testing report, you can analyze:

Individual test cases — drill down into specific failures to understand what went wrong
Score distributions — view average, median, and percentile breakdowns for each metric
Pass/fail results — a test case passes only if all its metrics meet their thresholds
AI-generated summary — get an automated analysis of patterns and issues across your test run

In later sections, you can find out more on what a test run offers.

Generating AI Outputs

In the quickstart above, we hardcoded the actual output directly in the dataset. This is useful for quick tests, but highly not recommedned. This is because you should aim to test changes made to your AI app, not static outputs that are pre-computed.

Confident AI offers more powerful ways to generate outputs dynamically:

Single prompt generation — define a prompt template in the platform and Confident AI calls your configured LLM provider to generate outputs automatically. Ideal for testing prompt variations or comparing models.
AI Connections — connect directly to your deployed AI system. If it’s reachable via HTTP(s), it’s testable. Customize request payloads, parse custom response structures, and pass headers or auth tokens.

AI connections are powerful because it allows Confident AI to test your AI apps as they are. However, it does require an initial small setup time from engineering.

AI Connections let you test your actual AI system end-to-end, catching integration issues that prompt-only testing misses.

Next Steps

Now that you’ve completed a basic evaluation, learn how to handle different use cases:

Single-Turn Evals

Evaluate one-shot Q&A, summarization, and classification tasks with generated outputs.

Multi-Turn Evals

Evaluate conversational AI where context builds across multiple exchanges.