Single-Turn Evals (No-Code) | Confident AI Docs

Overview

Single-turn evaluations test one input → one output interactions. These are use cases where each request is independent and doesn’t rely on conversation history:

Q&A systems — answering questions from documents or knowledge bases
Summarization — condensing long content into key points
Classification — categorizing text into predefined labels
RAG pipelines — retrieval-augmented generation with context

Single-turn evals treat your AI app as a black box — only the output, tools called, and retrieval context matter for evaluation.

Requirements

To run a single-turn evaluation, you need:

A single-turn dataset — goldens with input and optionally expected_output, context, etc.
A single-turn metric collection — the metrics you want to evaluate against

If you completed the Quickstart, you already have both of these ready.

How it works

No-code evals follow a simple 4-step process:

Define metrics — choose what aspects of quality to measure (e.g., relevancy, faithfulness)
Create dataset — build goldens with parameters such as inputs and expected outputs
Generate AI output — provide actual outputs from your AI app
Evaluate — run metrics against your test cases and view results

Here’s a visual representation on the data flow during evaluation:

Your “AI app” as shown in the diagram can be anything from single-prompt, multi-prompt, or full on any AI app reachable through the internet. More on this in later sections.

Run an Evaluation

You can evaluate on a dataset by clicking on the Evaluate button on the top right of a dataset page.

Select your dataset and metrics

Navigate to Project > Datasets, and select your single-turn dataset to evaluate
Click Evaluate
Select your single-turn Metric Collection

Configure output generation

Select how to generate actual outputs:

Prompt

AI Connection

For single-prompt systems, select a prompt template that Confident AI will use to call your configured LLM provider.

Select your desired prompt and the version of the prompt as your output generation method
Map any golden fields from your current dataset to any variables defined within your prompt
Confident AI calls your prompt for each golden and generates outputs automatically

You’ll need an existing prompt for this to work. If you haven’t already, you can create a prompt on the Prompt Studio.

Run and view results

Click Run Evaluation and wait for it to complete. You’ll be redirected to your test run dashboard showing:

Score distributions — average, median, and percentiles for each metric
Pass/fail results — a test case passes only if all metrics meet their thresholds
AI-generated summary — automated analysis of patterns and issues
Individual test cases — drill down into specific failures

Single-turn test run results

Regression Testing

Once you have two or more test runs, you can compare them side-by-side to identify regressions.

Open regression testing

Go to your test run’s A|B Regression Test tab
Click New Regression Test
Select the test runs you want to compare

Analyze regressions

The comparison view highlights:

Regressions (red) — test cases that got worse
Improvements (green) — test cases that got better
Side-by-side scores — metric comparisons across runs

A|B regression testing

Name your test runs with identifiers (e.g., “gpt-4o baseline”, “claude-3.5 v2”) to make regression comparisons easier to track.

Next Steps

Multi-Turn Evals

Evaluate conversational AI where context builds across multiple exchanges.

Arena

Compare prompts and models side-by-side in real-time.