Single-Turn Evals (No-Code)
Evaluate one-shot interactions like Q&A, summarization, and classification.
Overview
Single-turn evaluations test one input → one output interactions. These are use cases where each request is independent and doesn’t rely on conversation history:
- Q&A systems — answering questions from documents or knowledge bases
- Summarization — condensing long content into key points
- Classification — categorizing text into predefined labels
- RAG pipelines — retrieval-augmented generation with context
Single-turn evals treat your AI app as a black box — only the output, tools called, and retrieval context matter for evaluation.
Requirements
To run a single-turn evaluation, you need:
- A single-turn dataset — goldens with
inputand optionallyexpected_output,context, etc. - A single-turn metric collection — the metrics you want to evaluate against
If you completed the Quickstart, you already have both of these ready.
How it works
No-code evals follow a simple 4-step process:
- Define metrics — choose what aspects of quality to measure (e.g., relevancy, faithfulness)
- Create dataset — build goldens with parameters such as inputs and expected outputs
- Generate AI output — provide actual outputs from your AI app
- Evaluate — run metrics against your test cases and view results
Here’s a visual representation on the data flow during evaluation:
Your “AI app” as shown in the diagram can be anything from single-prompt, multi-prompt, or full on any AI app reachable through the internet. More on this in later sections.
Run an Evaluation
You can evaluate on a dataset by clicking on the Evaluate button on the top right of a dataset page.
Select your dataset and metrics
- Navigate to Project > Datasets, and select your single-turn dataset to evaluate
- Click Evaluate
- Select your single-turn Metric Collection
Configure output generation
Select how to generate actual outputs:
Prompt
AI Connection
For single-prompt systems, select a prompt template that Confident AI will use to call your configured LLM provider.
- Select your desired prompt and the version of the prompt as your output generation method
- Map any golden fields from your current dataset to any variables defined within your prompt
- Confident AI calls your prompt for each golden and generates outputs automatically
You’ll need an existing prompt for this to work. If you haven’t already, you can create a prompt on the Prompt Studio.
Run and view results
Click Run Evaluation and wait for it to complete. You’ll be redirected to your test run dashboard showing:
- Score distributions — average, median, and percentiles for each metric
- Pass/fail results — a test case passes only if all metrics meet their thresholds
- AI-generated summary — automated analysis of patterns and issues
- Individual test cases — drill down into specific failures
Regression Testing
Once you have two or more test runs, you can compare them side-by-side to identify regressions.
Name your test runs with identifiers (e.g., “gpt-4o baseline”, “claude-3.5 v2”) to make regression comparisons easier to track.