No-Code Evals Quickstart
Run your first evaluation in the platform UI — no code required.
Overview
This quickstart walks you through running your first no-code evaluation on Confident AI. By the end, you’ll have:
- Created a metric collection to define what you’re evaluating
- Built a dataset with goldens
- Run an evaluation and viewed results on the dashboard
A no-code evaluation workflow allows non-technical team members to run an end-to-end iteration of your AI app without leaving Confident AI.
You’ll need a Confident AI account to follow along. Sign up here if you haven’t already.
Run your first evaluation
Run your first evaluation by following this example for a single-turn, QA use case:
Create a Metric Collection
A metric collection groups the metrics you want to evaluate together.
- Navigate to Metric Collections in the sidebar
- Click Create Metric Collection
- Give it a name (e.g., “RAG Quality Metrics”)
- Select the metrics you want to include:
- Answer Relevancy — measures if the output addresses the input
- Faithfulness — measures if the output is grounded in the context
- Add any other metrics relevant to your use case
- Click Save
Start with 2-3 metrics for your first evaluation. You can always add more later.
Create a Dataset
Datasets contain the goldens you’ll use to generate AI outputs.
- Navigate to Datasets in the sidebar
- Click Create Dataset
- Give it a name (e.g., “QA Test Cases”)
- Add your golden:
- Input: The user query (e.g., “What is the refund policy?”)
- Expected Output (optional): The ideal response
- Actual Output: The AI app’s output to evaluate
- Click Save
We’ll cover all the ways you can generate AI outputs in later sections.
For this quickstart, provide a hardcoded actual output (don’t worry, we won’t be doing this later):
Run the Evaluation
Now let’s evaluate your goldens against your metrics.
- Navigate to Evaluations in the sidebar
- Click New Evaluation
- Select your Dataset (e.g., “QA Test Cases”)
- Select your Metric Collection (e.g., “RAG Quality Metrics”)
- Click Run Evaluation
The evaluation will process each test case and score it against your selected metrics.
View Results on Dashboard
Once your run an evaluation, you will be redirected to a test run. Wait for a moment for evaluation to complete, and ✅ done!. You’ve run your first no-code evaluation.
In the testing report, you can analyze:
- Individual test cases — drill down into specific failures to understand what went wrong
- Score distributions — view average, median, and percentile breakdowns for each metric
- Pass/fail results — a test case passes only if all its metrics meet their thresholds
- AI-generated summary — get an automated analysis of patterns and issues across your test run
In later sections, you can find out more on what a test run offers.
Generating AI Outputs
In the quickstart above, we hardcoded the actual output directly in the dataset. This is useful for quick tests, but highly not recommedned. This is because you should aim to test changes made to your AI app, not static outputs that are pre-computed.
Confident AI offers more powerful ways to generate outputs dynamically:
-
Single prompt generation — define a prompt template in the platform and Confident AI calls your configured LLM provider to generate outputs automatically. Ideal for testing prompt variations or comparing models.
-
AI Connections — connect directly to your deployed AI system. If it’s reachable via HTTP(s), it’s testable. Customize request payloads, parse custom response structures, and pass headers or auth tokens.
AI connections are powerful because it allows Confident AI to test your AI apps as they are. However, it does require an initial small setup time from engineering.
AI Connections let you test your actual AI system end-to-end, catching integration issues that prompt-only testing misses.
Next Steps
Now that you’ve completed a basic evaluation, learn how to handle different use cases: