Presenting...

The open-source LLM evaluation framework.

Star on GitHub
The Complete Guide to LLM Experimentation: Compare Prompts, Models, and Agents

Every AI team eventually ends up in the same debate: is this new prompt actually better, or do we just like the 3 examples we happened to look at?

This is what LLM experimentation is for. It gives you a controlled way to compare two or more versions of your AI app against the same dataset, using the same metrics, before you push changes to production.

No vibes, no cherry-picked examples, and no gut feel. Instead, you get a repeatable process for deciding which prompt, model, tool set, or agent configuration should actually ship.

In this guide, I'll walk through the full LLM experimentation workflow: what LLM experimentation means, how it differs from evaluation, how to choose what to optimize, how to curate datasets and metrics, how to interpret results, how to extend experiments into production, and how Confident AI supports the loop end-to-end.

TL;DR

  • LLM evaluation answers "how good is this version?" LLM experimentation answers "which version is better?"
  • The only fair comparison uses the same dataset and evaluation metrics. If prompt A and prompt B see different inputs, you are not running an experiment. You are collecting anecdotes.
  • Change one variable at a time. Compare prompt vs prompt, model vs model, tool set vs tool set, or retrieval config vs retrieval config. If everything changes at once, you will not know what worked.
  • Use the 2 + 3 rule for metrics. Start with 2 general-purpose metrics and 3 custom metrics that encode what good means for your product.
  • Look beyond LLM-as-a-judge scores. Latency, cost, tool correctness, retries, token usage, and escalation rate can matter just as much as answer quality.
  • Experiments do not end after the report. Winners should be versioned, A/B tested when appropriate, monitored in production, and used to improve future datasets.

What Is LLM Experimentation?

LLM experimentation is the process of comparing multiple versions of an LLM app under controlled conditions. Each version runs against the same evaluation dataset and gets scored by the same metric collection so you can see which version performs better.

Three prompt variants evaluated against the same dataset and metrics.

Three prompt variants are evaluated against the same dataset and metrics so their outputs can be compared fairly.

The basic experimentation process looks like this:

  1. Pick the parameter you want to optimize: prompt, model, tool set, retrieval config (top-k, chunking, reranker), agent architecture (planner, memory, reflection loop), etc.
  2. Choose a fixed dataset that represents the use case.
  3. Prepare a collection of metrics that evaluate what you care about.
  4. Run every variant against the same dataset and metric collection.
  5. Compare aggregate scores, inspect failures, and decide what to ship or investigate next.

That sounds obvious, but even small setup mistakes can make experiment results hard to trust. If each variant changes more than one parameter, if you optimize too many metrics at once, or if the main metric does not match the product goal, it becomes difficult to know what actually improved.

Choosing a winner is also not always straightforward. If version A performs better on 50% of test cases and version B wins on the other 50%, you need to inspect the trade-offs before deciding what to ship.

LLM Experimentation vs LLM Evaluation

Think of it this way:

  • An eval tells you whether one version passed your quality bar.
  • An experiment tells you which version performed best against one or more alternatives.

That distinction matters because evals are easy to mistake for experiments. If you run prompt A today, prompt B tomorrow, compare the pass rates, and call it a decision, the comparison is noisy unless both versions ran on the same dataset, with the same metrics, under the same conditions.

Confident AI regression testing suite

Confident AI regression tests make sure prompt, model, and agent changes are compared against the same examples before release.

An evaluation is still the atomic unit of experimentation. You cannot compare two versions unless each version was evaluated. The difference is that experimentation adds the comparison layer: same inputs, same metric collection, same runtime assumptions, and a decision about which version should move forward.

LLM Experiments vs A/B Tests

An LLM experiment is usually done offline in development or staging. You can compare many prompt, model, tool, retrieval, or agent variants against the same dataset and evaluation metrics. Think: "Prompt v4 performed better than prompt v3 and v2 on task completion, but lost on faithfulness."

An A/B test compares two versions, usually A vs B. In development, that might mean an A/B-style comparison on a fixed dataset. In production, it usually means splitting live traffic across two LLM app versions and measuring product outcomes. For example: "Prompt v4 improved support resolution compared to prompt v3, but produced more user-reported hallucinations."

In practice, offline experiments and A/B tests in production are usually used together to continuously improve an LLM app or agent:

  • Offline experiments help you eliminate bad variants before users see them.
  • Production A/B tests help you confirm that a strong offline winner still wins under real traffic.

Running an LLM Experiment

Running an LLM experiment should be easy once you have the right setup. A good setup is controlled in the right ways: clear baseline, clear variants, fixed dataset, and fixed metrics.

Run experiment modal for selecting an evaluation dataset and metric collection.

Confident AI lets you run an LLM experiment by selecting the dataset and metric collection that every variant will use.

The minimum workflow for an LLM app or agent looks like this:

  1. Choose what to optimize. Decide whether the experiment is about the prompt, model, tool set, retrieval config, agent architecture, or another parameter.
  2. Choose a baseline app or agent version. This is usually the current production version of the LLM app or agent.
  3. Create one or more app or agent variants. These are the proposed versions you want to compare against the baseline. Ideally, only the parameter you want to optimize should change.
  4. Run every version on the same evaluation dataset. Every variant should see the same inputs.
  5. Score every version with the same metric collection. Every variant should be judged by the same quality bar.

This lets you answer the practical question: if we swapped production to this version, what would likely get better and what would likely get worse?

Choosing What to Optimize

Most LLM apps have a lot of knobs. Prompts, models, tools, retrieval, temperature, memory, response format, and planner instructions can all change behavior.

Like any controlled experiment, you should start by choosing one independent variable. If you change five knobs at once and the result improves, you might have a better app, but you will not know which change helped or whether another change introduced a regression.

Parameter

When to Optimize

What You Compare

Prompt

You want better wording, instructions, examples, formatting, or refusal behavior

Two different task prompts, such as a concise instruction prompt vs a few-shot prompt

System prompt

You want to change the app's role, policy, tone, or global constraints

Two different system prompts, such as a strict policy-first prompt vs a more conversational prompt

Model

You want to trade off quality, cost, latency, or reliability

Different models, such as GPT-4o vs Claude vs Gemini

Tool set

You want to know whether an agent can complete tasks more reliably with different tools

Different tool configurations, such as the current tools vs a version with a refund lookup tool added

Retrieval config

You want better grounding or context selection in a RAG workflow

Different retrieval settings, such as top-k 3 vs top-k 8, or no reranker vs reranker

Temperature / decoding

You want to test creativity, determinism, or consistency

Different decoding settings, such as temperature 0 vs 0.3 vs 0.7

Memory strategy

You want a multi-turn agent to remember useful context without drifting

Different memory approaches, such as no memory vs short-term memory vs summarized memory

Agent planning

You want to reduce loops, wrong tool calls, or unnecessary reasoning

Different planning instructions, such as a direct tool-use planner vs a step-by-step planner

The point is not that you can never run multi-variable experiments. You can, especially for exploratory work. But if multiple variables change at once, you should not treat the result as proof that any one variable caused the improvement.

For most teams, one-variable experiments are the most effective path to better release decisions.

Curating Your Evaluation Dataset

Your dataset is the test bed your variants have to run across. If it is too easy, every version wins. If it is random, the result is noisy. If it is not representative, you optimize for a fake product.

Dataset editor showing evaluation test cases for an LLM experiment.

Confident AI datasets give every prompt, model, or agent variant the same test cases to run against.

For each use case, I recommend starting with roughly 100 test cases. That is enough to include happy paths, edge cases, ambiguous inputs, long-tail failures, and high-value customer scenarios without making every experiment painfully slow.

A good experiment dataset usually includes a few general categories, plus use-case-specific cases for the kind of LLM app you are testing:

  • Common successful workflows so you do not regress the core behavior.
  • Known failure cases from production, logs, support tickets, or manual review.
  • Ambiguous inputs where the app should ask for clarification or avoid over-assuming.
  • Policy-sensitive cases if the app handles security, compliance, safety, escalation, or other strict rules.
  • RAG cases if the app uses retrieval and needs expected context or grounding requirements.
  • Agentic tasks if the app uses tools, multi-step sequencing, or final-answer quality depends on intermediate actions.

Do not build one giant generic dataset and call it a day. Instead, create separate datasets for each use case, such as billing support, onboarding, sales email generation, medical intake summarization, or internal knowledge search.

This matters because a prompt can be great for one workflow and weak for another. The same is true for metrics: Faithfulness might matter for an internal knowledge search workflow, while Tool Correctness might matter more for an agent that updates records. Even within the same dataset, not every metric needs to apply to every test case, because different examples may be testing different behaviors. If your dataset and metric suite blend every workflow together, the aggregate score can hide the exact slice you need to improve.

Choosing Your Metrics

Your metrics are the scoreboard. If the scoreboard is wrong, the experiment is wrong.

Metrics library showing reusable metrics for LLM evaluation.

Confident AI lets you start with reusable metrics, then add custom metrics for the product-specific behavior your experiment needs to measure.

For most LLM experiments, start with the 2 + 3 rule: use 2 out-of-the-box metrics for common failure modes and 3 custom metrics for the use-case-specific behavior that makes your app good or bad. This gives you broad coverage without turning the experiment into a dashboard with 20 competing scores.

For example, a RAG support assistant might use Answer Relevancy and Faithfulness as the two general metrics, then add custom metrics for refund policy compliance, clarifying question quality, and customer tone.

The exact suite depends on the app. A RAG workflow might care most about grounding and retrieval quality, while an agentic workflow might need trace-level metrics for tool correctness, task completion, memory, and unnecessary loops. Not every metric needs to apply to every test case.

Not every signal should be an LLM-as-a-judge score either:

Signal

Why It Matters

Latency

A higher-quality answer can still be unusable if it is too slow

Cost

A variant can win on quality but be too expensive to ship broadly

Tool retries

The final answer may look correct while the trace shows instability

Escalation rate

A prompt that sounds better may still fail to resolve the task

Abstention or refusal rate

The variant may become too cautious or not cautious enough

Together, these metrics and signals tell you whether a variant is accurate, useful, efficient, and safe to ship.

Before you run big experiments, align your LLM-as-a-judge metrics against a small set of examples you understand. If your metric says a bad response is good, the experiment will optimize in the wrong direction very confidently.

Comparing Prompts Side by Side

Prompt experiments deserve their own workflow because prompts are uniquely open-ended. Model choice is constrained by the models available to you. Prompt design is not. There are effectively infinite ways to rewrite instructions, add examples, change tone, specify output format, or emphasize policy.

Side-by-side prompt comparison in Confident AI.

Confident AI lets you compare prompt variants side by side before running the full experiment.

That is why you usually should not send every prompt idea straight into a full experiment. Before scoring 100+ test cases, run each candidate prompt on one representative test case and inspect the answer it generates. The goal is not to declare a winner yet. The goal is to catch prompts that are not worth evaluating at scale.

Prompt versioning makes this much easier. If each prompt candidate is saved as a version, you can run experiments without losing track of which wording, examples, or output format produced each result.

For prompt variants, a clean setup looks like this:

  1. Use the current production prompt as the baseline.
  2. Add 2-4 candidate prompts that represent meaningfully different strategies.
  3. Keep the model, tools, retrieval config, and metrics fixed.
  4. Run a quick one-test-case check to see what each prompt actually generates.
  5. Remove prompts that fail basic formatting, tone, refusal, or instruction-following expectations.
  6. Run the strongest prompt candidates on the full dataset.

That quick run is important. You do not need to spend time scoring 100+ examples for a prompt that fails basic formatting, ignores the task, or refuses every request because you over-tightened the policy language.

Interpreting Experiment Results

A good experiment report does more than tell you "B scored higher." It shows whether B is better in the ways that matter.

When you read the report, look for:

  • Clear winners: one variant wins across the metrics and test cases that matter most.
  • Close calls: the difference is small enough that you should collect more test cases before shipping.
  • Trade-offs: one prompt improves relevance but hurts concision, tone, faithfulness, cost, or latency.
  • Failure clusters: the average looks fine, but one slice fails, such as billing, long conversations, retrieval-heavy cases, or tool-calling tasks.
  • Outliers: a narrow but important failure could be worse than a small aggregate improvement.

The mistake is treating experimentation like a leaderboard. If prompt B improves average Answer Relevancy by 4% but fails three high-value enterprise billing cases that prompt A handled correctly, B is probably not the version to ship. The report should show where variants diverge, because that is usually where the decision lives.

Running Experiments in Production

Offline experiments should catch obvious regressions. Production experiments tell you whether the strongest candidate still works under real traffic, real retrieval data, and real business constraints.

The relationship should look like this:

  1. Use offline experiments to eliminate weak variants.
  2. Ship the strongest candidate behind versioning.
  3. Run a canary, shadow deployment, or A/B test if the change is meaningful.
  4. Monitor online evals and product metrics.
  5. Feed production failures back into the next offline experiment.

This is what turns experimentation from a pre-deployment checklist into a continuous improvement loop.

A/B Testing Prompts and Models

Use offline experiments to eliminate bad candidates, then use shadow traffic, canaries, or production A/B tests to confirm the strongest version under real traffic.

Confident AI production traces and online evaluation metrics.

Confident AI lets you monitor production traces, online evaluation metrics, and failure rates after an A/B test or rollout.

The setup should look a lot like the offline experiment: keep the independent variable clear, release two versions such as prompt v3 and prompt v4, and monitor the same target metrics. The difference is that production can surface new issues, so watch traces, user feedback, online evals, latency, and cost while the test is running.

Expanding Your Evaluation Dataset for Experiments

Your evaluation dataset should get better every time production teaches you something.

The highest-value test cases usually come from:

  • Failed online evals where the shipped version violated a metric in production.
  • User-reported issues from support tickets, thumbs-down feedback, or sales calls.
  • Automatically surfaced issues that your metrics did not catch, such as hallucinations, tool errors, retrieval misses, and repeated retries surfaced by Confident AI's trace analysis.
  • Human-in-the-loop review queues where reviewers marked responses as low quality.
  • High-impact business failures such as missed escalations, incorrect refund guidance, or broken onboarding flows.

When one of these issues appears, do not just fix the prompt. Add the input, context, expected behavior, metadata, and failure reason back into your dataset so every future prompt, model, or agent variant has to handle that case.

Expanding Your Metric Suite

Production should also teach you when your metrics are incomplete or misaligned.

Confident AI metric alignment comparing human annotations and automated metrics.

Confident AI metric alignment shows where automated metrics agree or disagree with human annotations.

For example, maybe your LLM-as-a-judge metric says a support response is relevant and faithful, but human reviewers keep flagging it because it sounds too cold. That does not always mean you need a brand-new metric; it might mean your existing tone metric needs better criteria. Maybe the agent completes the task, but only after five unnecessary tool calls. Maybe answers are correct, but too slow for the UX. These are metric gaps or metric-alignment problems.

The best way to expand your metric suite is through a human-in-the-loop evaluation workflow:

  1. Sample production traces and outputs.
  2. Have humans review quality, failure type, and severity.
  3. Compare human labels against your automated metrics.
  4. Identify where automated metrics disagree with humans.
  5. Add new metrics, revise existing criteria, or tune thresholds to capture the missing judgment.
  6. Re-run historical examples to make sure the new metric behaves correctly.

This matters most for subjective criteria like tone, helpfulness, escalation quality, and policy nuance. LLM-as-a-judge metrics are powerful, but the criteria behind custom metrics still need to be tuned against the way your team actually judges quality.

Common LLM Experimentation Pitfalls

Most LLM experimentation failures come from bad experimental design.

  • Changing too many variables at once. If prompt, model, temperature, tools, and retrieval all change together, you will not know what caused the result.
  • Using different datasets per variant. Every variant must run on the same inputs.
  • Optimizing for one metric only. A variant can improve relevance while hurting faithfulness, safety, cost, or latency.
  • Trusting averages too much. Aggregate scores hide catastrophic failures in narrow but important slices.
  • Skipping the smoke test. A quick run on 10-20 cases can catch broken formatting, bad tool schemas, or obviously weak prompts before the full experiment.
  • Ignoring close calls. If the difference is tiny, collect more test cases before declaring a winner.
  • Forgetting cost and latency. A better answer that is too slow or too expensive might be a worse product.
  • Not versioning the experiment. If you cannot trace the winner back to a prompt, model, dataset, metric collection, and app release, you cannot reproduce the decision.

Another common pitfall is assuming every change needs a full experiment. Fast-moving teams may not run a complete prompt-vs-prompt or model-vs-model experiment for every small update, but they should still run scheduled evals. Experiments tell you which version performed best at one point in time; scheduled evals tell you whether the shipped version keeps passing the same quality bar as prompts, models, retrieval indexes, tools, and traffic change.

Why Confident AI Is the Best Platform to Run LLM Experiments

A useful LLM experimentation platform is not just a playground. A playground helps you brainstorm; an experimentation platform helps you make a decision you can defend.

Confident AI lets teams compare prompts, models, AI connections, and agent configurations against the same dataset and metric collection. You can version prompts, run quick checks in Arena, launch full experiments, and inspect the winning and losing examples from the same workflow.

Confident AI Arena comparing two AI app connections with prompt versions and hyperparameters.

Confident AI Arena lets teams compare real AI app connections, prompt versions, and hyperparameters before running the full experiment.

The workflow difference is that Confident AI can evaluate outputs from your actual AI app endpoint. You do not have to rebuild your prompts, chains, tools, and retrieval flow inside the platform just to run an experiment. If your app can return an output, Confident AI can score it against your dataset and metrics, which means you can test the hyperparameters you actually care about: prompts, models, retrieval config, tool sets, agent architecture, and more.

After deployment, Confident AI keeps the loop going with production A/B testing, online evals, automatic issue surfacing from traces, metric recommendations, human-in-the-loop workflows, and custom dashboards for quality, cost, latency, and failure rates. Experiments should not end as static reports. They should become the next release decision, production monitor, and dataset update.

Conclusion

LLM experimentation is not complicated, but it does require discipline. Compare variants against the same dataset. Score them with the same metrics. Change one variable at a time. Read both the aggregate scores and the individual failures. Then choose the version that wins on the quality bar you actually care about.

That is how you compare prompts side by side, test GPT-4o against Claude, and evaluate agent configurations without touching production. More importantly, it gives your team a shared way to decide what improved, what regressed, and what is ready to move forward.

The next time a prompt, model, or agent change looks better in a few examples, turn it into an experiment before treating it like a release decision. If it wins, you have evidence. If it loses, you found the regression before users did.

Thanks for reading. When you are ready to run this workflow on your own AI app, get started with Confident AI for free.

Frequently Asked Questions

What is LLM experimentation?
LLM experimentation is the process of comparing multiple versions of an LLM app under controlled conditions. Each version runs against the same dataset and metric collection so you can decide which prompt, model, retrieval config, tool set, or agent configuration performs better.
How do I compare two prompts side by side against the same test dataset?
Create prompt A and prompt B, run both against the exact same dataset, score both with the same metric collection, and compare aggregate scores plus per-test-case failures. The dataset and metrics must stay fixed; otherwise you are not comparing prompts, you are comparing different test conditions.
How do you A/B test LLM prompts?
Start with an offline experiment where each prompt runs on the same dataset and metrics. Use that to eliminate weak prompt candidates before users see them. If the change is meaningful, promote the strongest candidate behind versioning and run a production A/B test, canary, or shadow deployment to compare live outcomes.
What is the difference between prompt testing and prompt experimentation?
Prompt testing usually checks whether one prompt works on a set of examples. Prompt experimentation compares multiple prompt variants under the same conditions so you can decide which version performs better and what trade-offs each variant introduces.
What is the best platform for running side-by-side LLM experiments with evaluation metrics?
Confident AI is the best platform for side-by-side LLM experiments with evaluation metrics because it compares prompts, models, AI connections, and agent configurations against the same dataset and metric collection. It shows aggregate winners, metric-by-metric trade-offs, per-test-case outputs, traces for debugging, online evals, and production monitoring.
How many test cases should I use for an LLM experiment?
Start with roughly 100 test cases per use case. That is usually enough to cover common workflows, known failures, ambiguous inputs, policy-sensitive cases, and edge cases without making every experiment too slow. Add more examples when results are close or the workflow is high risk.
What metrics should I use for LLM experimentation?
Use the 2 + 3 rule: 2 out-of-the-box metrics for common quality dimensions and 3 custom metrics for your product-specific quality bar. Depending on the app, include metrics like Answer Relevancy, Faithfulness, Task Completion, Tool Correctness, tone, policy adherence, latency, cost, token usage, and escalation rate.
What is the best workflow for iterating on prompts with evaluation data instead of gut feel?
Explore variants quickly, run a small smoke test, formalize the strongest candidates into an experiment, run the same dataset and metrics across every variant, inspect outliers, promote the winner, then monitor production. Confident AI supports this loop with Arena, experiments, test reports, traces, online evals, and production monitoring.
How do I compare prompt variants and know which one is actually better?
Look for consistent wins across the metrics that matter, not just one higher average. Inspect trade-offs, close calls, failure clusters, and outlier failures. A variant is actually better when it improves the quality dimensions you care about without introducing unacceptable regressions in safety, faithfulness, latency, or cost.
What's the best way to A/B test different LLM models before deploying?
Run an offline controlled experiment before touching production. Use the same prompt, dataset, retrieval context, parameters, and metric collection for every model. Compare quality, cost, latency, and outlier failures together. The best model is the one that wins for your use case, not the one with the most impressive public benchmark score.
Should I A/B test LLM models in production or offline first?
Start offline first. Use a fixed dataset and metric collection to eliminate bad model candidates before users see them, then use shadow traffic, canaries, or production A/B tests to confirm the winner under real traffic. Confident AI supports the full loop: offline experiments, test reports, traces, scheduled evals, online evals, and production monitoring.
How do I test whether GPT-4o or Claude is better for my use case?
Set GPT-4o and Claude up as contestants in the same experiment, run both on your own evaluation dataset, and score them with your use-case metrics. Include quality metrics like Answer Relevancy, Faithfulness, Task Completion, and custom G-Eval metrics, plus cost and latency. Then inspect where each model fails before choosing a winner.
How do I run experiments to compare agent configurations like different tools or system prompts?
Change one agent variable at a time: system prompt, tool set, model, retrieval config, memory, or planning instruction. Run every configuration on the same task dataset and score with trace-level metrics like Task Completion, Tool Correctness, Faithfulness, safety, latency, and cost. Use traces to see why the winning configuration won.
How do I run controlled experiments on my LLM app without changing production?
Use your current production version as the baseline, create offline variants, and run them against a fixed dataset and metric collection. Confident AI experiments let you compare prompts, AI connections, models, and app configurations without deploying the variant first.
What is the difference between LLM evaluation and LLM experimentation?
LLM evaluation tells you how good one version is. LLM experimentation tells you which version is better by comparing multiple versions against the same dataset and metrics. Evals produce a snapshot; experiments produce a decision.

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an "aha!" moment, who knows?

The Eval Platform for AI Quality & Observability

Confident AI is the leading platform to evaluate AI apps on the cloud, with metrics open-sourced through DeepEval.

Regression test and evaluate LLM apps.
Easily A|B test prompts and models.
Edit and manage datasets on the cloud.
LLM observability with online evals.
Publicly sharable testing reports.
Automated human feedback collection.

More stories from us...