Products

Jeffrey Ip

Cofounder @ Confident AI, creator of DeepEval & DeepTeam. Working overtime to enforce responsible AI, with an unhealthy LLM evals addiction. Ex-Googler (YouTube), Microsoft AI (Office365).

The LLM Evaluation Playbook Simply Explained

Oct 10, 2025.16 min read

Presenting...

The open-source LLM evaluation framework.

Star on GitHub

I want you to meet Johnny. Johnny’s a great guy — LLM engineer, did MUN back in high school, valedictorian, graduated summa cum laude. But Johnny had one problem at work: no matter how hard he tried, he couldn’t get his manager to care about LLM evaluation.

Imagine being able to say, “This new version of our LLM support chatbot will increase customer ticket resolutions by 15%,” or “This RAG QA’s going to save 10 hours per week per analyst starting next sprint.” That was Johnny’s dream — using LLM evaluation results to forecast real-world impact before shipping to production.

But like most dreams, Johnny’s too, fell apart.

Most evaluation efforts fail because:

The metrics didn’t work — they weren’t reliable, meaningful, or aligned with your use case.
Even if the metrics worked, they didn’t map to a business KPI — you couldn’t connect the scores to real-world outcomes.

No correlation between evaluation results and value delivered by an LLM application

This LLM evaluation playbook is about fixing that. By the end, you’ll know how to design an outcome-based, LLM testing process that drive decisions — and confidently say, “Our pass rate just jumped from 70% to 85%, which means we’re likely to cut support tickets by 20% once this goes live”. This way, your engineering sprint goals can start becoming as simple as optimizing metrics.

You’ll learn:

What is LLM evaluation, why 95% of LLM evaluation efforts fail, and how not to become a victim of pointless LLM evals
How to connect LLM evaluation results to production impact, so your team can forecast improvements in user satisfaction, cost savings, or other KPIs before shipping.
How to build an outcome driven LLM evaluation process, including curating the right dataset, choosing meaningful metrics, and setting up a reliable testing workflow.
How to create a production-grade testing suite using DeepEval to scale LLM evaluation, but only after you've aligned your metrics.

I'll also include code samples for you to take action on.

TL;DR

The problem with LLM evals is they don't correlate to any measurable business value.
To fix this, engineers, PMs, QAs, and domain experts should curate a dataset of ~100 "expected outcomes" for an LLM use case (e.g. resolving a ticket for a customer support chatbot can be an expected outcome).
Do no blindly choose metrics that sound good on paper, you should implement a combination of metrics that correlate to the expected outcome instead.
This can take anywhere between 1 week to 2 months, and it is recommended that you should not go beyond 100 test cases to start with.
DeepEval (100% OS ⭐ https://github.com/confident-ai/deepeval) allows anyone to implement the metrics you've chosen in 5 lines of code.

What Is LLM Evaluation and Why Is It Broken?

There’s good news and bad news: LLM evaluation works — but it doesn’t work for most people. And most people haven’t read this article, yet.

LLM evaluation is the process of systematically testing Large Language Model (LLM) applications using metrics like answer relevance, correctness, factual accuracy, and similarity. The core idea is straightforward: define a diverse set of test cases that provide sufficient use case coverage, then use these metrics to determine how many of them your LLM application passes whenever you tweak your prompts, model choices, or system architecture.

This is what a test case looks like, which evaluates an individual LLM interaction:

Image taken from [DeepEval’s docs](https://www.deepeval.com/docs/evaluation-test-cases#llm-test-case) — Image taken from DeepEval’s docs

There’s an input to your LLM application, the generated “actual output” based on this input, and other dynamic parameters such as the retrieval context in a RAG pipeline or reference-based parameters like the expected output that represents labelled/target outputs. But fixing the process isn’t as simple as defining test cases and choosing metrics.

LLM evaluation often feels broken — because it’s not predictive of any desirable outcome meant to be delivered by your LLM, and therefore it doesn’t lead to anything meaningful.

You can’t point to improved test results and confidently say they’ll drive a measurable increase in ROI, and without a clear objective, there’s no real direction to improve. To address this, let’s look at the two modes of evaluation — and why focusing on end-to-end evaluation is key to staying aligned with your business goals.

Component-Level vs End-to-End Evaluation

LLM applications — especially with the rise of agentic workflows — are undoubtedly complex. Understandably, there can be many interactions across different components that are potential candidates for evaluation: embedding models interact with LLMs in a RAG pipeline, different agents may have sub-agents, each with their own tools, and so on. In fact, for those interested we've written a whole other piece on evaluating AI agents.

But for our objective of making LLM evaluation meaningful, we ought to focus on end-to-end evaluation instead, because that’s what users see.

An LLM system involving multiple components

End-to-end evaluation involves assessing the overall performance of your LLM application by treating it as a black box — feeding it inputs and comparing the generated outputs against expectations using chosen metrics. We’re focusing on end-to-end evaluation not because it’s simpler, but because these results are the ones that actually correlate with business KPIs.

Think about it: how can the performance of a triple nested RAG pipeline buried inside your agentic workflow possibly be used to explain an X% increase in automated support ticket resolution for a customer support LLM chat agent, for example?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free Checkout DeepEval

LLM Evaluation Must Correlate to ROI

Working on DeepEval, we often see engineers turn away from LLM evaluation after trying it out. Sometimes they just aren’t ready yet — still in the prototyping phase — but more often, they can't align on the ROI.

So we asked ourselves: If LLM evaluation is supposed to quantify how well your system achieves its intended goals, why do teams fail to benefit from it in 95% of cases?

The answer was painfully simple — and it exposed a faulty assumption. People are evaluating their LLM applications, but not against their actual goals. Even worse, most users don’t realize this disconnect themselves, because these evaluation metrics are just so convincing.

For example, here are some common LLM evaluation metrics used to evaluate LLM apps across use cases like chatbots, RAG QA systems, agent planners, and writing assistants:

Correctness — Measures whether the output is factually accurate and logically sound.
Answer Relevancy — Assesses how directly the output addresses the user’s query.
Tonality — Evaluates whether the response matches the desired tone (e.g. professional, friendly, concise).
Faithfulness (i.e. hallucination) — Checks if the output stays grounded in the retrieved context in RAG pipelines without fabricating information.
Tool Use — Verifies whether external tools (APIs, functions, databases) were used correctly and when appropriate.

These metrics seem valid — and they are. But the problem is that your LLM application doesn’t exist just to be “correct” or “relevant.” It exists to deliver ROI: to save time in internal workflows like RAG QA, or to reduce costs by automating customer support through LLM chat agents.

So what now, should you start renaming your metrics to “user satisfaction”, “revenue generated”, or “time saved” instead? No — not during development. Those are production outcomes, not development metrics. What you can do is correlate your evaluation metrics with production outcomes, and use those metrics as reliable proxies for success.

There should be a strong correlation between “value” and higher number of passing test cases

Without a clear metric-outcome relationship, it’s hard to even convince yourself that an improvement matters. When you have a metric-outcome connection in place though, aligning engineering goals become clear: improve the right metric, and you’re moving the needle toward business impact.

How to Setup A Correlated Metric-Outcome Relationship

This might not be what you want to hear, but you need humans. LLM evaluation scales human judgement, not replace them. Repeat after me: LLM evaluation scales human judgement, not replace them.

Humans-in-the-Loop

If you don’t have enough end-user feedback to curate a dataset of 25–50 “good” and “bad” outcomes as LLM test cases, you don’t need LLM evaluation — you need more users.

If there’s only one thing you remember from this article, it is this: humans are used to label desirable or undesirable OUTCOMES, and not expected scores of metrics you "think" will be useful. An outcome can be anything from a user closing the screen on your chatbot (bad outcome), loving the customer support experience after getting their ticket resolved (good outcome), or never interacting with your text-sql system ever again (bad). Whatever product metrics you use, you know it better than me.

If you’re a large enterprise that requires rigorous evaluation before deployment, that’s understandable too. In that case, use your engineering team to crowdsource test cases. Ask everyone to contribute 5–10 examples and label them as “good” or “bad” outcomes. It’s not as ideal as real end-user feedback, but it still works.

At a minimum, you should have:

25–50 human-labeled input-output pairs, with a verdict of desirable or undesirable outcome, and also ideally with reasoning and expected outputs included (especially for the “bad” ones)
A roughly 50/50 mix of good and bad outcomes

But don’t get carried away. It’s just as important not to start with too many test cases. You should be able to personally review each one. If the dataset gets so big that you find yourself skimming, that’s a problem — it’ll hurt the next step of setting up the metric-outcome relationship.

And if you’re thinking about using LLMs to generate synthetic test cases — don’t. We’re strict about this. Why? Because you’ll waste time. You’ll generate synthetic data, realize it doesn’t work, give up, and go back to eyeballing. A complete waste of time.

“The best part about synthetic data generation is you don’t have to do it.”

Remember, you’re trying to drive ROI with your LLM app in the real-world, not drive ROI in a simulation.

Aligning Your Metrics

At its core, your evaluation metrics should reflect whether a the desirable outcome is achieved. If a human marked an output as an undesirable outcome, your evaluation should fail that test case. If it passes instead, that’s a false negative — and a sign your metric is misaligned. Similarly, if a human says the output is “good” but the test fails, that’s a false positive.

Each test case can be evaluated using one or multiple metrics, and a test case only passes if and only if all metrics passes. The point isn’t to blindly optimize for a metric score, but to make sure the metric, no matter it be correctness, answer relevanacy, or something else, actually produces a pass/fail result that agrees with what humans would say.

In the next section, we’ll dive into choosing and combining metrics. For now, aim for this benchmark: your metrics should match human annotated outcome for at least 95% of the time — which translates to a combined false positive and false negative rate below 5%. If your evaluation regularly disagrees with human feedback, you’re optimizing for the wrong signal — and that undermines any effort to improve your LLM system.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free Checkout DeepEval

How to Align Your LLM Evaluation Metrics

Very likely, you won’t get your LLM metrics right on the first try — which is why you should treat metric design as an iterative process: start simple, experiment with scoring styles, tweak thresholds, refine LLM-as-a-judge prompts, and layer in multiple metrics when needed. Checkout our metric selection guide for each step in the process.

Goal to implement our metrics: Align the the test case pass/fail rate consistent with the expected outcomes of human curated test cases.

An evaluation metric architecture, taken from [the LLM evaluation metrics article.](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation) — An evaluation metric architecture, taken from the LLM evaluation metrics article.

We’ll be using Johnny's support chatbot example for this section, and in this example, a resolution of a support ticket is the desirable, good outcome, and vice versa.

1. Start with one metric

Pick the single most important metric aligned with your chatbot’s purpose — say, answer correctness. Start by testing whether the answers provided are factually accurate and useful for resolving support tickets. Remember, our goal is to see whether this aligns with human judgement.

Here’s how you can define an answer correctness metric in DeepEval ⭐, an open-source LLM evaluation framework I've been working on for the past 2 years:

pip install deepeval

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

test_case = LLMTestCase(input="...", actual_output="...")

correctness = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

correctness.measure(test_case)
print(correctness.score, correctness.reason)

Find more information on implementing G-Eval with DeepEval.

DeepEval’s metrics uses LLM-as-a-Judge, and in this particular example we used G-Eval, a SOTA way to create custom, task specific metrics using CoT prompting for utmost reliability and accuracy (here is a great read on what G-Eval is if you’re interested).

We’ll talk more about LLM-as-a-judge later, but the reason why we can use LLMs to evaluate LLMs is because LLMs actually align with human judgements (81%) more than humans align with themselves, making it the best evaluator for LLM evaluation.

2. Using binary vs. continuous scores

Decide whether you want a simple pass/fail system or a more flexible scoring range. Binary scores (0 or 1) are straightforward and great for deployment decisions, but they lack nuance. Continuous scores (e.g., 0.0–1.0) let you capture degrees of quality and adjust thresholds based on your tolerance for errors. For example, an answer that’s mostly correct but slightly flawed might score 0.8 — giving you room to tune what counts as a “pass.”

Metric scores are all continuous in DeepEval (docs here) by default, but you can always make them binary by turn on strict_mode , which only passes if the score is perfect (i.e. 1/1):

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

test_case = LLMTestCase(input="...", actual_output="...")

correctness = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    strict_mode=True
)

correctness.measure(test_case)
print(correctness.score, correctness.reason)

Note that you don’t have to use binary scores for all metrics. For example, you might want to make some metrics that are more one-dimensional binary, such as hallucination, while making relevancy continuous.

This also means if you rely on binary scores completely, you might also find yourself adding more metrics than necessary to capture more dimensions of what makes an LLM output “good” or “bad”.

3. Adjust Your Thresholds

If you’re using continuous scores, your threshold determines what counts as a “pass.” Set it too low, and you’ll get false positives. Set it too high, and you’ll get false negatives. Tune the threshold until your evaluation consistently agrees with the expected label of the 25–50 curated test cases you have.

All metrics in DeepEval range from 0–1, and here’s how you can adjust your threshold in DeepEval:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

test_case = LLMTestCase(input="...", actual_output="...")

correctness = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.8
)

correctness.measure(test_case)
print(correctness.score, correctness.reason)

4. Improve LLM-as-a-Judge

Often, no matter what thresholds you use — binary or continuous — the real issue lies in how your LLM-based evaluation is implemented. There are so many different LLM-as-a-judge techniques for scoring LLM evaluation metrics, with G-Eval being one of them (and in fact I’ve written a full comprehensive guide on all the different scoring methods for LLM evaluation metrics here).

In a nutshell, you’ll need to incorporate different techniques such as few-shot prompting, using different metrics, switching from a referenceless vs reference-based approach, etc.

You can read the full guide here on how to optimize LLM evaluators here, but here is a quick example to show how you can tune metric scores in DeepEval by using a reference-based G-Eval metric by comparing the “actual output”s to the “expected output”s of your LLM CS chatbot for the same correctness criteria we used for the earlier examples instead:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")

correctness = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    threshold=0.8
)

correctness.measure(test_case)
print(correctness.score, correctness.reason)

You can also write out the evaluation steps instead for a better G-Eval algorithm instead of a criteria:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")

correctness = GEval(
    name="Correctness",
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    threshold=0.8
)

correctness.measure(test_case)
print(correctness.score, correctness.reason)

If you’re not sure what a test case is, or what it evaluates, click here.

5. Using multiple metrics

Sometimes a single metric like “correctness” doesn’t fully explain why a test case fails. You might notice that users reject outputs not just for being incorrect, but for being irrelevant or overly verbose. In those cases, adding another metric — like “answer relevancy”— can capture what correctness alone misses. Layering multiple metrics helps you pass or fail a test case when it is supposed to.

Here’s a DeepEval example:

from deepeval.metrics import GEval, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval import evaluate

test_case = LLMTestCase(input="...", actual_output="...", expected_output="...")

correctness = GEval(
    name="Correctness",
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
    threshold=0.8
)

relevancy = AnswerRelevancyMetric()

evaluate(test_cases=[test_case], metrics=[correctness, relevancy])

DeepEval offers 30+ ready made metrics to catch all these use cases you may have so you don’t have to build your own, runs on any LLM, in any environment, anywhere, anytime, all at once.

You can get started by visiting official documentation for DeepEval.

Validating Your Metric-Outcome Relationship

Once you’ve aligned your metrics with human feedback, it’s time to validate that your evaluation actually predicts real-world outcomes. Start by hiding some test case labels (blind data), then score them using your collection of metrics. As you add more data, your test case false positive and false negative pass/fail should stay the same. If they don’t, your metrics aren’t generalizing — and likely haven’t covered enough edge cases.

New test cases should not affect the true passing and failing rate of your metric evaluated test cases

Keep iterating until your test case pass/fail rates holds steady, even as you scale up.

Repeat this loop until it sticks:

Add more blind-labeled test cases (e.g., new edge cases or borderline outputs), make sure these are also human generated
Run your metrics without looking at the labels
Compare metric results with the human-labeled outcomes
Track your false positive and false negative rate (aim for <5% combined)
If alignment breaks, revisit your metrics, thresholds, and all the other techniques we talked about.
Repeat until metric alignment remains stable across all new data

At the end, you should be able to draw a nice graph between test case pass/fail rate vs your desired outcome. For a customer support use case, your desired outcome should be the ticket resolution rate, which should give you a graph looking something like this:

Passing test cases should result in more tickets resolve

Even if your passing rate hits 100%, your "desirable outcome proportion" might still fall short of 1. This can happen for several reasons — your evaluation dataset might lack sufficient coverage, or in some cases, there are simply limitations to what AI can handle, and that’s perfectly okay.

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free Checkout DeepEval

How to Scale LLM Evaluations (When You’re Ready)

Just like building a startup, you shouldn’t scale aggressively until you’ve found PMF.

In your case, it will be MOF (metric-outcome fit, trust me on this), and once you’ve reached this point it means your metrics has meaning and can finally used to evaluate LLM test cases and tie them to real-world ROI.

Setup an LLM testing suite

You need an LLM testing suite. I don’ care which one you use, but please don’t go for CSV. Comparing hundreds of individual test cases with potentially multiple metrics across a few pre-deployment test runs is extremely ineffective, and if you’ve went through all the trouble to align your metrics you should just either build something your own or use something off the shelf like Confident AI, the DeepEval platform (slightly biased).

But seriously, Confident AI is free and 100% integrated with DeepEval. We’ve done all the hard work for you already, and it’s in dark mode:

Just run this command in the CLI to get started:

deepeval login

Here is the quickstart docs for Confident AI.

Unit testing in CI/CD (for regressions)

LLM evaluation should be integrated directly into your CI/CD pipeline. Treat your evaluation suite like unit tests: if the percentage of passing test cases drops (i.e. there’s a regression), deployment should be automatically blocked. Why? Because you now know that your LLM use case will for sure bring in less value in production, so don’t ship it.

This is also where your LLM testing suite comes in. You should setup a workflow that:

Runs unit-tests in CI/CD pipelines
Uploads these data to your testing suite of choice for data persistence and collaboration

If you use DeepEval + Confident AI, this is achieved by creating a test file, which is akin to Pytest for LLMs:

import pytest
 
from deepeval.prompt import Prompt
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval import assert_test
 
# Optional, edit and pull your dataset from Confident AI
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
 
# Optional, use your prompt from Confident AI
prompt = Prompt(alias="your-prompt-alias")
prompt.pull()
 
 
# Process each golden in your dataset
for goldens in dataset.goldens:
    input = golden.input
    # Replace your_llm_app() with your actual LLM application
    test_case = LLMTestCase(input=input, actual_output=your_llm_app(input, prompt))
    dataset.test_cases.append(test_case)
 
 
# Loop through test cases
@pytest.mark.parametrize("test_case", dataset)
def test_llm_app(test_case: LLMTestCase):
    # Replace with your metrics
    assert_test(test_case, [AnswerRelevancyMetric()])

Finally, create a .yaml file to execute this file using the deepeval test run command in CI/CD environments like Github Actions.

name: LLM App Unit Testing
 
on:
  push:
  pull_request:
 
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v2
 
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.10"
 
      - name: Install Poetry
        run: |
          curl -SSL https://install.python-poetry.org | python3 -
          echo "$HOME/.local/bin" >> $GITHUB_PATH
 
      - name: Install Dependencies
        run: poetry install --no-root
 
      - name: Login to Confident AI
        env:
          CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
        run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"
 
      - name: Run DeepEval Test Run
        run: poetry run deepeval test run test_llm_app.py

When your testing file runs, everything will be populated automatically on Confident AI. Again, here is the Confident AI documentation for this in full.

Prompt and model tracking

You should also keep track of your LLM system configurations when running test unit tests. After all, you don’t want to “forget” what was the implementation of your LLM app from a week ago, where the pass rate was at its highest.

You can do this by logging hyperparameters in DeepEval (in the same test file we saw above):

import pytest
 
from deepeval.prompt import Prompt
from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval import assert_test
 
# Optional, edit and pull your dataset from Confident AI
dataset = EvaluationDataset()
dataset.pull(alias="your-dataset-alias")
 
# Optional, use your prompt from Confident AI
prompt = Prompt(alias="your-prompt-alias")
prompt.pull()
 
 
# Process each golden in your dataset
for goldens in dataset.goldens:
    input = golden.input
    # Replace your_llm_app() with your actual LLM application
    test_case = LLMTestCase(input=input, actual_output=your_llm_app(input, prompt))
    dataset.test_cases.append(test_case)
 
 
# Loop through test cases
@pytest.mark.parametrize("test_case", dataset)
def test_llm_app(test_case: LLMTestCase):
    # Replace with your metrics
    assert_test(test_case, [AnswerRelevancyMetric()])

@deepeval.log_hyperparameters
def hyperparameters():
  return {"Model": "your-model-name", "Prompt": prompt}

Which allows you to also compare parameters like this:

Comparing the model parameters on Confident AI

Full documentation here.

Debugging evals with tracing

Even though we’re evaluating the end-to-end LLM system, you should also add tracing to debug which parts of your component might not be delivering the passing test cases that you want.

There are tools like Datadog or New Relic available, but LLM specialized observability tools like Confident AI allows you to incorporate tracing within your LLM testing suite:

Your choice to use Confident AI or something else, but the docs to Confident are here.

Adding more human feedback to dataset

Continually adding fresh human feedback ensures your metrics stay relevant over time. Without it, your evaluation risks drifting into irrelevance or redundancy — scoring well on outdated patterns while missing new failure modes. Regularly check that your metric scores still align with human judgment the same way they did a week, a month, or even a year ago.

Confident AI offers APIs through DeepEval for you to queue human feedback for ingestion into datasets:

from deepeval.dataset import EvaluationDataset, Golden

dataset = EvaluationDataset()

# Implement something here to collection
# human feedback for your LLM app
golden = Golden(input="...", expected_output="...")

dataset.queue(alias="your-dataset-alias", goldens=[golden])

Production monitoring

Production monitoring isn’t the first priority — but once everything else is in place, it becomes a powerful validation layer. Are users satisfied with outputs your tests marked as “passing”? Are they abandoning flows your metrics said were “good”?

You can also enable online metrics to score live responses (see docs for how Confident AI can do it here), but only do this after you’ve established strong offline evaluation, good test coverage, and clear metric-outcome alignment. Otherwise, you’re just adding noise.

Conclusion

In this article, we discussed what LLM evaluation is, the difference between component-wise and end-to-end evaluation, and why end-to-end evaluation is the mode of evaluation you want to be looking at when tying testing results to meaning business KPIs.

This is because LLM evaluation should be outcome-based, and an outcome are things such as user satisfaction, retention, etc. You should spend great effort in aligning your test case pass/fail rate to business KPIs, in order to predict how development testing results will drive ROI in production even before deployment.

The steps are simple:

Collect human-labeled test cases
Align your metric such that the test case pass/fail rate aligns with outcomes from your human curated test cases (<5% false positive/negative rate is ideal)
Keep iterating on your metrics until new test cases’s passing rate stays consistent, even for new test cases

With this, you should be able to justify how LLM evaluation is helping you, and not run LLM evals just because it is “best practice”.

A lot of the workflow can be automated DeepEval + Confident AI, and in fact this why we built our products this way. You wouldn’t have to build your own test suite, play around with messy CSV files for dataset curation, or stitch together disjointed products like Datadog and Google sheets for debugging your LLM app.

Don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article insightful, and as always, till next time.

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The DeepEval LLM Evaluation Platform

The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.

Regression test and evaluate LLM apps.

Easily A|B test prompts and models.

Edit and manage datasets on the cloud.

LLM observability with online evals.

Publicly sharable testing reports.

Automated human feedback collection.

Try Now for Free Checkout DeepEval

The LLM Evaluation Playbook Simply Explained

TL;DR

What Is LLM Evaluation and Why Is It Broken?

Component-Level vs End-to-End Evaluation

Confident AI: The DeepEval LLM Evaluation Platform

LLM Evaluation Must Correlate to ROI

How to Setup A Correlated Metric-Outcome Relationship

Humans-in-the-Loop

Aligning Your Metrics

Confident AI: The DeepEval LLM Evaluation Platform

How to Align Your LLM Evaluation Metrics

1. Start with one metric

2. Using binary vs. continuous scores

3. Adjust Your Thresholds

4. Improve LLM-as-a-Judge

5. Using multiple metrics

Validating Your Metric-Outcome Relationship

Confident AI: The DeepEval LLM Evaluation Platform

How to Scale LLM Evaluations (When You’re Ready)

Setup an LLM testing suite

Unit testing in CI/CD (for regressions)

Prompt and model tracking

Debugging evals with tracing

Adding more human feedback to dataset

Production monitoring

Conclusion

Confident AI: The DeepEval LLM Evaluation Platform

More stories from us...

Products

Blog

Resources

Company

Legal Stuff