<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Weekly Dose of Confident AI</title>
    <link>https://www.confident-ai.com/blog</link>
    <description>Resources to help teams build reliable AI systems - guides, tutorials, personal experiences, and essays to test LLM apps in every way possible.</description>
    <language>en-us</language>
    <lastBuildDate>Wed, 17 Jun 2026 12:01:52 GMT</lastBuildDate>
    <atom:link href="https://www.confident-ai.com/feed.xml" rel="self" type="application/rss+xml"/>
    
    <item>
      <title>Human-in-the-Loop Workflows for AI Agent Evaluation: Complete Guide</title>
      <link>https://www.confident-ai.com/blog/human-in-the-loop-llm-evaluation-guide</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/human-in-the-loop-llm-evaluation-guide</guid>
      <description>A practical guide to human-in-the-loop workflows for AI agent evaluation: how SMEs review AI agent failures, align automated metrics, and improve evaluation datasets.</description>
      <content:encoded><![CDATA[Automated evals are how AI teams scale AI quality, but not every quality decision should begin and end with a metric score. This is especially true for AI agents. A bad response might come from the final answer, the retrieved context, a tool call, a routing decision, a policy boundary, or a multiturn breakdown.

LLM judge metrics can catch a lot of this, but they still need human judgment to define what good output looks like, check whether metric scores match human expectations, and keep evaluation datasets grounded in truth.

Humanintheloop workflows keep evals aligned with human expectations and make that process systematic. Instead of asking subject matter experts (SMEs), QA teams, or domain experts to inspect random outputs forever (not scalable), they let you route the right cases to the right reviewers, capture structured feedback, and turn that feedback into better evals to a point where reviewers are barely needed.

In this guide, I'll walk through the three humanintheloop wor...]]></content:encoded>
      <pubDate>Sat, 13 Jun 2026 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https:/images/blog/human-in-the-loop-ai-agent-evaluation-cover.png" type="image/jpeg"/>
    </item>
    <item>
      <title>The Complete Guide to LLM Experimentation: Compare Prompts, Models, and Agents</title>
      <link>https://www.confident-ai.com/blog/llm-experimentation-complete-guide</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/llm-experimentation-complete-guide</guid>
      <description>A practical guide to running LLM experiments across prompts, models, tools, datasets, metrics, production A/B tests, and human-in-the-loop feedback loops.</description>
      <content:encoded><![CDATA[Every AI team eventually ends up in the same debate: is this new prompt actually better, or do we just like the 3 examples we happened to look at?

This is what LLM experimentation is for. It gives you a controlled way to compare two or more versions of your AI app against the same dataset, using the same metrics, before you push changes to production.

No vibes, no cherrypicked examples, and no gut feel. Instead, you get a repeatable process for deciding which prompt, model, tool set, or agent configuration should actually ship.

In this guide, I'll walk through the full LLM experimentation workflow: what LLM experimentation means, how it differs from evaluation, how to choose what to optimize, how to curate datasets and metrics, how to interpret results, how to extend experiments into production, and how Confident AI supports the loop endtoend.

 TL;DR

 LLM evaluation answers "how good is this version?" LLM experimentation answers "which version is better?"
 The only fair comparison...]]></content:encoded>
      <pubDate>Wed, 10 Jun 2026 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https:/images/blog/llm-experimentation-cover.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Three Ways AI Systems Fail Even When Evals Pass</title>
      <link>https://www.confident-ai.com/blog/three-ways-ai-systems-fail-even-when-evals-pass</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/three-ways-ai-systems-fail-even-when-evals-pass</guid>
      <description>AI systems can pass evals while still behaving incorrectly. This post explores three common failure modes that slip through output-based evaluation.</description>
      <content:encoded><![CDATA[The gap between correctness and behavior

Most teams building AI systems today have some form of evaluation in place. They run test cases, measure accuracy, and check whether the system produces the expected output. In isolation, these evaluations often look reassuring. The system answers correctly. The metrics look strong. The model appears to be working.

And yet, when those same systems are exposed to real usage, something starts to break down. The failures are not always obvious. The system still produces plausible answers. In some cases, it even continues to produce correct ones. But the behavior becomes inconsistent, brittle, and difficult to trust.

This is not a fringe issue. Across multiple studies and production environments, AI systems routinely produce answers that appear correct while relying on flawed reasoning, incomplete context, or unstable internal decision paths. Research into AIassisted search has found error rates that are far higher than most teams would expect, e...]]></content:encoded>
      <pubDate>Tue, 07 Apr 2026 00:00:00 GMT</pubDate>
      <author>Brian Neville-O&apos;Neill</author>
      <media:thumbnail url="https:/authors/brian-no.png"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/3w4ZyLX3xx3xddoTDMYmvY/a1ee43279d92f6341f6e35adad208dc1/AI_mishaps_and_the_correct_path.jpg" type="image/jpeg"/>
    </item>
    <item>
      <title>Your AI Agent Passed Evals. That’s the Problem.</title>
      <link>https://www.confident-ai.com/blog/your-ai-agent-passes-evals-thats-the-problem</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/your-ai-agent-passes-evals-thats-the-problem</guid>
      <description>Passing evals doesn’t mean your system works. It means your tests didn’t catch how it fails.</description>
      <content:encoded><![CDATA[Most teams take passing evals as a sign their system is working. It’s usually a sign they’re measuring the wrong thing.

You run your test suite. The outputs look good. The agent completes tasks correctly. Maybe you track accuracy across a handful of scenarios. Everything points in the same direction: This is ready. Then you ship it.

A couple of weeks later, things start to feel off. Not catastrophic failures. The system still “works.” But not in a way you fully trust. It does the right thing for the wrong reasons. It skips steps you assumed were happening. It makes decisions that technically produce a correct result, but would never pass review in a real workflow.

Nothing in your evals told you this was going to happen. Because your evals weren’t designed to catch it.

Most evaluation setups answer a very specific question: did the model produce the right output? That’s a useful question. It’s just not the one that matters once the system starts making decisions.

Passing evals does...]]></content:encoded>
      <pubDate>Mon, 06 Apr 2026 00:00:00 GMT</pubDate>
      <author>Brian Neville-O&apos;Neill</author>
      <media:thumbnail url="https:/authors/brian-no.png"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/MaRgbOsLfzCasWdlmrWun/1441cf71a4a81927d0db411fd6cf12fc/agent_first_llm_evaluation.jpg" type="image/jpeg"/>
    </item>
    <item>
      <title>Launch Week Day 5 (5/5): Generate Datasets from Your Data Sources</title>
      <link>https://www.confident-ai.com/blog/launch-week-q1-2026-day-5-dataset-generation</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/launch-week-q1-2026-day-5-dataset-generation</guid>
      <description>Your best evaluation data already exists — it&apos;s sitting in Google Drive, SharePoint, Notion, and S3. Dataset generation on Confident AI turns your existing documents into evaluation-ready datasets automatically.</description>
      <content:encoded><![CDATA[Welcome to the final day of Confident AI's Launch Week. We've spent the week shipping error analysis, scheduled evals, autoingest, and autocategorization — and today we're closing out with the feature teams have been asking us for more than anything else.

Launch Week Day 5 (5/5): Generate evaluation datasets directly from your data sources — Google Drive, SharePoint, Notion, S3, and more.

 The Dataset Problem Nobody Wants to Admit

Here's the dirty secret of LLM evaluation: most teams are evaluating on datasets they made up.

Not "made up" in the malicious sense — but handcrafted. Someone on the team sat down, wrote 30–50 questionanswer pairs based on what they thought users would ask, and called it a golden dataset. Maybe they got a PM to review it. Maybe they didn't. Either way, the dataset reflects the team's imagination, not reality.

And the problem compounds. Your RAG pipeline retrieves from a knowledge base that has thousands of documents — product specs, HR policies, legal co...]]></content:encoded>
      <pubDate>Sat, 04 Apr 2026 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/1zQ4VylCaQKNaoBFKbmru6/9832ea954c5dfd92c4c643992c34151f/Screenshot_2026-04-07_at_2.53.23_AM.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Launch Week Day 4 (4/5): Auto-Categorize Traces &amp; Threads</title>
      <link>https://www.confident-ai.com/blog/launch-week-q1-2026-day-4-trace-categorization</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/launch-week-q1-2026-day-4-trace-categorization</guid>
      <description>You can&apos;t improve what you can&apos;t see. Auto-categorization tells you what your users are actually asking, detects response drift, and shows you which categories perform best — and which ones need help.</description>
      <content:encoded><![CDATA[Welcome to Day 4 of Confident AI's Launch Week.

Day 1 was Automated Error Analysis. Day 2 was Scheduled Evals. Day 3 was AutoIngest Traces. Today we're launching something that changes how you see your production traffic.

Launch Week Day 4 (4/5): Autocategorize traces and threads.

 You Don't Know What Your Users Are Asking

Here's the uncomfortable truth about most AI agents in production: you have thousands of traces flowing through your system, and you have no structured understanding of what users are actually asking about.

You might have vibes. You might have anecdotes from support tickets. Maybe someone on your team pulls up a few traces every week and eyeballs them. But if I asked you right now — "what are the top 10 categories of questions your users asked last week, and which ones is your model struggling with?" — most teams can't answer that.

And if you can't answer that, you can't prioritize. You're guessing about what to improve, which prompts to rewrite, and which fail...]]></content:encoded>
      <pubDate>Fri, 03 Apr 2026 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/z4AOyzhvie3lP2E0PkWwj/8236bebef29ddda02fc9a5f3ca92cf18/Screenshot_2026-04-03_at_2.20.54_AM.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Launch Week Day 3 (3/5): Auto-Ingest Traces into Datasets &amp; Annotation Queues</title>
      <link>https://www.confident-ai.com/blog/launch-week-q1-2026-day-3-auto-ingest-traces</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/launch-week-q1-2026-day-3-auto-ingest-traces</guid>
      <description>Production traces are the best dataset you’ll ever get — but most teams never turn them into one. With auto-ingest, your traces flow straight into datasets and annotation queues, continuously.</description>
      <content:encoded><![CDATA[Welcome to Day 3 of Confident AI’s Launch Week.

Day 1 was Automated Error Analysis. Day 2 was Scheduled Evals. Today is the missing piece that makes both of those workflows actually sustainable.

Launch Week Day 3 (3/5): Autoingest traces into datasets and annotation queues.

 The Most Valuable Data You’re Not Using

Every LLM team says they want “more real data”.

Then they ship to production… and their traces just sit there.

Not because they don’t care — but because turning traces into something useful is a surprisingly annoying workflow:

1. Export traces from your observability system (or your internal logs).
2. Normalize the schema so you can use them as a dataset (inputs, outputs, context, tool calls, metadata).
3. Sample intelligently (because you can’t label everything).
4. Route the right examples to humans for review (the ones that actually matter).
5. Do it again next week because the world changed and your model drifted.

If this sounds familiar, it’s because most teams e...]]></content:encoded>
      <pubDate>Thu, 02 Apr 2026 00:00:00 GMT</pubDate>
      <author>Brian Romain</author>
      <media:thumbnail url="https:/authors/brian.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/5fmJSQA1J8GoVyHv210lt2/097b707f26b01e5215d6b8faf55ae15a/Screenshot_2026-04-02_at_2.40.35_AM.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Launch Week Day 2 (2/5): Scheduled Evals</title>
      <link>https://www.confident-ai.com/blog/launch-week-q1-2026-day-2-scheduled-evals</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/launch-week-q1-2026-day-2-scheduled-evals</guid>
      <description>Everyone agrees evals should run regularly. But nobody remembers to actually run them. Scheduled Evals fixes that — set the frequency, configure your mappings, and never scramble before a release again.</description>
      <content:encoded><![CDATA[Welcome to Day 2 of Confident AI's Launch Week. Yesterday we launched Automated Error Analysis — today, we're tackling something that sounds simple but trips up almost every team I talk to.

Launch Week Day 2 (2/5): Scheduled Evals.

 The Workflow Nobody Talks About

PMs and engineers run evaluations in CI/CD before deployment — that's table stakes. If you're not doing that, you've got bigger problems.

But there's another evaluation workflow that no one talks about: the one where stakeholders need evals run every X days so the team can sit together, review the results, find problems in their AI agents, and fix them before shipping.

It's the "recurring quality check" — the process that should be running in the background on a cadence, catching regressions and drift that a onetime CI/CD run will never surface.

Except... they forget. Every time.

 The Problem

Here's what the typical workflow looks like today:

1. Someone says "we should run evals regularly." Everyone agrees. Heads nod...]]></content:encoded>
      <pubDate>Wed, 01 Apr 2026 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/1BrU5al6FZU31Ykpao9hqc/5434572c6082829dcd5a1fbb34268193/Screenshot_2026-04-01_at_2.12.00_AM.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Announcing Launch Week Q1 &apos;26! Day 1: Automated Error Analysis</title>
      <link>https://www.confident-ai.com/blog/launch-week-q1-2026-day-1-error-analysis</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/launch-week-q1-2026-day-1-error-analysis</guid>
      <description>Error analysis used to mean pulling traces in code, hacking together an LLM to recommend metrics, and hoping for the best. Not anymore.</description>
      <content:encoded><![CDATA[Today, I'm excited to announce Confident AI's 1st Launch Week in 2026. This week, we're going to do 5 days of launches to show something new everyday — and we're kicking it off with a feature we've been working on for quite some time.

Launch Week Day 1 (1/5): Error Analysis, Fully Automated.

 What is Error Analysis?

Error analysis is the process of identifying failure modes in your LLM app's production traces through human review. You look at real outputs, figure out where and why things are going wrong, and use those findings to pick the right evaluation metrics to monitor going forward. It's the critical step between "my LLM app is live" and "I actually know what's breaking and how to catch it automatically."

 The Problem

If you've ever tried to set up error analysis for your LLM app in production, you know the pain. The typical workflow looks something like this:

1. Pull traces from production. You write some code to export your traces, maybe from a logging pipeline or an obse...]]></content:encoded>
      <pubDate>Tue, 31 Mar 2026 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/2r21o7OYHfPhpO15848i6q/f14618230b7707fc3982289f61347604/Screenshot_2026-03-31_at_3.14.40_AM.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Multi-Turn LLM Evaluation in 2026: What You Need to Know</title>
      <link>https://www.confident-ai.com/blog/multi-turn-llm-evaluation-in-2026</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/multi-turn-llm-evaluation-in-2026</guid>
      <description>In this article, I&apos;ll break down multi-turn LLM evaluation — how it differs from single-turn, what metrics actually matter, and how to implement it.</description>
      <content:encoded><![CDATA[Last month I got on a call with a team building a voice AI agent for insurance claims. They had all the right pieces — retrieval pipeline, tool calling, memory, handoffs to human agents. Their singleturn evals were passing at 92%. And yet, every week, their support inbox filled up with complaints about the bot "going in circles" and "forgetting what I just said."

The problem wasn't the model. It wasn't the prompts. It was that they were evaluating each turn in isolation — like grading a movie by looking at random frames instead of watching the film.

Multiturn LLM evaluation is fundamentally different from singleturn evaluation, and in 2026, most teams are still getting it wrong. They either repurpose singleturn metrics and hope for the best, or they skip conversational evaluation entirely because it feels too complex.

In this article, I'll break down everything you need to know about evaluating multiturn LLM apps — what it actually is, how it differs from singleturn, what metrics ma...]]></content:encoded>
      <pubDate>Sun, 22 Mar 2026 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/6NDKZI0pMcyfEc9q3ZiM3r/e3c6d58c187e0147ab41781d8c726a11/multi-turn-llm-evaluation.jpeg" type="image/jpeg"/>
    </item>
    <item>
      <title>The Step-By-Step Guide to MCP Evaluation</title>
      <link>https://www.confident-ai.com/blog/the-step-by-step-guide-to-mcp-evaluation</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/the-step-by-step-guide-to-mcp-evaluation</guid>
      <description>This article will teach you everything you need to evaluate MCP-based LLM applications.</description>
      <content:encoded><![CDATA[AI is getting smarter by the day, but intelligence alone isn’t enough. To be truly useful, AI needs to do more than just answer questions — it needs to complete real tasks.

Enter MCP (Model Context Protocol) — a framework that turns everyday LLM applications into AI agents on steroids.

Introduced by Anthropic in late 2024, MCP enables large language models to interact with the outside world through a standardized protocol. Instead of reinventing the wheel each time, developers can now plug their AI models into a shared ecosystem of resources. This makes AI applications more scalable, efficient, and capable of tackling a wider range of tasks.

But, as we all know, with great power comes great complexity. Giving AI access to MCP servers is one thing, but making sure it uses them correctly? That’s a completely different story. Is your AI making good use of the MCP’s resources? Passing the right arguments? Completing the actual task?

That’s where MCP evaluation comes in. In this guide, ...]]></content:encoded>
      <pubDate>Sat, 25 Oct 2025 00:00:00 GMT</pubDate>
      <author>Cale</author>
      <media:thumbnail url="https:/authors/cale.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/2J4hTr7si0VXtHWyrUMME/99550512b13215a6baba9a9126f2ff7d/MCP_Blog_Cover_Featured.png" type="image/jpeg"/>
    </item>
    <item>
      <title>AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows</title>
      <link>https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide</guid>
      <description>A practical guide to evaluating AI agents with LLM metrics and tracing—plus when human review matters, how it calibrates judges, and workflows that combine CI, sampling, and production signals.</description>
      <content:encoded><![CDATA[AI agents are complicated: models calling tools, tools invoking other agents, nested swarms, and systems that combine all of the above. It is confusing just to describe.

“AI agents” is a broad label. It can mean singleshot background jobs, multiturn conversational agents like RAG chatbots, voice assistants, or agents inside a larger agentic stack. More surface area means more ways a run can go sideways:

 Wrong tools or arguments, or misreading what a tool returned
 Retry or planning loops that never converge
 False task completion: the transcript says “done” but nothing actually changed
 Drift from the user’s intent across turns
 Traces that look fine to an LLM judge while still blowing cost, latency, or patience
 Busywork in the log: circular summaries or reasoning thrash without a real action

That variety is why agent evaluation feels overwhelming. It gets tractable once you treat it like any other critical system: clear scenarios, full traces, metrics that match the product, and ...]]></content:encoded>
      <pubDate>Tue, 07 Oct 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/1u4pT5sUA9faoRS86JX72X/0b8fbc2e376934b03b7803f93eed8928/ChatGPT_Image_Oct_8__2025__06_55_45_PM.png" type="image/jpeg"/>
    </item>
    <item>
      <title>LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing</title>
      <link>https://www.confident-ai.com/blog/llm-arena-as-a-judge-llm-evals-for-comparison-based-testing</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/llm-arena-as-a-judge-llm-evals-for-comparison-based-testing</guid>
      <description>In this article, you&apos;ll learn everything about running LLM Arena-as-a-judge as a novel way to regression test LLMs.</description>
      <content:encoded><![CDATA[Let’s imagine you’re building an LLM evaluation framework called DeepEval, and you talk to 20+ users a week. Out of those 20 users, over 15 of them ask this very question:

 Which metrics should I use if I need to compare the [fill in the blank here] of different prompts/models?

Clearly, most people is still confused about how each metric works and what use cases they are for, and that’s not good. To make testing your prompts and models more accessible, we ought to use something more intuitive and simple to understand.

So in this article, I’m introducing LLM ArenaasaJudge — a novel way to run automated, scaleable, comparisonbased LLMasajudge that just tells you which iteration of your LLM app worked best.

With LLM ArenaasaJudge, you don’t pick a metric. You pick the better output. That’s it.

 TL;DR

In this article, you’ll learn that:

 LLM Arena is an Elo rating system based on human feedback, and how to replace humans with LLM judges.
 LLM “Arena”asajudge can be extended not just...]]></content:encoded>
      <pubDate>Sun, 06 Jul 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/6xjYm7c3BtuD1YW9tL0H2L/571fe718cac170d7ac1cd8d7d6605807/686a903531fab9fb78ae29e7_llm-arena.jpg" type="image/jpeg"/>
    </item>
    <item>
      <title>RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, Contextual Relevancy, And More</title>
      <link>https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more</guid>
      <description>This article will go through everything you&apos;ll need for RAG evaluation, including metrics, and best practices.</description>
      <content:encoded><![CDATA[Building a RetrievalAugmented Generation (RAG) pipeline isn’t the mission impossible — knowing if it’s actually working is. You might finetune prompts, tweak retrievers, and switch models, but if your system is still hallucinating or citing the wrong context, what’s really broken? Without the right RAG evaluation metrics, it’s guesswork.

Sure, 2025 is the year of AI agents, but let’s face it: most agentic system still has a RAG pipeline somewhere in their AI workflow, and it’s vital that you’re about to secure the quality of that too.

So, in this article, we’ll go through everything you’ll need for RAG evaluation:

 What is RAG evaluation, how is it different from regular LLM and AI agent evaluation, and common points of failure
 Retriever metrics such as contextual relevancy, recall, and precision
 Generator metrics such as answer relevancy and faithfulness
 How to run RAG evaluation: both endtoend and at a componentlevel
 Best practices, including RAG evaluation in CI pipelines and...]]></content:encoded>
      <pubDate>Tue, 03 Jun 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/3L9gNQbcb2EXFje31M1cbT/088195ef1a8dac61c32816b4aa453732/rag-eval.jpg" type="image/jpeg"/>
    </item>
    <item>
      <title>LLM Evals Framework That Predicts ROI: A Step-by-Step Guide</title>
      <link>https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook</guid>
      <description>Most LLM evals fail because metrics don&apos;t predict ROI, build outcome-based evals that correlate with business KPIs.</description>
      <content:encoded><![CDATA[I want you to meet Johnny. Johnny’s a great guy — LLM engineer, did MUN back in high school, valedictorian, graduated summa cum laude. But Johnny had one problem at work: no matter how hard he tried, he couldn’t get his manager to care about LLM evaluation.

Imagine being able to say, “This new version of our LLM support chatbot will increase customer ticket resolutions by 15%,” or “This RAG QA’s going to save 10 hours per week per analyst starting next sprint.” That was Johnny’s dream — using LLM evaluation results to forecast realworld impact before shipping to production.

But like most dreams, Johnny’s too, fell apart.

Johnny's problem isn't unique. Across the industry, LLM evals efforts are disconnected from business outcomes. Teams run evaluations, hit 80% pass rates, and still can't answer: 'So what?'

Most evaluation efforts fail because:

 The metrics didn’t work — they weren’t reliable, meaningful, or aligned with your use case.
 Even if the metrics worked, they didn’t map t...]]></content:encoded>
      <pubDate>Fri, 02 May 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/5UHHNUo8SVtOGAJtBqPT8j/3136ca2e93f770e710a4eb5cfadd149e/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation</title>
      <link>https://www.confident-ai.com/blog/g-eval-the-definitive-guide</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/g-eval-the-definitive-guide</guid>
      <description>This article goes through everything on G-Eval for anyone to easily evaluate LLM apps on any task specific criteria.</description>
      <content:encoded><![CDATA[Evaluating Large Language Model (LLM) applications are just as important as unit testing traditional software. But building an effective LLM evaluation pipeline isn’t so straightforward. A strong eval workflow demands a wide range of custom LLM metrics tailored to your LLM app’s task, goals, characteristics, and quality standards.

That’s where GEval comes in.

GEval is an LLMeval that makes it easy to build researchbacked, LLMasajudge, custom metrics — often from just a single sentence written in plain language. An evaluation prompt for GEval might look something like this:

python
prompttemplate = """
🧠 Answer Correctness Evaluation
Task:
Rate the assistant's answer based on how correct it is given the question and context.

Criteria:
Correctness (1–5) — Does the answer factually align with the provided context and directly address the question?

Steps:
1. Read the question and context.
2. Check if the answer is factually correct and relevant.

Question: {question}
Context: {context...]]></content:encoded>
      <pubDate>Wed, 30 Apr 2025 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/4FfM4rbQQD8vndPQTGZLDM/8cbef433bf39000f8a76af4e45b2a3d2/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Top LLM Evaluators for Testing LLM Systems at Scale</title>
      <link>https://www.confident-ai.com/blog/top-llm-evaluators-for-testing-llms-at-scale</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/top-llm-evaluators-for-testing-llms-at-scale</guid>
      <description>In this article, we&apos;ll go through all the top LLM evaluators in 2025 including G-Eval and other LLM-as-a-judges.</description>
      <content:encoded><![CDATA[When talking to a user of DeepEval last week, here’s what I heard:

 “We [a team of 7 engineers] just sit in a room for 30 minutes in silence to prompt for half an hour while entering the results into a spreadsheet before giving the thumbs up for deployment” 

For many LLM engineering teams, predeployment checks still involve eyeballing outputs, “vibe checks,” and a big reason for this is because Large Language Model (LLM) applications are unpredictable which makes testing LLM applications a significant challenge.

While it’s essential to run quantitative evaluations through unit tests to catch regressions in CI/CD pipelines before deployment, the subjective and variable nature of LLM outputs makes principles in traditional software testing difficult to transfer.


LLM evaluators save testing time

But what if there were a way to address this unpredictability to enable unittesting for LLMs?

This is exactly why we need to discuss LLM evaluators, which tackle this challenge by using LLM...]]></content:encoded>
      <pubDate>Mon, 21 Apr 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/7p5oFNrC44eaeLckrKl7m9/a201684a5b4b4c0d9e8a15a00072c09b/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>How I raised Confident AI&apos;s $2.2M seed round in 5 days</title>
      <link>https://www.confident-ai.com/blog/how-i-closed-confident-ais-2-2m-seed-round-in-5-days</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/how-i-closed-confident-ais-2-2m-seed-round-in-5-days</guid>
      <description>Announcing Confident AI&apos;s seed round, with participation from a bunch of great investors.</description>
      <content:encoded><![CDATA[Today I’m proud to announce Confident AI’s oversubscribed $2.2m seed round with participation from Y Combinator, Flex Capital, Oliver Jung, Vermilion Cliffs Ventures, Liquid 2 Ventures, January Capital, and Rebel Fund. My cofounder Kritin and I couldn’t be more grateful for our investor’s trust in our opensource approach to LLM evaluation.

But this article isn’t about money, nor LLM evaluation. This article is all about my first experience fundraising as a first time founder.

 How It Started

As you may know, we’ve been building DeepEval for the past year and has since grown it to become one of the most adopted, if not already the most adopted, LLM evaluation framework in the world. DeepEval is used at enterprises such as BCG, Astrazenca, Stellantis, Mercedes Benz (what are LLMs doing in cars?), to name a few, and what started off as a bootstrapped opensource package became the thing that got us into YC.


DeepEval's unwavering growth

However, building in this space is tough — there...]]></content:encoded>
      <pubDate>Wed, 19 Mar 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/7FRWZ644Kb3oC9Vq5WhdKu/58ec1116fed956301dc7ee45e3b80f7e/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>How I Built Deterministic LLM Evaluation Metrics for DeepEval</title>
      <link>https://www.confident-ai.com/blog/how-i-built-deterministic-llm-evaluation-metrics-for-deepeval</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/how-i-built-deterministic-llm-evaluation-metrics-for-deepeval</guid>
      <description>In this article, I&apos;m sharing how I&apos;ve built DeepEval&apos;s latest deterministic, LLM-powered, custom metric.</description>
      <content:encoded><![CDATA[A little more than a month ago, I had several calls with a few DeepEval users and noticed a clear divide — those who were happy with the outofthebox metrics and those who weren’t.

For context, DeepEval is an opensource LLM evaluation framework I’ve been working on for the past year, and all of its LLM evaluation metrics uses LLMasajudge. It’s grown to nearly half a million monthly downloads and close to 5,000 GitHub stars. With over 800k daily evaluations ran, engineers nowadays use it to unittesting LLM applications such as RAG pipelines, agents, and chatbots.

The users who weren’t satisfied with our metrics had a simple reason: the metrics didn’t fit their use case and they weren’t deterministic enough since they were all evaluated using LLMasajudge. That’s a real problem because the whole point of DeepEval is to eliminate the need for engineers to build their own evaluation metrics and pipelines. If our builtin metrics aren’t usable and people have to go through that effort anyway...]]></content:encoded>
      <pubDate>Sun, 09 Feb 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/6IArqmZCtYXYy7Rn0hWewG/3ffb214ae673831b7a481d5112b6e346/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>LLM Agent Evaluation Metrics in 2026: Tool Calling, Task Completion, Reasoning, and Trace-Based Evals</title>
      <link>https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide</guid>
      <description>Learn how to evaluate LLM agents end-to-end with tool calling, task completion, reasoning, trace-based evals, human review, and DeepEval code examples.</description>
      <content:encoded><![CDATA[Last weekend, I asked Cursor to ship an internal analytics dashboard. Instead, it spiraled through a series of unfortunate tool calls, planning cycles, and greps. I ended up with an overheated laptop, a 404 page, and a bill that cost an arm and a leg (mind you, this was Opus 4.8 on extra high).


An agent that failed to ship the feature after a long and expensive trajectory.

That’s the problem with LLM agents in 2026: a small misunderstanding, bad tool call, or broken assumption can compound across hundreds of reasoning loops. An agent can look busy, reason intelligently, call the rightlooking tools, and still fail to complete the task. And even when it succeeds, it may have used the wrong tools, passed bad inputs, looped needlessly, or torched all your tokens to get there.

LLM agent evaluation is how you prevent those failures before they compound in production. More importantly, it tells you why an agent failed — which tool, input, reasoning step, or handoff actually broke — instea...]]></content:encoded>
      <pubDate>Mon, 27 Jan 2025 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/7a3JJxEZLZHyUa81BOw9FM/6e9ea6d05e265be2edd67993fcbc3029/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>LLM Guardrails for Data Leakage, Prompt Injection, and More</title>
      <link>https://www.confident-ai.com/blog/llm-guardrails-the-ultimate-guide-to-safeguard-llm-systems</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/llm-guardrails-the-ultimate-guide-to-safeguard-llm-systems</guid>
      <description>In this article, you&apos;ll learn everything you need to know on LLM guardrails and how to use it for LLM security.</description>
      <content:encoded><![CDATA[Whether you’re managing sensitive user data, avoiding harmful outputs, or ensuring adherence to regulatory standards, crafting the right LLM guardrails is essential for safe, scalable Large Language Model (LLM) applications. Guardrails are proactive and prescriptive — designed to handle edge cases, limit failures, and maintain trust in live systems. Building a solid foundation of guardrails ensures that your LLM doesn’t just perform well on paper but thrives safely and effectively in the hands of your users.

While LLM evaluation focuses on refining accuracy, relevance, and overall functionality, implementing effective LLM guardrails is about actively mitigating risks in realtime production environments (PS. Guardrails is a great way to stay compliant according to guidelines like OWASP Top 10 2025).

This article will teach you everything you need to know about LLM guardrails, with code samples included. We’ll dive into:

 What LLM guardrails are, how they are different from LLM evalua...]]></content:encoded>
      <pubDate>Sun, 26 Jan 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/14p8sd3RUs4vuOYsZA2C6D/1597ff305b9e88d960dfa2a7415df1da/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>OWASP Top 10 2025 for LLM Applications: What’s new? Risks, and Mitigation Techniques</title>
      <link>https://www.confident-ai.com/blog/owasp-top-10-2025-for-llm-applications-risks-and-mitigation-techniques</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/owasp-top-10-2025-for-llm-applications-risks-and-mitigation-techniques</guid>
      <description>In this article, we&apos;ll go through what is OWASP Top 10, as well as what&apos;s new in their latest 2025 guidelines.</description>
      <content:encoded><![CDATA[It’s terrifying to think that 2025 is the year of LLM agents, and yet here we are… LLMs are still ridiculously vulnerable to jailbreaking. Sure, DeepSeek made huge ripples in the AI community when it launched a couple of weeks ago, and I admit it’s incredibly powerful. But this X user still easily managed to generate a meth recipe with a simple prompt injection.


X user generating a meth recipe with DeepSeek

Now imagine plugging these very same LLMs into medical tools, legal systems, and financial services we use everyday.

Did I forget to mention that more than half — 53%, to be exact — of companies building AI agents right now aren’t even finetuning their models? Honestly, I can’t blame them — finetuning costs a fortune to do effectively. However, this means that any vulnerabilities in these LLMs will carry over to the agents you use. So, don’t be too surprised when your ‘drug research AI’ suddenly decides to moonlight as a meth cook.

Jokes aside, the safety and security of LLMs i...]]></content:encoded>
      <pubDate>Sat, 18 Jan 2025 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/5woEwHr9XgtGX8qAgA2WIp/3797a9146edcf931b1ffbc0fb1e2ddb0/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>The People&apos;s Choice of Top LLM Evaluation Tools in 2025</title>
      <link>https://www.confident-ai.com/blog/greatest-llm-evaluation-tools-in-2025</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/greatest-llm-evaluation-tools-in-2025</guid>
      <description>In this article, we&apos;ll bring you a hand-picked, carefully curated list of top LLM evaluation tools in the market.</description>
      <content:encoded><![CDATA[Let’s cut to the chase. There are tons of LLM evaluation tools out there, and they all look, feel, and sound the same. “Ship your LLM with confidence, “No more guesswork for LLMs”, ye right.

Why is this happening? Well, turns out LLM evaluation is a problem faced by anyone and everyone building LLM applications, and it sure is a painful one. LLM practitioners rely on “vibe checks”, “intuition”, “gut feel”, and resort to “making up test results to keep my manager happy”.

As a result, there are more than enough high quality LLM evaluation solutions out there in the market, and as the author of DeepEval ⭐, the opensource LLM evaluation framework with almost half a million monthly downloads, it is my duty to let you know the people's choice of top LLM evaluation tools based on their pros and cons.

But before we begin with our top 5 list though, let’s go over the basics.

 Why Should You Take LLM Evaluation Seriously?

Here's a common scenario you will be or are already running into: You...]]></content:encoded>
      <pubDate>Wed, 15 Jan 2025 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/14lJNzPPh7zuruzDoN7Ou7/bb7ccbbacefb3aae0d58d14520a6845d/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>The Comprehensive LLM Safety Guide: Navigate AI regulations and Best Practices for LLM Safety</title>
      <link>https://www.confident-ai.com/blog/the-comprehensive-llm-safety-guide-navigate-ai-regulations-and-best-practices-for-llm-safety</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/the-comprehensive-llm-safety-guide-navigate-ai-regulations-and-best-practices-for-llm-safety</guid>
      <description>In this article, you&apos;ll teach you about LLM regulations and how to maintain the safety of your LLM applications.</description>
      <content:encoded><![CDATA[With great power comes great responsibility. As LLMs become more powerful, they are entrusted with increasing autonomy. This means less human oversight, greater access to personal data, and an everexpanding role in handling reallife tasks.

From managing weekly grocery orders to overseeing complex investment portfolios, LLMs present a tempting target for hackers and malicious actors eager to exploit them. Ignoring these risks could have serious ethical, legal, and financial repercussions. As pioneers of this technology, we have a duty to prioritize and uphold LLM safety.

Although much of this territory is uncharted, it’s not entirely a black box. Governments worldwide are stepping up with new AI regulations, and extensive research is underway to develop risk mitigation strategies and frameworks. Today, we’ll dive into these topics, covering:

 What LLM Safety entails
 Government AI regulations and their impact on LLMs
 Key LLM vulnerabilities to watch out for
 Current LLM safety resea...]]></content:encoded>
      <pubDate>Sat, 02 Nov 2024 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/31oVeRezjS5U6LwylzpBKv/29bbf3b4eb35051f8c5e1b8faf41ac91/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>How to Jailbreak LLMs One Step at a Time: Top Techniques and Strategies</title>
      <link>https://www.confident-ai.com/blog/how-to-jailbreak-llms-one-step-at-a-time</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/how-to-jailbreak-llms-one-step-at-a-time</guid>
      <description>In this article, I&apos;ll show you how to jailbreak your LLM application to detect it for vulnerabilities.</description>
      <content:encoded><![CDATA[If you’ve ever heard of LLM redteaming at all, you’ve likely encountered several notable attacks: prompt injections, data poisoning, denialofservice (DoS) attacks, and more. However, when it comes to exploiting an LLM into generating undesirable or harmful outputs,  nothing is quite as powerful as LLM jailbreaking.

In fact, this study demonstrates that SOTA models like GPT4 were successfully compromised with just a few jailbreaking queries.

Still, while LLM jailbreaking has become a widely discussed topic, its definition can vary across different contexts, leading to some confusion about what it truly entails. Do not fear — today, I’ll guide you through everything you need to know about jailbreaking, including:

 What LLM jailbreaking is and its various types
 Key research and breakthroughs in jailbreaking
 A stepbystep guide to crafting highquality jailbreak attacks to identify vulnerabilities in your LLM application
 How to use DeepTeam ⭐, the opensource LLM red teaming framework t...]]></content:encoded>
      <pubDate>Wed, 30 Oct 2024 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/6LR951XSxIeqRULsGu6FZm/571f980db724e472a4344dea964a5baa/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>What is LLM Observability? - The Ultimate LLM Observability Guide</title>
      <link>https://www.confident-ai.com/blog/what-is-llm-observability-the-ultimate-llm-monitoring-guide</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/what-is-llm-observability-the-ultimate-llm-monitoring-guide</guid>
      <description>In this article, I&apos;ll share what you should definitely look for in your next LLM Observability solution.</description>
      <content:encoded><![CDATA[After months of intense evaluation, red teaming, and iterating on hundreds of prompts for your LLM application, launch day is finally here. Yet, as the final countdown ticks away, those familiar doubts lingers: Have you truly covered all the bases? Could someone find an exploit? Is your model really ready for the real  world? No seriously, is it really ready?

The truth is, no matter how thorough your preparation is, it’s impossible to test for every potential issue that could occur postdeployment. But here’s the good news — you’re not alone. With the right LLM monitoring, oversight, and observability tools, you can effectively manage and mitigate those risks, ensuring a smoother path ahead, which brings us to LLM observability.

By the end of this article, you'll be equipped with all the knowledge you need to empower your LLM application with observability. Let's begin.

 What is LLM Observability?

LLM observability provides teams with powerful insights to keep LLMs on track, ensurin...]]></content:encoded>
      <pubDate>Tue, 29 Oct 2024 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/4jlmGrXOGTf2cLLfTPlvfo/b6ed2c42bd25fd2e63fe4eec4f1db62f/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Top LLM Chatbot Evaluation Metrics: Conversation Testing Techniques</title>
      <link>https://www.confident-ai.com/blog/llm-chatbot-evaluation-explained-top-chatbot-evaluation-metrics-and-testing-techniques</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/llm-chatbot-evaluation-explained-top-chatbot-evaluation-metrics-and-testing-techniques</guid>
      <description>In this article, you&apos;ll learn about LLM red teaming and how it can be carried out using DeepTeam.</description>
      <content:encoded><![CDATA[Here’s a true story: Just last week, I was sorting out a delayed shipment and spent some time texting my dedicated customer support rep, Johnny. Johnny was great. He was polite and responsive, so much to the point that I felt bad leaving him on read at times. But as our conversation dragged on, he kept asking the same old questions, and his generic suggestions weren’t helping. Hmm, I thought to myself, maybe this isn’t Johnny after all.

I’m no detective, but it was obvious that Johnny is in fact an LLM chatbot. As much as I appreciate Johnny’s demeanor, he kept forgetting what I told him, and his answers were often times long and robotic. This is why LLM chatbot evaluation is imperative to deploying production grade LLM conversational agents.

In this article, I’ll teach you how to evaluate LLM chatbots so thoroughly that you’ll be able to quantify whether they’re convincing enough to pass as real people. More importantly, you’ll be able to use these evaluation results to identify how...]]></content:encoded>
      <pubDate>Sat, 05 Oct 2024 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/7cF9brxzF74MMiPRifHXKS/6f8d73a747c6ec063151238c18c63dea/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>LLM-as-a-Judge Simply Explained: The Complete Guide to Run LLM Evals at Scale</title>
      <link>https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method</guid>
      <description>Complete guide to LLM-as-a-Judge: how it works, single-output vs pairwise scoring, G-Eval, DAG, prompting techniques, and how to use LLM judges for scalable LLM evaluation.</description>
      <content:encoded><![CDATA[Here’s the strange thing about LLM evaluation in 2026: an LLM judge agrees with human reviewers about 85% of the time — higher than two humans agree with each other on the same task. That’s the bet behind LLMasaJudge, now the default method for evaluating LLM applications at scale.

The idea is simple. Have one LLM score the outputs of another against criteria you define — answer relevancy, faithfulness, helpfulness, bias, correctness. No annotation team, no waiting weeks, no fivefigure human eval contract. Just a prompt and an API call.

But LLM judges aren’t plugandplay. The wrong scoring method, the wrong prompt, the wrong rubric, and your eval scores end up just as flaky as the model you’re testing.

This guide is the complete playbook on LLMasaJudge — what it is, singleoutput vs pairwise scoring, the techniques that make judges accurate (GEval, DAG, chainofthought, fewshot prompting), how to handle bias and other limitations, and how to wire it all into production evaluation metri...]]></content:encoded>
      <pubDate>Sun, 01 Sep 2024 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/1VYUUwxRWmohOmLgyhL0Vk/164d61dc87afa179f9a4f60906e992bf/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>The Definitive LLM Security Guide: OWASP Top 10 2025, Safety Risks and How to Detect Them</title>
      <link>https://www.confident-ai.com/blog/the-comprehensive-guide-to-llm-security</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/the-comprehensive-guide-to-llm-security</guid>
      <description>In this article, I&apos;ll go through all the major pillars of LLM security you must know and how to mitigate them.</description>
      <content:encoded><![CDATA[Just the other day, I was experimenting with dialoguebased LLM jailbreaking and managed to crack GPT4 and GPT4o multiple times, unleashing a chaotic mix of humorous responses. But the fun and games stop when your system gets hacked, data leaks, and you’re hit with unimaginable legal and financial consequences.

As LLMs evolve, especially with Agentic RAG systems that can access and manage data, we must ensure their security to prevent any damaging outcomes.

In this article, I’ll be teaching you about the pillars of LLM security, different risks and vulnerabilities involved, and the best practices to keep these models — and your systems — safe.

 What is LLM Security?

LLM security involves identifying and mitigating vulnerabilities in large language models, such as their tendency to spread misinformation or generate harmful content. The range of potential vulnerabilities is vast, and companies prioritize them differently based on their unique needs.

For example, financial institution...]]></content:encoded>
      <pubDate>Mon, 19 Aug 2024 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/6gIrNeguQmmtJPUaiCeRFH/067250f98d923349072690999f621f67/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>LLM Red Teaming: The Complete Step-By-Step Guide To LLM Safety </title>
      <link>https://www.confident-ai.com/blog/red-teaming-llms-a-step-by-step-guide</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/red-teaming-llms-a-step-by-step-guide</guid>
      <description>In this article, you&apos;ll learn about LLM red teaming and how it can be carried out using DeepTeam.</description>
      <content:encoded><![CDATA[When Gemini first released its image generation capabilities, it generated human faces as people of color, even when it shouldn't. Although this may be hilarious to some, it soon became evident that as Large Language Models (LLMs) advanced and evolved, so did their risks, which includes:

 Disclosing PII
 Misinformation
 Bias
 Hate Speech
 Harmful Content

These are only few of the myriad of vulnerabilities that exist within LLM systems. In the case of Gemini, it was the severe inherent biases within its training data which ultimately reflected in the "politically correct" images you see. 


Gemini Politically Correct Generations

It you don't want your AI to appear infamously on front page of X or Reddit, it’s crucial to  red team your LLM system. This helps identify harmful behaviors your LLM application is vulnerable to, in order to build the necessary defenses (using LLM guardrails) to safeguard your company’s reputation from security, compliance, and reputation risks.

However, LL...]]></content:encoded>
      <pubDate>Sat, 29 Jun 2024 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/7ghtoObcUdK7MRSKtnGvmV/a23621f97c69f889d1a2cf16aa2fbafe/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Evaluating LLM Systems: Essential Metrics, Benchmarks, and Best Practices</title>
      <link>https://www.confident-ai.com/blog/evaluating-llm-systems-metrics-benchmarks-and-best-practices</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/evaluating-llm-systems-metrics-benchmarks-and-best-practices</guid>
      <description>In this article, you&apos;ll learn how to evaluate LLM systems using LLM evaluation metrics and benchmark datasets.</description>
      <content:encoded><![CDATA[Manually evaluating LLM systems is tedious, timeconsuming and frustrating, which is why if you’ve ever found yourself looping through a set of prompts to manually inspect each corresponding LLM output, you’ll be happy to know that this article will teach you everything you need to know about LLM evaluation to ensure the longevity of you and your LLM application.

LLM evaluation refers to the process of ensuring LLM outputs are aligned with human expectations, which can range from ethical and safety considerations, to more practical criteria such as the correctness and relevancy of LLM outputs. From an engineering perspective, these LLM outputs can often be found in the form of unit test cases, while evaluation criteria can be packaged in the form of LLM evaluation metrics.

On the agenda, we have:

 What is the difference between LLM and LLM system evaluation, and their benefits
 Offline evaluations, what are LLM system benchmarks, how to construct evaluation datasets and choose the ri...]]></content:encoded>
      <pubDate>Mon, 24 Jun 2024 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/zpRFwPPrjnvdma24lKFnJ/287deacacff774763b117e88b2b238f9/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Using LLMs for Synthetic Data Generation: The Definitive Guide</title>
      <link>https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms</guid>
      <description>In this article, I&apos;m show you everything you need on how to generate realistic synthetic datasets using LLMs.</description>
      <content:encoded><![CDATA[Constructing a largescale, comprehensive dataset to test LLM outputs can be a laborious, costly, and challenging process, especially if done from scratch. But what if I told you that it’s now possible to generate the same thousands of highquality test cases you spent weeks painstakingly crafting, in just a few minutes?

Synthetic data generation leverages LLMs to create quality data without the need to manually collect, clean, and annotate massive datasets. With models like GPT4, it's now possible to synthetically produce datasets that are more comprehensive and diverse than humanlabeled ones, in far less time, which can be used to benchmark LLM (systems) with the help of some LLM evaluation metrics.

In this article, I’ll teach you everything you need to know on how to use LLMs to generate synthetic datasets (which for example can be used to evaluate RAG pipelines). We’ll explore:

 Synthetic generation methods (Distillation and SelfImprovement)
 What data evolution is, various evolut...]]></content:encoded>
      <pubDate>Thu, 09 May 2024 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/2Rv6LjFJeV8vKB0mlM0WFx/18b6da8a2b19312aa154e625cbc07e13/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>How to Build an LLM Evaluation Framework, from Scratch</title>
      <link>https://www.confident-ai.com/blog/how-to-build-an-llm-evaluation-framework-from-scratch</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/how-to-build-an-llm-evaluation-framework-from-scratch</guid>
      <description>In this article, you&apos;re going to learn how to build the world&apos;s most robust and scalable LLM evaluation framework.</description>
      <content:encoded><![CDATA[Let’s set the stage: I’m about to change my prompt template for the 44th time when I get a message from my manager: “Hey Jeff, I hope you’re doing well today. Have you seen the newly opensourced Mistral model? I’d like you to try it out since I heard gives better results than the LLaMA2 you’re using.” 

Oh no, I think to myself, not again.

This frustrating interruption (and by this I mean the releasing of new models) is why I, as the creator of DeepEval, am here today to teach you how to build an LLM evaluation framework to systematically identify the best hyperparameters for your LLM systems.

Want to be one terminal command away from knowing whether you should be using the newly release Claude3 Opus model, or which prompt template you should be using? Let’s begin.

 What is an LLM Evaluation Framework?

An LLM evaluation framework is a software package that is designed to evaluate and test outputs of LLM systems on a range of different criteria. The performance of an LLM system (whi...]]></content:encoded>
      <pubDate>Fri, 05 Apr 2024 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/4povb6tJYaezb5IoEP4YiV/7f911c87f3fc4c57a1d0fea161636a25/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Top LLM Benchmarks Explained: MMLU, HellaSwag, BBH, and Beyond</title>
      <link>https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond</guid>
      <description>In this article, I&apos;m going to go through all the top LLM benchmarks currently used and why they matter.</description>
      <content:encoded><![CDATA[Just earlier this month, Anthropic unveiled their latest Claude3 Opus model, which was preceded by Mistral's Le Large model a week prior, which was again preceded by Google's Gemini Ultra 1.5, which was of course released shortly right after Ultra 1.0. With more LLMs than ever being released at breakneck speed, it is now imperative to quantify LLM performance on a standard set of tasks. So the question is, how?

LLM benchmarks offer a structured framework for evaluating LLMs across a variety of tasks. Understanding when and how to leverage them is crucial not just for comparing models, but also for building a reliable and failsafe model.

In this article, I’m going to walk you through everything you need to know about LLM benchmarks. We‘ll explore:

 What LLM benchmarks are and how to pick the one for your needs.
 All the key benchmarks in technical reports and industry. (MMLU, HellaSwag, BBH, etc.)
 The limitations of LLM benchmarks, and ways to get around them by generating synthetic...]]></content:encoded>
      <pubDate>Sat, 16 Mar 2024 00:00:00 GMT</pubDate>
      <author>Kritin Vongthongsri</author>
      <media:thumbnail url="https:/authors/kritin.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/4DPySdM4Ij59ydJFYEWvjl/c26bc9f295a345c1b1743e7c0413f8ee/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>LLM Testing in 2026: Top Methods and Strategies</title>
      <link>https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies</guid>
      <description>In this article, we&apos;ll learn everything there is to LLM testing, including best practices and methods to test LLMs.</description>
      <content:encoded><![CDATA[Just a week ago, I was on a call with a DeepEval user who told me she considers testing and evaluating large language models (LLMs) as distinct concepts. When asked what was her definition of LLM testing, this was what she said: 

 Evaluating LLMs to us is more about choosing the right LLMs through benchmarks, whereas LLM testing is more about exploring the unexpected things that can go wrong in different scenarios. 

Since I’ve already written quite an article on everything you need to know about LLM evaluation metrics, for this article we’ll dive into how to use these metrics for LLM testing instead. We’ll explore what LLM testing is, different test approaches and edge cases to look out for, highlight best practices for LLM testing, as well as how to carry out LLM testing through DeepEval, the opensource LLM testing framework.

And before I forget, here's my go to "graph" to explain the importance of testing, especially for unit testing AI agents.



Convinced? Let's dive right in.

...]]></content:encoded>
      <pubDate>Sun, 25 Feb 2024 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/5xiX86CGHWhcIoyJCoRZkD/c428040059c6d74ed7e9eeb090d594a7/65dba2a09f2a11e1429dd624_testing-p-1080.jpg" type="image/jpeg"/>
    </item>
    <item>
      <title>The Ultimate Guide to Fine-Tune LLaMA 3, With LLM Evaluations</title>
      <link>https://www.confident-ai.com/blog/the-ultimate-guide-to-fine-tune-llama-2-with-llm-evaluations</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/the-ultimate-guide-to-fine-tune-llama-2-with-llm-evaluations</guid>
      <description>In this article, we&apos;ll walkthrough how to fine-tune and evaluate a LLaMA-2 model using Hugging Face and DeepEval</description>
      <content:encoded><![CDATA[Finetuning a Large Language Model (LLM) comes with tons of benefits when compared to relying on proprietary foundational models such as OpenAI’s GPT models. Think about it, you get 10x cheaper inference cost, 10x faster tokens per second, and not have to worry about any shady stuff OpenAI’s doing behind their APIs. The way everyone should be thinking about finetuning, is not how we can outperform OpenAI or replace RAG, but how we can maintain the same performance while cutting down on inference time and cost for your specific use case. 



But let’s face it, the average Joe building RAG applications isn’t confident in their ability to finetune an LLM — training data are hard to collect, methodologies are hard to understand, and finetuned models are hard to evaluate. And so, finetuning has became the best vitamin for LLM practitioners. You’ll often hear excuses such as “Finetuning isn’t a priority right now”, “We’ll try with RAG and move to finetuning if necessary”, and the classic “Its...]]></content:encoded>
      <pubDate>Tue, 20 Feb 2024 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/7IVVs4Otrp3hCz4cEQlbBu/8e9e26a24ec09c08a636ed0a720119ad/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>RAG Evaluation: The Definitive Guide to Unit Testing RAG in CI/CD</title>
      <link>https://www.confident-ai.com/blog/how-to-evaluate-rag-applications-in-ci-cd-pipelines-with-deepeval</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/how-to-evaluate-rag-applications-in-ci-cd-pipelines-with-deepeval</guid>
      <description>In this tutorial, we&apos;ll walkthrough how to setup a full testing suite for RAG applications using DeepEval.</description>
      <content:encoded><![CDATA[RetrievalAugmented Generation (RAG) has become the most popular way to provide LLMs with extra context to generate tailored outputs. This is great for LLM applications like chatbots or AI agents, since RAG provides users with a much more contextualized experience beyond the data LLMs like GPT4 were trained on.

Unsurprisingly, LLM practitioners quickly ran into problems with evaluating RAG applications during development. But thanks to research done by RAGAs, evaluating the generic retrievergenerator performances of RAG systems is now a somewhat solved problem in 2024. Don’t get me wrong, building RAG applications remains a challenge — you could be using the wrong embedding model, a bad chunking strategy, or outputting responses in the wrong format, which is exactly what frameworks like LlamaIndex are trying to solve.

But now, as RAG architectures grow in complexity and collaboration among LLM practitioners on these projects increases, the occurrence of breaking changes is becoming mo...]]></content:encoded>
      <pubDate>Mon, 05 Feb 2024 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/7DdNqoWRb6LchNAEMNHEYh/bd95dcf9874fb5edc605dbdbcfda065b/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide</title>
      <link>https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation</guid>
      <description>In this article, I&apos;ll walkthrough everything you need to know about LLM evaluation metrics, with code samples.</description>
      <content:encoded><![CDATA[It is no secret that evaluating the outputs of Large Language Models (LLMs) is essential for anyone building robust LLM applications. Whether you're finetuning for accuracy, enhancing contextual relevance in a RAG pipeline, or increasing task completion rate in an AI agent, choosing the right evaluation metrics is critical. Yet, LLM evaluation remains notoriously difficult—especially when it comes to deciding what to measure and how.

Having built one of the most adopted LLM evaluation framework myself, this article will teach you everything you need to know about LLM evaluation metrics, with code samples included. Ready for the long list? Let’s begin.

(Update: For metrics evaluating AI agents, heck out this new article)

 TL;DR

Key takeaways:

 LLM metrics measures output quality across dimensions like correctness and relevance.
 Common mistakes: relying on traditional scorers like BLEU/ROUGE, where semantic nuance in LLM outputs is not captured.
 LLMasajudge is the most reliable me...]]></content:encoded>
      <pubDate>Mon, 22 Jan 2024 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/65nURaRhfXuOMqxF3sbH0y/d195c11f1d3b6d4d899e70f7819293e0/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>An Introduction to LLM Benchmarking</title>
      <link>https://www.confident-ai.com/blog/the-current-state-of-benchmarking-llms</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/the-current-state-of-benchmarking-llms</guid>
      <description>In this article, I&apos;ll show how benchmarking can help you choose the right LLM for your use case.</description>
      <content:encoded><![CDATA[Picture LLMs ranging from 7 billion to over 100 billion parameters, each more powerful than the last. Among them are the giants: Mistral 7 billion, Mixtral 8x7 billion, Llama 70 billion, and the colossal Falcon 180 billion. Yet, there also exist models like Phi1, Phi1.5, and Falcon 1B, striving for similar prowess with a leaner framework of 1 to 4 billion parameters. Each model, big or small, shares a common goal: to master the art of language, excelling in tasks like summarization, questionanswering, and named entity recognition.

But across all of these cases, Large Language Models (LLMs) universally share some very flawed behaviors:

 Some prompts cause LLMs to produce gibberish outputs, known as 'jailbreaking prompts'.
 LLMs are not always factually correct, a phenomenon also known as 'hallucination'.
 LLMs can exhibit unexpected behaviors that are unsafe for consumers to utilize.

It is evident that merely training LLMs is not sufficient. Thus, the question arises: How can we conf...]]></content:encoded>
      <pubDate>Mon, 25 Dec 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/23qYSYI9xoiLVCIJFaCT0q/74b9de872393216880cd3b20bcfeb6ef/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>A Step-By-Step Guide to Evaluating an LLM Text Summarization Task</title>
      <link>https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task</guid>
      <description>In this article, I&apos;ll teach you how to create your own text summarization metric.</description>
      <content:encoded><![CDATA[When you imagine what a good summary for a 10page research paper looks like, you likely picture a concise, comprehensive overview that accurately captures all key findings and data from the original work, presented in a clear and easily understandable format.

This might sound extremely obvious to us (I mean, who doesn’t know what a good summary looks like?), yet for large language models (LLMs) like GPT4, grasping this simple concept to accurately and reliably evaluate a text summarization task remains a significant challenge.

In this article, I’m going to share how we built our own bulletproof LLMEvals (metrics evaluated using LLMs) to evaluate a textsummarization task. In summary (no pun intended), it involves asking closedended questions to:

1. Identify misalignment in factuality between the original text and summary.
2. Identify exclusion of details in the summary from the original text.

 Existing Problems with Text Summarization Metrics

 Traditional, nonLLM Evals

Historicall...]]></content:encoded>
      <pubDate>Sun, 17 Dec 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/3oSPNlD5J6GvFFh5YANn17/e5bd8f8f7dc23915a9e4ec73901108a9/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Why OpenAI Assistants is a Big Win for LLM Evaluation</title>
      <link>https://www.confident-ai.com/blog/why-openai-assistants-is-a-big-win-for-llm-evaluation</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/why-openai-assistants-is-a-big-win-for-llm-evaluation</guid>
      <description>In this article, I&apos;ll share how JudgmentalGPT, our in-house evaluator was built using OpenAI&apos;s Assistants.</description>
      <content:encoded><![CDATA[A week after the famous, or infamous, OpenAI Dev Day, we at Confident AI released JudgementalGPT—an LLM agent built using OpenAI's Assistants API, specifically designed for the purpose of evaluating other LLM applications. What initially started off as an experimental idea quickly turned into a prototype that we were eager to ship as we received feedback from users that JudgementalGPT gave more accurate and reliable results when compared to other stateoftheart LLMbased evaluation approaches such as GEval.

Understandably, knowing that Confident AI is the world's first opensource evaluation infrastructure for LLMs, many demanded more transparency into how JudgementalGPT was built after our initial public release:

 I thought it's all open source, but it seems like JudgementalGPT, in particular, is a black box for users. It would be great if we had more knowledge on how this is built.

So here you go, dear anonymous internet stranger, this article is dedicated to you.

 Limitations of LL...]]></content:encoded>
      <pubDate>Tue, 21 Nov 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/5ORobDoTX2j1cE7rnwdoaY/5f934d40d456f83ec190bec0aaad2eb6/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Become a Prompt Artist: Understanding the Midjourney LLM</title>
      <link>https://www.confident-ai.com/blog/become-a-prompt-artist-understanding-the-midjourney-llm</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/become-a-prompt-artist-understanding-the-midjourney-llm</guid>
      <description>In this interactive tutorial, I&apos;ll show you how to become a Midjournalist to create image you image.</description>
      <content:encoded><![CDATA[By now, you've probably seen your fair share of jawdropping AI images on your social feeds and thought to yourself, "How are people creating these amazing images?" So, you jump onto Midjourney, eager to create your own. But what you produce isn't quite what you expected. You try again, only to be disappointed once more. Sound familiar?

In order to generate what you desire, you need to know how to prompt Midjourney's LLM the right way and to be honest, I've spent my fair share of time failing to generate what I want for Confident's weekly blog post cover images (such as the current one above).

So, are you ready for a lighthearted and interactive tutorial? Lets begin!

 Getting Started with your first Midjourney artwork

To get started with Midjourney, sign up to Discord if you haven't already and complete the registration process. Once you have Discord up and running, open the Midjourney website and click "Join Beta".



Once you've signed up, you can select a paid or a free plan. Use...]]></content:encoded>
      <pubDate>Wed, 15 Nov 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/6ZxumC761romYdAm0Hq1pm/8f7de15524d75b7c27558c6ad1655ca8/image_2025-07-28_084532032.png" type="image/jpeg"/>
    </item>
    <item>
      <title>How to Evaluate LLM Applications: The Complete Guide</title>
      <link>https://www.confident-ai.com/blog/how-to-evaluate-llm-applications</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/how-to-evaluate-llm-applications</guid>
      <description>In this article, we will debunk how to evaluate an LLM application / RAG pipelines the right way.</description>
      <content:encoded><![CDATA[ChatGPT, the leading code generator, has soared in popularity over the past year thanks to the seemingly omniscient GPT4. Its ability to generate coherent and poetic responses to previously unseen contexts has accelerated the development of other foundational large language models (LLMs), such as Anthropic’s Claude, Google’s Bard, and Meta’s opensource LLaMA model. Consequently, this has enabled ML engineers to build retrievalbased LLM applications around proprietary data like never before. But these applications continue to suffer from hallucinations, struggle to keep uptodate with the latest information, and don’t always respond relevantly to prompts.

In this article, as the founder of Confident AI, the world’s first opensource evaluation infrastructure for LLM applications, I will outline how to evaluate LLM and retrieval pipelines, different workflows you can employ for evaluation, and the common pitfalls when building RAG applications that evaluation can solve.

 Evaluation is (n...]]></content:encoded>
      <pubDate>Tue, 07 Nov 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/5qXaZ5KjZwyYEVIo9SQWvC/36b752d495a7c8556879d4fd0c86e1d8/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Why we replaced Pinecone with PGVector</title>
      <link>https://www.confident-ai.com/blog/why-we-replaced-pinecone-with-pgvector</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/why-we-replaced-pinecone-with-pgvector</guid>
      <description>Do you really need a dedicated vector database for your Generative AI application? Our experience says not always.</description>
      <content:encoded><![CDATA[Pinecone, the leading closedsource vector database provider, is known for being fast, scalable, and easy to use. Its ability to allow users to perform blazingfast vector search makes it a popular choice for largescale RAG applications. Our initial infrastructure for Confident AI, the world’s first opensource evaluation infrastructure for LLMs, utilized Pinecone to cluster LLM observability log data in production. However, after weeks of experimentation, we made the decision to replace it entirely with pgvector. Pinecone’s simplistic design is deceptive due to several hidden complexities, particularly in integrating with existing data storage solutions. For example, it forces a complicated architecture and its restrictive metadata storage capacity made it troublesome for managing dataintensive workloads.

In this article, I will explain why vector databases like Pinecone might not be the best choice for LLM applications and when you should avoid it.

 Pinecone Optimizes for Fast Vector ...]]></content:encoded>
      <pubDate>Sun, 29 Oct 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/lb37dsKYWYvPZMiqeLxNw/6300b77f9664a8a25c7081fcb31c2396/image_2025-07-28_074850109.png" type="image/jpeg"/>
    </item>
    <item>
      <title>What is Retrieval Augmented Generation (RAG)?</title>
      <link>https://www.confident-ai.com/blog/what-is-retrieval-augmented-generation</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/what-is-retrieval-augmented-generation</guid>
      <description>In this article, we&apos;re going to dive deep into the RAG rabbit hole.</description>
      <content:encoded><![CDATA[Largelanguage models like gpt4 are powerful and versatile generators of natural language, but also extremely limited by the the data they were trained on. To get around this problem, there’s a lot of recent talk about leveraging RAGbased systems, but what is RAG, what it can be used for, and why should you care?

In this article, I’m going to talk about what RAG is, how to implement a RAGbased LLM application (yes, with a complete code sample).

PS. Click here for a great read on how to unit test RAG applications in CI/CD pipelines.

 What is RAG?

Retrieval augmented generation is a technique in NLP that allows LLMs like ChatGPT to generate customized outputs that are outside the scope of the data it was trained on. An LLM application without RAG, is akin to asking ChatGPT to summarize an email without providing the actual email as context.

A RAG system consists of two primary components: the retriever and the generator.

The retriever is responsible for searching through the knowled...]]></content:encoded>
      <pubDate>Sun, 22 Oct 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/6bLDpE2WKggf7E4nRuCtbC/c0adcceabe46ec733012f54e563d6f2c/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>A Gentle Introduction to LLM Evaluation</title>
      <link>https://www.confident-ai.com/blog/a-gentle-introduction-to-llm-evaluation</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/a-gentle-introduction-to-llm-evaluation</guid>
      <description>In this article, we&apos;ll introduce the ways in which you can carry out automated, LLM evaluation.</description>
      <content:encoded><![CDATA[Most developers don't setup a process to automatically evaluate LLM outputs when building LLM applications even if that means introducing unnoticed breaking changes because evaluation can be an extremely challenging task. In this article, you're going to learn how to evaluate LLM outputs the right way. (PS. if you want to learn how to build your own evaluation framework, click here.) 

On the agenda:

 what are LLMs and why they're difficult to evaluate
 different ways to evaluate LLM outputs in Python
 how to evaluate LLMs using DeepEval

Enjoy!

 What are LLMs and what makes them so hard to evaluate?

To understand why LLMs are difficult to evaluate and why they're often times referred to as a "black box", let's debunk are LLMs and how they work.

GPT4 is an example of a large language model (LLM) and was trained on huge amounts of data. To be exact, around 300 billion words from articles, tweets, r/tifu, stackoverflow, howtoguides, and other pieces of data that were scraped off the ...]]></content:encoded>
      <pubDate>Tue, 03 Oct 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/20h6iNly3WljpPfb2k1GG/4b4fcb155cbe41c4022307fa633d34d2/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>How to build a PDF QA chatbot using OpenAI and ChromaDB </title>
      <link>https://www.confident-ai.com/blog/how-to-build-a-pdf-qa-chatbot-using-openai-and-chromadb</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/how-to-build-a-pdf-qa-chatbot-using-openai-and-chromadb</guid>
      <description>In this article, you&apos;ll learn how to build a RAG based chatbot on your PDFs using OpenAI and ChromaDB</description>
      <content:encoded><![CDATA[TL;DR

In this article, you'll learn how to build a RAG based chatbot to chat with any PDF of your choice so you can achieve your lifelong dream of talking to PDFs 😏 In the end, I'll also show how you can test what you've built.

I know, I wrote something similar in my last article on building a customer support chatbot 😅 but this week we're going to dive deep into how to use the raw OpenAI API to chat with PDF data (including text trapped in visuals like tables) stored in ChromaDB, as well as how to use Streamlit to build the chatbot UI.

 Introducing RAG, Vector Databases, and OCR

Before we dive into the code, let's debunk what we're going to implement 🕵️ To begin, OCR (Optical Character Recognition) is a technology within the field of computer vision that recognizes the characters present in the document and converts them into text  this is particularly helpful in the case of tables and charts in documents 😬 We'll be using OCR provided by Azure Cognitive Services in this tutori...]]></content:encoded>
      <pubDate>Tue, 26 Sep 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/WteyPyGMNNYSNa3Buu6zD/230dbf184d6526cafbc4a360f5869445/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Building a customer support chatbot using GPT-3.5 and lLamaIndex</title>
      <link>https://www.confident-ai.com/blog/building-a-customer-support-chatbot-using-gpt-3-5-and-llamaindex</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/building-a-customer-support-chatbot-using-gpt-3-5-and-llamaindex</guid>
      <description>In this article, you&apos;ll learn how to create a customer support chatbot using GPT-3.5 and lLamaIndex.</description>
      <content:encoded><![CDATA[Introducing OpenAI API, and lLamaindex

In this tutorial, we're going to use GPT3.5 provided by the OpenAI API. GPT3.5 is a machine learning model and is like a supersmart computer buddy made by OpenAI. It's been trained with tons of data from the internet so it can chat, answer questions, and help with all sorts of language tasks.

But, you might wonder, can raw, outofthebox GPT3.5 answer customer support questions that are specific to my own internal data?

Unfortunately, the answer is no 😔 because as you may know, GPT models have only been trained on public data up until 2021. This is precisely why we need open source frameworks like lLamaIndex! These frameworks help connect your internal data sources with GPT3.5, so your chatbot can output tailored responses based on data that regular ChatGPT don't know about 😊 (PS. if you want to learn how to use the raw OpenAI API to build a chatbot instead of a framework like lLamaIndex, here is another great tutorial.)

Pretty cool, huh? Lets...]]></content:encoded>
      <pubDate>Tue, 19 Sep 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/6H3c9DsCWfDMapkdJP5dR0/f8d5cd4239bc1ecb8c02d2cc05c18b5d/image.png" type="image/jpeg"/>
    </item>
    <item>
      <title>Generating synthetic data  with LLMs - Part 1</title>
      <link>https://www.confident-ai.com/blog/how-to-generate-synthetic-data-using-llms-part-1</link>
      <guid isPermaLink="true">https://www.confident-ai.com/blog/how-to-generate-synthetic-data-using-llms-part-1</guid>
      <description>LLMs make synthetic data easy to leverage, but how exactly can we make these generated data relevant and useful?</description>
      <content:encoded><![CDATA[The ability to use AI to generate  data out of thin air is one of those things that seem too good to be true — think about it, you can get your hands on quality data without needing to manually collect, clean, and annotate massive datasets.

But, as you might expect, synthetic data is not without its caveats. Although it is convenient, efficient, and cost effective, the quality of synthetic data is only as good as the method used to generate it. Settle for rudimentary methods, and you’ll end up with unusable datasets that don’t represent realworld data well.

In this article, I’m going to share how we managed to generate realistic textual synthetic data at Confident AI. Let's dive right into it.

 What is synthetic data?

First and foremost, synthetic data is artificially generated data in attempt to simulate realworld data. Unlike realworld data that is collected from observations or actual events (e.g., tweets on the platform formally known as Twitter), synthetic data is made up, som...]]></content:encoded>
      <pubDate>Fri, 08 Sep 2023 00:00:00 GMT</pubDate>
      <author>Jeffrey Ip</author>
      <media:thumbnail url="https:/authors/jeffrey.jpg"/>
      <enclosure url="https://images.ctfassets.net/otwaplf7zuwf/bHEj7d7NTkCTWA3PhJh7G/b8d637e3d376b6e63b76bf144b9226ee/image.png" type="image/jpeg"/>
    </item>
  </channel>
</rss>