Confident AI Blog - Resources to help teams stay confident in AI
SlackJust In: New Slack Community! Connect with AI engineers building with Confident AI, join now →

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Three Ways AI Systems Fail Even When Evals Pass

Three Ways AI Systems Fail Even When Evals Pass

AI systems can pass evals while still behaving incorrectly. This post explores three common failure modes that slip through output-based evaluation.

Brian Neville-O'Neill

Brian Neville-O'Neill

Apr 7, 2026
.
12 min
Your AI Agent Passed Evals. That’s the Problem.

Your AI Agent Passed Evals. That’s the Problem.

Passing evals doesn’t mean your system works. It means your tests didn’t catch how it fails.

Brian Neville-O'Neill

Brian Neville-O'Neill

Apr 6, 2026
.
4 min read
Multi-Turn LLM Evaluation in 2026: What You Need to Know

Multi-Turn LLM Evaluation in 2026: What You Need to Know

In this article, I'll break down multi-turn LLM evaluation — how it differs from single-turn, what metrics actually matter, and how to implement it.

Jeffrey Ip

Jeffrey Ip

Mar 22, 2026
.
14 min read
The Step-By-Step Guide to MCP Evaluation

The Step-By-Step Guide to MCP Evaluation

This article will teach you everything you need to evaluate MCP-based LLM applications.

Cale

Cale

Oct 25, 2025
.
9 min read
AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows

AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows

A practical guide to evaluating AI agents with LLM metrics and tracing—plus when human review matters, how it calibrates judges, and workflows that combine CI, sampling, and production signals.

Jeffrey Ip

Jeffrey Ip

Oct 7, 2025
.
20 min read
LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing

LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing

In this article, you'll learn everything about running LLM Arena-as-a-judge as a novel way to regression test LLMs.

Jeffrey Ip

Jeffrey Ip

Jul 6, 2025
.
10 min read
RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, Contextual Relevancy, And More

RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, Contextual Relevancy, And More

This article will go through everything you'll need for RAG evaluation, including metrics, and best practices.

Jeffrey Ip

Jeffrey Ip

Jun 3, 2025
.
9 min read
LLM Evals Framework That Predicts ROI: A Step-by-Step Guide

LLM Evals Framework That Predicts ROI: A Step-by-Step Guide

Most LLM evals fail because metrics don't predict ROI, build outcome-based evals that correlate with business KPIs.

Jeffrey Ip

Jeffrey Ip

May 2, 2025
.
16 min read
G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation

G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation

This article goes through everything on G-Eval for anyone to easily evaluate LLM apps on any task specific criteria.

Kritin Vongthongsri

Kritin Vongthongsri

Apr 30, 2025
.
14 min read
Top LLM Evaluators for Testing LLM Systems at Scale

Top LLM Evaluators for Testing LLM Systems at Scale

In this article, we'll go through all the top LLM evaluators in 2025 including G-Eval and other LLM-as-a-judges.

Jeffrey Ip

Jeffrey Ip

Apr 21, 2025
.
15 min read