Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

All Stories Featured Evaluation Safety Product

Three Ways AI Systems Fail Even When Evals Pass

AI systems can pass evals while still behaving incorrectly. This post explores three common failure modes that slip through output-based evaluation.

Brian Neville-O'Neill

Apr 7, 2026

12 min

Your AI Agent Passed Evals. That’s the Problem.

Passing evals doesn’t mean your system works. It means your tests didn’t catch how it fails.

Brian Neville-O'Neill

Apr 6, 2026

4 min read

Multi-Turn LLM Evaluation in 2026: What You Need to Know

In this article, I'll break down multi-turn LLM evaluation — how it differs from single-turn, what metrics actually matter, and how to implement it.

Jeffrey Ip

Mar 22, 2026

14 min read

The Step-By-Step Guide to MCP Evaluation

This article will teach you everything you need to evaluate MCP-based LLM applications.

Cale

Oct 25, 2025

9 min read

AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows

A practical guide to evaluating AI agents with LLM metrics and tracing—plus when human review matters, how it calibrates judges, and workflows that combine CI, sampling, and production signals.

Jeffrey Ip

Oct 7, 2025

20 min read

LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing

In this article, you'll learn everything about running LLM Arena-as-a-judge as a novel way to regression test LLMs.

Jeffrey Ip

Jul 6, 2025

10 min read

RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, Contextual Relevancy, And More

This article will go through everything you'll need for RAG evaluation, including metrics, and best practices.

Jeffrey Ip

Jun 3, 2025

9 min read

LLM Evals Framework That Predicts ROI: A Step-by-Step Guide

Most LLM evals fail because metrics don't predict ROI, build outcome-based evals that correlate with business KPIs.

Jeffrey Ip

May 2, 2025

16 min read

G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation

This article goes through everything on G-Eval for anyone to easily evaluate LLM apps on any task specific criteria.

Kritin Vongthongsri

Apr 30, 2025

14 min read

Top LLM Evaluators for Testing LLM Systems at Scale

In this article, we'll go through all the top LLM evaluators in 2025 including G-Eval and other LLM-as-a-judges.

Jeffrey Ip

Apr 21, 2025

15 min read