Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

All Stories Featured Evaluation Safety Product

LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing

Everything you need to run LLM Arena-as-a-judge: a comparison-based, pairwise approach to regression testing LLMs, with DeepEval code to set it up.

Jeffrey Ip

Jul 6, 2025

10 min read

RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, Contextual Relevancy, And More

RAG evaluation metrics — answer relevancy, faithfulness, and contextual relevancy — measure retrieval and generation quality, with working DeepEval code examples.

Jeffrey Ip

Jun 3, 2025

9 min read

LLM Evals Framework That Predicts ROI: A Step-by-Step Guide

Most LLM evals fail because their metrics don't predict ROI. This playbook shows how to build outcome-based evals that correlate with real business KPIs and user value.

Jeffrey Ip

May 2, 2025

16 min read

G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation

The definitive guide to G-Eval: how this LLM-as-a-judge metric works, chain-of-thought scoring, and how to evaluate LLM apps on custom criteria with DeepEval.

Kritin Vongthongsri

Apr 30, 2025

14 min read

Top LLM Evaluators for Testing LLM Systems at Scale

A rundown of the top LLM evaluators for testing at scale in 2025, from G-Eval and other LLM-as-a-judge metrics to the tooling that runs them in production.

Jeffrey Ip

Apr 21, 2025

15 min read

How I raised Confident AI's $2.2M seed round in 5 days

Confident AI raised an oversubscribed $2.2M seed round in 5 days. Here's the fundraising strategy, the investor conversations, and the hard lessons from the raise.

Jeffrey Ip

Mar 19, 2025

8 min read

How I Built Deterministic LLM Evaluation Metrics for DeepEval

DeepEval's DAG metrics make LLM-as-a-judge scoring deterministic by running outputs through a decision tree. Here's how I built these reliable, explainable metrics.

Jeffrey Ip

Feb 9, 2025

9 min read

LLM Agent Evaluation Metrics in 2026: Tool Calling, Task Completion, Reasoning, and Trace-Based Evals

Learn how to evaluate LLM agents end-to-end with tool calling, task completion, reasoning, trace-based evals, human review, and DeepEval code examples.

Kritin Vongthongsri

Jan 27, 2025

17 min read

The People's Choice of Top LLM Evaluation Tools in 2025

A hand-picked, carefully curated list of the best LLM evaluation tools in 2025, compared on metrics, features, and pricing to help you choose the right one.

Jeffrey Ip

Jan 15, 2025

6 min read

What is LLM Observability? - The Ultimate LLM Observability Guide

LLM observability is the practice of tracing, monitoring, and evaluating LLM apps in production. Learn what it covers and what to look for when choosing a tool.

Kritin Vongthongsri

Oct 29, 2024

9 min read

Back Next