Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

All Stories Featured Evaluation Safety Product

LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing

Everything you need to run LLM Arena-as-a-judge: a comparison-based, pairwise approach to regression testing LLMs, with DeepEval code to set it up.

Jeffrey Ip

Jul 6, 2025

10 min read

RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, Contextual Relevancy, And More

RAG evaluation metrics — answer relevancy, faithfulness, and contextual relevancy — measure retrieval and generation quality, with working DeepEval code examples.

Jeffrey Ip

Jun 3, 2025

9 min read

LLM Evals Framework That Predicts ROI: A Step-by-Step Guide

Most LLM evals fail because their metrics don't predict ROI. This playbook shows how to build outcome-based evals that correlate with real business KPIs and user value.

Jeffrey Ip

May 2, 2025

16 min read

G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation

The definitive guide to G-Eval: how this LLM-as-a-judge metric works, chain-of-thought scoring, and how to evaluate LLM apps on custom criteria with DeepEval.

Kritin Vongthongsri

Apr 30, 2025

14 min read

Top LLM Evaluators for Testing LLM Systems at Scale

A rundown of the top LLM evaluators for testing at scale in 2025, from G-Eval and other LLM-as-a-judge metrics to the tooling that runs them in production.

Jeffrey Ip

Apr 21, 2025

15 min read

How I raised Confident AI's $2.2M seed round in 5 days

Confident AI raised an oversubscribed $2.2M seed round in 5 days. Here's the fundraising strategy, the investor conversations, and the hard lessons from the raise.

Jeffrey Ip

Mar 19, 2025

8 min read

How I Built Deterministic LLM Evaluation Metrics for DeepEval

DeepEval's DAG metrics make LLM-as-a-judge scoring deterministic by running outputs through a decision tree. Here's how I built these reliable, explainable metrics.

Jeffrey Ip

Feb 9, 2025

9 min read

LLM Agent Evaluation Metrics in 2026: Tool Calling, Task Completion, Reasoning, and Trace-Based Evals

Learn how to evaluate LLM agents end-to-end with tool calling, task completion, reasoning, trace-based evals, human review, and DeepEval code examples.

Kritin Vongthongsri

Jan 27, 2025

17 min read

LLM Guardrails for Data Leakage, Prompt Injection, and More

LLM guardrails are input and output guards that block data leakage, prompt injection, and off-topic responses in real time. Learn the main types and how to add them.

Jeffrey Ip

Jan 26, 2025

15 min read

OWASP Top 10 2025 for LLM Applications: What’s new? Risks, and Mitigation Techniques

The 2025 OWASP Top 10 for LLM applications ranks risks from prompt injection to sensitive data disclosure. See what changed this year and how to mitigate each one.

Kritin Vongthongsri

Jan 18, 2025

14 min read

Back Next