Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

All Stories Featured Evaluation Safety Product

AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows

A practical guide to evaluating AI agents with LLM metrics and tracing—plus when human review matters, how it calibrates judges, and workflows that combine CI, sampling, and production signals.

Jeffrey Ip

Oct 7, 2025

20 min read

RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, Contextual Relevancy, And More

This article will go through everything you'll need for RAG evaluation, including metrics, and best practices.

Jeffrey Ip

Jun 3, 2025

9 min read

LLM Evals Framework That Predicts ROI: A Step-by-Step Guide

Most LLM evals fail because metrics don't predict ROI, build outcome-based evals that correlate with business KPIs.

Jeffrey Ip

May 2, 2025

16 min read

G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation

This article goes through everything on G-Eval for anyone to easily evaluate LLM apps on any task specific criteria.

Kritin Vongthongsri

Apr 30, 2025

14 min read

How I Built Deterministic LLM Evaluation Metrics for DeepEval

In this article, I'm sharing how I've built DeepEval's latest deterministic, LLM-powered, custom metric.

Jeffrey Ip

Feb 9, 2025

9 min read

LLM Guardrails for Data Leakage, Prompt Injection, and More

In this article, you'll learn everything you need to know on LLM guardrails and how to use it for LLM security.

Jeffrey Ip

Jan 26, 2025

15 min read

How to Jailbreak LLMs One Step at a Time: Top Techniques and Strategies

In this article, I'll show you how to jailbreak your LLM application to detect it for vulnerabilities.

Kritin Vongthongsri

Oct 30, 2024

16 min read

Top LLM Chatbot Evaluation Metrics: Conversation Testing Techniques

In this article, you'll learn about LLM red teaming and how it can be carried out using DeepTeam.

Jeffrey Ip

Oct 5, 2024

10 min read

LLM-as-a-Judge Simply Explained: The Complete Guide to Run LLM Evals at Scale

Complete guide to LLM-as-a-Judge: how it works, single-output vs pairwise scoring, G-Eval, DAG, prompting techniques, and how to use LLM judges for scalable LLM evaluation.

Jeffrey Ip

Sep 1, 2024

13 min read

LLM Red Teaming: The Complete Step-By-Step Guide To LLM Safety

In this article, you'll learn about LLM red teaming and how it can be carried out using DeepTeam.

Kritin Vongthongsri

Jun 29, 2024

16 min read