Weekly Dose of Confident AI

Weekly Dose of Confident AI https://www.confident-ai.com/blog Resources to help teams build reliable AI systems - guides, tutorials, personal experiences, and essays to test LLM apps in every way possible. en-us Wed, 17 Jun 2026 12:01:52 GMT Human-in-the-Loop Workflows for AI Agent Evaluation: Complete Guide https://www.confident-ai.com/blog/human-in-the-loop-llm-evaluation-guide https://www.confident-ai.com/blog/human-in-the-loop-llm-evaluation-guide A practical guide to human-in-the-loop workflows for AI agent evaluation: how SMEs review AI agent failures, align automated metrics, and improve evaluation datasets. Sat, 13 Jun 2026 00:00:00 GMT Kritin Vongthongsri The Complete Guide to LLM Experimentation: Compare Prompts, Models, and Agents https://www.confident-ai.com/blog/llm-experimentation-complete-guide https://www.confident-ai.com/blog/llm-experimentation-complete-guide A practical guide to running LLM experiments across prompts, models, tools, datasets, metrics, production A/B tests, and human-in-the-loop feedback loops. Wed, 10 Jun 2026 00:00:00 GMT Kritin Vongthongsri Three Ways AI Systems Fail Even When Evals Pass https://www.confident-ai.com/blog/three-ways-ai-systems-fail-even-when-evals-pass https://www.confident-ai.com/blog/three-ways-ai-systems-fail-even-when-evals-pass AI systems can pass evals while still behaving incorrectly. This post explores three common failure modes that slip through output-based evaluation. Tue, 07 Apr 2026 00:00:00 GMT Brian Neville-O'Neill Your AI Agent Passed Evals. That’s the Problem. https://www.confident-ai.com/blog/your-ai-agent-passes-evals-thats-the-problem https://www.confident-ai.com/blog/your-ai-agent-passes-evals-thats-the-problem Passing evals doesn’t mean your system works. It means your tests didn’t catch how it fails. Mon, 06 Apr 2026 00:00:00 GMT Brian Neville-O'Neill Launch Week Day 5 (5/5): Generate Datasets from Your Data Sources https://www.confident-ai.com/blog/launch-week-q1-2026-day-5-dataset-generation https://www.confident-ai.com/blog/launch-week-q1-2026-day-5-dataset-generation Your best evaluation data already exists — it's sitting in Google Drive, SharePoint, Notion, and S3. Dataset generation on Confident AI turns your existing documents into evaluation-ready datasets automatically. Sat, 04 Apr 2026 00:00:00 GMT Jeffrey Ip Launch Week Day 4 (4/5): Auto-Categorize Traces & Threads https://www.confident-ai.com/blog/launch-week-q1-2026-day-4-trace-categorization https://www.confident-ai.com/blog/launch-week-q1-2026-day-4-trace-categorization You can't improve what you can't see. Auto-categorization tells you what your users are actually asking, detects response drift, and shows you which categories perform best — and which ones need help. Fri, 03 Apr 2026 00:00:00 GMT Jeffrey Ip Launch Week Day 3 (3/5): Auto-Ingest Traces into Datasets & Annotation Queues https://www.confident-ai.com/blog/launch-week-q1-2026-day-3-auto-ingest-traces https://www.confident-ai.com/blog/launch-week-q1-2026-day-3-auto-ingest-traces Production traces are the best dataset you’ll ever get — but most teams never turn them into one. With auto-ingest, your traces flow straight into datasets and annotation queues, continuously. Thu, 02 Apr 2026 00:00:00 GMT Brian Romain Launch Week Day 2 (2/5): Scheduled Evals https://www.confident-ai.com/blog/launch-week-q1-2026-day-2-scheduled-evals https://www.confident-ai.com/blog/launch-week-q1-2026-day-2-scheduled-evals Everyone agrees evals should run regularly. But nobody remembers to actually run them. Scheduled Evals fixes that — set the frequency, configure your mappings, and never scramble before a release again. Wed, 01 Apr 2026 00:00:00 GMT Kritin Vongthongsri Announcing Launch Week Q1 '26! Day 1: Automated Error Analysis https://www.confident-ai.com/blog/launch-week-q1-2026-day-1-error-analysis https://www.confident-ai.com/blog/launch-week-q1-2026-day-1-error-analysis Error analysis used to mean pulling traces in code, hacking together an LLM to recommend metrics, and hoping for the best. Not anymore. Tue, 31 Mar 2026 00:00:00 GMT Jeffrey Ip Multi-Turn LLM Evaluation in 2026: What You Need to Know https://www.confident-ai.com/blog/multi-turn-llm-evaluation-in-2026 https://www.confident-ai.com/blog/multi-turn-llm-evaluation-in-2026 In this article, I'll break down multi-turn LLM evaluation — how it differs from single-turn, what metrics actually matter, and how to implement it. Sun, 22 Mar 2026 00:00:00 GMT Jeffrey Ip The Step-By-Step Guide to MCP Evaluation https://www.confident-ai.com/blog/the-step-by-step-guide-to-mcp-evaluation https://www.confident-ai.com/blog/the-step-by-step-guide-to-mcp-evaluation This article will teach you everything you need to evaluate MCP-based LLM applications. Sat, 25 Oct 2025 00:00:00 GMT Cale AI Agent Evaluation: Metrics, Traces, Human Review, and Workflows https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide A practical guide to evaluating AI agents with LLM metrics and tracing—plus when human review matters, how it calibrates judges, and workflows that combine CI, sampling, and production signals. Tue, 07 Oct 2025 00:00:00 GMT Jeffrey Ip LLM Arena-as-a-Judge: LLM-Evals for Comparison-Based Regression Testing https://www.confident-ai.com/blog/llm-arena-as-a-judge-llm-evals-for-comparison-based-testing https://www.confident-ai.com/blog/llm-arena-as-a-judge-llm-evals-for-comparison-based-testing In this article, you'll learn everything about running LLM Arena-as-a-judge as a novel way to regression test LLMs. Sun, 06 Jul 2025 00:00:00 GMT Jeffrey Ip RAG Evaluation Metrics: Assessing Answer Relevancy, Faithfulness, Contextual Relevancy, And More https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more This article will go through everything you'll need for RAG evaluation, including metrics, and best practices. Tue, 03 Jun 2025 00:00:00 GMT Jeffrey Ip LLM Evals Framework That Predicts ROI: A Step-by-Step Guide https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook Most LLM evals fail because metrics don't predict ROI, build outcome-based evals that correlate with business KPIs. Fri, 02 May 2025 00:00:00 GMT Jeffrey Ip G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation https://www.confident-ai.com/blog/g-eval-the-definitive-guide https://www.confident-ai.com/blog/g-eval-the-definitive-guide This article goes through everything on G-Eval for anyone to easily evaluate LLM apps on any task specific criteria. Wed, 30 Apr 2025 00:00:00 GMT Kritin Vongthongsri Top LLM Evaluators for Testing LLM Systems at Scale https://www.confident-ai.com/blog/top-llm-evaluators-for-testing-llms-at-scale https://www.confident-ai.com/blog/top-llm-evaluators-for-testing-llms-at-scale In this article, we'll go through all the top LLM evaluators in 2025 including G-Eval and other LLM-as-a-judges. Mon, 21 Apr 2025 00:00:00 GMT Jeffrey Ip How I raised Confident AI's $2.2M seed round in 5 days https://www.confident-ai.com/blog/how-i-closed-confident-ais-2-2m-seed-round-in-5-days https://www.confident-ai.com/blog/how-i-closed-confident-ais-2-2m-seed-round-in-5-days Announcing Confident AI's seed round, with participation from a bunch of great investors. Wed, 19 Mar 2025 00:00:00 GMT Jeffrey Ip How I Built Deterministic LLM Evaluation Metrics for DeepEval https://www.confident-ai.com/blog/how-i-built-deterministic-llm-evaluation-metrics-for-deepeval https://www.confident-ai.com/blog/how-i-built-deterministic-llm-evaluation-metrics-for-deepeval In this article, I'm sharing how I've built DeepEval's latest deterministic, LLM-powered, custom metric. Sun, 09 Feb 2025 00:00:00 GMT Jeffrey Ip LLM Agent Evaluation Metrics in 2026: Tool Calling, Task Completion, Reasoning, and Trace-Based Evals https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide Learn how to evaluate LLM agents end-to-end with tool calling, task completion, reasoning, trace-based evals, human review, and DeepEval code examples. Mon, 27 Jan 2025 00:00:00 GMT Kritin Vongthongsri LLM Guardrails for Data Leakage, Prompt Injection, and More https://www.confident-ai.com/blog/llm-guardrails-the-ultimate-guide-to-safeguard-llm-systems https://www.confident-ai.com/blog/llm-guardrails-the-ultimate-guide-to-safeguard-llm-systems In this article, you'll learn everything you need to know on LLM guardrails and how to use it for LLM security. Sun, 26 Jan 2025 00:00:00 GMT Jeffrey Ip OWASP Top 10 2025 for LLM Applications: What’s new? Risks, and Mitigation Techniques https://www.confident-ai.com/blog/owasp-top-10-2025-for-llm-applications-risks-and-mitigation-techniques https://www.confident-ai.com/blog/owasp-top-10-2025-for-llm-applications-risks-and-mitigation-techniques In this article, we'll go through what is OWASP Top 10, as well as what's new in their latest 2025 guidelines. Sat, 18 Jan 2025 00:00:00 GMT Kritin Vongthongsri The People's Choice of Top LLM Evaluation Tools in 2025 https://www.confident-ai.com/blog/greatest-llm-evaluation-tools-in-2025 https://www.confident-ai.com/blog/greatest-llm-evaluation-tools-in-2025 In this article, we'll bring you a hand-picked, carefully curated list of top LLM evaluation tools in the market. Wed, 15 Jan 2025 00:00:00 GMT Jeffrey Ip The Comprehensive LLM Safety Guide: Navigate AI regulations and Best Practices for LLM Safety https://www.confident-ai.com/blog/the-comprehensive-llm-safety-guide-navigate-ai-regulations-and-best-practices-for-llm-safety https://www.confident-ai.com/blog/the-comprehensive-llm-safety-guide-navigate-ai-regulations-and-best-practices-for-llm-safety In this article, you'll teach you about LLM regulations and how to maintain the safety of your LLM applications. Sat, 02 Nov 2024 00:00:00 GMT Kritin Vongthongsri How to Jailbreak LLMs One Step at a Time: Top Techniques and Strategies https://www.confident-ai.com/blog/how-to-jailbreak-llms-one-step-at-a-time https://www.confident-ai.com/blog/how-to-jailbreak-llms-one-step-at-a-time In this article, I'll show you how to jailbreak your LLM application to detect it for vulnerabilities. Wed, 30 Oct 2024 00:00:00 GMT Kritin Vongthongsri What is LLM Observability? - The Ultimate LLM Observability Guide https://www.confident-ai.com/blog/what-is-llm-observability-the-ultimate-llm-monitoring-guide https://www.confident-ai.com/blog/what-is-llm-observability-the-ultimate-llm-monitoring-guide In this article, I'll share what you should definitely look for in your next LLM Observability solution. Tue, 29 Oct 2024 00:00:00 GMT Kritin Vongthongsri Top LLM Chatbot Evaluation Metrics: Conversation Testing Techniques https://www.confident-ai.com/blog/llm-chatbot-evaluation-explained-top-chatbot-evaluation-metrics-and-testing-techniques https://www.confident-ai.com/blog/llm-chatbot-evaluation-explained-top-chatbot-evaluation-metrics-and-testing-techniques In this article, you'll learn about LLM red teaming and how it can be carried out using DeepTeam. Sat, 05 Oct 2024 00:00:00 GMT Jeffrey Ip LLM-as-a-Judge Simply Explained: The Complete Guide to Run LLM Evals at Scale https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method Complete guide to LLM-as-a-Judge: how it works, single-output vs pairwise scoring, G-Eval, DAG, prompting techniques, and how to use LLM judges for scalable LLM evaluation. Sun, 01 Sep 2024 00:00:00 GMT Kritin Vongthongsri The Definitive LLM Security Guide: OWASP Top 10 2025, Safety Risks and How to Detect Them https://www.confident-ai.com/blog/the-comprehensive-guide-to-llm-security https://www.confident-ai.com/blog/the-comprehensive-guide-to-llm-security In this article, I'll go through all the major pillars of LLM security you must know and how to mitigate them. Mon, 19 Aug 2024 00:00:00 GMT Kritin Vongthongsri LLM Red Teaming: The Complete Step-By-Step Guide To LLM Safety https://www.confident-ai.com/blog/red-teaming-llms-a-step-by-step-guide https://www.confident-ai.com/blog/red-teaming-llms-a-step-by-step-guide In this article, you'll learn about LLM red teaming and how it can be carried out using DeepTeam. Sat, 29 Jun 2024 00:00:00 GMT Kritin Vongthongsri Evaluating LLM Systems: Essential Metrics, Benchmarks, and Best Practices https://www.confident-ai.com/blog/evaluating-llm-systems-metrics-benchmarks-and-best-practices https://www.confident-ai.com/blog/evaluating-llm-systems-metrics-benchmarks-and-best-practices In this article, you'll learn how to evaluate LLM systems using LLM evaluation metrics and benchmark datasets. Mon, 24 Jun 2024 00:00:00 GMT Jeffrey Ip Using LLMs for Synthetic Data Generation: The Definitive Guide https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms https://www.confident-ai.com/blog/the-definitive-guide-to-synthetic-data-generation-using-llms In this article, I'm show you everything you need on how to generate realistic synthetic datasets using LLMs. Thu, 09 May 2024 00:00:00 GMT Kritin Vongthongsri How to Build an LLM Evaluation Framework, from Scratch https://www.confident-ai.com/blog/how-to-build-an-llm-evaluation-framework-from-scratch https://www.confident-ai.com/blog/how-to-build-an-llm-evaluation-framework-from-scratch In this article, you're going to learn how to build the world's most robust and scalable LLM evaluation framework. Fri, 05 Apr 2024 00:00:00 GMT Jeffrey Ip Top LLM Benchmarks Explained: MMLU, HellaSwag, BBH, and Beyond https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond https://www.confident-ai.com/blog/llm-benchmarks-mmlu-hellaswag-and-beyond In this article, I'm going to go through all the top LLM benchmarks currently used and why they matter. Sat, 16 Mar 2024 00:00:00 GMT Kritin Vongthongsri LLM Testing in 2026: Top Methods and Strategies https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies In this article, we'll learn everything there is to LLM testing, including best practices and methods to test LLMs. Sun, 25 Feb 2024 00:00:00 GMT Jeffrey Ip The Ultimate Guide to Fine-Tune LLaMA 3, With LLM Evaluations https://www.confident-ai.com/blog/the-ultimate-guide-to-fine-tune-llama-2-with-llm-evaluations https://www.confident-ai.com/blog/the-ultimate-guide-to-fine-tune-llama-2-with-llm-evaluations In this article, we'll walkthrough how to fine-tune and evaluate a LLaMA-2 model using Hugging Face and DeepEval Tue, 20 Feb 2024 00:00:00 GMT Jeffrey Ip RAG Evaluation: The Definitive Guide to Unit Testing RAG in CI/CD https://www.confident-ai.com/blog/how-to-evaluate-rag-applications-in-ci-cd-pipelines-with-deepeval https://www.confident-ai.com/blog/how-to-evaluate-rag-applications-in-ci-cd-pipelines-with-deepeval In this tutorial, we'll walkthrough how to setup a full testing suite for RAG applications using DeepEval. Mon, 05 Feb 2024 00:00:00 GMT Jeffrey Ip LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation In this article, I'll walkthrough everything you need to know about LLM evaluation metrics, with code samples. Mon, 22 Jan 2024 00:00:00 GMT Jeffrey Ip An Introduction to LLM Benchmarking https://www.confident-ai.com/blog/the-current-state-of-benchmarking-llms https://www.confident-ai.com/blog/the-current-state-of-benchmarking-llms In this article, I'll show how benchmarking can help you choose the right LLM for your use case. Mon, 25 Dec 2023 00:00:00 GMT Jeffrey Ip A Step-By-Step Guide to Evaluating an LLM Text Summarization Task https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task https://www.confident-ai.com/blog/a-step-by-step-guide-to-evaluating-an-llm-text-summarization-task In this article, I'll teach you how to create your own text summarization metric. Sun, 17 Dec 2023 00:00:00 GMT Jeffrey Ip Why OpenAI Assistants is a Big Win for LLM Evaluation https://www.confident-ai.com/blog/why-openai-assistants-is-a-big-win-for-llm-evaluation https://www.confident-ai.com/blog/why-openai-assistants-is-a-big-win-for-llm-evaluation In this article, I'll share how JudgmentalGPT, our in-house evaluator was built using OpenAI's Assistants. Tue, 21 Nov 2023 00:00:00 GMT Jeffrey Ip Become a Prompt Artist: Understanding the Midjourney LLM https://www.confident-ai.com/blog/become-a-prompt-artist-understanding-the-midjourney-llm https://www.confident-ai.com/blog/become-a-prompt-artist-understanding-the-midjourney-llm In this interactive tutorial, I'll show you how to become a Midjournalist to create image you image. Wed, 15 Nov 2023 00:00:00 GMT Jeffrey Ip How to Evaluate LLM Applications: The Complete Guide https://www.confident-ai.com/blog/how-to-evaluate-llm-applications https://www.confident-ai.com/blog/how-to-evaluate-llm-applications In this article, we will debunk how to evaluate an LLM application / RAG pipelines the right way. Tue, 07 Nov 2023 00:00:00 GMT Jeffrey Ip Why we replaced Pinecone with PGVector https://www.confident-ai.com/blog/why-we-replaced-pinecone-with-pgvector https://www.confident-ai.com/blog/why-we-replaced-pinecone-with-pgvector Do you really need a dedicated vector database for your Generative AI application? Our experience says not always. Sun, 29 Oct 2023 00:00:00 GMT Jeffrey Ip What is Retrieval Augmented Generation (RAG)? https://www.confident-ai.com/blog/what-is-retrieval-augmented-generation https://www.confident-ai.com/blog/what-is-retrieval-augmented-generation In this article, we're going to dive deep into the RAG rabbit hole. Sun, 22 Oct 2023 00:00:00 GMT Jeffrey Ip A Gentle Introduction to LLM Evaluation https://www.confident-ai.com/blog/a-gentle-introduction-to-llm-evaluation https://www.confident-ai.com/blog/a-gentle-introduction-to-llm-evaluation In this article, we'll introduce the ways in which you can carry out automated, LLM evaluation. Tue, 03 Oct 2023 00:00:00 GMT Jeffrey Ip How to build a PDF QA chatbot using OpenAI and ChromaDB https://www.confident-ai.com/blog/how-to-build-a-pdf-qa-chatbot-using-openai-and-chromadb https://www.confident-ai.com/blog/how-to-build-a-pdf-qa-chatbot-using-openai-and-chromadb In this article, you'll learn how to build a RAG based chatbot on your PDFs using OpenAI and ChromaDB Tue, 26 Sep 2023 00:00:00 GMT Jeffrey Ip Building a customer support chatbot using GPT-3.5 and lLamaIndex https://www.confident-ai.com/blog/building-a-customer-support-chatbot-using-gpt-3-5-and-llamaindex https://www.confident-ai.com/blog/building-a-customer-support-chatbot-using-gpt-3-5-and-llamaindex In this article, you'll learn how to create a customer support chatbot using GPT-3.5 and lLamaIndex. Tue, 19 Sep 2023 00:00:00 GMT Jeffrey Ip Generating synthetic data with LLMs - Part 1 https://www.confident-ai.com/blog/how-to-generate-synthetic-data-using-llms-part-1 https://www.confident-ai.com/blog/how-to-generate-synthetic-data-using-llms-part-1 LLMs make synthetic data easy to leverage, but how exactly can we make these generated data relevant and useful? Fri, 08 Sep 2023 00:00:00 GMT Jeffrey Ip