Confident AI Blog - Resources to help teams stay confident in AI
SlackJust In: New Slack Community! Connect with AI engineers building with Confident AI, join now →

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Top LLM Benchmarks Explained: MMLU, HellaSwag, BBH, and Beyond

Top LLM Benchmarks Explained: MMLU, HellaSwag, BBH, and Beyond

In this article, I'm going to go through all the top LLM benchmarks currently used and why they matter.

Kritin Vongthongsri

Kritin Vongthongsri

Mar 16, 2024
.
12 min read
LLM Testing in 2026: Top Methods and Strategies

LLM Testing in 2026: Top Methods and Strategies

In this article, we'll learn everything there is to LLM testing, including best practices and methods to test LLMs.

Jeffrey Ip

Jeffrey Ip

Feb 25, 2024
.
8 min read
The Ultimate Guide to Fine-Tune LLaMA 3, With LLM Evaluations

The Ultimate Guide to Fine-Tune LLaMA 3, With LLM Evaluations

In this article, we'll walkthrough how to fine-tune and evaluate a LLaMA-2 model using Hugging Face and DeepEval

Jeffrey Ip

Jeffrey Ip

Feb 20, 2024
.
12 min read
RAG Evaluation: The Definitive Guide to Unit Testing RAG in CI/CD

RAG Evaluation: The Definitive Guide to Unit Testing RAG in CI/CD

In this tutorial, we'll walkthrough how to setup a full testing suite for RAG applications using DeepEval.

Jeffrey Ip

Jeffrey Ip

Feb 5, 2024
.
9 min read
LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide

LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide

In this article, I'll walkthrough everything you need to know about LLM evaluation metrics, with code samples.

Jeffrey Ip

Jeffrey Ip

Jan 22, 2024
.
16 min read
An Introduction to LLM Benchmarking

An Introduction to LLM Benchmarking

In this article, I'll show how benchmarking can help you choose the right LLM for your use case.

Jeffrey Ip

Jeffrey Ip

Dec 25, 2023
.
17 min read
A Step-By-Step Guide to Evaluating an LLM Text Summarization Task

A Step-By-Step Guide to Evaluating an LLM Text Summarization Task

In this article, I'll teach you how to create your own text summarization metric.

Jeffrey Ip

Jeffrey Ip

Dec 17, 2023
.
8 min read
Why OpenAI Assistants is a Big Win for LLM Evaluation

Why OpenAI Assistants is a Big Win for LLM Evaluation

In this article, I'll share how JudgmentalGPT, our in-house evaluator was built using OpenAI's Assistants.

Jeffrey Ip

Jeffrey Ip

Nov 21, 2023
.
6 min read
Become a Prompt Artist: Understanding the Midjourney LLM

Become a Prompt Artist: Understanding the Midjourney LLM

In this interactive tutorial, I'll show you how to become a Midjournalist to create image you image.

Jeffrey Ip

Jeffrey Ip

Nov 15, 2023
.
16 min read
How to Evaluate LLM Applications: The Complete Guide

How to Evaluate LLM Applications: The Complete Guide

In this article, we will debunk how to evaluate an LLM application / RAG pipelines the right way.

Jeffrey Ip

Jeffrey Ip

Nov 7, 2023
.
10 min read