Stay Confident
Subscribe to our weekly newsletter to stay confident in the AI systems you build.
Top LLM Benchmarks Explained: MMLU, HellaSwag, BBH, and Beyond
In this article, I'm going to go through all the top LLM benchmarks currently used and why they matter.
LLM Testing in 2026: Top Methods and Strategies
In this article, we'll learn everything there is to LLM testing, including best practices and methods to test LLMs.
The Ultimate Guide to Fine-Tune LLaMA 3, With LLM Evaluations
In this article, we'll walkthrough how to fine-tune and evaluate a LLaMA-2 model using Hugging Face and DeepEval
RAG Evaluation: The Definitive Guide to Unit Testing RAG in CI/CD
In this tutorial, we'll walkthrough how to setup a full testing suite for RAG applications using DeepEval.

LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide
In this article, I'll walkthrough everything you need to know about LLM evaluation metrics, with code samples.

An Introduction to LLM Benchmarking
In this article, I'll show how benchmarking can help you choose the right LLM for your use case.
A Step-By-Step Guide to Evaluating an LLM Text Summarization Task
In this article, I'll teach you how to create your own text summarization metric.
Why OpenAI Assistants is a Big Win for LLM Evaluation
In this article, I'll share how JudgmentalGPT, our in-house evaluator was built using OpenAI's Assistants.
Become a Prompt Artist: Understanding the Midjourney LLM
In this interactive tutorial, I'll show you how to become a Midjournalist to create image you image.
How to Evaluate LLM Applications: The Complete Guide
In this article, we will debunk how to evaluate an LLM application / RAG pipelines the right way.

