Subscribe to our weekly newsletter to stay confident in the AI systems you build.

AI systems can pass evals while still behaving incorrectly. This post explores three common failure modes that slip through output-based evaluation.

Passing evals doesn’t mean your system works. It means your tests didn’t catch how it fails.

In this article, I'll break down multi-turn LLM evaluation — how it differs from single-turn, what metrics actually matter, and how to implement it.
This article will teach you everything you need to evaluate MCP-based LLM applications.

A practical guide to evaluating AI agents with LLM metrics and tracing—plus when human review matters, how it calibrates judges, and workflows that combine CI, sampling, and production signals.
In this article, you'll learn everything about running LLM Arena-as-a-judge as a novel way to regression test LLMs.
This article will go through everything you'll need for RAG evaluation, including metrics, and best practices.
Most LLM evals fail because metrics don't predict ROI, build outcome-based evals that correlate with business KPIs.
This article goes through everything on G-Eval for anyone to easily evaluate LLM apps on any task specific criteria.
In this article, we'll go through all the top LLM evaluators in 2025 including G-Eval and other LLM-as-a-judges.