If you are building something conversational — a conversational agent, a copilot, an agent with memory — you will need to decide how to evaluate full conversations, not just individual replies. Most teams start with single-turn coverage because the data is tractable and the mental model is simple. That is a fine starting point, but if your product is a conversation, it is not enough.
The failures users actually complain about are relational — they live across turns, not inside any single message:
- The model contradicts something it said four turns ago
- Context from earlier in the conversation gets dropped
- The topic drifts without the user steering it
- The assistant resolves prematurely while the user still has a problem
Each individual reply scores fine. The thread is wrong. Users say "it forgot what I said" or "it argued with me" — and single-turn scores will never catch that.
Does your product have conversations?
If yes, you need multi-turn coverage. In development multi-turn data is always scarce — real transcripts are messy, partial, or locked in privacy review — so most teams ship with only single-turn suites and tell themselves the chat product is covered. It is not. The failures above only surface in production, and reproducing them with a single-turn harness is nearly impossible because the bug lives across turns, not inside one.
If no — if your product is single-turn (a search bar with AI answers, an email draft generator, a document extraction API) — single-turn evaluation is the right and sufficient strategy. You do not need thread-level metrics for a product that does not have threads.
The objections from conversational teams are familiar: "We do not have conversation goldens." "Simulation will not feel real." Both are fair. But no multi-turn strategy means you are choosing production as your first multi-turn test environment. That is the most expensive lab you have, and the failures arrive with real users attached.
If the product is multi-turn, you need datasets and metrics aimed at threads, and pre-production usually needs simulated multi-turn runs so you are not waiting on live traffic to discover holes. Single-turn evals are necessary and insufficient — skipping the second half is a decision you will feel in the first month live.
TL;DR — Single-turn evals grade replies; multi-turn evals grade conversations. If your product is a conversation, you need both.