Introduction to LLM Evaluation
Run LLM evals with or without code — choose the workflow that fits your team.
Run LLM evals with or without code — choose the workflow that fits your team.
LLM evaluation on Confident AI refers to benchmarking via datasets in a pre-deployment setting, can be done in two ways:
deepeval (or deepeval.ts) framework, best for engineers and QAs.Both approaches give you access to the same comprehensive evaluation metrics and insights — the difference is in how you run them.
For those looking to use online evals for production monitoring on observability data, click here.
Both code-dirven and no-code workflows allow you to evaluate all 3 use cases:
One input → one output interactions like Q&A, summarization, or classification tasks.
Conversational interactions where context builds across multiple exchanges.
Complex systems with tool calls, reasoning chains, and multi-step execution.
Run evals entirely in the platform UI without writing any code or use deepeval programmatically:
Suitable for: PMs, QA teams, rapid prototyping
Suitable for: Engineers, automated testing
Most teams use both approaches. Start with no-code to explore and experiment, then move to code-driven for automated regression testing in CI/CD. The results from both workflows appear in the same dashboards.
Create, organize, and version datasets of test cases to systematically benchmark your LLM applications
Run experiments to compare prompts, models, and parameters with detailed analysis and insights
Catch regressions on different versions of your AI app with side-by-side test case comparisons
Integrate native pytest evaluations into your deployment CI/CD pipelines
New to LLM evaluation? These concepts will help you get the most out of your evals: