Introduction to LLM Evaluation
Run LLM evals with or without code — choose the workflow that fits your team.
Overview
LLM evaluation on Confident AI refers to benchmarking via datasets in a pre-deployment setting, can be done in two ways:
- No-code directly in the platform UI, best for QAs, PMs, SMEs, or,
- Code-driven using the
deepeval(ordeepeval.ts) framework, best for engineers and QAs.
Both approaches give you access to the same comprehensive evaluation metrics and insights — the difference is in how you run them.
For those looking to use online evals for production monitoring on observability data, click here.
What you can evaluate
Both code-dirven and no-code workflows allow you to evaluate all 3 use cases:
One input → one output interactions like Q&A, summarization, or classification tasks.
Conversational interactions where context builds across multiple exchanges.
Complex systems with tool calls, reasoning chains, and multi-step execution.
Choose your workflow
Run evals entirely in the platform UI without writing any code or use deepeval programmatically:
- Run experiments on single and multi-prompt AI apps
- Compare prompts and models in Arena
Suitable for: PMs, QA teams, rapid prototyping
- Automated regression testing in CI/CD
- Full control over output generation
- Version-controlled eval logic
Suitable for: Engineers, automated testing
Not sure which to pick?
Most teams use both approaches. Start with no-code to explore and experiment, then move to code-driven for automated regression testing in CI/CD. The results from both workflows appear in the same dashboards.
Key Capabilities
Create, organize, and version datasets of test cases to systematically benchmark your LLM applications
Run experiments to compare prompts, models, and parameters with detailed analysis and insights
Catch regressions on different versions of your AI app with side-by-side test case comparisons
Integrate native pytest evaluations into your deployment CI/CD pipelines
Learn the fundamentals
New to LLM evaluation? These concepts will help you get the most out of your evals:
- Single vs Multi-Turn Evals — understand when to use each approach
- Test Cases, Goldens, and Datasets — the building blocks of evaluation
- LLM-as-a-Judge Metrics — how automated scoring works