Question 1

Do I need to modify my existing codebase to get started?

Accepted Answer

If your AI app is reachable through APIs, no. Point to any endpoint and start sending requests — just like Postman. No SDK, no code changes, no engineering dependency to start running evals.

Question 2

How do eval metrics work?

Accepted Answer

We offer 50+ research-backed metrics mainly using LLM-as-a-judge evaluators that use a language model to assess quality, tone, safety, and more. Every metric is powered by DeepEval, the open-source evaluation framework used by teams at OpenAI, Google, and Microsoft.

Question 3

Can I test multi-turn conversations?

Accepted Answer

Yes. Unlike most eval tools that only test single prompts, you can simulate full multi-turn conversations end-to-end and catch failures that only surface across multiple exchanges.

Question 4

How do experiments work?

Accepted Answer

Change any variable — model, prompt, system logic — and run your golden dataset against both versions. Results are compared side by side across every metric and pipeline step so you can see exactly what improved and what regressed.

Question 5

Do I need to be an engineer to use this?

Accepted Answer

No. Engineers can connect endpoints and configure pipelines. Product managers and domain experts can tweak prompts, run experiments, and evaluate results — no engineering bottleneck required.

Question 6

Can I bring my own evaluation model?

Accepted Answer

Yes. We support major LLM providers like OpenAI, Anthropic, and Google. Cloud providers like Bedrock, Vertext, and Azure OpenAI, and gateways such as Portkey and LiteLLM.

Question 7

How do I know if my AI app is ready for testing?

Accepted Answer

We offer a built in tool to help you know if your app is returning the correct content for testing. Payloads are flexiable and outputs can be parsed from any format you return.

LLM Evals Your Team Will Love. Not Dread.

Testing you'll actually want to run.

Multi-turn conversation testing

Side-by-side experiments

Alignment metrics with humans

Automated evals on every change

MCP-native workflow

Metrics your org can rally behind.
Powered by DeepEval.

Works with your stack. All of it.

Have a Question?

Get started today.