G-Eval

Learn how to create a G-Eval metric for custom evaluation algorithms

Overview

G-Eval is a research-backed, LLM-as-a-Judge framework that lets you define custom evaluation metrics using plain language criteria. Confident AI uses G-Eval under the hood to power most custom LLM-as-a-judge metrics created on the platform.

Common use cases include answer correctness, coherence, tonality, safety, custom RAG evaluation, and summarization quality.

Why G-Eval?

G-Eval addresses common pitfalls of LLM-as-a-judge systems:

  • Inconsistent scoring — CoT decomposition forces structured reasoning, reducing randomness across runs
  • Lack of fine-grained judgment — Probability-weighted scoring enables nuanced differentiation between similar outputs
  • Verbosity and narcissistic bias — Customizable criteria let you explicitly penalize or reward specific behaviors

It is also extremely reliable (+-0.02 in scores over 10+ runs). This means with enough care in writing your G-Eval algorithm, you will eventually be able to achieve your metric results.

How It Works

G-Eval works in a few simple steps:

  1. Generates a list of evaluation steps, using an initial custom criteria
  2. Uses this list of evaluation steps to compute a score from 0 - 10
  3. Normalizes and takes a weighted summation of the final score to make the score more reliable
  4. The final score is then divided by 10 to normalize it to the range 0 - 1

G-Eval also provide an optional rubric, which you can use to confine custom metric scores in-between a certain range.

The evaluation steps does not determine what score G-Eval gives, but simply guide the LLM judge as a form of CoT to output something more reliable.

If you’re creating G-Eval locally via deepeval, you can also use the upload method to create a metric.

Writing an Effective Criteria

Follow these best practices when creating a G-Eval metric:

  • Be specific and detailed in your criteria
    If you expect the metric to handle X, Y, or Z, make sure those requirements are explicitly written into the criteria. The more detailed your criteria, the more reliable the evaluation.

  • Explicitly reference all required parameters
    State what each parameter means and how they connect. For example, specify that the 'actual output' should semantically match the 'expected output', rather than leaving the relationship implicit.

  • Use precise, concrete language
    Define exactly what terms like “accurate” mean — e.g., “does not contradict the 'retrieval context'” is better than a vague “factually correct”.

  • Keep evaluation steps focused on thinking, not scoring
    Evaluation steps should guide the LLM-as-a-Judge through its reasoning process. Save score definitions for the rubric — mixing them into the evaluation steps leads to worse results.

  • Use quantitative definitions in your rubric
    When defining score ranges, spell out what they mean in practice. For example, instead of labeling 0–1 as “low accuracy,” specify something concrete like “2–3 contradictions between the 'actual output' and 'retrieval context'”.

Create G-Eval via the UI

Single or multi-turn G-Eval metrics can be created under Project > Metrics > Library.

1

Fill in metric details

Provide the metric name, and optionally a description. You can also toggle whether you’re creating a single-turn or multi-turn metric.

General Metric Info

Your metric name must be unique in your project and clash with any of the default metric names.

2

Select required parameters

Custom metrics needs to know which parameters in test cases it should consider during evaluation for the results to be accurate and reliable - this step gives you the opportunity to do exactly this. The example below shows selecting single-turn test case parameters for a single-turn metric, but you can also do it for multi-turn parameters.

Evaluation Parameters
3

Define custom criteria

A custom criteria helps Confident AI generate evaluation steps and is what separates an out-of-the-box metric to a custom one.

You must mention the names of the required parameters you’ve selected from the previous step. For example, if you’ve selected the “Input” and “Actual Output” for a single-turn use case, you criteria could be something like:

Given the 'input' and 'actual output' which are the query and answer to an AI medical chatbot,
determine whether the 'actual output' is relevant and helpful to the 'input'.
Penalize heavily if not helpful. Relevancy is not so important.
Metric Criteria

Criteria are used for generating evaluation steps, and not used directly for evaluation.

4

Outline evaluation steps (optional)

This step is optional because Confident AI will auto generate evaluation steps based on your criteria if not provided one.

However, providing evaluation steps gives custom metrics more reliable scores, as Confident AI will skip the steps generation process if one is provided.

You should not outline what scores to return at this stage (that goes in the rubric which we will talk about later).

Optional Evaluation Steps
5

Setup rubric (optional)

Lastly, you can optionally provide a set of rubrics to confine evaluation scores. Your list of rubrics must:

  • Not overlap in score range
  • Contain a clear expected outcome for each score range
  • Be inclusive of 0 - 10

The rubric score is defined on a 0–10 scale, but the final score reported by Confident AI is normalized to a 0–1 range. We use integers for the rubric since LLM-as-a-Judge performs more reliably with whole numbers, and then divide by 10 afterward to convert it into the normalized scale.

Optional Rubric
6

Review and save

Once you have your criteria set, make sure everything looks right in the final review page, and click Save.

You can now add your custom metric to a metric collection to start running remote evals.

Create G-Eval in Code

You can create G-Eval metrics locally using deepeval and upload them to Confident AI.

Use GEval for evaluating single LLM interactions:

1from deepeval.metrics import GEval
2from deepeval.test_case import LLMTestCaseParams
3
4correctness_metric = GEval(
5 name="Correctness",
6 criteria="Determine whether the actual output is factually correct based on the expected output.",
7 evaluation_params=[
8 LLMTestCaseParams.ACTUAL_OUTPUT,
9 LLMTestCaseParams.EXPECTED_OUTPUT
10 ],
11)

You can also provide explicit evaluation_steps instead of criteria for more control:

1correctness_metric = GEval(
2 name="Correctness",
3 evaluation_steps=[
4 "Check whether the facts in 'actual output' contradict any facts in 'expected output'",
5 "Heavily penalize omission of detail",
6 "Vague language or contradicting opinions are OK"
7 ],
8 evaluation_params=[
9 LLMTestCaseParams.ACTUAL_OUTPUT,
10 LLMTestCaseParams.EXPECTED_OUTPUT
11 ],
12)

Once you’re happy with your GEval metric, call the .upload() method to create it on Confident AI. This syncs your local metric to the platform, where you can add it to metric collections and run remote evaluations.

1correctness_metric.upload()

For more details on parameters, rubrics, and advanced usage, see the deepeval documentation for GEval and ConversationalGEval.

Under the hood, .upload() calls the Evals API to create a custom G-Eval metric. Note that the name of your G-Eval metric must not already be taken on your Confident AI project.

POST
/v1/metrics
1import requests
2
3url = "https://api.confident-ai.com/v1/metrics"
4
5payload = {
6 "name": "Correctness",
7 "multiTurn": False,
8 "criteria": "Determine if the `actual output` is correct based on the `expected output`.",
9 "evaluationParams": ["actualOutput", "expectedOutput"]
10}
11headers = {
12 "CONFIDENT_API_KEY": "<PROJECT-API-KEY>",
13 "Content-Type": "application/json"
14}
15
16response = requests.post(url, json=payload, headers=headers)
17
18print(response.json())