G-Eval
Learn how to create a G-Eval metric for custom evaluation algorithms
Overview
G-Eval is a research-backed, LLM-as-a-Judge framework that lets you define custom evaluation metrics using plain language criteria. Confident AI uses G-Eval under the hood to power most custom LLM-as-a-judge metrics created on the platform.
Common use cases include answer correctness, coherence, tonality, safety, custom RAG evaluation, and summarization quality.
Why G-Eval?
G-Eval addresses common pitfalls of LLM-as-a-judge systems:
- Inconsistent scoring — CoT decomposition forces structured reasoning, reducing randomness across runs
- Lack of fine-grained judgment — Probability-weighted scoring enables nuanced differentiation between similar outputs
- Verbosity and narcissistic bias — Customizable criteria let you explicitly penalize or reward specific behaviors
It is also extremely reliable (+-0.02 in scores over 10+ runs). This means with enough care in writing your G-Eval algorithm, you will eventually be able to achieve your metric results.
How It Works
G-Eval works in a few simple steps:
- Generates a list of evaluation steps, using an initial custom criteria
- Uses this list of evaluation steps to compute a score from 0 - 10
- Normalizes and takes a weighted summation of the final score to make the score more reliable
- The final score is then divided by 10 to normalize it to the range 0 - 1
G-Eval also provide an optional rubric, which you can use to confine custom metric scores in-between a certain range.
The evaluation steps does not determine what score G-Eval gives, but simply guide the LLM judge as a form of CoT to output something more reliable.
If you’re creating G-Eval locally via deepeval, you can also use the upload method to create a metric.
Writing an Effective Criteria
Follow these best practices when creating a G-Eval metric:
-
Be specific and detailed in your criteria
If you expect the metric to handle X, Y, or Z, make sure those requirements are explicitly written into the criteria. The more detailed your criteria, the more reliable the evaluation. -
Explicitly reference all required parameters
State what each parameter means and how they connect. For example, specify that the'actual output'should semantically match the'expected output', rather than leaving the relationship implicit. -
Use precise, concrete language
Define exactly what terms like “accurate” mean — e.g., “does not contradict the'retrieval context'” is better than a vague “factually correct”. -
Keep evaluation steps focused on thinking, not scoring
Evaluation steps should guide the LLM-as-a-Judge through its reasoning process. Save score definitions for the rubric — mixing them into the evaluation steps leads to worse results. -
Use quantitative definitions in your rubric
When defining score ranges, spell out what they mean in practice. For example, instead of labeling0–1as “low accuracy,” specify something concrete like “2–3 contradictions between the'actual output'and'retrieval context'”.
Create G-Eval via the UI
Single or multi-turn G-Eval metrics can be created under Project > Metrics > Library.
Fill in metric details
Provide the metric name, and optionally a description. You can also toggle whether you’re creating a single-turn or multi-turn metric.
Your metric name must be unique in your project and clash with any of the default metric names.
Select required parameters
Custom metrics needs to know which parameters in test cases it should consider during evaluation for the results to be accurate and reliable - this step gives you the opportunity to do exactly this. The example below shows selecting single-turn test case parameters for a single-turn metric, but you can also do it for multi-turn parameters.
Define custom criteria
A custom criteria helps Confident AI generate evaluation steps and is what separates an out-of-the-box metric to a custom one.
You must mention the names of the required parameters you’ve selected from the previous step. For example, if you’ve selected the “Input” and “Actual Output” for a single-turn use case, you criteria could be something like:
Criteria are used for generating evaluation steps, and not used directly for evaluation.
Outline evaluation steps (optional)
This step is optional because Confident AI will auto generate evaluation steps based on your criteria if not provided one.
However, providing evaluation steps gives custom metrics more reliable scores, as Confident AI will skip the steps generation process if one is provided.
You should not outline what scores to return at this stage (that goes in the rubric which we will talk about later).
Setup rubric (optional)
Lastly, you can optionally provide a set of rubrics to confine evaluation scores. Your list of rubrics must:
- Not overlap in score range
- Contain a clear expected outcome for each score range
- Be inclusive of 0 - 10
The rubric score is defined on a 0–10 scale, but the final score reported by Confident AI is normalized to a 0–1 range. We use integers for the rubric since LLM-as-a-Judge performs more reliably with whole numbers, and then divide by 10 afterward to convert it into the normalized scale.
Create G-Eval in Code
You can create G-Eval metrics locally using deepeval and upload them to Confident AI.
Single-turn
Multi-turn
Use GEval for evaluating single LLM interactions:
You can also provide explicit evaluation_steps instead of criteria for more control:
Once you’re happy with your GEval metric, call the .upload() method to create it on Confident AI. This syncs your local metric to the platform, where you can add it to metric collections and run remote evaluations.
For more details on parameters, rubrics, and advanced usage, see the deepeval documentation for GEval and ConversationalGEval.
Under the hood, .upload() calls the Evals API to create a custom G-Eval metric. Note that the name of your G-Eval metric must not already be taken on your Confident AI project.