
Evaluating Large Language Model (LLM) applications are just as important as unit testing traditional software. But building an effective LLM evaluation pipeline isn’t so straightforward. A strong eval workflow demands a wide range of custom LLM metrics tailored to your LLM app’s task, goals, characteristics, and quality standards.
That’s where G-Eval comes in.
G-Eval is an LLM-eval that makes it easy to build research-backed, LLM-as-a-judge, custom metrics — often from just a single sentence written in plain language. An evaluation prompt for G-Eval might look something like this:
But as you’ll learn later in this article (and in the journey of building a startup), nothing is as simple as it seems. In this article, I’ll walk you through everything you need to know about G-Eval, including:
- What G-Eval is, how it works, and how it addresses the common pitfalls of LLM-based evaluation
- How to implement a G-Eval metric, choose the right criteria, and when to specify evaluation steps
- Tips for improving G-Eval beyond the original paper’s implementation
- The most commonly used G-Eval metrics — like correctness, coherence, fluency, and more
PS. We'll also show how to use DeepEval, ⭐ the open-source LLM evaluation framework, to implement G-Eval in 5 lines of code.
What is G-Eval?
G-Eval is a research-backed evaluation framework that lets you create custom LLM-as-a-judge metrics to evaluate any natural language generation (NLG) task by simply writing an evaluation criteria in natural language. It leverages an automatic chain-of-thought (CoT) approach to decompose your criteria and evaluate LLM outputs through a three-step process:
- Evaluation Step Generation: an LLM first transforms your natural language criterion into a structured list of evaluation steps.
- Judging: these steps are then used by an LLM judge to assess your application’s output.
- Scoring: the resulting judgments are weighted by their log-probabilities to produce a final G-Eval score.

G-Eval was first introduced in the paper “NLG Evaluation using GPT-4 with Better Human Alignment”, and was originally developed as a superior alternative to traditional reference-based metrics like BLEU and ROUGE, which struggles with subjective and open-ended tasks that requires creativity, nuance, and an understanding of word semantics.
G-Eval makes great LLM evaluation metrics because it is accurate, easily tunable, and surprisingly consistent across runs. In fact, here are the top use cases for G-Eval metrics:
- Answer Correctness — Measures an LLM’s generated response’s alignment with the expected output.
- Coherence — Measures logical and linguistic structure of the LLM generated response.
- Tonality — Measures the tone and style of a generated LLM response.
- Safety — Typically for responsible AI, Measures how safe and ethical the response is.
- Custom RAG — Measures the quality, typically faithfulness, of a RAG system.
Back to the paper: The original G-Eval process involved taking a user-defined criterion and converting it into step-by-step instructions, which were then embedded into a prompt template for an LLM to generate a score. The criterion prompt for coherence in the paper looked like this:
Which resulted in this final evaluation prompt after evaluation steps were generated from the above criterion:
Note that this evaluation prompt represents G-Eval in its simplest form — as you continue through the article, we’ll explore different versions of G-Eval and how it can be improved even beyond the original implementation.
The research also showed that G-Eval consistently outperformed both traditional statistical scorers and modern LLM-based metrics — including DAG, GPTScore, BERTScore, and UniEval, across a variety of custom tasks, including:
- Text Summarization: G-Eval achieved the highest Spearman correlation with human judgments (0.514), outperforming all baselines on coherence, consistency, fluency, and relevance.
- Dialogue Generation: G-Eval led across dimensions such as naturalness, coherence, engagingness, groundedness, and hallucination detection.
- Hallucination Detection: G-Eval outperformed all other evaluators on the QAGS benchmark.

LLM evaluators face a number of well-known limitations, and while these can’t be fully eliminated, G-Eval was able to achieve SOTA performance by applying targeted techniques to reduce their impact — ultimately resulting in a metric framework that performed on par with human judgment.
In the next section, we’ll dive deeper into these common issues with LLM judges and explain exactly how the techniques behind G-Eval solve them.
G-Eval Makes Up For Inefficient LLM-as-a-Judge Metrics
LLM-as-a-judge is a powerful way to assess LLM generated content, but they also come with limitations due to the probabilistic and opaque nature of language models. These challenges can lead to evaluations that are noisy, inconsistent, or biased. Below, we’ll break down the most common pitfalls — and how G-Eval is designed to address them.
1. Inconsistent Scoring
LLM judges are inherently non-deterministic, meaning the same response can receive different scores on separate evaluation runs. This variability can lead to inconsistent and unreliable evaluation results, making it difficult to benchmark model performance accurately.
How G-Eval solves it: G-Eval uses Auto-CoT to break down evaluations into structured steps. CoT itself was not a novel concept — it was first introduced by Wei et al. (2022) as a prompting technique to encourage LLMs to engage in intermediate reasoning steps before arriving at final answers.

However, G-Eval was the first framework to apply CoT for evaluation purposes by requiring the LLM judge to generate a set of evaluation steps (i.e., intermediate reasoning steps). This forced decomposition enables the LLM judge to evaluate outputs through multiple, clearer, and simpler sub-criteria.
More sub-criteria leads to greater robustness and less randomness, while simpler sub-criteria reduce bias and improve accuracy. Together, these two effects result in more consistent and reproducible judgments.
2. Lack of Fine-Grained Judgment
LLMs are well-suited for broad evaluation tasks like verifying facts or assigning general quality ratings on a 1–5 scale. However, when evaluations demand more precise, fine-grained scoring, their reliability drops. The same output might receive different scores across runs, making results inconsistent and introducing noise. This randomness makes LLMs less effective for tasks that require detailed or nuanced judgment.

How G-Eval solves it: G-Eval introduces probability normalization, leveraging token-level confidence scores to create a probability-weighted metric score with fine-grained precision.
By weighting judgments using log-probabilities — instead of relying solely on the raw scores output by the LLM — G-Eval significantly reduces bias and enables the model to better differentiate between outputs of similar quality.
3. Verbosity Bias
LLM judges often favor verbose answers, which can skew evaluations by rewarding longer outputs over higher-quality ones — even when brevity or clarity is more appropriate.
How G-Eval solves it: Due to the customizability of G-Eval, you can define custom evaluation criteria that either penalize verbosity, reward conciseness, do both, or remain indifferent to verbosity, depending on what you care about and what your use case demands. As long as the criteria are clear, simple, and concise, adding such constraints can significantly reduce LLMs’ verbosity bias, as well as any other bias that might exist.
4. Narcissistic Bias
Research shows that LLMs such as GPT-4 and Claude-v1 exhibit self-preference, favoring their own responses 10% to 25% more during evaluations. Even though models like GPT-3.5 are less biased, this tendency still introduces a significant skew in evaluation outcomes, compromising the objectivity of LLM-as-a-judge systems.

How G-Eval solves it: While G-Eval doesn’t directly eliminate narcissistic bias, its primary goal is to help improve your LLM application. Because it applies the same evaluation rubric to all outputs — judged by the same LLM — the scores are relative and consistent. This makes any self-preference bias less impactful in practice.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
How to Implement a G-Eval Metric In Code
We saw what a simple G-Eval prompt looks like at the beginning of this article, but that doesn’t handle all the nuances. I’d like to introduce DeepEval’s implementation of G-Eval instead (docs here), which is much simpler and looks like this:
DeepEval (⭐ opens-source!) allows you to define a G-Eval metric in 3 simple steps:
- Writing your evaluation criteria in plain English
- Assign your custom metric a name
- Specifying which parts of the LLM interaction (
evaluation_params
) you want to evaluate
(DeepEval is an easy-to-use, open-source framework designed for evaluating and testing large language model systems. Think of it like Pytest — but purpose-built for unit testing LLM outputs. In fact, it was the first eval library to include G-Eval as part of its metric suite.)
Select an Evaluation Criteria
Defining a G-Eval metric is as simple as providing a criterion and selecting the evaluation parameter in a test case (more on this later), since G-Eval automatically converts the criterion into structured evaluation steps used during evaluation.
If you’re looking for more G-Eval code examples, you should check out this blog I wrote on the top G-Eval use-cases — it’s packed with practical samples from the most common real-world use-cases. Don’t worry though, as we’ll cover everything in more detail later in this guide.
The most important part of defining a good G-Eval metric is crafting the right evaluation criteria. This involves thinking carefully about the qualities your LLM agent/app should demonstrate — whether they align with what your users value most or what’s critical to your product’s success.
Reviewing input-response examples is one of the most effective ways to identify these key traits and refine your criteria accordingly. For example, in the case of a medical chatbot:
The example above reveals a key weakness. In high-stakes domains like healthcare, every interaction must reinforce trust and reliability. Even if the LLM’s response is factually accurate, a casual tone — like saying “Don’t overthink it” — can erode user confidence. To avoid this, you should define evaluation criteria that enforce a professional tone. For example:
“Evaluate whether llm output maintains a professional, respectful tone appropriate for medical communication, avoiding overly casual language.”
By reviewing multiple input-response pairs, you’ll start to recognize patterns and better understand which criteria are most important for your specific LLM application.
A Note On The Form-filling Paradigm
You may have noticed that in the DeepEval example above, G-Eval evaluates not only the actual LLM output but also the expected output. This is because G-Eval uses a form-filling paradigm, allowing it to assess multiple evaluation parameters within a single test case.
This allows G-Eval to support more complex, multi-field evaluations. The fields in a typical test case include:
- Input: the user query or prompt.
- Actual Output: the response generated by the LLM.
- Expected Output: the ideal or ground truth response, if available.
- Retrieval Context: external knowledge retrieved at runtime (e.g., documents used in a RAG
- Context: the information the LLM was expected to retrieve or rely on to answer correctly.

These parameters must be explicitly referenced in your evaluation criteria and passed to the G-Eval metric when it’s instantiated. The specific parameters you include will depend on the metric task you’re defining.
For example, if you’re evaluating tone or coherence, referencing only the LLM’s output is usually sufficient. But if you’re building a custom faithfulness metric, you’ll also need to include the retrieval context — so the evaluation can determine how accurately the output reflects the retrieved information.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Refining G-Eval beyond Research Implementation
G-Eval is a solid starting point for LLM evaluation, but there are key areas where it can be improved — specifically around how evaluation steps are defined and how scoring is structured. By the way, this is also why you’d want to use DeepEval’s G-Eval implementation.
Criteria vs Evaluation Steps
While we originally introduced G-Eval as a 3-step process — evaluation step generation, evaluation, and weighted score calculation — in practice, many implementations of G-Eval allow you to skip the first step entirely by providing evaluation steps manually.
It’s important to understand when you should use evaluation criteria versus directly supplying evaluation steps. In the early stages, providing a criterion is an excellent way to get started. It’s simple to implement since you only need to write a short sentence in natural language, making it easy to experiment with different evaluation ideas. This flexibility helps you quickly test and iterate to see how well your evaluations align with human judgment.
However, once you’ve settled on a strong evaluation idea, it’s better to move toward providing evaluation steps directly. This shift is important for two main reasons:
- Evaluation steps generation is probabilistic — Since an LLM generates the steps from your criterion, there can be slight randomness or inconsistency each time. This variability can make your metric less stable if you always rely on auto-generated steps.
- Fine-tuning your metric becomes much easier — Once you have a decent evaluation criterion, improving the metric to better align with human judgment often requires small, precise adjustments — such as rewording a step for clarity or adding a specific sub-check. These refinements are only possible when you’re working directly with explicit evaluation steps, not just a single broad criterion.
By moving to explicit evaluation steps, you gain more control, consistency, and the ability to fine-tune your evaluations for long-term reliability.
Scoring Rubrics
In the original G-Eval framework, evaluations are based on a natural language criterion, which is automatically decomposed into evaluation steps. Each step is judged by the model with a yes, no, or unsure answer, and final scores are computed by weighting these responses using token-level log probabilities.
However, G-Eval does not formally define a structured rubric — meaning it lacks an explicit scoring system where different evaluation criteria are scored separately and consistently on a fixed scale. All evaluation steps are implicitly treated equally, and the final score is a continuous value between 0 and 1 rather than a human-readable score like 0–10.
DeepEval (docs here) allows you to define a formal rubric structure that combines multiple criteria, each with its own separate scoring rule and enforced range, such as 0 to 10. For example:
This allows G-Eval to assign distinct scores to different dimensions of quality, making evaluations more interpretable, more stable, and easier to customize for different applications.
Instead of collapsing everything into a single probabilistic score, each aspect of the output would be judged transparently according to specific standards, enabling more fine-grained model assessment and easier benchmarking across models and tasks.
Making G-Eval Production Scale
As shown in the original implementation’s prompt template for Coherence, creating G-Eval from scratch — even in its simplest form — is no easy task. One key reason developers choose DeepEval for their G-Eval implementation is that it abstracts away the boilerplate and complexity involved in building an evaluation framework from the ground up. Also, Deepeval is open-source.
DeepEval (quickstart here) is an open-source LLM evaluation framework and removes many of the operational barriers and adds support for more advanced use, while fully aligning with the methods outlined in the original G-Eval research. You can also use an LLM judge for your G-Eval metric.
Here’s what DeepEval does to make G-Eval productions scale:
- Judge Flexibility: Run G-Eval with any LLM-as-a-judge — like GPT-4, Claude, or your own fine-tuned model — without any extra setup.
- Speed Optimization: Evaluations are executed concurrently, making it scalable for large test suites.
- Result Caching: Avoids redundant evaluations by caching results automatically.
- Robust Error Handling: Handles edge cases and model failures gracefully, so one bad output doesn’t break your entire run.
- CI/CD Integration: Easily integrates with testing frameworks like Pytest, enabling G-Eval to run as part of your CI/CD pipeline.
- Platform Compatibility: Connects seamlessly with platforms like Confident AI for monitoring and analysis.
- Advanced Evaluation Methods: Supports DAG-based evaluation, allowing you to chain or branch metrics for more structured and deterministic workflows.
For example, here is how you can use G-Eval in your CI/CD pipelines for unit testing your LLM application in a Pytest like fashion using DeepEval:
And using deepeval test run
to run your test file:
Here is the full documentation if you want to use G-Eval inside DeepEval, no strings attached.
Most Common G-Eval Metric Use Cases
As a maintainer at DeepEval, I see thousands of G-Eval metrics run daily across a wide range of use cases. While many users define their own custom evaluations, a few G-Eval metrics consistently appear across different applications.

Below are 7 of the most commonly used metrics on DeepEval, along with their implementations:
- Answer Correctness — Measures how well the output aligns with the expected answer.
- Coherence — Evaluates the logical flow and linguistic clarity of the response.
- Tonality — Assesses the tone and style, ensuring it matches the intended voice.
- Safety — Checks whether the output is safe, ethical, and free from harmful content.
- Custom RAG — Measures the quality and reliability of a Retrieval-Augmented Generation (RAG) system.
- Summarization — Measures the quality of the summary with respect to the original input.
- Completeness — Measures whether the response fully addresses all relevant parts of the input.
To end this article, we’ll go through each one of them, with code examples.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Answer Correctness
Answer Correctness is the most widely used G-Eval metric, and evaluates how closely an LLM’s output aligns with an expected answer. As a reference-based metric, it relies on a ground-truth response and is best suited for development settings where labeled data is available. Since correctness is inherently subjective, G-Eval is well-equipped to handle the nuanced, context-dependent nature of this evaluation.
Here’s how you might evaluate it on an LLM test case that represents the interaction you want to assess:
Coherence
Coherence measures how logically and clearly an LLM’s output is structured, ensuring that the response flows smoothly and is easy to understand. Unlike Answer Correctness, it doesn’t require a ground-truth reference, making it useful in both development and production settings — especially in tasks like document generation, educational content, and technical writing where clarity is critical.
There are many ways to define a coherence metric depending on your use case. Common angles include fluency, consistency, clarity, and conciseness— but you can also focus on structural flow, logical sequencing, or adherence to a specific narrative format.
Tonality
Tonality evaluates whether the output matches the intended communication style. Similar to Coherence, it is judged solely based on the actual output and does not require a ground-truth reference. This makes it especially useful in production settings where stylistic alignment is critical — such as healthcare assistants, customer support agents, or educational tutors.
There are many ways to define a tonality metric depending on your use case. Common angles include professionalism, empathy, directness, and friendliness — but you can also focus on domain-specific expectations such as emotional support, technical formality, or conversational tone.
Safety
Safety evaluates whether a model’s output aligns with ethical, secure, and socially responsible standards. This includes avoiding harmful or toxic content, protecting user privacy, and minimizing bias or discriminatory language. Like Tonality and Coherence, Safety is judged solely on the output itself — no ground-truth reference is required — making it ideal for production use cases such as moderation, healthcare, and customer service.
There are many ways to define a safety metric depending on the specific risk you’re addressing. Common focuses include PII leakage, bias and stereotyping, ethical alignment, and global inclusivity.
Custom RAG
DeepEval provides robust out-of-the-box metrics like Answer Relevancy and Contextual Precision for evaluating Retrieval-Augmented Generation (RAG) systems. These metrics help ensure that both the retrieved documents and the generated answers meet quality standards — making them especially valuable in production pipelines for search, virtual assistants, and domain-specific applications.
However, there are cases where you’ll need to define custom RAG metrics. In regulated fields like healthcare, for example, evaluations often require stricter checks for hallucinations and traceability to source material.
Summarization
Some G-Eval metrics like summarization are use-case specific. Summarization metrics evaluate whether a model-generated summary accurately reflects the key points of the original input without introducing hallucinations. This metric is essential in use cases like news summarization, meeting notes, legal document abstraction, and any task where compression of information must retain factual accuracy.
Because summarization tasks vary — from extractive to highly abstractive — defining this metric often depends on your domain. The focus may be on factual coverage, precision, or minimizing distortion of meaning.
Completeness
Completeness measures whether the model’s output fully addresses all relevant parts of the input. It ensures that the response doesn’t skip over any sub-questions, instructions, or important details. This is especially useful in multi-part queries, instruction-following tasks, and support scenarios where thoroughness is critical.
This metric should not be confused with Answer Relevancy, which focuses on whether the answer is on-topic. Completeness, by contrast, checks whether everything required has been answered — even if the content is relevant, it may still be incomplete.
Advanced G-Eval Usage
G-Eval is a flexible evaluation method well-suited for subjective and open-ended tasks like tone, helpfulness, or persuasiveness. For applications that require more structure or rule-based logic, however, you can also integrate G-Eval within a Deep Acyclic Graph (DAG) setup.
This allows you to combine the interpretability of decision trees with the nuance of G-Eval scoring — making your evaluations more modular and controlled.

In DAG, each node represents an evaluation decision, and you can use G-Eval at the leaves to assess higher-level qualities after filtering for specific conditions. This makes it easy to build objective, rule-driven workflows while still benefiting from the strengths of G-Eval for nuanced judgments.
I'm not going to make this article longer than it already is so, click here for the full code implementation if you're interested.
Conclusion
In this article, we covered everything you need to know about G-Eval — what it is, how it works, and how to define custom metrics tailored to your specific LLM application. We explored how G-Eval tackles common pitfalls of LLM-as-a-judge systems, why it’s more robust than other evaluators, and how you can go beyond the original paper to refine your metrics further.
We also walked through the most common G-Eval metrics like correctness, coherence, tonality, safety, and RAG evaluation — and showed how to implement them with just a few lines of code using DeepEval.
At the end of the day, if you’re serious about evaluating LLMs with precision and flexibility, G-Eval is the go-to method — and DeepEval makes it dead simple to use.
Don’t forget to ⭐ star DeepEval on GitHub ⭐ if you found this article insightful, and that's all for today.
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)