In this story
Jeffrey Ip
Cofounder @ Confident AI, creating companies that tell stories. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

A Step-By-Step Guide to Evaluating an LLM Text Summarization Task

December 17, 2023
8 min read
The open-source evaluation framework for LLMs.
⭐ Star on Github
A Step-By-Step Guide to Evaluating an LLM Text Summarization Task

When you imagine what a good summary for a 10-page research paper looks like, you likely picture a concise, comprehensive overview that accurately captures all key findings and data from the original work, presented in a clear and easily understandable format.

This might sound extremely obvious to us (I mean, who doesn’t know what a good summary looks like?), yet for large language models (LLMs) like GPT-4, grasping this simple concept to accurately and reliably evaluate a text summarization task remains a significant challenge.

In this article, I’m going to share how we built our own bullet-proof LLM-Evals (metrics evaluated using LLMs) to evaluate a text-summarization task. In summary (no pun intended), it involves asking closed-ended questions to:

  1. Identify misalignment in factuality between the original text and summary.
  2. Identify exclusion of details in the summary from the original text.

Existing Problems with Text Summarization Metrics

Traditional, non-LLM Evals

Historically, model-based scorers (e.g., BertScore and ROUGE) have been used to evaluate the quality of text summaries. These metrics, as I outlined here, while useful, often focus on surface-level features like word overlap and semantic similarity.

  • Word Overlap Metrics: Metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) often compare the overlap of words or phrases between the generated summary and a reference summary. If both summaries are of similar length, the likelihood of a higher overlap increases, potentially leading to higher scores.
  • Semantic Similarity Metrics: Tools like BertScore evaluate the semantic similarity between the generated summary and the reference. Longer summaries might cover more content from the reference text, which could result in a higher similarity score, even if the summary isn’t necessarily better in terms of quality or conciseness.

Moreover, these metrics struggle especially when the original text is composed of concatenated text chunks, which is often the case for a retrieval augmented generation (RAG) summarization use case. This is because they often fail to effectively assess summaries for disjointed information within the combined text chunks.


In one of my previous articles, I introduced G-Eval, an LLM-Evals framework that can be used for a summarization task. It usually involves providing the original text to an LLM like GPT-4 and asking it to generate a score and provide a reason for its evaluation. However, although better than traditional approaches, evaluating text summarization with LLMs presents its own set of challenges:

  1. Arbitrariness: LLM Chains of Thought (CoTs) are often arbitrary, which is particularly noticeable when the models omit details that humans would typically consider essential to include in the summary.
  2. Bias: LLMs often overlook factual inconsistencies between the summary and original text as they tend to prioritize summaries that reflect the style and content present in their training data.

In a nutshell, arbitrariness causes LLM-Evals to overlook the exclusion of essential details(or at least hinders their ability to identify what should be considered essential), while bias causes LLM-Evals to overlook factual inconsistencies between the original text and the summary.

LLM-Evals can be Engineered to Overcome Arbitrariness and Bias

Unsurprisingly, while developing our own summarization metric at Confident AI, we ran into all the problems I mentioned above. However, we came across this paper that introduced the Question Answer Generation framework, which was instrumental in overcoming bias and inconsistencies in LLM-Evals.

Question-Answer Generation

The Question-Answer Generation (QAG) framework is a process where close-ended questions are generated based on some text (which in our case is either the original text or the summary), before asking a language model (LM) to give an answer based on some reference text.

Let’s take this text for example:

The ‘inclusion score’ is calculated as the percentage of assessment questions
for which both the summary and the original document provide a ‘yes’ answer. This method ensures that the summary not only includes key information from the original text but also accurately represents it. A higher inclusion score indicates a more comprehensive and faithful summary, signifying that the summary effectively encapsulates the crucial points and details from the original content.

A sample question according to QAG would be:

Is the ‘inclusion score’ the percentage of assessment questions for which both the summary and the original document provide a ‘yes’ answer?

To which, the answer would be ‘yes’.

QAG is essential in evaluating a summarization task because closed-ended questions remove stochasticity, which in turn leads to arbitrariness and bias in LLM-Evals. (For those interested, here is another great read on why QAG is so great as an LLM metric scorer.)

A Text Summarization Metric is the Combination of Inclusion and Alignment Scores

If you re-read the first paragraph of this article, you’ll notice that at the end of the day, you only care about two things in a summarization task:

  1. Inclusion of details.
  2. Factual alignment between the original text and summary.

Therefore, a text summarization task can be evaluated by calculating an inclusion and alignment score respectively, and combining the two to yield a final summarization score.

Calculating Inclusion Score

So here’s a fun challenge for you: Given these two pieces of information, can you deduce how we’re going to calculate the inclusion score using QAG?

  1. Inclusion measures the amount of detail included in the summary from the original text.
  2. QAG requires a reference text to generate a set of close-ended questions

Spoiler alert, here’s the algorithm:

  1. Generate n questions from the original text in a summarization task.
  2. For each question, generate either a ‘yes’, ‘no’, or ‘idk’ answer using information from the original text and summary individually. The ‘idk’ answer from the summary represents the case where the summary does not contain enough information to answer the question.

The higher the number of matching answers, the greater the inclusion score. This is because matching answers indicate the summary is both factually correct and contains sufficient detail to answer the question. A ‘no’ from the summary indicates a contradiction, whereas an ‘idk’ indicates omission. (Since the questions are generated from the original text, answers from the original text should all be ‘yes’.)

Why use the original text to generate the questions you’d ask? The original text is used as a reference to generate questions because it is effectively a superset of the summary. To accurately determine if any important details have been overlooked in the summary, it’s crucial to refer to the text that contains more comprehensive details.

A QAG example for calculating inclusion

Calculating Alignment Score

The general algorithm to calculate the alignment score is identical to the one used for inclusion. However, note that in the case of alignment, we utilize the summary as the reference text to generate close-ended questions instead. This is because for alignment, we only want to detect cases of hallucination and contradiction, so we are more concerned with the original text’s answers to the summary’s questions.

The ‘idk’ or ‘no’ answer from the original text indicates either a hallucination or contradiction respectively. (Again, answers from the summary should all be ‘yes’.)

Combining the Scores

There are several ways you can combine the scores to generate a final summarization score. You can take an average, use geometric/harmonic means, or take the minimum of the two, just to list a few options. For now, we at Confident AI are opting for the straightforward approach of taking the minimum of the two scores. This choice prioritizes ensuring accurate scoring, but keep in mind we’re constantly monitoring results to refine our methodology as needed. Remember to choose whatever makes sense for your use case!


LLM-Evals are superior to traditional metrics, but LLMs have arbitrary CoTs and bias that cause them to overlook factual misalignment and exclusion of details during evaluation. If you want to personally make LLM-Evals for a text summarization task more reliable, QAG is a great framework.

However, if you just want to use an LLM-Eval that evaluates a summary based on QAG, you can use DeepEval. We’ve done all the hard work for you already. Just provide an original text and the summary to calculate a summarization score in 10 lines of code. (PS. You can also manually supply a set of assessment questions used to calculate the inclusion score if you know what type of summaries you’re expecting.)

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import SummarizationMetric

test_case = LLMTestCase(
  input="the original text...", 
  actual_output="the summary..."
summarization_metric = SummarizationMetric()
evaluate([test_case], [summarization_metric])

Thank you for reading and as always, till next time.

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Schedule a call with me here (it’s free), or ask us anything in our discord. I might give you an “aha!” moment, who knows?

Jeffrey Ip
Cofounder @ Confident AI, creating companies that tell stories. Ex-Googler (YouTube), Microsoft AI (Office365). Working overtime to enforce responsible AI.

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.