Summarization

Summarization is a single-turn metric to determine if your summarizer is generating facutally correct summaries.

Overview

The summarization metric is a single-turn metric that uses LLM-as-a-judge to evaluate an LLM’s ability to summarize text. It generates close-ended questions from the original text and checks if the summary can answer them accurately.

Here’s a good read on how our summarization metric was developed.

The summarization metric assumes the original text to be the input and the summary generated as the actual output.

Required Parameters

These are the parameters you must supply in your test case to run evaluations for summarization metric:

input
stringRequired

The text sent to your summarization agent to summarize.

actual_output
stringRequired

The summary generated by your summarization agent for the given input.

How Is It Calculated?

The summarization metric breaks the score into alignment_score and coverage_score.


Summarization=min(Alignement Score, Coverage Score)\text{Summarization} = \text{min(\text{Alignement Score}, \text{Coverage Score})}

The final score is the minumum of:

  • alignment_score which determines whether the summary contains hallucinated or contradictory information to the original text.
  • coverage_score which determines whether the summary contains the necessary information from the original text.

Create Locally

You can create the SummarizationMetric in deepeval as follows:

1from deepeval.metrics import SummarizationMetric
2
3metric = SummarizationMetric()

Here’s a list of parameters you can configure when creating a SummarizationMetric:

threshold
numberDefaults to 0.5

A float to represent the minimum passing threshold.

assessment_questions
list of stringsDefaults to questions generated by deepeval at evaluation time

A list of close-ended questions that can be answered with either a yes or a no. These are questions you want your summary to be able to ideally answer, they are helpful for using a custom criteria for a good summary.

n
numberDefaults to 5

The number of assessment questions to generate when assessment_questions is not provided.

truths_extraction_limit
number

An integer which when set, determines the maximum number of factual truths to extract from the input.

model
string | ObjectDefaults to gpt-4.1

A string specifying which of OpenAI’s GPT models to use OR any custom LLM model of type DeepEvalBaseLLM.

include_reason
booleanDefaults to true

A boolean to enable the inclusion a reason for its evaluation score.

async_mode
booleanDefaults to true

A boolean to enable concurrent execution within the measure() method.

strict_mode
booleanDefaults to false

A boolean to enforce a binary metric score: 0 for perfection, 1 otherwise.

verbose_mode
booleanDefaults to false

A boolean to print the intermediate steps used to calculate the metric score.

This can be used for both single-turn E2E and component-level testing.

Create Remotely

For users not using deepeval python, or want to run evals remotely on Confident AI, you can use the summarization metric by adding it to a single-turn metric collection. This will allow you to use summarization metric for:

  • Single-turn E2E testing
  • Single-turn component-level testing
  • Online and offline evals for traces and spans