Summarization
Summarization is a single-turn metric to determine if your summarizer is generating facutally correct summaries.
Overview
The summarization metric is a single-turn metric that uses LLM-as-a-judge to evaluate an LLM’s ability to summarize text. It generates close-ended questions from the original text and checks if the summary can answer them accurately.
Here’s a good read on how our summarization metric was developed.
The summarization metric assumes the original text to be the input and the summary generated as the actual output.
Required Parameters
These are the parameters you must supply in your test case to run evaluations for summarization metric:
The text sent to your summarization agent to summarize.
The summary generated by your summarization agent for the given input.
How Is It Calculated?
The summarization metric breaks the score into alignment_score and coverage_score.
The final score is the minumum of:
alignment_scorewhich determines whether the summary contains hallucinated or contradictory information to the original text.coverage_scorewhich determines whether the summary contains the necessary information from the original text.
Create Locally
You can create the SummarizationMetric in deepeval as follows:
Here’s a list of parameters you can configure when creating a SummarizationMetric:
A float to represent the minimum passing threshold.
A list of close-ended questions that can be answered with either a yes or a
no. These are questions you want your summary to be able to ideally answer,
they are helpful for using a custom criteria for a good summary.
The number of assessment questions to generate when assessment_questions is
not provided.
An integer which when set, determines the maximum number of factual truths to extract from the input.
A string specifying which of OpenAI’s GPT models to use OR any custom LLM
model of type
DeepEvalBaseLLM.
A boolean to enable the inclusion a reason for its evaluation score.
A boolean to enable concurrent execution within the measure() method.
A boolean to enforce a binary metric score: 0 for perfection, 1 otherwise.
A boolean to print the intermediate steps used to calculate the metric score.
This can be used for both single-turn E2E and component-level testing.
Create Remotely
For users not using deepeval python, or want to run evals remotely on Confident AI, you can use the summarization metric by adding it to a single-turn metric collection. This will allow you to use summarization metric for:
- Single-turn E2E testing
- Single-turn component-level testing
- Online and offline evals for traces and spans