Contextual Recall

Contextual Recall is a single-turn metric used to evaluate a RAG retriever

Overview

The contextual recall metric is a single-turn RAG metric that uses LLM-as-a-judge to assess whether your retriever has surfaced enough relevant context to produce an answer similar to the expected output.

The input of a test case should not contain the entire prompt, but just the query when using the contextual recall metric.

Required Parameters

These are the parameters you must supply in your test case to run evaluations for contextual recall metric:

input
stringRequired

The input query you supply to your RAG application.

expected_output
stringRequired

The expected output your RAG application has to generate for a given input.

retrieval_context
list of stringRequired

The retrieved context your retriever outputs for a given input sorted by their rank.

How Is It Calculated?

The contextual recall metric first extracts distinct statements from the expected output using an LLM, then uses the same LLM to check how many of those statements are supported by the retrieved context nodes.


Contextual Recall=Number of Attributable StatementsTotal Number of Statements\text{Contextual Recall} = \frac{\text{Number of Attributable Statements}}{\text{Total Number of Statements}}

The final score is the proportion of attributable statements in expected output.

Create Locally

You can create the ContextualRecallMetric in deepeval as follows:

1from deepeval.metrics import ContextualRecallMetric
2
3metric = ContextualRecallMetric()

Here’s a list of parameters you can configure when creating a ContextualRecallMetric:

threshold
numberDefaults to 0.5

A float to represent the minimum passing threshold.

model
string | ObjectDefaults to gpt-4.1

A string specifying which of OpenAI’s GPT models to use OR any custom LLM model of type DeepEvalBaseLLM.

include_reason
booleanDefaults to true

A boolean to enable the inclusion a reason for its evaluation score.

async_mode
booleanDefaults to true

A boolean to enable concurrent execution within the measure() method.

strict_mode
booleanDefaults to false

A boolean to enforce a binary metric score: 0 for perfection, 1 otherwise.

verbose_mode
booleanDefaults to false

A boolean to print the intermediate steps used to calculate the metric score.

evaluation_template
ContextualRecallTemplateDefaults to deepeval's template

An instance of ContextualRecallTemplate object, which allows you to override the default prompts used to compute the ContextualRecallMetric score.

This can be used for both single-turn E2E and component-level testing.

Create Remotely

For users not using deepeval python, or want to run evals remotely on Confident AI, you can use the contextual recall metric by adding it to a single-turn metric collection. This will allow you to use contextual recall metric for:

  • Single-turn E2E testing
  • Single-turn component-level testing
  • Online and offline evals for traces and spans