Tool Correctness

Tool Correctness is a single-turn metric to determine an agent’s tool calling ability

Overview

The tool correctness metric is a single-turn metric that evaluates your LLM agent’s ability to call tools correctly. Unlike other metrics, it does not use an LLM for evaluation.

The tool correctness metric needs you to supply both tools called and expected tools in your test case.

Required Parameters

These are the parameters you must supply in your test case to run evaluations for tool correctness metric:

input
stringRequired

The input supplied to your LLM agent.

actual_output
stringRequired

The final output your LLM agent generated for the given input.

tools_called
listRequired

A list of the tools called by your LLM agent in the order of their calling in a ToolCall instance.

expected_tools
stringRequired

A list of the expected tools to be called by your LLM agent in the order of how they should be called in a ToolCall instance.

How Is It Calculated?

The tool correctness metric uses a deterministic approach to calculate the score by iterating over the tools called and expected tools to see if your agent has called the appropriate tools.


Tool Correctness=Number of Correctly Used Tools(or Correct Input Parameters / Outputs)Total Number of Tools Called\text{Tool Correctness} = \frac{\text{Number of Correctly Used Tools(or Correct Input Parameters / Outputs)}}{\text{Total Number of Tools Called}}

The final score is the proportion of correctly used tools from tools called.

Create Locally

You can create the ToolCorrectnessMetric in deepeval as follows:

1from deepeval.metrics import ToolCorrectnessMetric
2
3metric = ToolCorrectnessMetric()

Here’s a list of parameters you can configure when creating a ToolCorrectnessMetric:

threshold
numberDefaults to 0.5

A float to represent the minimum passing threshold.

evaluation_params
listDefaults to an empty list

A list of ToolCallParams indicating the strictness of the correctness criteria, available options are ToolCallParams.INPUT_PARAMETERS and ToolCallParams.OUTPUT.

should_consider_ordering
booleanDefaults to false

A boolean which when set to True, will consider the ordering in which the tools were called in.

should_exact_match
booleanDefaults to false

A boolean which when set to True, will required the tools called and expected tools to be exactly the same.

include_reason
booleanDefaults to true

A boolean to enable the inclusion a reason for its evaluation score.

strict_mode
booleanDefaults to false

A boolean to enforce a binary metric score: 0 for perfection, 1 otherwise.

verbose_mode
booleanDefaults to false

A boolean to print the intermediate steps used to calculate the metric score.

This can be used for both single-turn E2E and component-level testing.

Create Remotely

For users not using deepeval python, or want to run evals remotely on Confident AI, you can use the tool correctness metric by adding it to a single-turn metric collection. This will allow you to use tool correctness metric for:

  • Single-turn E2E testing
  • Single-turn component-level testing
  • Online and offline evals for traces and spans