Tool Correctness
Tool Correctness is a single-turn metric to determine an agent’s tool calling ability
Overview
The tool correctness metric is a single-turn metric that evaluates your LLM agent’s ability to call tools correctly. Unlike other metrics, it does not use an LLM for evaluation.
The tool correctness metric needs you to supply both tools called and expected tools in your test case.
Required Parameters
These are the parameters you must supply in your test case to run evaluations for tool correctness metric:
The input supplied to your LLM agent.
The final output your LLM agent generated for the given input.
A list of the tools called by your LLM agent in the order of their calling in a ToolCall instance.
A list of the expected tools to be called by your LLM agent in the order of how they should be called in a ToolCall instance.
How Is It Calculated?
The tool correctness metric uses a deterministic approach to calculate the score by iterating over the tools called and expected tools to see if your agent has called the appropriate tools.
The final score is the proportion of correctly used tools from tools called.
Create Locally
You can create the ToolCorrectnessMetric in deepeval as follows:
Here’s a list of parameters you can configure when creating a ToolCorrectnessMetric:
A float to represent the minimum passing threshold.
A list of ToolCallParams indicating the strictness of the correctness
criteria, available options are ToolCallParams.INPUT_PARAMETERS and
ToolCallParams.OUTPUT.
A boolean which when set to True, will consider the ordering in which the tools were called in.
A boolean which when set to True, will required the tools called and expected tools to be exactly the same.
A boolean to enable the inclusion a reason for its evaluation score.
A boolean to enforce a binary metric score: 0 for perfection, 1 otherwise.
A boolean to print the intermediate steps used to calculate the metric score.
This can be used for both single-turn E2E and component-level testing.
Create Remotely
For users not using deepeval python, or want to run evals remotely on Confident AI, you can use the tool correctness metric by adding it to a single-turn metric collection. This will allow you to use tool correctness metric for:
- Single-turn E2E testing
- Single-turn component-level testing
- Online and offline evals for traces and spans