
LLM agents suck. I spent the past entire week building a web-crawling agent — only to watch it crawl at a snail’s pace, repeat pointless function calls, and spiral into infinite reasoning loops. Eventually, I finally threw in the towel and scrapped it for simple web-scraping script that took 30 minutes to code.

Alright, I’m not anti-LLM agents — I’m building an AI company, after all. That said, building an agent that’s efficient, reliable, and scalable is no easy task. The good news? Once you’ve pinpoint and eliminate the bottlenecks, the automation upside is enormous. The key is knowing where and how to evaluate your agent effectively.
Over the past year, I’ve helped hundreds of companies stress-test their agents, curate benchmarks, and drive performance improvements. Today, I’ll walk you through everything you need to know to evaluate LLM agents effectively.
LLM Agent Evaluation vs LLM Evaluation
LLM agent evaluation is the process of assessing autonomous AI workflows on performance metrics such as task completion, which may sound familiar to regular LLM evaluation — but it isn't.
To understand how LLM agent evaluation differs from traditional LLM evaluation, it’s important to recognize what makes agents fundamentally different:
- Architectural complexity: Agents are built from multiple components, often chained together in intricate workflows.
- Tool usage: They can invoke external tools and APIs to complete tasks.
- Autonomy: Agents operate with minimal human input, making dynamic decisions on what to do next.
- Reasoning frameworks: They often rely on advanced planning or decision-making strategies to guide behavior.
This complexity makes evaluation challenging, because now it is not just the end-to-end system that we have to evaluate, as is the case in typical RAG evaluation. An agent might:
- Call tools in varying sequences with different inputs.
- Invoke other sub-agents, each with its own set of goals.
- Generate non-deterministic behaviors based on state, memory, or context.
These component-level interactions must not be neglected when performing agentic evaluations.

As a result, LLM agents are evaluated at two distinct levels:
- End-to-end evaluation: Treats the entire system as a black box, focusing on whether the overall task was completed successfully given a specific input.
- Component-level evaluation: Examines individual parts (like sub-agents, RAG pipelines, or API calls) to identify where failures or bottlenecks occur.
This layered approach helps diagnose both surface-level and deep-rooted issues in agent performance, and before we dive deeper into evaluation, let's understand how an agent works.
Characteristics of LLM Agents
In the previous section, we briefly introduced the core characteristics of LLM agents: tool calling, autonomy, and reasoning. These traits give agents their unique capabilities and real-world reach, but are often themselves the source of errors.
- Tool invocation & API calls: agents can call external services — updating databases, booking restaurants, trading stocks, scraping websites — enabling real-world interaction. At the same time, mis-chosen tools, bad parameters, or unexpected outputs can derail the entire workflow.
- High autonomy: agents plan and execute a sequence of steps with a high-level of autonomy — gathering information, invoking tools, then synthesizing a final answer. This multi-step process boosts capability but as with exploding gradients, magnifies the impact of any individual mistake.
- Intermediate reasoning: agents deliberate before taking action — using reasoning frameworks like ReAct, which helps them make deliberate choices, but flawed logic can lead to infinite loops or misdirected tool calls.
These 3 characteristics set LLM agents apart, but not every agent invokes tools or engages in true reasoning. Agents operate at different autonomy levels depending on their purpose and use case: a basic chatbot isn’t the same as JARVIS. Defining these autonomy levels is crucial, since each tier demands its own evaluation criteria and techniques.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Different Types of LLM Agents
LLM agents can be classified into 4 distinct levels, each successively more advanced and autonomous than the last. Although we’re still in the early stages, the rapid pace of AI development could see us moving quickly through these tiers. Clearly defining the scope is crucial for determining which evaluations are needed to keep AI systems both safe and high-performing.
Level 1: Generator Agent
Most LLM agents in production today are Generator Agents. These include basic customer support chatbots and RAG-based applications. Agents at this level are purely reactive — responding to user queries without any ability to reflect, refine, or improve beyond their training data or provided context.

Level 2: Tool-Calling Agent
When people talk about LLM Agents, they’re usually referring to Tool-Calling Agents — this is where most AI development is happening today. These agents can decide when to retrieve information from APIs, databases, or search engines and can execute tasks using external tools, such as booking a flight, browsing the web, or running calculations. These agents are still reactive, however.

Level 3: Planning Agent
Planning agents take AI beyond simple tool use by structuring multi-step workflows and making results-based execution choices. Unlike Tool-Calling Agents — they detect state changes, refine their approach, and sequence tasks intelligently. Take for example an advanced debugging agent that analyze logs, attempt fixes, and verify solutions before proceeding. Unfortunately, even agents at this level are still reactive, however. They can’t initiate tasks on their own and don’t persist beyond a single workflow.

Level 4: Autonomous Agent
Autonomous Agents don’t just follow commands — they initiate actions, persist across sessions, and adapt based on feedback. Unlike lower-level agents, they can execute tasks without needing constant user input. In theory, they are self-improving by nature, and could even develop new solutions beyond predefined workflows, although the scope of fully independent, self-improving agents remain out of reach in today's landscape.
Since most agents today operate at level 2, I’ll take a deep dive into the 3 key aspects of agent evaluation that we briefly discussed earlier: Tool-Calling Evaluation, Agent Workflow Evaluation, and Reasoning Evaluation. By examining relevant metrics and sharing practical examples, I’ll demonstrate why these evaluations are crucial and absolutely necessary to your LLM Agent evaluation pipeline.
Evaluating Tool-Use
Evaluation tool-use focuses on two critical aspects: Tool Correctness, which determines if the correct tools were called, and Tool-Calling Efficiency, which evaluates whether the tools were used in the most efficient way to achieve the desired results. These tool metrics are crucial for Level 2 tool-calling agents but remain important at Levels 3 and 4, where tool usage continues to play a significant role.
Tool Correctness
Tool Correctness assesses whether an agent’s tool-calling behavior aligns with expectations by verifying that all required tools were correctly called. Unlike most LLM evaluation metrics, the Tool Correctness metric is a deterministic measure and not an LLM-judge.

At it’s most basic level, evaluating tool selection itself is sufficient. But more often that not, you’ll also want to assess the Input Parameters passed into these tools and the Output Accuracy of the results they generate:
- Tool Selection: Comparing the tools the agent calls to the ideal set of tools required for a given user input.
- Input Parameters: Evaluating the accuracy of the input parameters passed into the tools against ground truth references.
- Output Accuracy: Verifying the generated outputs of the tools against the expected ground truth.
It’s important to note that these parameters represent levels of strictness rather than distinct metrics, as evaluating input parameters and output accuracy depends on the correct tools being called. If the wrong tools are used, evaluating these parameters becomes irrelevant.
Furthermore, the Tool Correctness score doesn’t have to be binary or require exact matching:
- Order Independence: The order of tool calls may not matter as long as all necessary tools are used. In such cases, evaluation can focus on comparing sets of tools rather than exact sequences.
- Frequency Flexibility: The number of times each tool is called may be less significant than ensuring the correct tools are selected and used effectively.
These considerations all depend on your evaluation criteria, which is strongly tied to your LLM Agent’s use-case. For example, a medical AI agent responsible for diagnosing a patient might query the “patient symptom checker” tool after retrieving data from the “medical history database” tool, rather than in the reverse order. As long as both tools are used correctly and all relevant information is accounted for, the diagnosis could still be accurate.
The same flexibility in scoring applies to Input Parameters and Output Accuracy. If a tool requires multiple input parameters, you might calculate the percentage of correct parameters rather than demand an exact match. Similarly, if the output is a numerical value, you could measure its percentage deviation from the expected result.
Ultimately, your definition of the Tool Correctness metric should align with your evaluation criteria and use case to ensure it effectively reflects the desired outcomes.
Tool Efficiency
Equally important to tool correctness is tool efficiency. Inefficient tool-calling patterns can increase response times, frustrate users, and significantly raise operational costs.
Think about it: imagine a chatbot helping you book a flight. If it first checks the weather, then converts currency, and only afterward searches for flights, it’s taking an unnecessarily convoluted route. Sure, it might get the job done eventually, but wouldn’t it be far better if it went straight to the flight API?
Let’s explore how tool efficiency can be evaluated, starting with deterministic methods:
- Redundant Tool Usage measures how many tools are invoked unnecessarily — those that do not directly contribute to achieving the intended outcome. This can be calculated as the percentage of unnecessary tools relative to the total number of tool invocations.
- Tool Frequency evaluates whether tools are being called more often than necessary. This method penalizes tools that exceed a predefined threshold for the number of calls required to complete a task (many times this is just 1).
While these deterministic metrics provide a solid foundation, evaluating tool efficiency for more complex LLM agents can be challenging. Tool-calling behavior in such agents can quickly become branched, nested, and convoluted (trust me I’ve tried).
A more flexible approach is using an LLM as a judge. For example, one way to calculate tool efficiency extracts the user’s goal (the task the agent needs to accomplish) and evaluates the tool-calling trajectory based on the tools called (e.g., name, description, input parameters, output) and a provided list of available tools to determine if the trajectory was the most efficient method (DeepEval’s method).
This metric not only simplifies efficiency calculation but also avoids the need for rigid specifications, such as a fixed number of tool calls. Instead, it evaluates efficiency based on the tools available and their relevance to the task at hand.
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Evaluating Agentic Workflows
While tool-calling metrics are essential for assessing LLM agents, they focus only of on usage. However, effective evaluation requires a broader perspective — one that examines the agent’s entire workflow.
This includes assessing the full process: from the initial user input, through the reasoning steps and tool interactions, to the final response provided to the user.
Task Completion
A critical metric for assessing agent workflows is Task Completion (also known as task success or goal accuracy). This metric measures how effectively an LLM agent completes a user-given task. The definition of “task completion” can vary significantly depending on the task’s context. Similar to tool use, task completion is most critical for Level 2 tool-calling agents and will continue to be important at higher autonomy levels.
Consider AgentBench, which was the first benchmarking tool designed to evaluate the ability of LLMs to act as agents. It tests LLMs across eight distinct environments, each with unique task completion criteria, including:

- Digital Card Game: here, the task completion criteria is clear and objective — the agent’s goal is to win the game. The corresponding metric is the win rate, or the number of times the agent wins.
- Web Shopping: here, task completion is less straightforward. AgentBench uses a custom metric to evaluate the product purchased by the agent against the ideal product. This metric considers multiple factors, such as price similarity and attribute similarity, which is determined through text matching.
Custom metrics like these are highly effective when the scope of tasks is limited and accompanied by a large dataset with ground-truth labels. However, in real-world applications, agents are often required to perform a diverse set of tasks—many of which may lack predefined ground-truth datasets.
For example, an LLM agent equipped with tools like a web browser can perform virtually unlimited web-based tasks. In such cases, collecting and evaluating interactions in production becomes impractical, as ground-truth references cannot be defined for every possible task. This complexity necessitates a more adaptable and scalable evaluation framework.
DeepEval’s Task Completion metric addresses these challenges by leveraging LLMs to:
- Determine the task from the user’s input.
- Analyze the reasoning steps, tool usage, and final response to assess whether the task was successfully completed.
With this approach, you no longer need to rely on predefined ground-truth datasets or rigid custom criteria. Instead, DeepEval gives you the flexibility to evaluate tasks of all kind.
G-Eval for Custom Agent
Sometimes, you’ll want to evaluate something specific about your LLM agent. G-Eval is a framework that leverages LLMs with chain-of-thought (CoT) reasoning to evaluate outputs based on ANY custom criteria.
This means you can define custom metrics in natural language to assess your agent’s workflow. Because G-Eval is a custom metric, it’s role is equally important at all autonomy Levels—from basic assistants to the most autonomous ones.
Consider a Restaurant Booking Assistant. A common issue might arise where the agent tells the user, “The restaurant is fully booked,” but leaves out important context, such as whether it checked alternative dates or nearby restaurants. For users, this can feel incomplete or unhelpful. To ensure the output reflects the full scope of the agent’s efforts and improves user experience, you could define custom evaluation criteria with G-Eval, such as:
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)
Agentic Reasoning Evaluation
We’ve all seen benchmarks like MMLU and reasoning tasks such as BoolQ being used to test an LLM’s ability to handle mathematical, commonsense, and causal reasoning. While these benchmarks are useful, they often assume that a model’s reasoning skills are entirely dependent on its inherent capabilities. But in practice, that’s rarely the whole story.
In real-world scenarios, your LLM agent’s reasoning is shaped by much more than just the model itself. Things like the prompt template (e.g., chain-of-thought reasoning), tool usage, and the agent’s architecture all play critical roles. Testing the model in isolation might give you a starting point, but it won’t tell you how well your agent performs in real-world workflows where these factors come into play.
On top of that, you need to think about your agent’s specific domain. Every task and workflow is different, and tailoring evaluations to your unique use case is the best way to ensure your agent’s reasoning is both accurate and useful.
Here are a few metrics you can use to evaluate agent-specific reasoning:
- Reasoning Relevancy: is the reasoning behind each tool call clearly tied to what the user is asking for? For example, if the agent queries a restaurant database, it should make sense why it’s doing that — it’s checking availability because the user requested it.
- Reasoning Coherence: Does the reasoning follow a logical, step-by-step process? Each step should add value and make sense in the context of the task.
Agentic reasoning is somewhat important at Level 2 tool-calling agents—though variability is limited by standardized frameworks like ReAct. It becomes increasingly critical at Level 3, as agents take on a planning role and intermediate reasoning steps grow both more important and more domain-specific.
Component-level Evaluations
So far, we’ve discussed tool-calling, task completion, reasoning, and custom metrics — and shown how these evaluations can be done on DeepEval from an end-to-end perspective. As a reminder, end-to-end evaluations treat an LLM application’s components as a black box, focusing on the main user-facing parameters such as the input and output (with the exception of retrieval contexts and tool calls). This view is essential for metrics like task completion, which assesses the entire agentic flow to determine whether an outcome was achieved.
However, AI agents are modular. A complex agent might combine multiple retrieval engines, several LLM generators, sub-agents (or “swarms”), and numerous tool invocations. You can certainly measure things like contextual recall for RAG agents or tool correctness end-to-end, but those scores won’t pinpoint which component is underperforming.
For instance, a multi-retriever setup might flag “retrieval needs improvement,” but it won’t tell you whether Retriever A or Retriever B is the bottleneck.
By defining metrics at the component level, you gain precise visibility into each part of your system. That lets you identify the true failure points — whether a specific retriever, a particular generator, or a sub-agent — and focus your efforts on the most impactful, lowest-hanging fruit. This applies to any of the metrics that we previously discussed, except for task completion.
RAG Metrics
RAG metrics can be applied at both end-to-end and component-level testing. The five core RAG metrics are Answer Relevancy, Faithfulness, Contextual Relevancy, Contextual Precision, and Contextual Recall.
Four of these — Answer Relevancy, Faithfulness, Contextual Precision, and Contextual Recall — depend on the LLM’s generated answer, so they must be used at the agent level (where you have both retrieved contexts and the final output).
Contextual Relevancy, however, only compares the input query against the retrieved passages — so it can be utilized directly at the retriever level.
Tool Use
Tool Efficiency and Tool Correctness metrics rely solely on the record of tool invocations and the expected calls. Since they don’t require the LLM’s output, you can evaluate them at the most granular component — per tool call or per tool set call. This lets you isolate and optimize each tool integration (or group of related tools) rather than lumping all tool interactions into an end-to-end list.
Safety Metrics
Safety metrics (bias, toxicity, harmful content, etc.) can be applied at both the LLM level (raw model outputs) and the agent level (post-tool-call or aggregated responses). As long as you have an output string, you can compute these safety scores on individual generator level or on the agent level, giving you flexibility to monitor and remediate issues wherever they arise.
Component Level-Evaluations with tracing
Component-level evaluation works best when done in conjunction with LLM tracing. As the name suggests, tracing lets you trace each individual component in your LLM agent—whether it’s a retrieval call, a reranker, or a custom tool invocation—that contributes to the final response or action. By inspecting these traces, you can quickly identify low-scoring components, understand exactly where performance is lagging, and apply targeted fixes.
General observability platforms like Datadog or New Relic can capture system-level metrics, but LLM-specialized tools such as Confident AI embed tracing directly into your LLM testing suite.For more information on how to set up and use tracing with Confident AI, see the Tracing on Confident AI documentation.

Conclusion
Can’t believe you made it all the way here! Congratulations on becoming an expert in evaluating LLM agents. To recap, LLM agents stand out from regular LLM applications due to their ability to call tools and perform reasoning.
This means we need to evaluate tool-calling, reasoning steps, and entire agentic workflows that combine these capabilities. Fortunately, DeepEval provides these metrics out of the box and ready to use.
Don’t forget that you’ll also need to focus on other aspects beyond just agentic metrics to get a comprehensive evaluation. Thanks for reading and don’t forget to give ⭐ DeepEval a star on Github ⭐ if you found this article useful.
Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?
Confident AI: The DeepEval LLM Evaluation Platform
The leading platform to evaluate and test LLM applications on the cloud, native to DeepEval.
.png)
.png)
.png)
.png)
.png)
.png)
Got Red? Safeguard LLM Systems Today with Confident AI
The leading platform to red-team LLM applications for your organization, powered by DeepTeam.
.png)
.png)
.png)
.png)
.png)
.png)