LlamaIndex

Use Confident AI for LLM observability and evals for LlamaIndex

Overview

LlamaIndex is an LLM framework that makes it easy to build knowledge agents from complex data. Confident AI allows you to trace and evaluate LlamaIndex agents in just a few lines of code.

Tracing Quickstart

1

Install Dependencies

Run the following command to install the required packages:

$pip install -U deepeval llama-index
2

Setup Confident AI Key

Login to Confident AI using your Confident API key.

$deepeval login
3

Instrument LlamaIndex

Call instrument_llama_index once at startup, passing LlamaIndex’s root dispatcher. Every subsequent LlamaIndex call in your application will automatically be traced and sent to Confident AI.

main.py
1import asyncio
2from llama_index.llms.openai import OpenAI
3from llama_index.core.agent import FunctionAgent
4import llama_index.core.instrumentation as instrument
5
6from deepeval.integrations.llama_index import instrument_llama_index
7instrument_llama_index(instrument.get_dispatcher())
8
9def multiply(a: float, b: float) -> float:
10 """Useful for multiplying two numbers."""
11 return a * b
12
13agent = FunctionAgent(
14 tools=[multiply],
15 llm=OpenAI(model="gpt-4o-mini"),
16 system_prompt="You are a helpful assistant that can perform calculations.",
17)
18
19async def main():
20 return await agent.run("What is 3 * 12?")
21
22asyncio.run(main())

instrument_llama_index registers DeepEval’s handler with LlamaIndex’s instrumentation dispatcher. From that point on, all LlamaIndex spans and events are captured automatically — no other code changes are required.

You can directly view the traces on Confident AI by clicking on the link printed in the console output.

What Gets Traced

The integration captures the following span types automatically:

Span typeWhen it is created
AgentAny Workflow.run() or FunctionAgent.run() call
LLMEach LLMChatStartEvent / LLMChatEndEvent pair — includes input messages, model name, and the inferred provider (e.g. OpenAI, Anthropic)
ToolAny call_tool / acall_tool / acall invocation — captures tool name, inputs, and outputs
GenericAll other instrumented LlamaIndex methods

Retrieval context from RetrievalEndEvent is automatically attached to the enclosing span, making it available for retrieval-based metrics.

Each span is tagged with integration: "LlamaIndex" so you can filter by framework on Confident AI.

Advanced Features

Set trace attributes

You can attach metadata, user identifiers, and other attributes to a trace by wrapping your LlamaIndex call inside the trace context manager.

main.py
1import asyncio
2from llama_index.core.agent import FunctionAgent
3from llama_index.llms.openai import OpenAI
4import llama_index.core.instrumentation as instrument
5from deepeval.integrations.llama_index import instrument_llama_index
6from deepeval.tracing import trace
7
8instrument_llama_index(instrument.get_dispatcher())
9
10agent = FunctionAgent(
11 tools=[],
12 llm=OpenAI(model="gpt-4o-mini"),
13 system_prompt="You are a helpful assistant.",
14)
15
16async def handle_request(user_input: str, user_id: str, thread_id: str):
17 with trace(
18 user_id=user_id,
19 thread_id=thread_id,
20 tags=["production", "v2"],
21 metadata={"environment": "prod"},
22 ):
23 return await agent.run(user_input)
24
25asyncio.run(handle_request("Hello!", user_id="user-42", thread_id="conv-99"))
name
str

The name of the trace. Learn more.

tags
List[str]

Tags are string labels that help you group related traces. Learn more.

metadata
Dict

Attach arbitrary metadata to the trace. Learn more.

thread_id
str

Supply the thread or conversation ID to view and evaluate conversations. Learn more.

user_id
str

Supply the user ID to enable user analytics. Learn more.

input
Any

Override the top-level input recorded for this trace.

output
Any

Override the top-level output recorded for this trace.

retrieval_context
List[str]

Explicitly set the retrieval context for this trace.

context
List[str]

Contextual information available to the model at inference time.

expected_output
str

The expected or ground-truth output for this trace.

tools_called
List[ToolCall]

Manually specify the tools called during this trace.

expected_tools
List[ToolCall]

The expected tools that should have been called.

Each attribute is optional and works the same way as the native tracing features on Confident AI.

Evals Usage

Online evals

You can run online evals on your LlamaIndex application to evaluate all incoming traces on Confident AI’s servers. This approach is recommended when your application is in production.

1

Create metric collection

Create a metric collection on Confident AI with the metrics you wish to use to evaluate your LlamaIndex application.

Create metric collection

The LlamaIndex integration automatically captures input and actual_output for Agent and LLM spans. Use metrics that only require those fields (e.g. Answer Relevancy, Task Completion) unless you also supply retrieval_context, context, expected_output, or expected_tools explicitly via the trace context manager or span context objects.

2

Run evals

Pass metric_collection to the trace context manager to evaluate the entire trace with your chosen metric collection.

main.py
1import asyncio
2from llama_index.llms.openai import OpenAI
3from llama_index.core.agent import FunctionAgent
4import llama_index.core.instrumentation as instrument
5from deepeval.integrations.llama_index import instrument_llama_index
6from deepeval.tracing import trace
7
8instrument_llama_index(instrument.get_dispatcher())
9
10def multiply(a: float, b: float) -> float:
11 """Useful for multiplying two numbers."""
12 return a * b
13
14agent = FunctionAgent(
15 tools=[multiply],
16 llm=OpenAI(model="gpt-4o-mini"),
17 system_prompt="You are a helpful assistant.",
18)
19
20async def llm_app(user_input: str):
21 with trace(metric_collection="my_metric_collection"):
22 return await agent.run(user_input)
23
24asyncio.run(llm_app("What is 3 * 12?"))

All incoming traces will now be evaluated using metrics from your metric collection.

Span-level evals

For finer-grained control, you can attach metrics or a metric collection directly to individual Agent or LLM spans using AgentSpanContext or LlmSpanContext. This lets you evaluate specific spans independently.

main.py
1import asyncio
2from llama_index.llms.openai import OpenAI
3from llama_index.core.agent import FunctionAgent
4import llama_index.core.instrumentation as instrument
5from deepeval.integrations.llama_index import instrument_llama_index
6from deepeval.tracing import trace
7from deepeval.tracing.trace_context import AgentSpanContext
8from deepeval.metrics import AnswerRelevancyMetric
9
10instrument_llama_index(instrument.get_dispatcher())
11
12def multiply(a: float, b: float) -> float:
13 """Useful for multiplying two numbers."""
14 return a * b
15
16agent = FunctionAgent(
17 tools=[multiply],
18 llm=OpenAI(model="gpt-4o-mini"),
19 system_prompt="You are a helpful assistant that can perform calculations.",
20)
21
22answer_relevancy = AnswerRelevancyMetric()
23
24async def llm_app(user_input: str):
25 agent_span_context = AgentSpanContext(
26 metrics=[answer_relevancy],
27 )
28 with trace(agent_span_context=agent_span_context):
29 return await agent.run(user_input)
30
31asyncio.run(llm_app("What is 3 * 12?"))

AgentSpanContext — applied to Agent spans (i.e. workflow / agent .run() calls):

metrics
List[BaseMetric]

A list of DeepEval metric instances to evaluate this agent span with.

metric_collection
str

Name of a metric collection on Confident AI to use for evaluation.

expected_output
str

The expected output for the agent span.

expected_tools
List[ToolCall]

The expected tools that should have been called.

context
List[str]

Contextual information for the span.

retrieval_context
List[str]

Retrieved documents or chunks for the span.


LlmSpanContext — applied to LLM spans (i.e. individual LLM calls):

metrics
List[BaseMetric]

A list of DeepEval metric instances to evaluate this LLM span with.

metric_collection
str

Name of a metric collection on Confident AI to use for evaluation.

prompt
Prompt

A Prompt object from deepeval.prompt to associate a managed prompt with this LLM span.

expected_output
str

The expected output for the LLM span.

expected_tools
List[ToolCall]

The expected tools that should have been called.

context
List[str]

Contextual information for the span.

retrieval_context
List[str]

Retrieved documents or chunks for the span.

View on Confident AI

You can view the evals on Confident AI by clicking on the link in the output printed in the console.