LlamaIndex

Use Confident AI for LLM observability and evals for LlamaIndex

Overview

LlamaIndex is an LLM framework that makes it easy to build knowledge agents from complex data. Confident AI allows you to trace and evaluate LlamaIndex agents in just a few lines of code.

Tracing Quickstart

1

Install Dependencies

Run the following command to install the required packages:

$pip install -U deepeval llama-index
2

Setup Confident AI Key

Login to Confident AI using your Confident API key.

$deepeval login
3

Configure LlamaIndex

Instrument LlamaIndex using instrument_llama_index to enable Confident AI’s LlamaIndexHandler.

main.py
1import asyncio
2from llama_index.llms.openai import OpenAI
3from llama_index.core.agent import FunctionAgent
4import llama_index.core.instrumentation as instrument
5
6from deepeval.integrations.llama_index import instrument_llama_index
7instrument_llama_index(instrument.get_dispatcher())
8
9def multiply(a: float, b: float) -> float:
10 """Useful for multiplying two numbers."""
11 return a * b
12
13agent = FunctionAgent(
14 tools=[multiply],
15 llm=OpenAI(model="gpt-4o-mini"),
16 system_prompt="You are a helpful assistant that can perform calculations.",
17)
18
19async def llm_app(input: str):
20 return await agent.run(input)
21
22asyncio.run(llm_app("What is 3 * 12?"))

Now whenever you use LlamaIndex, DeepEval will collect LlamaIndex traces and publish them to Confident AI.

You can directly view the traces on Confident AI by clicking on the link in the output printed in the console.

Evals Usage

Online evals

You can run online evals on your LlamaIndex agent, which will run evaluations on all incoming traces on Confident AI’s servers. This approach is recommended if your agent is in production.

1

Create metric collection

Create a metric collection on Confident AI with the metrics you wish to use to evaluate your LlamaIndex agent.

Create metric collection

Your metric collection should only contain metrics that don’t require retrieval_context, context, expected_output, or expected_tools for evaluation.

2

Run evals

Confident AI supports online evals for LlamaIndex’s FunctionAgent, ReActAgent and CodeActAgent. Replace your LlamaIndex agent with DeepEval’s, and provide metric collection as an argument to the agent.

main.py
1import asyncio
2from llama_index.llms.openai import OpenAI
3import llama_index.core.instrumentation as instrument
4from deepeval.integrations.llama_index import instrument_llama_index
5from deepeval.integrations.llama_index import FunctionAgent
6
7instrument_llama_index(instrument.get_dispatcher())
8
9def multiply(a: float, b: float) -> float:
10 """Useful for multiplying two numbers."""
11 return a * b
12
13agent = FunctionAgent(
14 tools=[multiply],
15 llm=OpenAI(model="gpt-4o-mini"),
16 system_prompt="You are a helpful assistant that can perform calculations.",
17 metric_collection="<your-metric-collection-name>",
18)
19
20async def llm_app(input: str):
21 return await agent.run(input)
22
23asyncio.run(llm_app("What is 3 * 12?"))

All incoming traces will now be evaluated using metrics from your metric collection.

End-to-end evals

Running end-to-end evals on your LlamaIndex agent evaluates your agent locally, and is the recommended approach if your agent is in a development or testing environment.

1

Create metric

1from deepeval.metrics import AnswerRelevancyMetric
2
3answer_relevancy_metric = AnswerRelevancyMetric(
4 threshold=0.7,
5 model="gpt-4o-mini",
6 include_reason=True
7)

Similar to online evals, you can only run end-to-end evals on metrics that don’t require retrieval_context, context, expected_output, or expected_tools for evaluation.

3

Run evals

Provide your metrics. Then, use the dataset’s evals_iterator to invoke your LlamaIndex agent for each golden.

main.py
1import asyncio
2from llama_index.llms.openai import OpenAI
3import llama_index.core.instrumentation as instrument
4from deepeval.integrations.llama_index import instrument_llama_index
5from deepeval.metrics import AnswerRelevancyMetric
6from deepeval.integrations.llama_index import FunctionAgent
7
8instrument_llama_index(instrument.get_dispatcher())
9answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini", include_reason=True)
10
11def multiply(a: float, b: float) -> float:
12 """Useful for multiplying two numbers."""
13 return a * b
14
15agent = FunctionAgent(
16 tools=[multiply],
17 llm=OpenAI(model="gpt-4o-mini"),
18 system_prompt="You are a helpful assistant that can perform calculations.",
19 metrics=[answer_relevancy_metric],
20)
21
22async def llm_app(input: str):
23 return await agent.run(input)
24
25from deepeval.dataset import EvaluationDataset, Golden
26
27dataset = EvaluationDataset(
28 goldens=[Golden(input="What is 3 * 12?"), Golden(input="What is 4 * 13?")]
29)
30
31for golden in dataset.evals_iterator():
32 task = asyncio.create_task(llm_app(golden.input))
33 dataset.evaluate(task)

This will automatically generate a test run with evaluated traces using inputs from your dataset.

View on Confident AI

You can view the evals on Confident AI by clicking on the link in the output printed in the console.