OpenAI

Use Confident AI for LLM observability and evals for OpenAI

Overview

Confident AI lets you trace and evaluate OpenAI calls, whether standalone or used as a component within a larger application.

Tracing Quickstart

1

Install Dependencies

Run the following command to install the required packages:

$pip install -U deepeval openai
2

Setup Confident AI Key

Login to Confident AI using your Confident API key.

$deepeval login
3

Configure OpenAI

To begin tracing your OpenAI calls as a component in your application, import OpenAI from DeepEval instead.

main.py
1from deepeval.openai import OpenAI
2
3client = OpenAI()
4
5response = client.chat.completions.create(
6 model="gpt-4o-mini",
7 messages=[
8 {"role": "system", "content": "You are a helpful assistant."},
9 {"role": "user", "content": "What is the weather in France?"},
10 ],
11)

DeepEval’s OpenAI client traces chat.completions.create, beta.chat.completions.parse, and responses.create methods.

4

Run OpenAI

Invoke your agent by executing the script:

$python main.py

You can directly view the traces on Confident AI by clicking on the link in the output printed in the console.

Advanced Usage

Logging prompts

If you are managing prompts on Confident AI and wish to log them, pass your Prompt object to the trace context.

main.py
1from deepeval.openai import OpenAI
2from deepeval.prompt import Prompt
3from deepeval.tracing import trace
4
5prompt = Prompt(alias="my-prompt")
6prompt.pull(version="00.00.01")
7
8client = OpenAI()
9
10with trace(prompt=prompt):
11 response = client.chat.completions.create(
12 model="gpt-4o",
13 messages=[
14 {"role": "system", "content": prompt.interpolate(name="Jhon")}, # should be a string system prompt
15 {"role": "user", "content": "Hello, how are you?"},
16 ],
17 )

Logging threads

Threads are used to group related traces together, and are useful for chat apps, agents, or any multi-turn interactions. Learn more about threads here. You can set the thread_id in the trace context.

main.py
1from deepeval.openai import OpenAI
2from deepeval.tracing import trace
3
4client = OpenAI()
5
6with trace(thread_id="test_thread_id_1"):
7 response = client.chat.completions.create(
8 model="gpt-4o",
9 messages=[
10 {"role": "system", "content": "You are a helpful assistant."},
11 {"role": "user", "content": "Hello, how are you?"},
12 ],
13 )

Other trace attributes

Confident AI’s LLM tracing advanced features provide teams with the ability to set certain attributes for each trace when invoking your OpenAI client.

For example, user_id can be used to enable user analytics. You can learn more about user id here. Similarly, you can set the metadata to attach any metadata to the trace.

You can set these attributes in the trace context when invoking your OpenAI client.

main.py
1from deepeval.openai import OpenAI
2from deepeval.tracing import trace
3
4client = OpenAI()
5
6with trace(
7 thread_id="test_thread_id_1",
8 metadata={"test_metadata_1": "test_metadata_1"},
9):
10 response = client.chat.completions.create(
11 model="gpt-4o",
12 messages=[
13 {"role": "system", "content": "You are a helpful assistant."},
14 {"role": "user", "content": "Hello, how are you?"},
15 ],
16 )

This override the trace attributes that were set using update_current_trace method.

name
str

The name of the trace. Learn more.

tags
List[str]

Tags are string labels that help you group related traces. Learn more.

metadata
Dict

Attach any metadata to the trace. Learn more.

thread_id
str

Supply the thread or conversation ID to view and evaluate conversations. Learn more.

user_id
str

Supply the user ID to enable user analytics. Learn more.

Each attribute is optional, and works the same way as the native tracing features on Confident AI.

Evals Usage

Online evals

If your OpenAI application is in production, and you still want to run evaluations on your traces, use online evals. It lets you run evaluations on all incoming traces on Confident AI’s server.

1

Create metric collection

Create a metric collection on Confident AI with the metrics you wish to use to evaluate your OpenAI agent. Copy the name of the metric collection.

Create metric collection
2

Run evals

Set the llm_metric_collection name in the trace context when invoking your OpenAI client to evaluate Llm Spans.

main.py
1from deepeval.openai import OpenAI
2from deepeval.tracing import trace
3
4client = OpenAI()
5
6with trace(llm_metric_collection="test_collection_1"):
7 response = client.chat.completions.create(
8 model="gpt-4o",
9 messages=[
10 {"role": "system", "content": "You are a helpful assistant."},
11 {"role": "user", "content": "Hello, how are you?"},
12 ],
13 )

End-to-end evals

Confident AI allows you to run end-to-end evals on your OpenAI client to evaluate your OpenAI calls directly. This is recommended if you are testing your OpenAI calls in isolation.

1

Create metric

1from deepeval.metrics import AnswerRelevancyMetric
2
3task_completion = AnswerRelevancyMetric(
4 threshold=0.7,
5 model="gpt-4o-mini",
6 include_reason=True
7)

You can only run end-to-end evals on OpenAI using metrics that evaluate input, output, or tools_called. You can pass parameters like expected_output, expected_tools, context and retrieval_context to the trace context.

2

Run evals

Replace your OpenAI client with DeepEval’s. Then, use the dataset’s evals_iterator to invoke your OpenAI client for each golden.

main.py
1from deepeval.openai import OpenAI
2from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
3from deepeval.dataset import EvaluationDataset
4from deepeval.tracing import trace
5
6client = OpenAI()
7
8dataset = EvaluationDataset()
9dataset.pull("your-dataset-alias")
10
11for golden in dataset.evals_iterator():
12 # run OpenAI client
13 with trace(
14 llm_metrics=[AnswerRelevancyMetric(), BiasMetric()],
15 expected_output=golden.expected_output,
16 ):
17 client.chat.completions.create(
18 model="gpt-4o",
19 messages=[
20 {"role": "system", "content": "You are a helpful assistant."},
21 {"role": "user", "content": golden.input}
22 ],
23 )

This will automatically generate a test run with evaluated OpenAI traces using inputs from your dataset.

Using OpenAI in component-level evals

You can also evaluate OpenAI calls through component-level evals. This approach is recommended if you are testing your OpenAI calls as a component in a larger application system.

1

Create metric

1from deepeval.metrics import AnswerRelevancyMetric
2
3task_completion = AnswerRelevancyMetric(
4 threshold=0.7,
5 model="gpt-4o-mini",
6 include_reason=True
7)

As with end-to-end evals, you can only use metrics that evaluate input, output, or tools_called.

2

Run evals

Replace your OpenAI client with DeepEval’s. Then, use the dataset’s evals_iterator to invoke your LLM application for each golden.

Make sure that each function or method in your LLM application is decorated with @observe.

1from deepeval.openai import OpenAI
2from deepeval.tracing import observe, trace
3from deepeval.dataset import EvaluationDataset
4from deepeval.metrics import AnswerRelevancyMetric
5
6client = OpenAI()
7
8@observe()
9def generate_response(input: str) -> str:
10 with trace(
11 llm_metrics=[AnswerRelevancyMetric()],
12 expected_output=golden.output,
13 ):
14 response = client.chat.completions.create(
15 model="gpt-4.1",
16 messages=[
17 {"role": "system", "content": "You are a helpful assistant."},
18 {"role": "user", "content": input},
19 ],
20 )
21 return response
22
23# Create dataset
24dataset = EvaluationDataset()
25dataset.pull("your-dataset-alias")
26
27# Run component-level evaluation
28for golden in dataset.evals_iterator():
29 generate_response(golden.input)