OpenAI | Confident AI Docs

Overview

Confident AI lets you trace and evaluate OpenAI calls, whether standalone or used as a component within a larger application.

Tracing Quickstart

Install Dependencies

Run the following command to install the required packages:

$ pip install -U deepeval openai

Setup Confident AI Key

$ deepeval login

Configure OpenAI

To begin tracing your OpenAI calls as a component in your application, import OpenAI from DeepEval instead.

Chat Completions

Responses

Async Chat Completions

Async Responses

main.py

1 from deepeval.openai import OpenAI
2 
3 client = OpenAI()
4 
5 response = client.chat.completions.create(
6     model="gpt-4o-mini",
7     messages=[
8         {"role": "system", "content": "You are a helpful assistant."},
9         {"role": "user", "content": "What is the weather in France?"},
10     ],
11 )

DeepEval’s OpenAI client traces chat.completions.create, beta.chat.completions.parse, and responses.create methods.

Run OpenAI

Invoke your agent by executing the script:

$ python main.py

You can directly view the traces on Confident AI by clicking on the link in the output printed in the console.

Advanced Usage

Logging prompts

If you are managing prompts on Confident AI and wish to log them, pass your Prompt object to the trace context.

main.py

1 from deepeval.openai import OpenAI
2 from deepeval.prompt import Prompt
3 from deepeval.tracing import trace
4 
5 prompt = Prompt(alias="my-prompt")
6 prompt.pull(version="00.00.01")
7 
8 client = OpenAI()
9 
10 with trace(prompt=prompt):
11     response = client.chat.completions.create(
12         model="gpt-4o",
13         messages=[
14             {"role": "system", "content": prompt.interpolate(name="Jhon")}, # should be a string system prompt
15             {"role": "user", "content": "Hello, how are you?"},
16         ],
17     )

Logging threads

Threads are used to group related traces together, and are useful for chat apps, agents, or any multi-turn interactions. Learn more about threads here. You can set the thread_id in the trace context.

main.py

1 from deepeval.openai import OpenAI
2 from deepeval.tracing import trace
3 
4 client = OpenAI()
5 
6 with trace(thread_id="test_thread_id_1"):
7     response = client.chat.completions.create(
8         model="gpt-4o",
9         messages=[
10             {"role": "system", "content": "You are a helpful assistant."},
11             {"role": "user", "content": "Hello, how are you?"},
12         ],
13     )

Other trace attributes

Confident AI’s LLM tracing advanced features provide teams with the ability to set certain attributes for each trace when invoking your OpenAI client.

For example, user_id can be used to enable user analytics. You can learn more about user id here. Similarly, you can set the metadata to attach any metadata to the trace.

You can set these attributes in the trace context when invoking your OpenAI client.

main.py

1 from deepeval.openai import OpenAI
2 from deepeval.tracing import trace
3 
4 client = OpenAI()
5 
6 with trace(
7     thread_id="test_thread_id_1",
8     metadata={"test_metadata_1": "test_metadata_1"},
9 ):
10     response = client.chat.completions.create(
11         model="gpt-4o",
12         messages=[
13             {"role": "system", "content": "You are a helpful assistant."},
14             {"role": "user", "content": "Hello, how are you?"},
15         ],
16     )

This override the trace attributes that were set using update_current_trace method.

View Trace Attributes

name

str

The name of the trace. Learn more.

Evals Usage

Online evals

If your OpenAI application is in production, and you still want to run evaluations on your traces, use online evals. It lets you run evaluations on all incoming traces on Confident AI’s server.

Create metric collection

Create a metric collection on Confident AI with the metrics you wish to use to evaluate your OpenAI agent. Copy the name of the metric collection.

Create metric collection

Run evals

Set the llm_metric_collection name in the trace context when invoking your OpenAI client to evaluate Llm Spans.

main.py

1 from deepeval.openai import OpenAI
2 from deepeval.tracing import trace
3 
4 client = OpenAI()
5 
6 with trace(llm_metric_collection="test_collection_1"):
7     response = client.chat.completions.create(
8         model="gpt-4o",
9         messages=[
10             {"role": "system", "content": "You are a helpful assistant."},
11             {"role": "user", "content": "Hello, how are you?"},
12         ],
13     )

End-to-end evals

Confident AI allows you to run end-to-end evals on your OpenAI client to evaluate your OpenAI calls directly. This is recommended if you are testing your OpenAI calls in isolation.

Create metric

1 from deepeval.metrics import AnswerRelevancyMetric
2 
3 task_completion = AnswerRelevancyMetric(
4     threshold=0.7,
5     model="gpt-4o-mini",
6     include_reason=True
7 )

You can only run end-to-end evals on OpenAI using metrics that evaluate input, output, or tools_called. You can pass parameters like expected_output, expected_tools, context and retrieval_context to the trace context.

Run evals

Replace your OpenAI client with DeepEval’s. Then, use the dataset’s evals_iterator to invoke your OpenAI client for each golden.

Chat Completions

Responses

Async Chat Completions

Async Responses

main.py

1 from deepeval.openai import OpenAI
2 from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
3 from deepeval.dataset import EvaluationDataset
4 from deepeval.tracing import trace
5 
6 client = OpenAI()
7 
8 dataset = EvaluationDataset()
9 dataset.pull("your-dataset-alias")
10 
11 for golden in dataset.evals_iterator():
12     # run OpenAI client
13     with trace(
14         llm_metrics=[AnswerRelevancyMetric(), BiasMetric()],
15         expected_output=golden.expected_output,
16     ):
17         client.chat.completions.create(
18             model="gpt-4o",
19             messages=[
20                 {"role": "system", "content": "You are a helpful assistant."},
21                 {"role": "user", "content": golden.input}
22             ],
23         )

This will automatically generate a test run with evaluated OpenAI traces using inputs from your dataset.

Using OpenAI in component-level evals

You can also evaluate OpenAI calls through component-level evals. This approach is recommended if you are testing your OpenAI calls as a component in a larger application system.

Create metric

1 from deepeval.metrics import AnswerRelevancyMetric
2 
3 task_completion = AnswerRelevancyMetric(
4     threshold=0.7,
5     model="gpt-4o-mini",
6     include_reason=True
7 )

As with end-to-end evals, you can only use metrics that evaluate input, output, or tools_called.

Run evals

Replace your OpenAI client with DeepEval’s. Then, use the dataset’s evals_iterator to invoke your LLM application for each golden.

Make sure that each function or method in your LLM application is decorated with @observe.

Chat Completions

Responses

Async Chat Completions

Async Responses

1 from deepeval.openai import OpenAI
2 from deepeval.tracing import observe, trace
3 from deepeval.dataset import EvaluationDataset
4 from deepeval.metrics import AnswerRelevancyMetric
5 
6 client = OpenAI()
7 
8 @observe()
9 def generate_response(input: str) -> str:
10     with trace(
11         llm_metrics=[AnswerRelevancyMetric()],
12         expected_output=golden.output,
13     ):
14         response = client.chat.completions.create(
15             model="gpt-4.1",
16             messages=[
17                 {"role": "system", "content": "You are a helpful assistant."},
18                 {"role": "user", "content": input},
19             ],
20         )
21         return response
22 
23 # Create dataset
24 dataset = EvaluationDataset()
25 dataset.pull("your-dataset-alias")
26 
27 # Run component-level evaluation
28 for golden in dataset.evals_iterator():
29     generate_response(golden.input)