Portkey | Confident AI Docs

Overview

Confident AI lets you trace and evaluate Portkey LLM calls, whether standalone or used as a component within a larger application.

Tracing Quickstart

Install Dependencies

Run the following command to install the required packages:

$ pip install -U deepeval portkey-ai

Setup Confident AI Key

$ deepeval login

Configure Portkey

To begin tracing your Portkey LLM calls as a component in your application, import OpenAI and use the PORTKEY_GATEWAY_URL to trace the calls.

Chat Completions

Async Chat Completions

Responses

Async Responses

main.py

1 from deepeval.openai import OpenAI
2 from portkey_ai import PORTKEY_GATEWAY_URL
3 
4 portkey = OpenAI(
5 base_url = PORTKEY_GATEWAY_URL,
6 api_key = "<PORTKEY_API_KEY>"
7 )
8 
9 response = portkey.chat.completions.create(
10     model = "@slug/<model>",
11     messages = [
12     {"role": "system", "content": "You are a helpful assistant."},
13     {"role": "user", "content": "What is Portkey"}
14     ],
15 )

DeepEval’s Portkey client traces chat.completions.create method.

Run Portkey

Invoke your agent by executing the script:

$ python main.py

You can directly view the traces on Confident AI by clicking on the link in the output printed in the console.

Advanced Usage

Logging prompts

If you are managing prompts on Confident AI and wish to log them, pass your Prompt to the create method.

main.py

1 from portkey_ai import PORTKEY_GATEWAY_URL
2 
3 from deepeval.openai import OpenAI
4 from deepeval.prompt import Prompt
5 from deepeval.tracing import trace
6 
7 portkey = OpenAI(
8   base_url = PORTKEY_GATEWAY_URL,
9   api_key = "<PORTKEY_API_KEY>"
10 )
11 
12 prompt = Prompt(alias="my_prompt")
13 prompt.pull(version="00.00.01")
14 
15 with trace(prompt=prompt):
16   response = portkey.chat.completions.create(
17       model = "@slug/<model>",
18       messages = [
19         {"role": "system", "content": prompt.interpolate(name="John")}, # string system prompt
20         {"role": "user", "content": "What is Portkey"}
21       ],
22   )
23 
24 print(response.choices[0].message.content)

Logging threads

Threads are used to group related traces together, and are useful for chat apps, agents, or any multi-turn interactions. Learn more about threads here. You can set the thread_id in the trace context.

main.py

1 from deepeval.openai import OpenAI
2 from deepeval.tracing import trace
3 
4 from portkey_ai import PORTKEY_GATEWAY_URL
5 
6 portkey = OpenAI(
7   base_url = PORTKEY_GATEWAY_URL,
8   api_key = "<PORTKEY_API_KEY>"
9 )
10 
11 with trace(thread_id="test_thread_id_1"):
12   response = portkey.chat.completions.create(
13       model = "@slug/<model>",
14       messages = [
15         {"role": "system", "content": "You are a helpful assistant."},
16         {"role": "user", "content": "What is Portkey"}
17       ],
18   )
19 
20 print(response.choices[0].message.content)

This is an example of using STRING type prompt interpolation.

Evals Usage

Online evals

If your OpenAI application is in production, and you still want to run evaluations on your traces, use online evals. It lets you run evaluations on all incoming traces on Confident AI’s server.

Create metric collection

Create a metric collection on Confident AI with the metrics you wish to use to evaluate your OpenAI agent. Copy the name of the metric collection.

Create metric collection

Run evals

Set the llm_metric_collection name in the trace context when invoking your OpenAI client to evaluate Llm Spans.

main.py

1 from deepeval.openai import OpenAI
2 from deepeval.tracing import trace
3 
4 client = OpenAI()
5 
6 with trace(llm_metric_collection="test_collection_1"):
7     response = client.chat.completions.create(
8         model="gpt-4o",
9         messages=[
10             {"role": "system", "content": "You are a helpful assistant."},
11             {"role": "user", "content": "Hello, how are you?"},
12         ],
13     )

End-to-end evals

Confident AI allows you to run end-to-end evals on your OpenAI client to evaluate your Portkey calls directly. This is recommended if you are testing your Portkey calls in isolation.

Create metric

1 from deepeval.metrics import AnswerRelevancyMetric
2 
3 task_completion = AnswerRelevancyMetric(
4     threshold=0.7,
5     model="gpt-4o-mini",
6     include_reason=True
7 )

You can only run end-to-end evals on Portkey using metrics that evaluate input, output, or tools_called. You can pass parameters like expected_output, expected_tools, context and retrieval_context to the trace context.

Run evals

Replace your OpenAI client with DeepEval’s. Then, use the dataset’s evals_iterator to invoke your OpenAI client for each golden. Remember to replace base_url and api_key with the Portkey gateway URL and API key.

Chat Completions

Responses

Async Chat Completions

Async Responses

main.py

1 from deepeval.openai import OpenAI
2 from deepeval.metrics import AnswerRelevancyMetric, BiasMetric
3 from deepeval.dataset import EvaluationDataset
4 from deepeval.tracing import trace
5 
6 client = OpenAI(
7     base_url = PORTKEY_GATEWAY_URL,
8     api_key = "<PORTKEY_API_KEY>"
9 )
10 
11 dataset = EvaluationDataset()
12 dataset.pull("your-dataset-alias")
13 
14 for golden in dataset.evals_iterator():
15     with trace(
16         llm_metrics=[AnswerRelevancyMetric(), BiasMetric()],
17         expected_output=golden.expected_output,
18     ):
19         client.chat.completions.create(
20             model="gpt-4o",
21             messages=[
22                 {"role": "system", "content": "You are a helpful assistant."},
23                 {"role": "user", "content": golden.input}
24             ],
25         )

This will automatically generate a test run with evaluated Portkey traces using inputs from your dataset.

Using OpenAI in component-level evals

You can also evaluate Portkey calls through component-level evals. This approach is recommended if you are testing your Portkey calls as a component in a larger application system.

Create metric

1 from deepeval.metrics import AnswerRelevancyMetric
2 
3 task_completion = AnswerRelevancyMetric(
4     threshold=0.7,
5     model="gpt-4o-mini",
6     include_reason=True
7 )

As with end-to-end evals, you can only use metrics that evaluate input, output, or tools_called.

Run evals

Replace your OpenAI client with DeepEval’s. Then, use the dataset’s evals_iterator to invoke your LLM application for each golden.

Make sure that each function or method in your LLM application is decorated with @observe.

Chat Completions

Responses

Async Chat Completions

Async Responses

1 from deepeval.openai import OpenAI
2 from deepeval.tracing import observe, trace
3 from deepeval.dataset import EvaluationDataset
4 from deepeval.metrics import AnswerRelevancyMetric
5 
6 client = OpenAI(
7     base_url = PORTKEY_GATEWAY_URL,
8     api_key = "<PORTKEY_API_KEY>"
9 )
10 
11 @observe()
12 def generate_response(input: str) -> str:
13     with trace(
14         llm_metrics=[AnswerRelevancyMetric()],
15         expected_output=golden.output,
16     ):
17         response = client.chat.completions.create(
18             model="gpt-4.1",
19             messages=[
20                 {"role": "system", "content": "You are a helpful assistant."},
21                 {"role": "user", "content": input},
22             ],
23         )
24         return response
25 
26 # Create dataset
27 dataset = EvaluationDataset()
28 dataset.pull("your-dataset-alias")
29 
30 # Run component-level evaluation
31 for golden in dataset.evals_iterator():
32     generate_response(golden.input)