Generating synthetic data with LLMs - Part 1

The ability to use AI to generate data out of thin air is one of those things that seem too good to be true — think about it, you can get your hands on quality data without needing to manually collect, clean, and annotate massive datasets.

But, as you might expect, synthetic data is not without its caveats. Although it is convenient, efficient, and cost effective, the quality of synthetic data is only as good as the method used to generate it. Settle for rudimentary methods, and you’ll end up with unusable datasets that don’t represent real-world data well.

In this article, I’m going to share how we managed to generate realistic textual synthetic data at Confident AI. Let's dive right into it.

What is synthetic data?

First and foremost, synthetic data is artificially generated data in attempt to simulate real-world data. Unlike real-world data that is collected from observations or actual events (e.g., tweets on the platform formally known as Twitter), synthetic data is made up, sometimes entirely, but more commonly based on a small subset of real-world data (also known as data augmentation).

This kind of data is often used for testing, training, and validating machine learning models, especially in scenarios where using real-world data is scarce or difficult to collect.

The struggles in generating textual data

Historically, while the demand for synthetic data continued to rise steadily over the years, advancements in generation methods struggled to keep pace.

Methods available at the time were often simplistic, perhaps relying on basic statistical methods, or they were too domain-specific and hard to generalize, meaning they lacked the complexity to mimic real-world data in a meaningful way.

Let’s take Generative Adversarial Networks (GANs) as an example. GANs employed a novel architecture of two neural networks — a generator and a discriminator — that competed with each other. The competition between these two networks resulted in the generation of highly realistic and complex synthetic data.

However, as one might have guessed from the title of this article, there were still major drawbacks when leveraging GANs to generate textual data.

Mode Collapse: A phenomenon where the generator starts to produce the same output (or very similar outputs) over and over again.
Difficult to train: GANs are notoriously hard to train, with issues like vanishing/exploding gradients and oscillations in loss.
Long-Range Dependencies: Textual data often involve long-range dependencies (e.g., the subject of a sentence affecting a verb that appears much later), and capturing these effectively is a challenge even for advanced GAN architectures.
Very Needy: They require lots of data to train on (ironically).

Needless to say, there’s a lot of hurdles to overcome and consider when it comes to textual data. Let’s cut to the chase and see why you should use LLMs instead.

Confident AI: The LLM Evaluation Platform

The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.

Regression test and evaluate LLM apps on the cloud.

LLM evaluation metrics for ANY use case.

Real-time LLM observability and tracing.

Automated human feedback collection.

Generate evaluation datasets on the cloud.

LLM security, risk, and vulnerability scanning.

Try Now for Free

Checkout DeepEval

Generating synthetic data with LLMs

Like it or hate it, large language models (LLMs) like GPT-4 has democratized textual synthetic data. Let’s say I want to generate some queries related to the topic of synthetic data. All I have to do is use either ChatGPT or OpenAI’s API to generate a set of tweets. For example, here’s how you can do it in python (note: I’m using GPT-3.5):


import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

response = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    {
      "role": "user",
      "content": "Generate 5 questions one might have on synthetic data."
    }
  ],
  temperature=1,
  max_tokens=256
)

print(response.choices[0].message.content)

Here’s a sample output:


1. What are the key advantages of using synthetic data over real-world data 
   in machine learning models?

2. How is synthetic data generated and how closely can it mimic the 
   characteristics of original datasets?

3. Is synthetic data reliable for training machine learning models in 
   sensitive sectors like healthcare or finance?

4. What are the ethical considerations associated with using synthetic data, 
   especially when it is used to replace or supplement personally identifiable 
   information?

5. Can synthetic data be used to address the challenges of data imbalance in 
   machine learning, and if so, how effective is it compared to traditional 
   resampling techniques?

While the generated data is quite varied, it may not accurately reflect real-world conditions, making it less useful for certain applications. Fortunately, by carefully crafting the input prompts, we can improve the authenticity of the synthetic data.

Using Dynamic Prompts Templates to make Synthetic Data Realistic

The pervasive problem with synthetic data generation is there’s often a mismatch between the generative distribution and the distribution of real-world data.

However, due to the versatile and adaptable nature of LLMs, we can easily ground generated data by dynamically changing the prompt (basically string interpolation!). For example, you might want to wrap the OpenAI API call in a function instead and make it accept additional context as parameters :


import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

def generate_synthetic_data(context):
  prompt = f""" 
    Generate 5 questions one might have on synthetic data. In your questions, also take
    into account the context below.
    Context: {context}
  """

  response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
      {
        "role": "user",
        "content": prompt
      }
    ],
    temperature=1,
    max_tokens=256
  )

  return response.choices[0].message.content
  
print(generate_synthetic_data("Data Privacy is a huge concern for enterprises."))

Here’s a sample output (don’t forget we’re using GPT-3.5!):


1. How can synthetic data help enterprises address data privacy concerns while 
   still maintaining the ability to perform data analytics and testing?

2. What are the key differences between real data and synthetic data in terms 
   of their privacy implications for enterprises?

3. Are there any legal or regulatory considerations that enterprises should 
   be aware of when using synthetic data to safeguard data privacy?

4. How can enterprises ensure that synthetic data accurately represents their 
   real data while preserving the privacy of sensitive information?

5. What are the potential limitations or challenges that enterprises may face 
   when implementing synthetic data solutions to protect data privacy, and 
   how can they mitigate these challenges effectively?

As you can see, the output is much morerelevant and significantly improved compared to previous iterations without dynamic prompting.

Conclusion

In this article, we explored ways to contextualize synthetic data effectively. LLMs like GPT-3.5 can offer a simple yet powerful way of generating data through some careful prompt designing.

Stay tuned for our Part 2 guide on diversifying your synthetic data set!

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?