The ability to use AI to generate data out of thin air is one of those things that seem too good to be true — think about it, you can get your hands on quality data without needing to manually collect, clean, and annotate massive datasets.
But, as you might expect, synthetic data is not without its caveats. Although it is convenient, efficient, and cost effective, the quality of synthetic data is only as good as the method used to generate it. Settle for rudimentary methods, and you’ll end up with unusable datasets that don’t represent real-world data well.
In this article, I’m going to share how we managed to generate realistic textual synthetic data at Confident AI. Let's dive right into it.
First and foremost, synthetic data is artificially generated data in attempt to simulate real-world data. Unlike real-world data that is collected from observations or actual events (e.g., tweets on the platform formally known as Twitter), synthetic data is made up, sometimes entirely, but more commonly based on a small subset of real-world data (also known as data augmentation).
This kind of data is often used for testing, training, and validating machine learning models, especially in scenarios where using real-world data is scarce or difficult to collect.
Historically, while the demand for synthetic data continued to rise steadily over the years, advancements in generation methods struggled to keep pace.
Methods available at the time were often simplistic, perhaps relying on basic statistical methods, or they were too domain-specific and hard to generalize, meaning they lacked the complexity to mimic real-world data in a meaningful way.
Let’s take Generative Adversarial Networks (GANs) as an example. GANs employed a novel architecture of two neural networks — a generator and a discriminator — that competed with each other. The competition between these two networks resulted in the generation of highly realistic and complex synthetic data.
However, as one might have guessed from the title of this article, there were still major drawbacks when leveraging GANs to generate textual data.
Needless to say, there’s a lot of hurdles to overcome and consider when it comes to textual data. Let’s cut to the chase and see why you should use LLMs instead.
Like it or hate it, large language models (LLMs) like GPT-4 has democratized textual synthetic data. Let’s say I want to generate some queries related to the topic of synthetic data. All I have to do is use either ChatGPT or OpenAI’s API to generate a set of tweets. For example, here’s how you can do it in python (note: I’m using GPT-3.5):
Here’s a sample output:
While the generated data is quite varied, it may not accurately reflect real-world conditions, making it less useful for certain applications. Fortunately, by carefully crafting the input prompts, we can improve the authenticity of the synthetic data.
The pervasive problem with synthetic data generation is there’s often a mismatch between the generative distribution and the distribution of real-world data.
However, due to the versatile and adaptable nature of LLMs, we can easily ground generated data by dynamically changing the prompt (basically string interpolation!). For example, you might want to wrap the OpenAI API call in a function instead and make it accept additional context as parameters :
Here’s a sample output (don’t forget we’re using GPT-3.5!):
As you can see, the output is much morerelevant and significantly improved compared to previous iterations without dynamic prompting.
In this article, we explored ways to contextualize synthetic data effectively. LLMs like GPT-3.5 can offer a simple yet powerful way of generating data through some careful prompt designing.
Stay tuned for our Part 2 guide on diversifying your synthetic data set!
Subscribe to our weekly newsletter to stay confident in the AI systems you build.
In this article, I'll share how JudgmentalGPT, our in-house evaluator was built using OpenAI's Assistants.
In this interactive tutorial, I'll show you how to become a Midjournalist to create image you image.