In this story
Kritin Vongthongsri
Cofounder @ Confident AI | Empowering LLM practitioners with Evals | Previously AI/ML at Fintech Startup | ML + CS @ Princeton

Using LLMs for Synthetic Data Generation: The Definitive Guide

June 11, 2024
·
12 min read
Presenting...
The open-source LLM evaluation framework.
Star on GitHub
Using LLMs for Synthetic Data Generation: The Definitive Guide

Constructing a large-scale, comprehensive dataset to test LLM outputs can be a laborious, costly, and challenging process, especially if done from scratch. But what if I told you that it’s now possible to generate the same thousands of high-quality test cases you spent weeks painstakingly crafting, in just a few minutes?

Synthetic data generation leverages LLMs to create quality data without the need to manually collect, clean, and annotate massive datasets. With models like GPT-4, it's now possible to synthetically produce datasets that are more comprehensive and diverse than human-labeled ones, in far less time, which can be used to benchmark LLM (systems) with the help of some LLM evaluation metrics.

In this article, I’ll teach you everything you need to know on how to use LLMs to generate synthetic datasets (which for example can be used to evaluate RAG pipelines). We’ll explore:

  • Synthetic generation methods (Distillation and Self-Improvement)
  • What data evolution is, various evolution techniques, and its role in synthetic data generation
  • A step-by-step tutorial on creating high-quality synthetic data from scratch using LLMs.
  • How to use DeepEval to generate synthetic datasets in under 5 lines of code.

Intrigued? Let’s dive in.

What is Synthetic Data Generation, Using LLMs?

Synthetic data generation using LLMs involves using an LLM to create artificial data, which often are datasets that can be used to train, fine-tune, and even evaluate LLMs themselves. Generating synthetic datasets is not only faster than scouring public datasets and cheaper than human annotation but also results in higher quality and data diversity, which is imperative for building an LLM evaluation framework.

The process starts with the creation of synthetic queries, which are generated using context from your knowledge base (often in the form of documents) as the ground truth. The generated queries are then "evolved" multiple times to complicate and make realistic, and when combined with the original context it was generated from, makes up your final synthetic dataset. Although optional, you can also choose to generate a target label for each synthetic query-context pair, which will act as the expected output of your LLM system for a given query.

A Data Synthesizer Architecture

When it comes to generating a synthetic dataset for evaluation, there are two main methods: self-improvement from using your model’s output, or distillation from a more advanced model.

  • Self-improvement: involves your model generating data iteratively from its own output without external dependencies
  • Distillation: involves using a stronger model to generate synthetic data for to evaluate a weaker model

Self-improvement methods, such as Self-Instruct or SPIN, are limited by a model’s capabilities and may suffer from amplified biases and errors. In contrast, distillation techniques are only limited by the best model available, ensuring the highest quality generation.

Data Survival-Of-The-Fittest

Let’s clarify what data evolution is and why it’s so important to synthetic data generation using LLMs. Data evolution, first introduced in Microsoft’s Evol-Instruct, involves iteratively enhancing an existing set of queries to generate more complex and diverse ones through prompt engineering. This step is crucial for ensuring the quality, comprehensiveness, complexity, and diversity of the dataset. It’s what makes synthetic data superior to public or human-annotated datasets.

In fact, the original authors managed to produce 250,000 instructions from just 175 human-created queries. There are 3 types of data evolution:

  • In-Depth Evolving: Expands a simple instruction into a more detailed and intricate version.
  • In-Breadth Evolving: Produces new, diverse instructions to enrich the dataset.
  • Elimination Evolving: Removes less effective or failed instructions.
In-depth (blue) and in-breadth (red) evolutions of ‘1+1’ query

There are several ways to perform in-depth evolution, such as complicating inputs, increasing the need for reasoning, or adding multiple steps to complete a task. Each approach contributes to a higher level of sophistication in the generated data.

In-depth evolution ensures the creation of nuanced, high-quality queries, while in-breadth evolution enhances diversity and comprehensiveness. By evolving each query or instruction multiple times, we increase its complexity, resulting in a rich and multifaceted dataset. But enough of me talking, let's show you how to put everything in action.

Take this query for example:

What is 1+1?

We can in-depth-evolve it to something like this instead:

In what situation does 1+1 not equal to 2?

Which I hope we can all agree is more complicated and realistic than a generic 1+1. In the next section, we'll show how to actually employ these evolution methods when generating synthetic datasets.

Confident AI: The LLM Evaluation Platform

The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.

Regression test and evaluate LLM apps on the cloud.
LLM evaluation metrics for ANY use case.
Real-time LLM observability and tracing.
Automated human feedback collection.
Generate evaluation datasets on the cloud.
LLM security, risk, and vulnerability scanning.

Step-By-Step Guide: Generating Synthetic Data Using LLMs

Before we begin, lets be reminded of the data synthesizer architecture we'll be building:

A Data Synthesizer Architecture

You'll notice there are five main steps:

  1. Document Chunking
  2. Context Generation
  3. Query Generation
  4. Data Evolution
  5. Label/Expected Output Generation (optional)

Sure, DeepEval already has a feature complete data synthesizer ready for synthetic dataset generation (which I'll show you later in the final section), but for those curious on how it's done, let's begin.

1. Document Chunking

The first step is to chunk your document. As the name suggests, document chunking means dividing it into smaller, meaningful ‘chunks.’ This way, you can break down larger documents into manageable sub-documents while maintaining their context. Chunking also allows for embedding generation in documents that exceed the token limit of the embedding model.

This step is essential because it helps identify semantically similar chunks and generate queries or tasks based on shared contexts.

There are several chunking strategies like fixed-size chunking and context-aware chunking. You can also adjust hyperparameters such as character size and chunk overlap. In the example below, we’ll use token-based chunking with a character size of 1024 and no overlap. Here’s how you can chunk your document:


pip install langchain langchain_openai

# Step 1. Chunk Documents
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=1024, chunk_overlap=0)
loader = PyPDFLoader("chatbot_information.pdf")
raw_chunks = loader.load_and_split(text_splitter)

Once you have the chunks, convert each one into embeddings. These embeddings capture the semantic meaning of each chunk and are combined with the chunk’s content to form a list of Chunk objects.


from langchain_openai import OpenAIEmbeddings
...

embedding_model = OpenAIEmbeddings(api_key="...")
content = [rc.page_content for rc in raw_chunks]
embeddings = embedding_model.embed_documents(content)

2. Context Generation

To generate context, start by randomly selecting a chunk of data to act as your focal anchor for finding related information.


# Step 2: Generate context by selecting chunks
import random
...

reference_index = random.randint(0, len(embeddings) - 1)
reference_embedding = embeddings[reference_index]
contexts = [content[reference_index]]

Next, set a similarity threshold and use cosine similarity to identify related chunks to build your context:


...

similarity_threshold = 0.8
similar_indices = []
for i, embedding in enumerate(embeddings):
    product = np.dot(reference_embedding, embedding)
    norm = np.linalg.norm(reference_embedding) * np.linalg.norm(embedding)
    similarity = product / norm
    if similarity >= similarity_threshold:
        similar_indices.append(i)

for i in similar_indices:
    contexts.append(content[i])

This step is crucial because it allows you to enhance the robustness of your queries by diversifying sources of information around the same topic. By including multiple chunks of data that share a similar theme, you also provide the model with richer, more nuanced information on the subject.

This ensures that your queries cover the topic comprehensively, resulting in more well-rounded and accurate responses.

3. Query Generation

Now comes the fun part with LLMs. Use a GPT model to generate a series of tasks or queries for the context created using a structured prompt.

Provide a prompt that asks the model to act as a copywriter, generating JSON objects containing an input key, which is the query. Each input should either be a question or statement answerable using the provided context.


# Step 3. Generate a series of queries for similar chunks
from langchain_openai import ChatOpenAI
...

prompt = f"""I want you act as a copywriter. Based on the given context, 
which is list of strings, please generate a list of JSON objects 
with a `input` key. The `input` can either be a question or a 
statement that can be addressed by the given context.

contexts:
{contexts}"""

query = ChatOpenAI(openai_api_key="...").invoke(prompt)

This step forms the basis of your queries, which will be evolved and included in the final dataset.

4. Query Evolution

Finally, we’ll evolve our queries from Step 3 using multiple evolution templates. You can define as many templates as you want, but we’ll focus on three: multi-context understanding, multi-step reasoning, and hypothetical scenario.


# Evolution prompt templates as strings
multi_context_template = f"""
I want you to rewrite the given `input` so that it requires readers to use information from all elements in `Context`.

1. `Input` should require information from all `Context` elements. 
2. `Rewritten Input` must be concise and fully answerable from `Context`. 
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

reasoning_template = f"""
I want you to rewrite the given `input` so that it explicitly requests multi-step reasoning.

1. `Rewritten Input` should require multiple logical connections or inferences.
2. `Rewritten Input` should be concise and understandable.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` must be fully answerable from `Context`.
5. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

hypothetical_scenario_template = f"""
I want you to rewrite the given `input` to incorporate a hypothetical or speculative scenario.

1. `Rewritten Input` should encourage applying knowledge from `Context` to deduce outcomes.
2. `Rewritten Input` should be concise and understandable.
3. Do not use phrases like 'based on the provided context.'
4. `Rewritten Input` must be fully answerable from `Context`.
5. `Rewritten Input` should not exceed 15 words.

Context: {context}
Input: {original_input}
Rewritten Input:
"""

You can see that each template imposes specific constraints on the output. Feel free to adjust them based on how you want your evaluation queries to appear in the final dataset. We’ll use these templates to evolve the original queries multiple times, randomly selecting templates each time.


# Step 4. Evolve Queries
...

example_generated_query = "How do chatbots use natural language understanding?"
context = contexts 
original_input = example_generated_query 
evolution_templates = [multi_context_template, reasoning_template, hypothetical_scenario_template]

# Number of evolution steps to apply
num_evolution_steps = 3

# Function to perform random evolution steps
def evolve_query(original_input, context, steps):
    current_input = original_input
    for _ in range(steps):
        # Choose a random (or using custom logic) template from the list
        chosen_template = random.choice(evolution_templates)
        # Replace the placeholders with the current context and input
        evolved_prompt = chosen_template.replace("{context}", str(context)).replace("{original_input}", current_input)
        # Update the current input with the "Rewritten Input" section
        current_input = ChatOpenAI(openai_api_key="...").invoke(evolved_prompt)
    return current_input

# Evolve the input by randomly selecting the evolution type
evolved_query = evolve_query(original_input, context, num_evolution_steps)

And there you have it, our final evolved query! Repeat this process to generate more queries and further refine your dataset. For evaluation purposes, you’ll need to properly format these input queries and contexts into a suitable testing framework.

5. Expected Output Generation

Although this step is optional, I would highly recommend generating expected outputs for each evolved query. This is because it is easier for a human evaluator to correct and annotate expected outputs than to create them from scratch.


# Step 5. Generate Expected Output
...

# Define prompt template
expected_output_template = f"""
I want you to generate an answer for the given `input`. This answer has to be factually aligned to the provided context.

Context: {context}
Input: {evolved_query}
Answer:
"""

# Fill in the values
prompt = expected_output_template.replace("{context}", str(context)).replace("{evolved_query}", evolved_query)

# Generate expected output
expected_output = ChatOpenAI(openai_api_key="...").invoke(prompt)

As a final step to wrap things up, combine the evolved query, context, and expected output as a data row in your synthetic dataset.


from pydantic import BaseModel
from typing import Optional, List
...

class SyntheticData(BaseModel):
	query: str
	expected_output: Optional[str]
	context: List[str]

synthetic_data = SyntheticData(
	query=evolved_query, 
	expected_output=expected_output, 
	context=context
)

# Simple implementation of synthetic dataset
synthetic_dataset = []
synthetic_dataset.append(synthetic_data)

Now all you need to do, is repeat steps 1-5 until you have a reasonably sized synthetic dataset, which you can later use to evaluate and test your LLM (systems) on!

Generating Synthetic Datasets Using DeepEval

In this final section, I'd like to show you a battle-tested data synthesizer I've open-sourced it in DeepEval. This includes from synthetic data generation to formatting it into test cases ready for LLM evaluation and testing, which you can use in just 2 lines of code. And the best part is, you can leverage ANY LLM of your choice. Here’s how you can use DeepEval for synthetic dataset generation:


pip install deepeval

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer()
synthesizer.generate_goldens_from_docs(
    document_paths=['example.txt', 'example.docx', 'example.pdf'],
    max_goldens_per_document=2
)

You can read more about how to use DeepEval's synthesizer to generate synthetic datasets in DeepEval's docs, but in summary DeepEval takes in your documents, does all the chunking and context generation for you, before generating synthetic "goldens" that are basically data rows that ultimately form your synthetic dataset. Easy enough?

Conclusion

Generating synthetic datasets using LLMs is great because it is a quick and cheap way to get your hands on large amounts of data. However, generated data can look extremely repetitive and often times doesn't represent the underlying data distribution well enough to be deemed useful. In this article, we talked how to solve this problem by first selecting relevant context from documents, before using it to generated queries that can be used to test and evaluate your LLM systems on.

We also explored data evolution, which we used to make synthetic queries more realistic. If you're looking to build a data synthesizer from scratch, this article serves as a great tutorial. However, if you're looking for something more robust and production ready, you can use DeepEval. It is open-source, extremely easy to use (seriously), and has a whole evaluation and testing suite for you to use the generated synthetic dataset to seamlessly test and evaluate your LLM systems on.

Thank you for reading and if you've found this article useful, don't forget to ⭐ DeepEval a star on GitHub ⭐!

* * * * *

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an “aha!” moment, who knows?

Confident AI: The LLM Evaluation Platform

The all-in-one platform to evaluate and test LLM applications on the cloud, fully integrated with DeepEval.

Regression test and evaluate LLM apps on the cloud.
LLM evaluation metrics for ANY use case.
Real-time LLM observability and tracing.
Automated human feedback collection.
Generate evaluation datasets on the cloud.
LLM security, risk, and vulnerability scanning.
Kritin Vongthongsri
Cofounder @ Confident AI | Empowering LLM practitioners with Evals | Previously AI/ML at Fintech Startup | ML + CS @ Princeton

Stay Confident

Subscribe to our weekly newsletter to stay confident in the AI systems you build.

Thank you! You're now subscribed to Confident AI's weekly newsletter.
Oops! Something went wrong while submitting the form.