Blog

Introducing Synthetic Data Generation Pipelines: Customize how you generate data

Jun 25, 2026·3 min read

Jeffrey Ip

Co-founder @ Confident AI. Creator of DeepEval & DeepTeam. Building an unhealthy LLM evals addiction. Ex-Googler (YouTube), Microsoft AI (Office365).

Introducing Synthetic Data Generation Pipelines: Customize how you generate data

Today we're launching Synthetic Data Generation Pipelines on Confident AI: a simple way to bring your team's synthetic data generation setup onto the platform without losing the flexibility that made it useful locally.

Synthetic data is most useful when it is grounded in the same information your AI uses in production. Now you can build that process as a pipeline: pick your data sources, configure the generation steps, and let Confident AI turn that context into evaluation-ready examples.

Confident AI: Introducing Synthetic Data Generation Pipelines — customize how you generate data

Configure a synthetic data generation pipeline by choosing the data sources and generation steps that fit your use case.

The problem: local pipelines did not consolidate well

The teams building the best synthetic datasets usually already had something working locally. They had custom scripts, hand-picked sources, prompt templates, filters, evolution steps, and styling logic that matched their use case.

The problem started when that setup needed to become shared infrastructure. Once teams tried to consolidate generation on a platform, they often had to flatten everything into a rigid workflow: one source, one generation path, one way to produce examples.

That made the platform easier to centralize, but worse for the teams actually generating data.

Mix and match your context

Every team keeps its source material in different places. Product docs might live in Google Drive, customer records in Salesforce, warehouse data in Snowflake, and domain-specific media in a custom corpus.

Synthetic Data Generation Pipelines let you bring those sources together instead of forcing generation to start from a single bucket of text.

Use sources like:

Snowflake for structured warehouse data
Salesforce for customer and business context
Google Drive for docs, specs, and knowledge files
Custom corpora for images, video, or domain-specific source material

Choose the sources that matter for a dataset, then reuse that setup whenever you need more examples.

Customize each generation step

The pipeline gives you control over how data is created, not just where it comes from.

You can configure the maximum number of concurrent generation requests, decide whether expected outputs should be generated alongside each golden, and tune the downstream steps that shape the final dataset.

The flow is straightforward:

Select the data sources to pull context from
Construct context from those sources
Filter the generated candidates for quality
Evolve the examples into useful variations
Apply the final styling for your dataset

That means you can generate examples for support bots, internal search, sales assistants, document QA, multimodal workflows, and any other AI system that depends on real context.

From source material to eval data

The goal is simple: make it easier to create datasets that actually resemble production, while preserving the generation logic your team already knows works.

Instead of keeping that logic trapped in local scripts, you can define the generation pipeline once on Confident AI and keep improving it as your eval needs change.

Get started

Synthetic Data Generation Pipelines are live on Confident AI now.

Open your project, create a generation config, and start by selecting the data sources you want Confident AI to use.

Do you want to brainstorm how to evaluate your LLM (application)? Ask us anything in our discord. I might give you an "aha!" moment, who knows?

AI Quality for the entire organization, not just individual teams

Give all AI use cases the same quality bar with all-in-one evals, observability, and red teaming, and enforce them at scale.

AI evals for product teams, not just engineers.

Observability for production traffic.

Red teaming for security and safety.

AI governance for multiple projects at once.

Book a Demo Or sign up