Data generation

Overview

The Data Generation page allows you to create high-quality training data using large language models (LLMs). You can:

Generate new rows (entire synthetic datasets)
Generate new columns (labels, summaries, or other outputs) based on existing datasets

All jobs are scoped to your current project.

Data Generation List View

When you land on the Data Generation tab, you'll see a table of all jobs in the current project.

Table Columns

Each row includes:

Job name
Source dataset (only shown for "Generate columns" jobs)
Problem type — either Generate rows or Generate columns
Created date
Status — In Progress or Complete

Table Features

Search/filter by job name, file name, or problem type
Edit mode (top-left) lets you select and delete multiple jobs
Cancel button to exit edit mode

Row Actions

Click the action menu on any row to:

Copy — pre-fills a new generation job with the same settings
Rename
Delete

Start a New Data Generation Job

Click the New Data Generation button to open the job creation form.

The form is divided into the following sections:

Data Generation Details

Choose a generation method.

Use a Demo Recipe

Pick from a few built-in generation recipes such as:

Haiku generation
Essay scoring
Text-to-SQL

These are great for trying out the system or prototyping ideas.

Use Custom Configuration (most common)

Name your data generation job (optional; autogenerated if blank)
Problem Type:
- Generate rows (default): LLM creates brand new records
- Generate columns: LLM adds a new column to an existing dataset
Number of Samples: How many rows you want to generate (default: 10, but typically you'll want 1,000 or more)

If you choose Generate columns, a fourth section called Dataset Selection will appear, allowing you to pick the input dataset the LLM will enrich.

Dataset Selection (if generating columns)

This section appears only if you select Generate columns as your problem type.

Choose a dataset from your project that the LLM will read from and enrich (e.g., by generating labels or answers).

Use cases include:

Labeling call center transcripts
Annotating support tickets with topics
Adding summaries or titles to long-form text

This is ideal for producing high-quality, labeled datasets that can be used to fine-tune smaller LLMs, reducing production inference costs.

LLM Config

Configure the LLM used for data generation. This is typically a large, hosted model (e.g., GPT-4, Claude, Gemini).

Model Selection: Choose from the list of models configured for your organization
Max New Tokens: Maximum tokens the model is allowed to generate (default: 1024)
Temperature: Controls randomness
- 0.0–0.3: More predictable
- 0.7+: More creative
- Default is 0.2

These models are not the ones you will fine-tune — they are used to create high-quality labeled data for training your smaller, production-ready models.

Transformation Instruction Templates

This section helps define how the large language model should generate your data.

Suggestions to Improve

At the very top of this section, you’ll find a text box labeled Suggestions to Improve.

This is the most important starting point. Here’s how to use it:

In the box, write what you want out of the data generation.

Example:
- “I want a haiku about nature, especially themes like the moon or the stars.”
- “Generate a call center transcript with 3 binary classification labels.”
Click Improve Config at the bottom of the screen.

This uses an LLM to auto-fill the following fields:
- Random Seed Dictionary (optional, improves variety)
- System Prompt Template
- User Prompt Template
Review the generated fields to ensure they match your intent. Edit them if needed.

You can manually write all of these fields if you want to, but using "Improve Config" is the fastest and most reliable method.

Optional: Check or Reset the Config

Click Check Config to run a few test examples and preview generated rows
Click Reset Settings to clear the form and start fresh

Start Data Generation

Click Start Data Generation to begin the job using the provided instructions and model settings.

After You Start: Viewing a Data Generation Job

Once your job starts, the system will show a job summary view with the following details:

Status: In Progress or Complete
Runtime and ETA
Cost (may be $0 for small jobs)
Total Input Tokens
Total Generated Tokens
Speed (tokens per second)

You’ll also see:

A text length distribution chart
A word cloud of generated terms
A config summary (including prompts and settings)
A log of generation activity

View Your Generated Dataset

At the top of the job summary page, click Check Result Dataset to jump directly to the newly created dataset.

This takes you to the dataset page, where you can:

View row/column metadata
Explore token statistics
Preview a sample of your data

To learn more about working with datasets, see the Datasets documentation.

Feedback

Submit and view feedback for this page
Send feedback about H2O Enterprise LLM Studio to cloud-feedback@h2o.ai

Overview​

Data Generation List View​

Table Columns​

Table Features​

Row Actions​

Start a New Data Generation Job​

Data Generation Details​

Use a Demo Recipe​

Use Custom Configuration (most common)​

Dataset Selection (if generating columns)​

LLM Config​

Transformation Instruction Templates​

Suggestions to Improve​

Optional: Check or Reset the Config​

Start Data Generation​

After You Start: Viewing a Data Generation Job​

View Your Generated Dataset​