Data generation
Overview
The Data Generation page allows you to create high-quality training data using large language models (LLMs). You can:
- Generate new rows (entire synthetic datasets)
- Generate new columns (labels, summaries, or other outputs) based on existing datasets
All jobs are scoped to your current project.
Data Generation List View
When you land on the Data Generation tab, you'll see a table of all jobs in the current project.
Table Columns
Each row includes:
- Job name
- Source dataset (only shown for "Generate columns" jobs)
- Problem type — either
Generate rowsorGenerate columns - Created date
- Status —
In ProgressorComplete
Table Features
- Search/filter by job name, file name, or problem type
- Edit mode (top-left) lets you select and delete multiple jobs
- Cancel button to exit edit mode
Row Actions
Click the action menu on any row to:
- Copy — pre-fills a new generation job with the same settings
- Rename
- Delete
Start a New Data Generation Job
Click the New Data Generation button to open the job creation form.
The form is divided into the following sections:
Data Generation Details
Choose a generation method.
Use a Demo Recipe
Pick from a few built-in generation recipes such as:
- Haiku generation
- Essay scoring
- Text-to-SQL
These are great for trying out the system or prototyping ideas.
Use Custom Configuration (most common)
- Name your data generation job (optional; autogenerated if blank)
- Problem Type:
Generate rows(default): LLM creates brand new recordsGenerate columns: LLM adds a new column to an existing dataset
- Number of Samples: How many rows you want to generate (default:
10, but typically you'll want 1,000 or more)
If you choose Generate columns, a fourth section called Dataset Selection will appear, allowing you to pick the input dataset the LLM will enrich.
Dataset Selection (if generating columns)
This section appears only if you select Generate columns as your problem type.
Choose a dataset from your project that the LLM will read from and enrich (e.g., by generating labels or answers).
Use cases include:
- Labeling call center transcripts
- Annotating support tickets with topics
- Adding summaries or titles to long-form text
This is ideal for producing high-quality, labeled datasets that can be used to fine-tune smaller LLMs, reducing production inference costs.
LLM Config
Configure the LLM used for data generation. This is typically a large, hosted model (e.g., GPT-4, Claude, Gemini).
- Model Selection: Choose from the list of models configured for your organization
- Max New Tokens: Maximum tokens the model is allowed to generate (default:
1024) - Temperature: Controls randomness
0.0–0.3: More predictable0.7+: More creative- Default is
0.2
These models are not the ones you will fine-tune — they are used to create high-quality labeled data for training your smaller, production-ready models.
Transformation Instruction Templates
This section helps define how the large language model should generate your data.
Optional: Improve Configuration
This is the most important starting point for setting up your data generation. Here's how to use it:
-
Click the Improve Config ✨ button at the bottom of the screen.
This opens a popup dialog where you can provide specific instructions for how to improve your configuration.
-
In the dialog, enter what you want out of the data generation (optional - you can leave it blank for general improvements).
Example instructions:
- "I want a haiku about nature, especially themes like the moon or the stars."
- "Generate a call center transcript with 3 binary classification labels."
- "Create more diverse examples with better formatting."
-
Click Improve Config in the dialog.
This uses an LLM to auto-fill the following fields:
- Random Seed Dictionary (optional, improves variety)
- User Prompt Template
-
Review the generated fields to ensure they match your intent. Edit them if needed.
You can manually write all of these fields if you want to, but using "Improve Config" is the fastest and most reliable method.
Optional: Check or Reset the Config
- Click Check Config to run a few test examples and preview generated rows
- Click Reset Settings to clear the form and start fresh
Start Data Generation
Click Start Data Generation to begin the job using the provided instructions and model settings.
After You Start: Viewing a Data Generation Job
Once your job starts, the system will show a job summary view with the following details:
- Status: In Progress or Complete
- Runtime and ETA
- Cost (may be $0 for small jobs)
- Total Input Tokens
- Total Generated Tokens
- Speed (tokens per second)
You’ll also see:
- A text length distribution chart
- A word cloud of generated terms
- A config summary (including prompts and settings)
- A log of generation activity
View Your Generated Dataset
At the top of the job summary page, click Check Result Dataset to jump directly to the newly created dataset.
This takes you to the dataset page, where you can:
- View row/column metadata
- Explore token statistics
- Preview a sample of your data
To learn more about working with datasets, see the Datasets documentation.
- Submit and view feedback for this page
- Send feedback about H2O Enterprise LLM Studio to cloud-feedback@h2o.ai