Data generation
Overviewβ
The Data Generation page allows you to create high-quality training data using large language models (LLMs). You can:
- Generate new rows (entire synthetic datasets)
- Generate new columns (labels, summaries, or other outputs) based on existing datasets
All jobs are scoped to your current project.
Data Generation List Viewβ
When you land on the Data Generation tab, you'll see a table of all jobs in the current project.
Table Columnsβ
Each row includes:
- Job name
- Source dataset (only shown for "Generate columns" jobs)
- Problem type β either
Generate rows
orGenerate columns
- Created date
- Status β
In Progress
orComplete
Table Featuresβ
- Search/filter by job name, file name, or problem type
- Edit mode (top-left) lets you select and delete multiple jobs
- Cancel button to exit edit mode
Row Actionsβ
Click the action menu on any row to:
- Copy β pre-fills a new generation job with the same settings
- Rename
- Delete
Start a New Data Generation Jobβ
Click the New Data Generation button to open the job creation form.
The form is divided into the following sections:
Data Generation Detailsβ
Choose a generation method.
Use a Demo Recipeβ
Pick from a few built-in generation recipes such as:
- Haiku generation
- Essay scoring
- Text-to-SQL
These are great for trying out the system or prototyping ideas.
Use Custom Configuration (most common)β
- Name your data generation job (optional; autogenerated if blank)
- Problem Type:
Generate rows
(default): LLM creates brand new recordsGenerate columns
: LLM adds a new column to an existing dataset
- Number of Samples: How many rows you want to generate (default:
10
, but typically you'll want 1,000 or more)
If you choose Generate columns
, a fourth section called Dataset Selection will appear, allowing you to pick the input dataset the LLM will enrich.
Dataset Selection (if generating columns)β
This section appears only if you select Generate columns as your problem type.
Choose a dataset from your project that the LLM will read from and enrich (e.g., by generating labels or answers).
Use cases include:
- Labeling call center transcripts
- Annotating support tickets with topics
- Adding summaries or titles to long-form text
This is ideal for producing high-quality, labeled datasets that can be used to fine-tune smaller LLMs, reducing production inference costs.
LLM Configβ
Configure the LLM used for data generation. This is typically a large, hosted model (e.g., GPT-4, Claude, Gemini).
- Model Selection: Choose from the list of models configured for your organization
- Max New Tokens: Maximum tokens the model is allowed to generate (default:
1024
) - Temperature: Controls randomness
0.0β0.3
: More predictable0.7+
: More creative- Default is
0.2
These models are not the ones you will fine-tune β they are used to create high-quality labeled data for training your smaller, production-ready models.
Transformation Instruction Templatesβ
This section helps define how the large language model should generate your data.
Suggestions to Improveβ
At the very top of this section, youβll find a text box labeled Suggestions to Improve.
This is the most important starting point. Hereβs how to use it:
-
In the box, write what you want out of the data generation.
Example:
- βI want a haiku about nature, especially themes like the moon or the stars.β
- βGenerate a call center transcript with 3 binary classification labels.β
-
Click Improve Config at the bottom of the screen.
This uses an LLM to auto-fill the following fields:
- Random Seed Dictionary (optional, improves variety)
- System Prompt Template
- User Prompt Template
-
Review the generated fields to ensure they match your intent. Edit them if needed.
You can manually write all of these fields if you want to, but using "Improve Config" is the fastest and most reliable method.
Optional: Check or Reset the Configβ
- Click Check Config to run a few test examples and preview generated rows
- Click Reset Settings to clear the form and start fresh
Start Data Generationβ
Click Start Data Generation to begin the job using the provided instructions and model settings.
After You Start: Viewing a Data Generation Jobβ
Once your job starts, the system will show a job summary view with the following details:
- Status: In Progress or Complete
- Runtime and ETA
- Cost (may be $0 for small jobs)
- Total Input Tokens
- Total Generated Tokens
- Speed (tokens per second)
Youβll also see:
- A text length distribution chart
- A word cloud of generated terms
- A config summary (including prompts and settings)
- A log of generation activity
View Your Generated Datasetβ
At the top of the job summary page, click Check Result Dataset to jump directly to the newly created dataset.
This takes you to the dataset page, where you can:
- View row/column metadata
- Explore token statistics
- Preview a sample of your data
To learn more about working with datasets, see the Datasets documentation.
- Submit and view feedback for this page
- Send feedback about H2O Enterprise LLM Studio to cloud-feedback@h2o.ai