Skip to main content

Data generation

Overview​

The Data Generation page allows you to create high-quality training data using large language models (LLMs). You can:

  • Generate new rows (entire synthetic datasets)
  • Generate new columns (labels, summaries, or other outputs) based on existing datasets

All jobs are scoped to your current project.

Data Generation List View​

When you land on the Data Generation tab, you'll see a table of all jobs in the current project.

Table Columns​

Each row includes:

  • Job name
  • Source dataset (only shown for "Generate columns" jobs)
  • Problem type β€” either Generate rows or Generate columns
  • Created date
  • Status β€” In Progress or Complete

Table Features​

  • Search/filter by job name, file name, or problem type
  • Edit mode (top-left) lets you select and delete multiple jobs
  • Cancel button to exit edit mode

Row Actions​

Click the action menu on any row to:

  • Copy β€” pre-fills a new generation job with the same settings
  • Rename
  • Delete

Start a New Data Generation Job​

Click the New Data Generation button to open the job creation form.

The form is divided into the following sections:

Data Generation Details​

Choose a generation method.

Use a Demo Recipe​

Pick from a few built-in generation recipes such as:

  • Haiku generation
  • Essay scoring
  • Text-to-SQL

These are great for trying out the system or prototyping ideas.

Use Custom Configuration (most common)​

  • Name your data generation job (optional; autogenerated if blank)
  • Problem Type:
    • Generate rows (default): LLM creates brand new records
    • Generate columns: LLM adds a new column to an existing dataset
  • Number of Samples: How many rows you want to generate (default: 10, but typically you'll want 1,000 or more)

If you choose Generate columns, a fourth section called Dataset Selection will appear, allowing you to pick the input dataset the LLM will enrich.

Dataset Selection (if generating columns)​

This section appears only if you select Generate columns as your problem type.

Choose a dataset from your project that the LLM will read from and enrich (e.g., by generating labels or answers).

Use cases include:

  • Labeling call center transcripts
  • Annotating support tickets with topics
  • Adding summaries or titles to long-form text

This is ideal for producing high-quality, labeled datasets that can be used to fine-tune smaller LLMs, reducing production inference costs.

LLM Config​

Configure the LLM used for data generation. This is typically a large, hosted model (e.g., GPT-4, Claude, Gemini).

  • Model Selection: Choose from the list of models configured for your organization
  • Max New Tokens: Maximum tokens the model is allowed to generate (default: 1024)
  • Temperature: Controls randomness
    • 0.0–0.3: More predictable
    • 0.7+: More creative
    • Default is 0.2

These models are not the ones you will fine-tune β€” they are used to create high-quality labeled data for training your smaller, production-ready models.

Transformation Instruction Templates​

This section helps define how the large language model should generate your data.

Suggestions to Improve​

At the very top of this section, you’ll find a text box labeled Suggestions to Improve.

This is the most important starting point. Here’s how to use it:

  1. In the box, write what you want out of the data generation.

    Example:

    • β€œI want a haiku about nature, especially themes like the moon or the stars.”
    • β€œGenerate a call center transcript with 3 binary classification labels.”
  2. Click Improve Config at the bottom of the screen.

    This uses an LLM to auto-fill the following fields:

    • Random Seed Dictionary (optional, improves variety)
    • System Prompt Template
    • User Prompt Template
  3. Review the generated fields to ensure they match your intent. Edit them if needed.

You can manually write all of these fields if you want to, but using "Improve Config" is the fastest and most reliable method.

Optional: Check or Reset the Config​

  • Click Check Config to run a few test examples and preview generated rows
  • Click Reset Settings to clear the form and start fresh

Start Data Generation​

Click Start Data Generation to begin the job using the provided instructions and model settings.

After You Start: Viewing a Data Generation Job​

Once your job starts, the system will show a job summary view with the following details:

  • Status: In Progress or Complete
  • Runtime and ETA
  • Cost (may be $0 for small jobs)
  • Total Input Tokens
  • Total Generated Tokens
  • Speed (tokens per second)

You’ll also see:

  • A text length distribution chart
  • A word cloud of generated terms
  • A config summary (including prompts and settings)
  • A log of generation activity

View Your Generated Dataset​

At the top of the job summary page, click Check Result Dataset to jump directly to the newly created dataset.

This takes you to the dataset page, where you can:

  • View row/column metadata
  • Explore token statistics
  • Preview a sample of your data

To learn more about working with datasets, see the Datasets documentation.


Feedback