Version: v0.17.0

Datasets

Overview

The Datasets page lets you manage all datasets within your current project. From here, you can view, search, filter, and take actions on individual datasets — including launching experiments directly from the dataset view.

You can access Datasets from:

The top navigation tabs, or
The Datasets card on the homepage of a project

view-datasets

List datasets

When you open the Datasets page, you’ll see a table showing all datasets in the current project.

Table columns

Each row represents one dataset and includes:

Name
Number of rows
File size (e.g. KB, MB)
File type (e.g. CSV, Parquet)
Created date

You can:

Search by name
Search by file type
Customize columns using the column visibility toggle (top-right of the table)

Edit dataset

Click the Edit button (top-left) to:

Select multiple datasets
Bulk delete selected datasets

edit-dataset

Use the Cancel button (bottom of the screen) to exit edit mode.

Row actions

Each dataset row includes a dropdown menu with:

Rename — opens a rename dialog
Delete — removes the dataset from the project

Add new dataset

Click the ➕ New Dataset button (top of the page) to upload your own data.

You’ll be taken to a new screen where you can:

Name your dataset (optional — a default name will be auto-generated if left blank)
Upload a file via drag-and-drop or file picker

Supported file formats

.csv — Comma-separated values
.parquet or .pq — Parquet format
.json — JSON structured data

info

For text-based problem types (Language Modeling, Classification), datasets should contain at least one column of text data.
For vision-based problem types (Image Classification, Object Detection, Multimodal), datasets should contain a column with image data (base64-encoded or binary). Object detection datasets additionally require an annotation column with bounding box annotations in COCO format.
If your dataset contains only input text (without labels), you can generate outputs using the Data Generation feature before fine-tuning.

Once uploaded, you’ll be returned to the Datasets page and your new dataset will appear in the list.

Use a demo dataset

If you don’t have a dataset handy, you can start with one of our pre-configured demo datasets.

On the New Dataset screen, scroll down to find:

“…or get started quickly with our demo datasets”

Clicking any of the demos will immediately import that dataset into your project.

demo-datasets

Available demo datasets

Text Datasets
Image Datasets
Multimodal Datasets
Data Generation

Dataset	Problem Type	Size	Description
Chatbot Conversations	Language Modeling	14 MB • ~12K rows	Instruction-output pairs for training conversational AI models (OASST2)
Text-to-SQL	Language Modeling	3.7 MB • 3K rows	Question-SQL pairs for training models to convert natural language to SQL
Sentiment Classification	Classification	20 MB • ~25K rows	IMDB movie reviews for binary sentiment classification
Multi-Label Text Classification	Classification	5.1 MB • ~10K rows	Financial news with multiple category labels (Reuters-21578)

Dataset	Problem Type	Size	Description
Digit Classification	Image Classification	15 MB • ~60K rows	Handwritten digit images for classification (MNIST)
Mini ImageNet	Image Classification	99 MB • 800 rows	Generic image dataset with 50 categories from ImageNet
VOC 2012	Object Detection	109 MB • 1K rows	Object detection dataset with bounding box annotations for 20 object categories (PASCAL VOC)

Dataset	Problem Type	Size	Description
Mini TextVQA	Visual Question Answering	250 MB • 650 rows	Visual question answering with images and text-based questions (TextVQA)

Dataset	Problem Type	Size	Description
Topic-Based Data Generation	Data Generation	7 KB • 120 rows	Simple topic prompts for synthetic content generation
Essay-Enhanced Data Generation	Data Generation	110 KB • 120 rows	Essay samples with topics for advanced synthetic data generation
Text-to-SQL Queries	Data Generation	2.1 MB • 10K rows	Natural language queries mapped to SQL table schemas

Note

The specific demos available may change over time — they’re designed to help you quickly test and explore LLM fine-tuning in the platform.

After importing a demo, you’ll be redirected back to the Datasets list where you can explore and use it like any other dataset.

Dataset overview page

Clicking on a dataset opens its Overview page.

Header actions

At the top of the page:

Rename the dataset via the edit icon next to the name
Create Experiment to kick off a fine-tuning run using this dataset
Delete the dataset

Metadata summary

Right below the header, you’ll see metadata about the dataset:

File size
Row and column count
Token count (total tokens in the dataset)
Created date
Number of experiments launched using this dataset

Column details

Scroll down to view the Column Details section:

Each column is listed with:
- Column name
- Max token length for that column (useful for prompt planning)
You can collapse/expand this section as needed

Dataset sample

Below the column details is the Dataset Sample table:

Shows up to 20 preview rows
You can expand individual rows to view full text if it’s truncated
Especially helpful for verifying formatting and text content in large-language tasks

Feedback

Submit and view feedback for this page
Send feedback about H2O Enterprise LLM Studio to cloud-feedback@h2o.ai

Overview​

List datasets​

Table columns​

Edit dataset​

Row actions​

Add new dataset​

Supported file formats​

Use a demo dataset​

Available demo datasets​

Dataset overview page​

Header actions​

Metadata summary​

Column details​

Dataset sample​