Dataset formats
The data (dataset) for one of the supported problem types needs to be formatted (prepared) by you in a certain way. Below, you can find instructions on formatting your dataset for a particular supported problem type.
Image regression
- Format
- Example
The data for an image regression experiment needs to be in a .zip
file (1) containing a .csv
file (2) and an image folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
- The available dataset connectors require the data for an image regression experiment to be in a ZIP file.
Note
To learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specified
Note
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)
Note
H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes
Note
Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image regression experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to the same format as a train.csv
file, but does not require label column(s)
The coins_image_regression.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image regression problem. The .zip
file contains a .csv
file and an image folder. The structure of the .zip
file is as follows:
coins_image_regression.zip
│ └───coins_image_regression.csv
│ │
│ └───images