Dataset formats
Overview
The data (dataset) for one of the supported problem types needs to be formatted (prepared) by you in a certain way. Below, you can find instructions on formatting your dataset for a particular supported problem type.
With H2O Label Genie (a Wave application in H2O AI Cloud), you can label your image, text, and audio data to generate annotated datasets supported in H2O Hydrogen Torch. To learn more, see H2O Label Genie | Docs.
To learn how to import a formatted (preprocessed) dataset, see Import a dataset.
Image
Image regression
- Format
- Example
The data for an image regression experiment needs to be in a .zip
file (1) containing a .csv
file (2) and an image folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column(s)
- The available dataset connectors require the data for an image regression experiment to be in a ZIP file. Note
To learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specifiedNote
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)Note
H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image regression experiment. Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
The coins_image_regression.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image regression problem. The .zip
file contains a .csv
file and an image folder. The structure of the .zip
file is as follows:
coins_image_regression.zip
│ └───coins_image_regression.csv
│ │
│ └───images
│ └───95_1477858074.jpg
│ └───95_1477858068.jpg
│ └───95_1477858062.jpg
│ ...
The first three rows of the .csv
file are as follows:
image_path | label | fold |
---|---|---|
105_1479344562.jpg | 105 | 1 |
105_1479344940.jpg | 105 | 2 |
125_1479424716.jpg | 125 | 1 |
- In this example, the data directory in the image column (image_path) is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specified as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.
3D image regression
- Format
- Example
The data for a 3D image regression experiment needs to be in a .zip
file (1) containing a .csv
file (2) and an image folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column(s)
- The available dataset connectors require the data for a 3D image regression experiment to be in a ZIP file. Note
To learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specifiedNote
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)Note
H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image regression experiment. Note
All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.
The mnist_3d_image_regression_3d.zip
file is a preprocessed dataset in H2O Hydrogen Torch that was formatted to solve a 3D image regression problem. The .zip
file contains a .csv
file and an image folder. The structure of the .zip
file is as follows:
mnist_3d_image_regression_3d.zip
│ └───train.csv
│ │
│ └───images
│ └───39385.npy
│ └───28837.npy
│ └───35708.npy
│ ...
The first three rows of the train.csv
file are as follows:
image | label |
---|---|
39385.npy | 1 |
28837.npy | 0 |
35708.npy | 2 |
- In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specified as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.
Image classification
- Format
- Example
The data for an image classification experiment needs a .zip
file (1) containing a .csv
file (2) and an image folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column(s)
- The available dataset connectors require the data for an image classification experiment to be in a ZIP file. Note
To learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specified Note
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient Note
- H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image classification experiment. Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
The flower_image_classification.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multi-class image classification problem. The structure of the .zip
file is as follows:
flower_image_classification.zip
│ └───train.csv
│ │
│ └───images
│ └───100080576_f52e8ee070_n.jpg
│ └───10043234166_e6dd915111_n.jpg
│ └───1008566138_6927679c8a.jpg
│ ...
The first three rows of the train.csv
file are as follows:
image | label |
---|---|
5777669976_a205f61e5b.jpg | roses |
4860145119_b1c3cbaa4e_n.jpg | roses |
15011625580_7974c44bce.jpg | roses |
- In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specfied as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
3D image classification
- Format
- Example
The data for a 3D image classification experiment needs a .zip
file (1) containing a .csv
file (2) and an image folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to the same format as a train.csv
file, but does not require a label column(s)
- The available dataset connectors require the data for a 3D image classification experiment to be in a ZIP file. Note
To learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specified Note
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient Note
- H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image classification experiment. Note
All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.
The mnist_3d_image_classification_3d.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multi-class 3D image classification problem. The structure of the .zip
file is as follows:
mnist_3d_image_classification_3d.zip
│ └───train.csv
│ │
│ └───images
│ └───39385.npy
│ └───28837.npy
│ └───35708.npy
│ ...
The first three rows of the train.csv
file are as follows:
image | label |
---|---|
39385.npy | 1 |
28837.npy | 0 |
35708.npy | 2 |
- In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specfied as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.
Image metric learning
- Format
- Example
The data for an image metric learning experiment needs to be in a .zip
file (1) containing a .csv
file (2) and an image folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column
- The available dataset connectors require the data for an image metric learning experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specifiedNote
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A label column containing the class names Note
Similar images should have the same class name.
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image metric learning experiment. Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
The bicycle_image_metric_learning.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image metric learning problem. The structure of the .zip
file is as follows:
bicycle_image_metric_learning.zip
│ └───train.csv
│ │
│ images
│ └───181783211141_0.jpg
│ └───181596348104_1.jpg
│ └───171166528893_0.jpg
│ ...
The first three rows of the .csv
file are as follows:
image | label | fold |
---|---|---|
181783211141_0.JPG | 181783211141 | 0 |
181596348104_1.JPG | 181596348104 | 2 |
171166528893_0.JPG | 171166528893 | 0 |
- In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specify as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
Image object detection
- Formats
- Format conversions
H2O Hydrogen Torch supports several dataset (data) formats for an image object detection experiment. Supported formats are as follows:
Hydrogen Torch format
The data following the Hydrogen Torch format for an image object detection experiment is structured as follows: A .zip
file (1) containing a .pq
file (parquet) (2) and an image folder (3).
folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .pq
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.pq
file needs to follow the format described above - A validation
.pq
file needs to follow the same format as a train.pq
file - A test
.pq
file needs to follow the same format as a train.pq
file, but does not require a class_id, x_min, x_max, y_min, and y_max column
- The available dataset connectors require the data for an image object detection experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.pq
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specifiedNote
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each bounding box. Each row of the dataset should contain a list of class names, where each element in the list refers to a single box
- An x_min, x_max, y_min, and y_max column corresponding to the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a list of coordinates, where each element in the list refers to a single boxNote
- The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
- The length of each list for the class_id, x_min, x_max, y_min, and y_max needs to be equal and needs to refer to the total number of bounding boxes in each respective image. If a box is not present for a given image, all lists need to be empty.
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment. Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
Example
The global_wheat_image_object_detection.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image object detection problem. The structure of the .zip
file is as follows:
global_wheat_image_object_detection.zip
│ └───train.pq
│ │
│ └───images
│ └───7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg
│ └───3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg
│ └───37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg
│ ...
As follows, three random rows from the .pq
file:
image | class_id | x_min | y_min | x_max | y_max |
---|---|---|---|---|---|
7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg | ['wheat' 'wheat' 'wheat' ...] | [689 718 382 ...] | [884 464 42 ...] | [754 768 450 ...] | [920 516 101 ...] |
3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg | ['wheat' 'wheat' 'wheat' ...] | [924 698 904 ...] | [195 10 32 ...] | [981 763 938 ...] | [247 101 79 ...] |
37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg | ['wheat' 'wheat' 'wheat' ...] | [919 811 4 ...] | [535 820 96 ...] | [1024 912 71 ...] | [613 894 164 ...] |
- In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.
Individual boxes format
The data following the individual boxes format for an image object detection experiment is structured as follows: A .zip
file (1) containing a .csv
file (2) and an image folder (3):
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a class_id, x_min, x_max, y_min, and y_max column
- The available dataset connectors require the data for an image object detection to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specified Note
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each box. Each row of the dataset should contain a single box
- An x_min, x_max, y_min, and y_max column containing the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a single coordinate value for a corresponding bounding boxNote
- The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
- If a box is not present for a given image, the column class_id, x_min, x_max, y_min, and y_max should be empty.
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment. Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
Example
image | x_min | y_min | x_max | y_max | class_id |
---|---|---|---|---|---|
bafc.jpg | 311 | 43 | 378 | 134 | wheat |
bafc.jpg | 276 | 83 | 354 | 153 | wheat |
bafc.jpg | 442 | 309 | 541 | 381 | wheat |
cryv.jpg | 301 | 13 | 328 | 124 | wheat |
cryv.jpg | 246 | 80 | 344 | 113 | wheat |
cryv.jpg | 432 | 303 | 341 | 181 | wheat |
COCO format
The data following the COCO format for an image object detection experiment is structured as follows: A .zip
file (1) containing a .json
file (2) and an image folder (3):
folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .json
files in the .zip
file that you can use as train, validation, and test datasets:
- A train
.json
file needs to follow the format described above - A validation
.json
file needs to follow the same format as a train.json
file - A test
.json
file needs to follow the same format as a train.json
file, but does not require labels
- The available dataset connectors require the data for an image object detection experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.json
file that contains labels in a COCO format. - A folder containing all the images specified in the
.json
file; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.NoteAll images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
Pascal VOC format
The data following the Pascal VOC format for an image object detection experiment is structured as follows: A .zip
file (1) containing a folder with .xml
files with labels (2) and an image folder (3):
folder_name.zip (1)
│ └───xml_folder_name (2)
│ └───name_of_image.xml
│ └───name_of_image.xml
│ └───name_of_image.xml
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple folders with labels in the .zip
file that you can use as train, validation, and test datasets:
- A train folder with labels needs to follow the format described above
- A validation folder with labels should have the same format as a train folder
- A test folder with labels should have the same format as a train folder, but labels are not required
- The available dataset connectors require the data for an image object detection experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A folder that contains
.xml
files with labels in a Pascal VOC format. - An image folder that contains all the images specified in the
.xml
files; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.NoteAll images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
Individual Boxes to Hydrogen Torch format
import pandas as pd
# Read data
df = pd.read_csv("/data/train.csv")
# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
df[["image_id", "class_id", "x_min", "y_min", "x_max", "y_max"]].to_parquet(
"/data/train.pq", engine="pyarrow", index=False
)
COCO to Hydrogen Torch format
import json
import pandas as pd
def get_object_detection(df):
images = pd.DataFrame(df["images"])
categories = pd.DataFrame(df["categories"])
annotations = pd.DataFrame(df["annotations"])
annotations["x_min"] = annotations["bbox"].map(lambda x: x[0]).astype(int)
annotations["y_min"] = annotations["bbox"].map(lambda x: x[1]).astype(int)
annotations["x_max"] = annotations["bbox"].map(lambda x: x[0] + x[2]).astype(int)
annotations["y_max"] = annotations["bbox"].map(lambda x: x[1] + x[3]).astype(int)
annotations = annotations[
["image_id", "category_id", "x_min", "y_min", "x_max", "y_max"]
]
annotations["category_id"] = annotations["category_id"].astype(int)
annotations = annotations.merge(
categories[["id", "name"]].drop_duplicates(), left_on="category_id", right_on="id", how="left"
)
annotations = annotations.merge(
images[["id", "file_name"]].drop_duplicates(), left_on="image_id", right_on="id", how="right"
)
annotations.drop(["id_x", "id_y", "image_id"], axis=1, inplace=True)
return annotations
# Read data
with open("/data/COCO_train_annos.json", "r") as fp:
train = json.load(fp)
# Parse COCO format
train_ann = get_object_detection(train)
# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
train_ann[["file_name", "name", "x_min", "y_min", "x_max", "y_max"]].to_parquet(
"/data/train.pq", engine="pyarrow", index=False
)
Pascal VOC to Hydrogen Torch
import glob
import os
from xml.etree import ElementTree
import pandas as pd
from tqdm import tqdm
observations = []
for xml in tqdm(glob.glob("/data/Annotations/*.xml")):
tree = ElementTree.parse(xml)
root = tree.getroot()
objs = root.findall("object")
for obj in objs:
name = obj.find("name").text
bndbox = obj.find("bndbox")
xmin = float(bndbox.findtext("xmin")) - 1
ymin = float(bndbox.findtext("ymin")) - 1
xmax = float(bndbox.findtext("xmax"))
ymax = float(bndbox.findtext("ymax"))
try:
img_name = root.findall("path")[0].text.split("/")[-1]
except Exception:
img_name = root.findall("filename")[0].text
observations.append(
(
img_name,
name,
xmin,
ymin,
xmax,
ymax,
)
)
df = pd.DataFrame(
observations, columns=["image", "class_id", "x_min", "y_min", "x_max", "y_max"]
)
# Prepare the processed dataset
df = df.groupby(["image"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
df.to_parquet("/data/train.pq", engine="pyarrow", index=False)
Image semantic segmentation
- Formats
- Helper functions
- Format conversions
H2O Hydrogen Torch supports several dataset (data) formats for an image semantic segmentation experiment. Supported formats are as follows:
Hydrogen Torch format
The data following the Hydrogen Torch format* for an image semantic segmentation experiment is structured as follows: A .zip
file (1) containing a .pq
file (parquet) (2) and an image folder (3):
folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .pq
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.pq
file needs to follow the format described above - A validation
.pq
file needs to follow the same format as a train.pq
file - A test
.pq
file needs to follow the same format as a train.pq
file, but does not need a class_id and rle_mask column
- The available dataser connectors require the data for an image semantic segmentation experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.pq
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specifiedNote
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each mask. Each row of the dataset should contain a list of all possible class names
- A rle_mask column containing run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be providedNote
The length of each class_id and rle_mask list must be equal while referring to the total number of classes.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image semantic segmentation experiment. Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
Example
The fashion_image_semantic_segmentation.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image semantic segmentation problem. The structure of the .zip
file is as follows:
fashion_image_semantic_segmentation.zip
│ └───train.pq
│ │
│ └───images
| └───img_0458.png
| └───img_0604.png
│ └───img_0668.png
│ ...
As follows, three random rows from the .pq
file:
image | class_id | rle_mask |
---|---|---|
img_0458.png | ['shoes' 'pants' 'dress' 'coat' 'shirt'] | ['180629 7 181447 17... |
img_0604.png | ['shoes' 'pants' 'dress' 'coat' 'shirt'] | ['189672 2 190493 9... |
img_0668.png | ['shoes' 'pants' 'dress' 'coat' 'shirt'] | ['108023 11 108848 11... |
- In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
COCO format
The data following the COCO format for an image semantic segmentation experiment is structured as follows: A .zip
file (1) containing a .json
file (2) and an image folder (3):
folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .json
files in the .zip
file that you can use as train, validation, and test datasets:
- A train
.json
file needs to follow the format described above - A validation
.json
file needs to follow the same format as a train.json
file - A test
.json
file needs to follow the same format as a train.json
file, but does not require labels
- The available dataset connectors require the data for an image semantic segmentation experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.json
file that contains labels in a COCO format. - A folder containing all the image specified in the
.json
file; H2O Hydrogen Torch uses the images in this folder during an image semantic segmentation experiment.
RLE encoding and decoding functions
from typing import Tuple
import numpy as np
def mask2rle(x: np.ndarray) -> str:
"""
Converts input masks into RLE-encoded strings.
Args:
x: numpy array of shape (height, width), 1 - mask, 0 - background
Returns:
RLE string
"""
pixels = x.T.flatten()
pixels = np.concatenate([[0], pixels, [0]])
runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
runs[1::2] -= runs[::2]
return " ".join(str(x) for x in runs)
def rle2mask(mask_rle: str, shape: Tuple[int, int]) -> np.ndarray:
"""
Converts RLE-encoded string into the binary mask.
Args:
mask_rle: RLE-encoded string
shape: (height,width) of array to return
Returns:
binary mask: 1 - mask, 0 - background
"""
s = mask_rle.split()
starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
starts -= 1
ends = starts + lengths
img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
for lo, hi in zip(starts, ends):
img[lo:hi] = 1
return img.reshape(shape, order="F") # Needed to align to RLE direction
`.csv` file with masks to Hydrogen Torch format
import pandas as pd
df = pd.read_csv("/data/train.csv")
# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()
df[["image_id", "class_id", "rle_mask"]].to_parquet(
"/data/train.pq", engine="pyarrow", index=False
)
COCO to Hydrogen Torch format
import json
import pandas as pd
from pycocotools.coco import COCO
def get_semantic_segmentation(df, coco_path):
coco = COCO(coco_path)
images = images[["id", "file_name"]].drop_duplicates()
images.columns = ["image_id", "file_name"]
categories = categories[["id", "name"]].drop_duplicates()
categories.columns = ["category_id", "name"]
# Filter out _background_ class
categories = categories[categories.name != "_background_"]
all_labels = [
pd.DataFrame({"file_name": x, "name": categories.name.unique()})
for x in images.file_name.unique()
]
all_labels = pd.concat(all_labels)
all_labels = all_labels.merge(images).merge(categories).reset_index(drop=True)
rles = []
for idx, row in all_labels.iterrows():
yield data_split, idx / len(all_labels)
semantic_annotations = [
x
for x in df["annotations"]
if x["image_id"] == row["image_id"]
and int(x["category_id"]) == row["category_id"]
]
if len(semantic_annotations) == 0:
rles.append("")
continue
semantic_mask = np.max(
[coco.annToMask(x) for x in semantic_annotations], axis=0
)
# mask2rle() is defined in "Helper functions" section
rles.append(mask2rle(semantic_mask))
all_labels["rle_mask"] = rles
return all_labels
# Read data
train_path = "/data/COCO_train_annos.json"
with open(train_path, "r") as fp:
train = json.load(fp)
# Parse COCO format
train_ann = get_semantic_segmentation(df=train, coco_path=train_path)
# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: x.to_list()).reset_index()
train_ann[["file_name", "name", "rle"]].to_parquet(
"/data/train.pq", engine="pyarrow", index=False
)
3D image semantic segmentation
- Format
- Example
- Helper functions
The data for a 3D image semantic segmentation experiment needs to be structured as follows: A .zip
file (1) containing a .pq
file (parquet) (2) and an image folder (3):
folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .pq
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.pq
file needs to follow the format described above - A validation
.pq
file needs to follow the same format as a train.pq
file - A test
.pq
file needs to follow the same format as a train.pq
file, but does not need a class_id and rle_mask column
- The available dataser connectors require the data for a 3D image semantic segmentation experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.pq
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specifiedNote
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each mask. Each row of the dataset should contain a list of all possible class names
- A rle_mask column containing run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be providedNote
The length of each class_id and rle_mask list must be equal while referring to the total number of classes.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image semantic segmentation experiment. Note
All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.
The covid_ct_image_semantic_segmentation_3d.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a 3D image semantic segmentation problem. The structure of the .zip
file is as follows:
covid_ct_image_semantic_segmentation_3d.zip
│ └───train.pq
│ │
│ └───images
| └───coronacases_org_001.npy
| └───coronacases_org_002.npy
│ └───coronacases_org_003.npy
│ ...
As follows, three random rows from the .pq
file:
image | class_id | rle_mask |
---|---|---|
coronacases_org_001.npy | ['lung'] | ['171087 6 171095 7 171... |
coronacases_org_002.npy | ['lung'] | ['6439 8 6563 15 6689... |
coronacases_org_003.npy | ['lung'] | ['103983 1 119580 9 119... |
- In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.
RLE encoding and decoding functions
from typing import Tuple
import numpy as np
def mask2rle(x: np.ndarray) -> str:
"""
Converts input masks into RLE-encoded strings.
Args:
x: numpy array of shape (height, width), 1 - mask, 0 - background
Returns:
RLE string
"""
pixels = x.T.flatten()
pixels = np.concatenate([[0], pixels, [0]])
runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
runs[1::2] -= runs[::2]
return " ".join(str(x) for x in runs)
def rle2mask_3d(mask_rle: str, shape: Tuple) -> np.ndarray:
"""
Converts RLE-encoded string into the binary mask (3D version).
Args:
mask_rle: RLE-encoded string
shape: (height, width, depth) of array to return
Returns:
binary mask: 1 - mask, 0 - background
"""
shape = [shape[1], shape[0], shape[2]]
s = mask_rle.split()
starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
starts -= 1
ends = starts + lengths
img = np.zeros(shape[0] * shape[1] * shape[2], dtype=np.uint8)
for lo, hi in zip(starts, ends):
img[lo:hi] = 1
return img.reshape(shape).transpose(2, 1, 0)
Image instance segmentation
- Formats
- Helper functions
- Format conversions
H2O Hydrogen Torch supports several dataset (data) formats for an image instance segmentation experiment. Supported formats are as follows:
Hydrogen Torch format
The data following the Hydrogen Torch format for an image instance segmentation experiment is structured as follows: A .zip
file (1) containing a .pq
file (parquet) (2) and an image folder (3):
folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .pq
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
, but does not require a class_id and rle_mask column
- The available dataset connectors require the data for an image instance segmentation experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.pq
file containing the following columns:- An image column containing the names of the images for the experiment, where each image has an image extension specifiedNote
- Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
- Suppose the names of the images don't specify the data directory (location of the images in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each instance mask. Each row of the dataset should contain a list of class names, where each element in the list refers to a single mask instance.
- A rle_mask column containing run-length-encoded (RLE) masks for each instance from the class_id column. Each row of the dataset should contain a list of RLE-encoded masks, where each element in the list refers to a single instance.Note
The length of each class_id and rle_mask list must be equal while referring to the total number of instances in each respective image. If an instance is not present for a given image, all lists need to be empty.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An image column containing the names of the images for the experiment, where each image has an image extension specified
- An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image instance segmentation experiment. Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
Example
The coco_image_instance_segmentation.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image instance segmentation problem. The structure of the .zip
file is as follows:
coco_image_instance_segmentation.zip
│ └───train.pq
│ │
│ └───images
│ └───000000151231.jpg
│ └───000000433826.jpg
│ └───000000061159.jpg
│ ...
As follows, three random rows from the .pq
file:
image_id | class_id | rle_mask |
---|---|---|
000000151231.jpg | ['car' 'car'] | ['91949 7 92375 14 92801... |
000000433826.jpg | ['car' 'car'] | ['224473 3 224952 4 22... |
000000061159.jpg | ['car' 'car'] | ['161665 9 162291 25... |
- In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
COCO format
The data following the COCO format for an image instance segmentation experiment is structured as follows: A .zip
file (1) containing a .json
file (2) and an image folder (3):
folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
You can have multiple .json
files in the .zip
file that you can use as train, validation, and test datasets:
- A train
.json
file needs to follow the format described above - A validation
.json
file needs to follow the same format as a train.json
file - A test
.json
file needs to follow the same format as a train.csv
file, but does not require labels
- The available dataset connectors require the data for an image instance segmentation to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Data connectors. - A
.json
file that contains labels in a COCO format . - A folder containing all the images specified in the
.json
file; H2O Hydrogen Torch uses the images in this folder to run an image instance segmentation experiment.
RLE encoding and decoding functions
from typing import Tuple
import numpy as np
def mask2rle(x: np.ndarray) -> str:
"""
Converts input masks into RLE-encoded strings.
Args:
x: numpy array of shape (height, width), 1 - mask, 0 - background
Returns:
RLE string
"""
pixels = x.T.flatten()
pixels = np.concatenate([[0], pixels, [0]])
runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
runs[1::2] -= runs[::2]
return " ".join(str(x) for x in runs)
def rle2mask(mask_rle: str, shape: Tuple[int, int]) -> np.ndarray:
"""
Converts RLE-encoded string into the binary mask.
Args:
mask_rle: RLE-encoded string
shape: (height,width) of array to return
Returns:
binary mask: 1 - mask, 0 - background
"""
s = mask_rle.split()
starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
starts -= 1
ends = starts + lengths
img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
for lo, hi in zip(starts, ends):
img[lo:hi] = 1
return img.reshape(shape, order="F") # Needed to align to RLE direction
`.csv` file with masks to Hydrogen Torch format
import pandas as pd
df = pd.read_csv("/data/train.csv")
# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()
df[["image_id", "class_id", "rle_mask"]].to_parquet(
"/data/train.pq", engine="pyarrow", index=False
)
COCO to H2O Hydrogen Torch format
import json
import pandas as pd
from pycocotools.coco import COCO
def get_instance_segmentation(df, coco_path):
coco = COCO(json_path)
images = pd.DataFrame(df["images"])
categories = pd.DataFrame(df["categories"])
annotations = pd.DataFrame(df["annotations"])
rles = []
for idx, annotation in enumerate(df["annotations"]):
yield data_split, idx / len(df["annotations"])
mask = mask2rle(coco.annToMask(annotation))
rles.append(mask)
annotations["rle_mask"] = rles
annotations.loc[annotations.rle_mask == "", "rle_mask"] = float("nan")
annotations = annotations[["image_id", "category_id", "rle_mask"]]
annotations["category_id"] = annotations["category_id"].astype(int)
annotations = annotations.merge(
categories[["id", "name"]].drop_duplicates(),
left_on="category_id",
right_on="id",
how="left",
)
annotations = annotations.merge(
images[["id", "file_name"]].drop_duplicates(),
left_on="image_id",
right_on="id",
how="right",
)
annotations.drop(["id_x", "id_y", "image_id"], axis=1, inplace=True)
return annotations
# Read data
train_path = "/data/COCO_train_annos.json"
with open(train_path, "r") as fp:
train = json.load(fp)
# Parse COCO format
train_ann = get_instance_segmentation(df=train, coco_path=train_path)
# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
train_ann[["file_name", "name", "rle"]].to_parquet(
"/data/train.pq", engine="pyarrow", index=False
)
Text
Text regression
- Formats
- Example
The data for a text regression experiment can be formatted following format 1 or 2.
- Format 1
- Format 2
A .csv
file.
csv_name.csv (1)(2)
A .zip
file containing a .csv
file.
folder_name.zip (1)
│ └───csv_name.csv (2)
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require label column(s)
- The available dataset connectors require the data for a text regression experiment to be in a
.zip
or.csv
file.NoteTo learn how to upload your
.zip
or.csv
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- A text column containing the texts for the experiment
- One or more label columns containing the numerical labels (targets)Note
H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
The wellformed_query_text_regression.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text regression problem.
As follows, two random rows from the .csv
file:
rating | text |
---|---|
0.2 | The European Union includes how many ? |
1.0 | What is released when an ion is formed ? |
- The rating column refers to the label column.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
Text classification
- Formats
- Example
The data for a text classification experiment can be formatted following format 1 or 2.
- Format 1
- Format 2
A .csv
file.
csv_name.csv (1)(2)
A .zip
file containing a .csv
file.
folder_name.zip (1)
│ └───csv_name.csv (2)
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column(s)
- The available dataset connectors require the data for a text classification experiment to be in a
.zip
or.csv
file.NoteTo learn how to upload your
.zip
or.csv
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- A text column containing the texts for the experiment
- One or more label columns containing either either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficientNote
- H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
The amazon_reviews_text_classification.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text classification problem.
The first two rows of the .csv
file are as follows:
text | label |
---|---|
GREAT!!!!! Review: I got this toy a couple of days ago and I ABSOLUTELY LOVE IT! It is so much more realistic looking than my other baby born comfort seat. All though I dont have a baby born I had one before but I sold it at a garage sale. So I use It for my berenguar baby doll. And it even has the buckle that goes across the shoulder like a real babies car seat!!!! DEFFINATELY WORTH THE MONEY!!!!!! | Positive |
This Or "Dixie Chicken" Presents Them At A Peak Review: Though lyrically the overall feel of this record is slightly provincial, it can still transport me to places I wanna be. Musically, this pop product from California is stylistically consistent. Yet the instrumentation is diverse and each member is resourceful. But it's Lowell George's vocals and slide guitar that are primarily at the center. He's not flashy and that's a positive. You get treated to 12-bar blues, a song of prescription meds for tripping and a blues with an accordian.But the three highlights are "Easy To Slip", a jaunty acoustic/electric number about lighting up and the sheer joy that memory drifting can project, "Teenage Nervous Breakdown" in which they switch to the domain of energy-driven rock and roll and the title track, a leisurely-paced country blues in which a generous helping of background vocals provides just the right amount of tension. | Positive |
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
Text sequence to sequence
- Formats
- Example
The data for a text sequence to sequence experiment can be formatted following format 1 or 2.
- Format 1
- Format 2
A .csv
file.
csv_name.csv (1)(2)
A .zip
file containing a .csv
file.
folder_name.zip (1)
│ └───csv_name.csv (2)
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require an output_text column
- The available dataset connectors require the data for a text sequence to sequence experiment to be in a
.zip
or.csv
file.NoteTo learn how to upload your
.zip
or.csv
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An input-text column containing/representing the input texts
- An output-text column containing/representing the out put texts
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
The cnn_dailymail_text_sequence_to_sequence.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text sequence to sequence problem. The structure of the .zip
file is as follows:
cnn_dailymail_text_sequence_to_sequence.zip
│ └───train.csv
As follows, a random row from the .csv
file:
Random row
text | summary | id |
---|---|---|
It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It's a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but because he wants to. "While I believe I have the authority to carry out this military action without specific congressional authorization, I know that the country will be stronger if we take this course, and our actions will be even more effective," he said. "We should have this debate, because the issues are too big for business as usual." Obama said top congressional leaders had agreed to schedule a debate when the body returns to Washington on September 9. The Senate Foreign Relations Committee will hold a hearing over the matter on Tuesday, Sen. Robert Menendez said. Transcript: Read Obama's full remarks . Syrian crisis: Latest developments . U.N. inspectors leave Syria . Obama's remarks came shortly after U.N. inspectors left Syria, carrying evidence that will determine whether chemical weapons were used in an attack early last week in a Damascus suburb. "The aim of the game here, the mandate, is very clear -- and that is to ascertain whether chemical weapons were used -- and not by whom," U.N. spokesman Martin Nesirky told reporters on Saturday. But who used the weapons in the reported toxic gas attack in a Damascus suburb on August 21 has been a key point of global debate over the Syrian crisis. Top U.S. officials have said there's no doubt that the Syrian government was behind it, while Syrian officials have denied responsibility and blamed jihadists fighting with the rebels. British and U.S. intelligence reports say the attack involved chemical weapons, but U.N. officials have stressed the importance of waiting for an official report from inspectors. The inspectors will share their findings with U.N. Secretary-General Ban Ki-moon Ban, who has said he wants to wait until the U.N. team's final report is completed before presenting it to the U.N. Security Council. The Organization for the Prohibition of Chemical Weapons, which nine of the inspectors belong to, said Saturday that it could take up to three weeks to analyze the evidence they collected. "It needs time to be able to analyze the information and the samples," Nesirky said. He noted that Ban has repeatedly said there is no alternative to a political solution to the crisis in Syria, and that "a military solution is not an option." Bergen: Syria is a problem from hell for the U.S. Obama: 'This menace must be confronted' Obama's senior advisers have debated the next steps to take, and the president's comments Saturday came amid mounting political pressure over the situation in Syria. Some U.S. lawmakers have called for immediate action while others warn of stepping into what could become a quagmire. Some global leaders have expressed support, but the British Parliament's vote against military action earlier this week was a blow to Obama's hopes of getting strong backing from key NATO allies. On Saturday, Obama proposed what he said would be a limited military action against Syrian President Bashar al-Assad. Any military attack would not be open-ended or include U.S. ground forces, he said. Syria's alleged use of chemical weapons earlier this month "is an assault on human dignity," the president said. A failure to respond with force, Obama argued, "could lead to escalating use of chemical weapons or their proliferation to terrorist groups who would do our people harm. In a world with many dangers, this menace must be confronted." Syria missile strike: What would happen next? Map: U.S. and allied assets around Syria . Obama decision came Friday night . On Friday night, the president made a last-minute decision to consult lawmakers. What will happen if they vote no? It's unclear. A senior administration official told CNN that Obama has the authority to act without Congress -- even if Congress rejects his request for authorization to use force. Obama on Saturday continued to shore up support for a strike on the al-Assad government. He spoke by phone with French President Francois Hollande before his Rose Garden speech. "The two leaders agreed that the international community must deliver a resolute message to the Assad regime -- and others who would consider using chemical weapons -- that these crimes are unacceptable and those who violate this international norm will be held accountable by the world," the White House said. Meanwhile, as uncertainty loomed over how Congress would weigh in, U.S. military officials said they remained at the ready. 5 key assertions: U.S. intelligence report on Syria . Syria: Who wants what after chemical weapons horror . Reactions mixed to Obama's speech . A spokesman for the Syrian National Coalition said that the opposition group was disappointed by Obama's announcement. "Our fear now is that the lack of action could embolden the regime and they repeat his attacks in a more serious way," said spokesman Louay Safi. "So we are quite concerned." Some members of Congress applauded Obama's decision. House Speaker John Boehner, Majority Leader Eric Cantor, Majority Whip Kevin McCarthy and Conference Chair Cathy McMorris Rodgers issued a statement Saturday praising the president. "Under the Constitution, the responsibility to declare war lies with Congress," the Republican lawmakers said. "We are glad the president is seeking authorization for any military action in Syria in response to serious, substantive questions being raised." More than 160 legislators, including 63 of Obama's fellow Democrats, had signed letters calling for either a vote or at least a "full debate" before any U.S. action. British Prime Minister David Cameron, whose own attempt to get lawmakers in his country to support military action in Syria failed earlier this week, responded to Obama's speech in a Twitter post Saturday. "I understand and support Barack Obama's position on Syria," Cameron said. An influential lawmaker in Russia -- which has stood by Syria and criticized the United States -- had his own theory. "The main reason Obama is turning to the Congress: the military operation did not get enough support either in the world, among allies of the US or in the United States itself," Alexei Pushkov, chairman of the international-affairs committee of the Russian State Duma, said in a Twitter post. In the United States, scattered groups of anti-war protesters around the country took to the streets Saturday. "Like many other Americans...we're just tired of the United States getting involved and invading and bombing other countries," said Robin Rosecrans, who was among hundreds at a Los Angeles demonstration. What do Syria's neighbors think? Why Russia, China, Iran stand by Assad . Syria's government unfazed . After Obama's speech, a military and political analyst on Syrian state TV said Obama is "embarrassed" that Russia opposes military action against Syria, is "crying for help" for someone to come to his rescue and is facing two defeats -- on the political and military levels. Syria's prime minister appeared unfazed by the saber-rattling. "The Syrian Army's status is on maximum readiness and fingers are on the trigger to confront all challenges," Wael Nader al-Halqi said during a meeting with a delegation of Syrian expatriates from Italy, according to a banner on Syria State TV that was broadcast prior to Obama's address. An anchor on Syrian state television said Obama "appeared to be preparing for an aggression on Syria based on repeated lies." A top Syrian diplomat told the state television network that Obama was facing pressure to take military action from Israel, Turkey, some Arabs and right-wing extremists in the United States. "I think he has done well by doing what Cameron did in terms of taking the issue to Parliament," said Bashar Jaafari, Syria's ambassador to the United Nations. Both Obama and Cameron, he said, "climbed to the top of the tree and don't know how to get down." The Syrian government has denied that it used chemical weapons in the August 21 attack, saying that jihadists fighting with the rebels used them in an effort to turn global sentiments against it. British intelligence had put the number of people killed in the attack at more than 350. On Saturday, Obama said "all told, well over 1,000 people were murdered." U.S. Secretary of State John Kerry on Friday cited a death toll of 1,429, more than 400 of them children. No explanation was offered for the discrepancy. Iran: U.S. military action in Syria would spark 'disaster' Opinion: Why strikes in Syria are a bad idea . | Syrian official: Obama climbed to the top of the tree, "doesn't know how to get down" Obama sends a letter to the heads of the House and Senate . Obama to seek congressional approval on military action against Syria . Aim is to determine whether CW were used, not by whom, says U.N. spokesman . | 0001d1afc246a7964130f43ae940af6bc6c57f01 |
- In this example, the text column refers to the input-text column, while the summary column refers to the output-text column.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
Text span prediction
- Formats
- Example
The data for a text span prediction experiment can be formatted following format 1 or 2.
- Format 1
- Format 2
A .csv
file.
csv_name.csv (1)(2)
A .zip
file containing a .csv
file.
folder_name.zip (1)
│ └───csv_name.csv (2)
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require an answer column
- The available dataset connectors require the data for a text span prediction experiment to be in a
.zip
or.csv
file.NoteTo learn how to upload your
.zip
or.csv
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- A context column containing/representing the input texts
- A question column containing/representing the questions (that the input context text can answer)
- An answer column containing/representing the substrings from the context column that answers the questions (question column)
- An optional answer-start column containing/representing the start of the substring answers in the context column Note
- The start of the substring answers needs to be specified by integers representing the index where the answer starts in the context.
- If you do not provide an answer-start column, H2O Hydrogen Torch selects the first occurrence of the answer in the context.
- An optional fold column containing cross-validation fold indexes Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
The squad_text_span_prediction.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text span prediction problem. The structure of the .zip
file is as follows:
squad_text_span_prediction.zip
│ └───squad_v1.csv
As follows, a random row from the .csv
file:
question | context | answer |
---|---|---|
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? | Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary. | Saint Bernadette Soubirous |
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
Text token classification
- Formats
- Example
- Conversions
The data for a text token classification experiment can be formatted following format 1 or 2.
- Format 1
- Format 2
A .pq
(parquet) file.
parquet_name.pq (1)(2)
A .zip
file containing a .pq
(parquet) file.
folder_name.zip (1)
│ └───parquet_name.pq (2)
You can have multiple .pq
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.pq
file needs to follow the format described above - A validation
.pq
file needs to follow the same format as a train.pq
file - A test
.pq
file needs to follow the same format as a train.pq
file, but does not require a label column
- The available dataset connectors require the data for a text token classification to be in a
.zip
or.pq
file.NoteTo learn how to upload your
.zip
or.pq
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.pq
file containing the following columns:- A text column containing tokenized text: each sample should have a list of string tokens
- A label column containing token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
The conll2003_text_token_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text token classification problem. The structure of the .zip
file is as follows:
conll2003_text_token_classification.zip
│ └───test.pq
│ └───train.pq
│ └───validation.pq
As follows, a random row from the train.pq
file:
id | text | pos_tags | chunk_tags | ner_tags |
---|---|---|---|---|
4158 | ['Nijmeh' 'of' 'Lebanon' 'beat' 'Nasr' 'of' 'Saudi' 'Arabia' '1-0' '(' 'halftime' '1-0' ')' 'in' 'their' 'Asian' 'club' 'championship' 'second' 'round' 'first' 'leg' 'tie' 'on' 'Saturday' '.'] | ['NNS' 'IN' 'NNP' 'VBD' 'NNP' 'IN' 'NNP' 'NNP' 'NNP' '(' 'NN' 'CD' ')' 'IN' 'PRP$' 'JJ' 'NN' 'NN' 'NN' 'NN' 'JJ' 'NN' 'NN' 'IN' 'NNP' '.'] | ['B-NP' 'B-VP' 'B-VP' 'I-VP' 'B-NP' 'I-NP' 'B-PP' 'B-NP' 'O' 'O' 'B-NP' 'B-NP' 'I-NP' 'I-NP' 'B-PP' 'B-NP' 'I-NP' 'B-NP' 'I-NP' 'B-VP' 'B-NP' 'B-PP' 'B-VP' 'O'] | ['B-ORG' 'O' 'B-LOC' 'O' 'B-ORG' 'O' 'B-LOC' 'I-LOC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-MISC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O'] |
- The *_tags columns refer to the label column and can only be selected when running a text token classification experiment. Only one column from the available label columns can be selected when running an experiment.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
Convert `CoNLL-2003` dataset
from pathlib import Path
import pandas as pd
try:
import datasets
except ImportError:
raise ImportError("Need datasets>=1.11.0 to download English CoNLL2003 data!")
dataset = datasets.load_dataset("conll2003")
for subset in dataset:
out_path = Path(f"/data/conll2003/{subset}.pq")
out_path.parent.mkdir(exist_ok=True, parents=True)
df = pd.DataFrame(dataset[subset])
# Decode the label encoded labels
for feature in dataset[subset].features:
if isinstance(dataset[subset].features[feature], datasets.Sequence):
feat = dataset[subset].features[feature].feature
if isinstance(feat, datasets.ClassLabel):
df[feature] = df[feature].apply(feat.int2str)
df.rename(columns={"tokens": "text"}, inplace=True)
df.to_parquet(out_path, engine="pyarrow", index=False)
Text metric learning
- Formats
- Example
The data for a text metric learning experiment can be formatted following format 1 or 2.
- Format 1
- Format 2
A .csv
file.
csv_name.csv (1)(2)
A .zip
file containing a .csv
file.
folder_name.zip (1)
│ └───csv_name.csv (2)
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column
- The available dataset connectors require the data for a text metric learning experiment to be in a
.zip
or.csv
file.NoteTo learn how to upload your
.zip
or.csv
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- A text column containing the input texts
- A label column containing the class names Note
Texts that are similar should have the same class name.
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
The ubuntu_text_metric_learning.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text metric learning problem. The structure of the .zip
file is as follows:
ubuntu_text_metric_learning.zip
│ └───train.csv
│ └───test.csv
As follows, a random row from the train.csv
file:
text | label | fold |
---|---|---|
what is the easiest way to strip a desktop edition to a server edition ? | 16 | 1 |
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
Audio
Audio regression
- Format
- Example
The data for an audio regression experiment needs to be in a .zip
file (1) containing a .csv
file (2) and an audio folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column(s)
- The available dataset connectors require the data for an audio regression experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specifiedNote
- Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
- Suppose the names of the audio files don't specify the data directory (location of the audios in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)Note
H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting an audio regressuin experiment.
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
- An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the audio regression experiment. Note
All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
The amnist_audio_regression.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an audio regression problem. The .zip
file contains a .csv
file and an audio folder. The structure of the .zip
file is:
amnist_audio_regression.zip
│ └───amnist_meta.csv
│ │
│ └───amnist_audios
│ └───0_01_0.ogg
│ └───0_01_1.ogg
│ └───0_01_2.ogg
│ ...
The first three rows of the .csv
file are:
audio | label | fold |
---|---|---|
2_26_2.ogg | 2 | 0 |
2_26_38.ogg | 2 | 1 |
9_26_47.ogg | 9 | 2 |
- In this example, the data directory in the audio column is not specified. That being the case, it needs to be specified when uploading the dataset, and the amnist_audios folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .
Audio classification
- Format
- Example
The data for an audio classification experiment needs to be in a .zip
file (1) containing a .csv
file (2) and an audio folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column(s)
- The available data connectors require the data for an audio classification experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specifiedNote
- Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
- Suppose the names of the audio files don't specify the data directory (location of the audios in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing either multi-class labels (One-hot encoded) or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficientNote
- H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. The classes are mutually exclusive in multi-class problems, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
- An audio folder that contains all the audio files specified in the audio column above; H2O Hydrogen Torch uses the audios in this folder to run the audio classification experiment. Note
All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
The esc10_audio_classification.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multiclass audio classification problem. The structure of the .zip
file is:
esc10_audio_classification.zip
│ └───esc10_meta.csv
│ │
│ └───audio_esc10
│ └───2-37806-B-40.wav
│ └───5-200339-A-1.wav
│ └───1-172649-D-40.wav
│ ...
The first three rows of the .csv
file are:
filename | fold | label |
---|---|---|
1-100032-A-0.wav | 0 | dog |
1-110389-A-0.wav | 0 | dog |
1-116765-A-41.wav | 0 | chainsaw |
- In this example, the data directory in the filename column is not specified. That being the case, it needs to be specified when uploading the dataset, and the audio_files folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.
Speech
Speech recognition
- Format
- Example
The data for a speech recognition experiment needs to be in a .zip
file (1) containing a .csv
file (2) and an audio folder (3).
folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
You can have multiple .csv
files in the .zip
file that you can use as train, validation, and test dataframes:
- A train
.csv
file needs to follow the format described above - A validation
.csv
file needs to follow the same format as a train.csv
file - A test
.csv
file needs to follow the same format as a train.csv
file, but does not require a label column(s)
- The available data connectors require the data for a speech recognition experiment to be in a
.zip
file.NoteTo learn how to upload your
.zip
file as your dataset in H2O Hydrogen Torch, see Dataset connectors. - A
.csv
file containing the following columns:- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specifiedNote
- To learn about supported audio extensions for a speech recognition experiment, see Supported audio extensions for speech recognition.
- Suppose the names of the audio files don't specify the data directory (location of the audios in the
.zip
file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
tipFor most supported speech architectures, utilize speech audios of up to 30 seconds. Attempting to train with longer speech samples may lead to:
- Out-of-memory (OOM) issues even on high VRAM GPUs
- Poor training performance
- One label column containing the text transcript of the audio
- An optional fold column containing cross-validation fold indexesNote
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
- An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the experiment. Note
All audios need to have an audio extension. To learn about supported audio extensions, see Supported audio extensions for speech recognition.
The minds14_US_speech_recognition.zip
file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a speech recognition problem. The structure of the .zip
file is:
minds14-US_speech_recognition.zip
│ └───annotations.csv
│ └───audio
│ └───0.wav
│ └───1.wav
│ └───2.wav
│ ...
The first three rows of the .csv
file are:
file | transcript | duration |
---|---|---|
0.wav | I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER [...] | 11 |
1.wav | I'M WONDERING HOW TO SET UP A JOINT ACCOUNT WITH MY WIFE [...] | 7 |
2.wav | HI I'D LIKE TO SET UP A JOINT ACCOUNT WIH MY PARTNER I'M NOT SEEING [...] | 24 |
- The duration column is not a required column when formating your dataset for a speech recognition experiment
- In this example, the data directory in the file column is not specified. That being the case, it needs to be specified when uploading the dataset, and the audio folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
- To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.
Data collection
Example 1: Amazon S3
Below, observe a Python script example parsing a folder structure of an Amazon S3 bucket collecting images into a single dataset. In other words, the script demonstrates how to create a new dataset (ZIP file) from several files in S3 to later re-upload to S3.
# Import libraries
# We use `boto3` to connect to S3
# Optionally `tqdm` can be used to show download progress
# We use pandas for data manipulation
from boto3.session import Session
from tqdm import tqdm
import pandas as pd
import shutil
import os
# Set AWS credentials
# You can set them directly or use the environment variables, if those are set
aws_access_key = os.environ["AWS_ACCESS_KEY_ID"]
aws_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
# Set list of bucket paths, that contain image files
images_bucket_subfolders = ["h2o-release/hydrogen-torch/data-prep"]
# Set path to the train CSV
csv_path = "h2o-release/hydrogen-torch/data-prep/csvs/train.csv"
# Set allowed file extensions
allowed_extensions = [".jpg",".jpeg", ".png"]
# Files will be downloaded to data folder
data_folder = "data"
image_folder = f"{data_folder}/images"
os.makedirs(image_folder, exist_ok=True)
# Connect to S3
s3 = Session(aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key).resource("s3")
# Download train.csv
bucket, csv_path = csv_path.split("/", 1)
output_csv_path = f"{data_folder}/{os.path.basename(csv_path)}"
s3.Bucket(bucket).download_file(csv_path, output_csv_path)
# Make sure the "Image Column" contains only file names, not full paths
image_col = "image"
data = pd.read_csv(output_csv_path)
data[image_col] = data[image_col].map(os.path.basename)
data = data.to_csv(output_csv_path, index=False)
# Download all image files
for images_bucket_subfolder in images_bucket_subfolders:
if "/" in images_bucket_subfolder:
bucket, subfolder = images_bucket_subfolder.split("/", 1)
else:
bucket, subfolder = images_bucket_subfolder, ""
s3_bucket = s3.Bucket(bucket)
files = s3_bucket.objects
if subfolder:
files = files.filter(Prefix=f"{subfolder}/")
files = list(files)
for file in tqdm(files):
if any([file.key.endswith(ext) for ext in allowed_extensions]):
s3_bucket.download_file(file.key, f"{image_folder}/{os.path.basename(file.key)}")
# Create ZIP file that can be imported to H2O Hydrogen Torch
zip_file_name = "flowers_image_classification"
full_zip_file_name = shutil.make_archive(zip_file_name, 'zip', data_folder)
# Set desired S3 path where to upload the ZIP file in format "bucket_name" or "bucket_name/subfolder_1/.../subfolder_n"
upload_bucket_path = "YOUR_BUCKET_NAME/SUB_FOLDER"
# Upload the ZIP file
rel_zip_file_name = os.path.basename(full_zip_file_name)
upload_path = f"{upload_bucket_path}/{rel_zip_file_name}"
upload_bucket, upload_zip_path = upload_path.split("/", 1)
s3.Bucket(upload_bucket).upload_file(rel_zip_file_name, upload_zip_path)
Supported audio extensions for speech recognition
For speech recognition, H2O Hydrogen Torch supports the following audio extension:
- Uncompressed (
.wav
).
Supported audio extensions for audio processing
The following is a list of supported audio extensions for audio processing in H2O Hydrogen Torch:
- Uncompressed:
.wav
,.aiff
- Lossless compressed:
.flac
- Lossy compressed:
.mp3
,.ogg
Supported image extensions for image processing
The following is a list of supported image extensions for image processing in H2O Hydrogen Torch:
- Windows bitmaps:
.bmp
- JPEG files:
.jpeg
,.jpg
,.jpe
- JPEG 2000 files:
.jp2
- Portable Network Graphics:
.png
- WebP:
.webp
- Portable image format:
.pbm
,.pgm
,.ppm
,.pnm
- TIFF files:
.tiff
,.tif
- Radiance HDR:
.hdr
- NumPy data array:
.npy
noteFor 2D image processing, the data must be of shape
[height, width, channels]
.
Supported 3D image extensions for 3D image processing
For 3D image problem types, H2O Hydrogen Torch supports the following 3D image extension:
- NumPy data array:
.npy
noteFor 3D image processing, the data must be of shape
[height, width, depth, channels]
.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai