Version: v1.3.0

Dataset formats

Overview

The data (dataset) for one of the supported problem types needs to be formatted (prepared) by you in a certain way. Below, you can find instructions on formatting your dataset for a particular supported problem type.

With H2O Label Genie (a Wave application in H2O AI Cloud), you can label your image, text, and audio data to generate annotated datasets supported in H2O Hydrogen Torch. To learn more, see H2O Label Genie | Docs.

note

To learn how to import a formatted (preprocessed) dataset, see Import a dataset.

Image

Image regression

Format
Example

The data for an image regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)

The available dataset connectors require the data for an image regression experiment to be in a ZIP file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)
  Note
  H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image regression experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

The coins_image_regression.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image regression problem. The .zip file contains a .csv file and an image folder. The structure of the .zip file is as follows:

coins_image_regression.zip
│   └───coins_image_regression.csv
│   │
│   └───images
│       └───95_1477858074.jpg
│       └───95_1477858068.jpg
│       └───95_1477858062.jpg
│       ...

The first three rows of the .csv file are as follows:

image_path	label	fold
105_1479344562.jpg	105	1
105_1479344940.jpg	105	2
125_1479424716.jpg	125	1

Note

In this example, the data directory in the image column (image_path) is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specified as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

3D image regression

Format
Example

The data for a 3D image regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)

The available dataset connectors require the data for a 3D image regression experiment to be in a ZIP file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)
  Note
  H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image regression experiment.
Note
All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.

The mnist_3d_image_regression_3d.zip file is a preprocessed dataset in H2O Hydrogen Torch that was formatted to solve a 3D image regression problem. The .zip file contains a .csv file and an image folder. The structure of the .zip file is as follows:

mnist_3d_image_regression_3d.zip
│   └───train.csv
│   │
│   └───images
│       └───39385.npy
│       └───28837.npy
│       └───35708.npy
│       ...

The first three rows of the train.csv file are as follows:

image	label
39385.npy	1
28837.npy	0
35708.npy	2

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specified as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.

Image classification

Format
Example

The data for an image classification experiment needs a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)

The available dataset connectors require the data for an image classification experiment to be in a ZIP file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
  Note
  - H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
  - For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexes
Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image classification experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

The flower_image_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multi-class image classification problem. The structure of the .zip file is as follows:

flower_image_classification.zip
│   └───train.csv
│   │
│   └───images
│       └───100080576_f52e8ee070_n.jpg
│       └───10043234166_e6dd915111_n.jpg
│       └───1008566138_6927679c8a.jpg
│       ...

The first three rows of the train.csv file are as follows:

image	label
5777669976_a205f61e5b.jpg	roses
4860145119_b1c3cbaa4e_n.jpg	roses
15011625580_7974c44bce.jpg	roses

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specfied as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.

3D image classification

Format
Example

The data for a 3D image classification experiment needs a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require a label column(s)

The available dataset connectors require the data for a 3D image classification experiment to be in a ZIP file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
  Note
  - H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
  - For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexes
Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image classification experiment.
Note
All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.

The mnist_3d_image_classification_3d.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multi-class 3D image classification problem. The structure of the .zip file is as follows:

mnist_3d_image_classification_3d.zip
│   └───train.csv
│   │
│   └───images
│       └───39385.npy
│       └───28837.npy
│       └───35708.npy
│       ...

The first three rows of the train.csv file are as follows:

image	label
39385.npy	1
28837.npy	0
35708.npy	2

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specfied as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.

Image metric learning

Format
Example

The data for an image metric learning experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column

The available dataset connectors require the data for an image metric learning experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A label column containing the class names
Note
Similar images should have the same class name.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image metric learning experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

The bicycle_image_metric_learning.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image metric learning problem. The structure of the .zip file is as follows:

bicycle_image_metric_learning.zip
│   └───train.csv
│   │
│   images
│       └───181783211141_0.jpg
│       └───181596348104_1.jpg
│       └───171166528893_0.jpg
│       ...

The first three rows of the .csv file are as follows:

image	label	fold
181783211141_0.JPG	181783211141	0
181596348104_1.JPG	181596348104	2
171166528893_0.JPG	171166528893	0

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specify as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

Image object detection

Formats
Format conversions

H2O Hydrogen Torch supports several dataset (data) formats for an image object detection experiment. Supported formats are as follows:

Hydrogen Torch format
Individual boxes format
COCO format
Pascal VOC format

Hydrogen Torch format

The data following the Hydrogen Torch format for an image object detection experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3).

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .pq file needs to follow the format described above
A validation .pq file needs to follow the same format as a train .pq file
A test .pq file needs to follow the same format as a train .pq file, but does not require a class_id, x_min, x_max, y_min, and y_max column

The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each bounding box. Each row of the dataset should contain a list of class names, where each element in the list refers to a single box
- An x_min, x_max, y_min, and y_max column corresponding to the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a list of coordinates, where each element in the list refers to a single box
  Note
  - The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
  - The length of each list for the class_id, x_min, x_max, y_min, and y_max needs to be equal and needs to refer to the total number of bounding boxes in each respective image. If a box is not present for a given image, all lists need to be empty.
- An optional fold column containing cross-validation fold indexes
Note
The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

The global_wheat_image_object_detection.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image object detection problem. The structure of the .zip file is as follows:

global_wheat_image_object_detection.zip
│   └───train.pq
│   │
│   └───images
│       └───7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg
│       └───3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg
│       └───37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg
│       ...

As follows, three random rows from the .pq file:

image	class_id	x_min	y_min	x_max	y_max
7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg	['wheat' 'wheat' 'wheat' ...]	[689 718 382 ...]	[884 464 42 ...]	[754 768 450 ...]	[920 516 101 ...]
3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg	['wheat' 'wheat' 'wheat' ...]	[924 698 904 ...]	[195 10 32 ...]	[981 763 938 ...]	[247 101 79 ...]
37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg	['wheat' 'wheat' 'wheat' ...]	[919 811 4 ...]	[535 820 96 ...]	[1024 912 71 ...]	[613 894 164 ...]

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Individual boxes format

The data following the individual boxes format for an image object detection experiment is structured as follows: A .zip file (1) containing a .csv file (2) and an image folder (3):

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a class_id, x_min, x_max, y_min, and y_max column

The available dataset connectors require the data for an image object detection to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each box. Each row of the dataset should contain a single box
- An x_min, x_max, y_min, and y_max column containing the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a single coordinate value for a corresponding bounding box
  Note
  - The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
  - If a box is not present for a given image, the column class_id, x_min, x_max, y_min, and y_max should be empty.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

image	x_min	y_min	x_max	y_max	class_id
bafc.jpg	311	43	378	134	wheat
bafc.jpg	276	83	354	153	wheat
bafc.jpg	442	309	541	381	wheat
cryv.jpg	301	13	328	124	wheat
cryv.jpg	246	80	344	113	wheat
cryv.jpg	432	303	341	181	wheat

COCO format

The data following the COCO format for an image object detection experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│   └───json_name.json (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

A train .json file needs to follow the format described above
A validation .json file needs to follow the same format as a train .json file
A test .json file needs to follow the same format as a train .json file, but does not require labels

The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .json file that contains labels in a COCO format.
A folder containing all the images specified in the .json file; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Pascal VOC format

The data following the Pascal VOC format for an image object detection experiment is structured as follows: A .zip file (1) containing a folder with .xml files with labels (2) and an image folder (3):

folder_name.zip (1)
│   └───xml_folder_name (2)
│       └───name_of_image.xml
│       └───name_of_image.xml
│       └───name_of_image.xml
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple folders with labels in the .zip file that you can use as train, validation, and test datasets:

A train folder with labels needs to follow the format described above
A validation folder with labels should have the same format as a train folder
A test folder with labels should have the same format as a train folder, but labels are not required

The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A folder that contains .xml files with labels in a Pascal VOC format.
An image folder that contains all the images specified in the .xml files; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Details

Individual Boxes to Hydrogen Torch format

import pandas as pd


# Read data
df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()

df[["image_id", "class_id", "x_min", "y_min", "x_max", "y_max"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Details

COCO to Hydrogen Torch format

import json
import pandas as pd


def get_object_detection(df):
    images = pd.DataFrame(df["images"])
    categories = pd.DataFrame(df["categories"])
    annotations = pd.DataFrame(df["annotations"])

    annotations["x_min"] = annotations["bbox"].map(lambda x: x[0]).astype(int)
    annotations["y_min"] = annotations["bbox"].map(lambda x: x[1]).astype(int)
    annotations["x_max"] = annotations["bbox"].map(lambda x: x[0] + x[2]).astype(int)
    annotations["y_max"] = annotations["bbox"].map(lambda x: x[1] + x[3]).astype(int)

    annotations = annotations[
        ["image_id", "category_id", "x_min", "y_min", "x_max", "y_max"]
    ]

    annotations["category_id"] = annotations["category_id"].astype(int)
    annotations = annotations.merge(
        categories[["id", "name"]].drop_duplicates(), left_on="category_id", right_on="id", how="left"
    )
    annotations = annotations.merge(
        images[["id", "file_name"]].drop_duplicates(), left_on="image_id", right_on="id", how="right"
    )

    annotations.drop(["id_x", "id_y", "image_id"], axis=1, inplace=True)

    return annotations


# Read data
with open("/data/COCO_train_annos.json", "r") as fp:
    train = json.load(fp)

# Parse COCO format
train_ann = get_object_detection(train)

# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
train_ann[["file_name", "name", "x_min", "y_min", "x_max", "y_max"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Details

Pascal VOC to Hydrogen Torch

import glob
import os
from xml.etree import ElementTree

import pandas as pd
from tqdm import tqdm


observations = []

for xml in tqdm(glob.glob("/data/Annotations/*.xml")):
    tree = ElementTree.parse(xml)
    root = tree.getroot()
    objs = root.findall("object")

    for obj in objs:
        name = obj.find("name").text

        bndbox = obj.find("bndbox")
        xmin = float(bndbox.findtext("xmin")) - 1
        ymin = float(bndbox.findtext("ymin")) - 1
        xmax = float(bndbox.findtext("xmax"))
        ymax = float(bndbox.findtext("ymax"))
        
        try:
            img_name = root.findall("path")[0].text.split("/")[-1]
        except Exception:
            img_name = root.findall("filename")[0].text

        observations.append(
            (
                img_name,
                name,
                xmin,
                ymin,
                xmax,
                ymax,
            )
        )

df = pd.DataFrame(
    observations, columns=["image", "class_id", "x_min", "y_min", "x_max", "y_max"]
)

# Prepare the processed dataset
df = df.groupby(["image"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
df.to_parquet("/data/train.pq", engine="pyarrow", index=False)

Image semantic segmentation

Formats
Helper functions
Format conversions

H2O Hydrogen Torch supports several dataset (data) formats for an image semantic segmentation experiment. Supported formats are as follows:

Hydrogen Torch format
COCO format

Hydrogen Torch format

The data following the Hydrogen Torch format* for an image semantic segmentation experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .pq file needs to follow the format described above
A validation .pq file needs to follow the same format as a train .pq file
A test .pq file needs to follow the same format as a train .pq file, but does not need a class_id and rle_mask column

The available dataser connectors require the data for an image semantic segmentation experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each mask. Each row of the dataset should contain a list of all possible class names
- A rle_mask column containing run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be provided
  Note
  The length of each class_id and rle_mask list must be equal while referring to the total number of classes.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image semantic segmentation experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

The fashion_image_semantic_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image semantic segmentation problem. The structure of the .zip file is as follows:

fashion_image_semantic_segmentation.zip
│   └───train.pq
│   │
│   └───images
|       └───img_0458.png
|       └───img_0604.png    
│       └───img_0668.png
│           ...

As follows, three random rows from the .pq file:

image	class_id	rle_mask
img_0458.png	['shoes' 'pants' 'dress' 'coat' 'shirt']	['180629 7 181447 17...
img_0604.png	['shoes' 'pants' 'dress' 'coat' 'shirt']	['189672 2 190493 9...
img_0668.png	['shoes' 'pants' 'dress' 'coat' 'shirt']	['108023 11 108848 11...

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

COCO format

The data following the COCO format for an image semantic segmentation experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│   └───json_name.json (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

A train .json file needs to follow the format described above
A validation .json file needs to follow the same format as a train .json file
A test .json file needs to follow the same format as a train .json file, but does not require labels

The available dataset connectors require the data for an image semantic segmentation experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .json file that contains labels in a COCO format.
A folder containing all the image specified in the .json file; H2O Hydrogen Torch uses the images in this folder during an image semantic segmentation experiment.

Details

RLE encoding and decoding functions

from typing import Tuple

import numpy as np


def mask2rle(x: np.ndarray) -> str:
    """
    Converts input masks into RLE-encoded strings.

    Args:
        x: numpy array of shape (height, width), 1 - mask, 0 - background
    Returns:
        RLE string
    """

    pixels = x.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return " ".join(str(x) for x in runs)


def rle2mask(mask_rle: str, shape: Tuple[int, int]) -> np.ndarray:
    """
    Converts RLE-encoded string into the binary mask.

    Args:
        mask_rle: RLE-encoded string
        shape: (height,width) of array to return
    Returns:
        binary mask: 1 - mask, 0 - background
    """

    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape, order="F")  # Needed to align to RLE direction

Details

.csv file with masks to Hydrogen Torch format

import pandas as pd

df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()

df[["image_id", "class_id", "rle_mask"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Details

COCO to Hydrogen Torch format

import json
import pandas as pd
from pycocotools.coco import COCO


def get_semantic_segmentation(df, coco_path):
    coco = COCO(coco_path)

    images = images[["id", "file_name"]].drop_duplicates()
    images.columns = ["image_id", "file_name"]

    categories = categories[["id", "name"]].drop_duplicates()
    categories.columns = ["category_id", "name"]
    # Filter out _background_ class
    categories = categories[categories.name != "_background_"]

    all_labels = [
    pd.DataFrame({"file_name": x, "name": categories.name.unique()})
    for x in images.file_name.unique()
    ]
    all_labels = pd.concat(all_labels)
    all_labels = all_labels.merge(images).merge(categories).reset_index(drop=True)

    rles = []
    for idx, row in all_labels.iterrows():
        yield data_split, idx / len(all_labels)
        semantic_annotations = [
            x
            for x in df["annotations"]
            if x["image_id"] == row["image_id"]
            and int(x["category_id"]) == row["category_id"]
        ]

        if len(semantic_annotations) == 0:
            rles.append("")
            continue
        semantic_mask = np.max(
            [coco.annToMask(x) for x in semantic_annotations], axis=0
        )
        # mask2rle() is defined in "Helper functions" section
        rles.append(mask2rle(semantic_mask))

    all_labels["rle_mask"] = rles

    return all_labels


# Read data
train_path = "/data/COCO_train_annos.json"
with open(train_path, "r") as fp:
    train = json.load(fp)

# Parse COCO format
train_ann = get_semantic_segmentation(df=train, coco_path=train_path)

# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: x.to_list()).reset_index()
train_ann[["file_name", "name", "rle"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

3D image semantic segmentation

Format
Example
Helper functions

The data for a 3D image semantic segmentation experiment needs to be structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .pq file needs to follow the format described above
A validation .pq file needs to follow the same format as a train .pq file
A test .pq file needs to follow the same format as a train .pq file, but does not need a class_id and rle_mask column

The available dataser connectors require the data for a 3D image semantic segmentation experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each mask. Each row of the dataset should contain a list of all possible class names
- A rle_mask column containing run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be provided
  Note
  The length of each class_id and rle_mask list must be equal while referring to the total number of classes.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image semantic segmentation experiment.
Note
All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.

The covid_ct_image_semantic_segmentation_3d.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a 3D image semantic segmentation problem. The structure of the .zip file is as follows:

covid_ct_image_semantic_segmentation_3d.zip
│   └───train.pq
│   │
│   └───images
|       └───coronacases_org_001.npy
|       └───coronacases_org_002.npy
│       └───coronacases_org_003.npy
│           ...

As follows, three random rows from the .pq file:

image	class_id	rle_mask
coronacases_org_001.npy	['lung']	['171087 6 171095 7 171...
coronacases_org_002.npy	['lung']	['6439 8 6563 15 6689...
coronacases_org_003.npy	['lung']	['103983 1 119580 9 119...

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.

Details

RLE encoding and decoding functions

from typing import Tuple

import numpy as np


def mask2rle(x: np.ndarray) -> str:
    """
    Converts input masks into RLE-encoded strings.

    Args:
        x: numpy array of shape (height, width), 1 - mask, 0 - background
    Returns:
        RLE string
    """

    pixels = x.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return " ".join(str(x) for x in runs)


def rle2mask_3d(mask_rle: str, shape: Tuple) -> np.ndarray:
    """
    Converts RLE-encoded string into the binary mask (3D version).

    Args:
        mask_rle: RLE-encoded string
        shape: (height, width, depth) of array to return
    Returns:
        binary mask: 1 - mask, 0 - background
    """
    shape = [shape[1], shape[0], shape[2]]
    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0] * shape[1] * shape[2], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape).transpose(2, 1, 0)

Image instance segmentation

Formats
Helper functions
Format conversions

H2O Hydrogen Torch supports several dataset (data) formats for an image instance segmentation experiment. Supported formats are as follows:

Hydrogen Torch format
COCO format

Hydrogen Torch format

The data following the Hydrogen Torch format for an image instance segmentation experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv, but does not require a class_id and rle_mask column

The available dataset connectors require the data for an image instance segmentation experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each instance mask. Each row of the dataset should contain a list of class names, where each element in the list refers to a single mask instance.
- A rle_mask column containing run-length-encoded (RLE) masks for each instance from the class_id column. Each row of the dataset should contain a list of RLE-encoded masks, where each element in the list refers to a single instance.
  Note
  The length of each class_id and rle_mask list must be equal while referring to the total number of instances in each respective image. If an instance is not present for a given image, all lists need to be empty.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image instance segmentation experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

The coco_image_instance_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image instance segmentation problem. The structure of the .zip file is as follows:

coco_image_instance_segmentation.zip
│   └───train.pq
│   │
│   └───images
│       └───000000151231.jpg
│       └───000000433826.jpg
│       └───000000061159.jpg
│           ...

As follows, three random rows from the .pq file:

image_id	class_id	rle_mask
000000151231.jpg	['car' 'car']	['91949 7 92375 14 92801...
000000433826.jpg	['car' 'car']	['224473 3 224952 4 22...
000000061159.jpg	['car' 'car']	['161665 9 162291 25...

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

COCO format

The data following the COCO format for an image instance segmentation experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│   └───json_name.json (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

A train .json file needs to follow the format described above
A validation .json file needs to follow the same format as a train .json file
A test .json file needs to follow the same format as a train .csv file, but does not require labels

The available dataset connectors require the data for an image instance segmentation to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data connectors.
A .json file that contains labels in a COCO format .
A folder containing all the images specified in the .json file; H2O Hydrogen Torch uses the images in this folder to run an image instance segmentation experiment.

Details

RLE encoding and decoding functions

from typing import Tuple

import numpy as np


def mask2rle(x: np.ndarray) -> str:
    """
    Converts input masks into RLE-encoded strings.

    Args:
        x: numpy array of shape (height, width), 1 - mask, 0 - background
    Returns:
        RLE string
    """

    pixels = x.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return " ".join(str(x) for x in runs)


def rle2mask(mask_rle: str, shape: Tuple[int, int]) -> np.ndarray:
    """
    Converts RLE-encoded string into the binary mask.

    Args:
        mask_rle: RLE-encoded string
        shape: (height,width) of array to return
    Returns:
        binary mask: 1 - mask, 0 - background
    """

    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape, order="F")  # Needed to align to RLE direction

Details

.csv file with masks to Hydrogen Torch format

import pandas as pd


df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()

df[["image_id", "class_id", "rle_mask"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Details

COCO to H2O Hydrogen Torch format

import json
import pandas as pd
from pycocotools.coco import COCO


def get_instance_segmentation(df, coco_path):
    coco = COCO(json_path)

    images = pd.DataFrame(df["images"])
    categories = pd.DataFrame(df["categories"])
    annotations = pd.DataFrame(df["annotations"])

    rles = []
    for idx, annotation in enumerate(df["annotations"]):
        yield data_split, idx / len(df["annotations"])
        mask = mask2rle(coco.annToMask(annotation))
        rles.append(mask)

    annotations["rle_mask"] = rles
    annotations.loc[annotations.rle_mask == "", "rle_mask"] = float("nan")

    annotations = annotations[["image_id", "category_id", "rle_mask"]]

    annotations["category_id"] = annotations["category_id"].astype(int)
    annotations = annotations.merge(
        categories[["id", "name"]].drop_duplicates(),
        left_on="category_id",
        right_on="id",
        how="left",
    )
    annotations = annotations.merge(
        images[["id", "file_name"]].drop_duplicates(),
        left_on="image_id",
        right_on="id",
        how="right",
    )

    annotations.drop(["id_x", "id_y", "image_id"], axis=1, inplace=True)
    
    return annotations


# Read data
train_path = "/data/COCO_train_annos.json"
with open(train_path, "r") as fp:
    train = json.load(fp)

# Parse COCO format
train_ann = get_instance_segmentation(df=train, coco_path=train_path)

# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
train_ann[["file_name", "name", "rle"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Text

Text regression

Formats
Example

The data for a text regression experiment can be formatted following format 1 or 2.

Format 1
Format 2

A .csv file.

csv_name.csv (1)(2)

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require label column(s)

The available dataset connectors require the data for a text regression experiment to be in a .zip or .csv file.

Note

To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

A .csv file containing the following columns:
- A text column containing the texts for the experiment
- One or more label columns containing the numerical labels (targets)
  Note
  H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

The wellformed_query_text_regression.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text regression problem.

As follows, two random rows from the .csv file:

rating	text
0.2	The European Union includes how many ?
1.0	What is released when an ion is formed ?

Note

The rating column refers to the label column.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

Text classification

Formats
Example

The data for a text classification experiment can be formatted following format 1 or 2.

Format 1
Format 2

A .csv file.

csv_name.csv (1)(2)

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)

The available dataset connectors require the data for a text classification experiment to be in a .zip or .csv file.
Note
To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- A text column containing the texts for the experiment
- One or more label columns containing either either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
  Note
  - H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
  - For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

The amazon_reviews_text_classification.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text classification problem.

The first two rows of the .csv file are as follows:

text	label
GREAT!!!!! Review: I got this toy a couple of days ago and I ABSOLUTELY LOVE IT! It is so much more realistic looking than my other baby born comfort seat. All though I dont have a baby born I had one before but I sold it at a garage sale. So I use It for my berenguar baby doll. And it even has the buckle that goes across the shoulder like a real babies car seat!!!! DEFFINATELY WORTH THE MONEY!!!!!!	Positive
This Or "Dixie Chicken" Presents Them At A Peak Review: Though lyrically the overall feel of this record is slightly provincial, it can still transport me to places I wanna be. Musically, this pop product from California is stylistically consistent. Yet the instrumentation is diverse and each member is resourceful. But it's Lowell George's vocals and slide guitar that are primarily at the center. He's not flashy and that's a positive. You get treated to 12-bar blues, a song of prescription meds for tripping and a blues with an accordian.But the three highlights are "Easy To Slip", a jaunty acoustic/electric number about lighting up and the sheer joy that memory drifting can project, "Teenage Nervous Breakdown" in which they switch to the domain of energy-driven rock and roll and the title track, a leisurely-paced country blues in which a generous helping of background vocals provides just the right amount of tension.	Positive

text

label

GREAT!!!!! Review: I got this toy a couple of days ago and I ABSOLUTELY LOVE IT! It is so much more realistic looking than my other baby born comfort seat. All though I dont have a baby born I had one before but I sold it at a garage sale. So I use It for my berenguar baby doll. And it even has the buckle that goes across the shoulder like a real babies car seat!!!! DEFFINATELY WORTH THE MONEY!!!!!!

Positive

This Or "Dixie Chicken" Presents Them At A Peak Review: Though lyrically the overall feel of this record is slightly provincial, it can still transport me to places I wanna be. Musically, this pop product from California is stylistically consistent. Yet the instrumentation is diverse and each member is resourceful. But it's Lowell George's vocals and slide guitar that are primarily at the center. He's not flashy and that's a positive. You get treated to 12-bar blues, a song of prescription meds for tripping and a blues with an accordian.But the three highlights are "Easy To Slip", a jaunty acoustic/electric number about lighting up and the sheer joy that memory drifting can project, "Teenage Nervous Breakdown" in which they switch to the domain of energy-driven rock and roll and the title track, a leisurely-paced country blues in which a generous helping of background vocals provides just the right amount of tension.

Positive

Note

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

Text sequence to sequence

Formats
Example

The data for a text sequence to sequence experiment can be formatted following format 1 or 2.

Format 1
Format 2

A .csv file.

csv_name.csv (1)(2)

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require an output_text column

The available dataset connectors require the data for a text sequence to sequence experiment to be in a .zip or .csv file.
Note
To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An input-text column containing/representing the input texts
- An output-text column containing/representing the out put texts
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

The cnn_dailymail_text_sequence_to_sequence.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text sequence to sequence problem. The structure of the .zip file is as follows:

cnn_dailymail_text_sequence_to_sequence.zip
│   └───train.csv

As follows, a random row from the .csv file:

Details

Random row

text	summary	id
It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It's a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but because he wants to. "While I believe I have the authority to carry out this military action without specific congressional authorization, I know that the country will be stronger if we take this course, and our actions will be even more effective," he said. "We should have this debate, because the issues are too big for business as usual." Obama said top congressional leaders had agreed to schedule a debate when the body returns to Washington on September 9. The Senate Foreign Relations Committee will hold a hearing over the matter on Tuesday, Sen. Robert Menendez said. Transcript: Read Obama's full remarks . Syrian crisis: Latest developments . U.N. inspectors leave Syria . Obama's remarks came shortly after U.N. inspectors left Syria, carrying evidence that will determine whether chemical weapons were used in an attack early last week in a Damascus suburb. "The aim of the game here, the mandate, is very clear -- and that is to ascertain whether chemical weapons were used -- and not by whom," U.N. spokesman Martin Nesirky told reporters on Saturday. But who used the weapons in the reported toxic gas attack in a Damascus suburb on August 21 has been a key point of global debate over the Syrian crisis. Top U.S. officials have said there's no doubt that the Syrian government was behind it, while Syrian officials have denied responsibility and blamed jihadists fighting with the rebels. British and U.S. intelligence reports say the attack involved chemical weapons, but U.N. officials have stressed the importance of waiting for an official report from inspectors. The inspectors will share their findings with U.N. Secretary-General Ban Ki-moon Ban, who has said he wants to wait until the U.N. team's final report is completed before presenting it to the U.N. Security Council. The Organization for the Prohibition of Chemical Weapons, which nine of the inspectors belong to, said Saturday that it could take up to three weeks to analyze the evidence they collected. "It needs time to be able to analyze the information and the samples," Nesirky said. He noted that Ban has repeatedly said there is no alternative to a political solution to the crisis in Syria, and that "a military solution is not an option." Bergen: Syria is a problem from hell for the U.S. Obama: 'This menace must be confronted' Obama's senior advisers have debated the next steps to take, and the president's comments Saturday came amid mounting political pressure over the situation in Syria. Some U.S. lawmakers have called for immediate action while others warn of stepping into what could become a quagmire. Some global leaders have expressed support, but the British Parliament's vote against military action earlier this week was a blow to Obama's hopes of getting strong backing from key NATO allies. On Saturday, Obama proposed what he said would be a limited military action against Syrian President Bashar al-Assad. Any military attack would not be open-ended or include U.S. ground forces, he said. Syria's alleged use of chemical weapons earlier this month "is an assault on human dignity," the president said. A failure to respond with force, Obama argued, "could lead to escalating use of chemical weapons or their proliferation to terrorist groups who would do our people harm. In a world with many dangers, this menace must be confronted." Syria missile strike: What would happen next? Map: U.S. and allied assets around Syria . Obama decision came Friday night . On Friday night, the president made a last-minute decision to consult lawmakers. What will happen if they vote no? It's unclear. A senior administration official told CNN that Obama has the authority to act without Congress -- even if Congress rejects his request for authorization to use force. Obama on Saturday continued to shore up support for a strike on the al-Assad government. He spoke by phone with French President Francois Hollande before his Rose Garden speech. "The two leaders agreed that the international community must deliver a resolute message to the Assad regime -- and others who would consider using chemical weapons -- that these crimes are unacceptable and those who violate this international norm will be held accountable by the world," the White House said. Meanwhile, as uncertainty loomed over how Congress would weigh in, U.S. military officials said they remained at the ready. 5 key assertions: U.S. intelligence report on Syria . Syria: Who wants what after chemical weapons horror . Reactions mixed to Obama's speech . A spokesman for the Syrian National Coalition said that the opposition group was disappointed by Obama's announcement. "Our fear now is that the lack of action could embolden the regime and they repeat his attacks in a more serious way," said spokesman Louay Safi. "So we are quite concerned." Some members of Congress applauded Obama's decision. House Speaker John Boehner, Majority Leader Eric Cantor, Majority Whip Kevin McCarthy and Conference Chair Cathy McMorris Rodgers issued a statement Saturday praising the president. "Under the Constitution, the responsibility to declare war lies with Congress," the Republican lawmakers said. "We are glad the president is seeking authorization for any military action in Syria in response to serious, substantive questions being raised." More than 160 legislators, including 63 of Obama's fellow Democrats, had signed letters calling for either a vote or at least a "full debate" before any U.S. action. British Prime Minister David Cameron, whose own attempt to get lawmakers in his country to support military action in Syria failed earlier this week, responded to Obama's speech in a Twitter post Saturday. "I understand and support Barack Obama's position on Syria," Cameron said. An influential lawmaker in Russia -- which has stood by Syria and criticized the United States -- had his own theory. "The main reason Obama is turning to the Congress: the military operation did not get enough support either in the world, among allies of the US or in the United States itself," Alexei Pushkov, chairman of the international-affairs committee of the Russian State Duma, said in a Twitter post. In the United States, scattered groups of anti-war protesters around the country took to the streets Saturday. "Like many other Americans...we're just tired of the United States getting involved and invading and bombing other countries," said Robin Rosecrans, who was among hundreds at a Los Angeles demonstration. What do Syria's neighbors think? Why Russia, China, Iran stand by Assad . Syria's government unfazed . After Obama's speech, a military and political analyst on Syrian state TV said Obama is "embarrassed" that Russia opposes military action against Syria, is "crying for help" for someone to come to his rescue and is facing two defeats -- on the political and military levels. Syria's prime minister appeared unfazed by the saber-rattling. "The Syrian Army's status is on maximum readiness and fingers are on the trigger to confront all challenges," Wael Nader al-Halqi said during a meeting with a delegation of Syrian expatriates from Italy, according to a banner on Syria State TV that was broadcast prior to Obama's address. An anchor on Syrian state television said Obama "appeared to be preparing for an aggression on Syria based on repeated lies." A top Syrian diplomat told the state television network that Obama was facing pressure to take military action from Israel, Turkey, some Arabs and right-wing extremists in the United States. "I think he has done well by doing what Cameron did in terms of taking the issue to Parliament," said Bashar Jaafari, Syria's ambassador to the United Nations. Both Obama and Cameron, he said, "climbed to the top of the tree and don't know how to get down." The Syrian government has denied that it used chemical weapons in the August 21 attack, saying that jihadists fighting with the rebels used them in an effort to turn global sentiments against it. British intelligence had put the number of people killed in the attack at more than 350. On Saturday, Obama said "all told, well over 1,000 people were murdered." U.S. Secretary of State John Kerry on Friday cited a death toll of 1,429, more than 400 of them children. No explanation was offered for the discrepancy. Iran: U.S. military action in Syria would spark 'disaster' Opinion: Why strikes in Syria are a bad idea .	Syrian official: Obama climbed to the top of the tree, "doesn't know how to get down" Obama sends a letter to the heads of the House and Senate . Obama to seek congressional approval on military action against Syria . Aim is to determine whether CW were used, not by whom, says U.N. spokesman .	0001d1afc246a7964130f43ae940af6bc6c57f01

Note

In this example, the text column refers to the input-text column, while the summary column refers to the output-text column.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

Text span prediction

Formats
Example

The data for a text span prediction experiment can be formatted following format 1 or 2.

Format 1
Format 2

A .csv file.

csv_name.csv (1)(2)

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require an answer column

The available dataset connectors require the data for a text span prediction experiment to be in a .zip or .csv file.
Note
To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- A context column containing/representing the input texts
- A question column containing/representing the questions (that the input context text can answer)
- An answer column containing/representing the substrings from the context column that answers the questions (question column)
- An optional answer-start column containing/representing the start of the substring answers in the context column
  Note
  - The start of the substring answers needs to be specified by integers representing the index where the answer starts in the context.
  - If you do not provide an answer-start column, H2O Hydrogen Torch selects the first occurrence of the answer in the context.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

The squad_text_span_prediction.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text span prediction problem. The structure of the .zip file is as follows:

squad_text_span_prediction.zip
│   └───squad_v1.csv

As follows, a random row from the .csv file:

question	context	answer
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?	Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.	Saint Bernadette Soubirous

Note

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

Text token classification

Formats
Example
Conversions

The data for a text token classification experiment can be formatted following format 1 or 2.

Format 1
Format 2

A .pq (parquet) file.

parquet_name.pq (1)(2)

A .zip file containing a .pq (parquet) file.

folder_name.zip (1)
│   └───parquet_name.pq (2)

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .pq file needs to follow the format described above
A validation .pq file needs to follow the same format as a train .pq file
A test .pq file needs to follow the same format as a train .pq file, but does not require a label column

The available dataset connectors require the data for a text token classification to be in a .zip or .pq file.
Note
To learn how to upload your .zip or .pq file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- A text column containing tokenized text: each sample should have a list of string tokens
- A label column containing token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

The conll2003_text_token_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text token classification problem. The structure of the .zip file is as follows:

conll2003_text_token_classification.zip
│   └───test.pq
│   └───train.pq
│   └───validation.pq

As follows, a random row from the train.pq file:

id	text	pos_tags	chunk_tags	ner_tags
4158	['Nijmeh' 'of' 'Lebanon' 'beat' 'Nasr' 'of' 'Saudi' 'Arabia' '1-0' '(' 'halftime' '1-0' ')' 'in' 'their' 'Asian' 'club' 'championship' 'second' 'round' 'first' 'leg' 'tie' 'on' 'Saturday' '.']	['NNS' 'IN' 'NNP' 'VBD' 'NNP' 'IN' 'NNP' 'NNP' 'NNP' '(' 'NN' 'CD' ')' 'IN' 'PRP$' 'JJ' 'NN' 'NN' 'NN' 'NN' 'JJ' 'NN' 'NN' 'IN' 'NNP' '.']	['B-NP' 'B-VP' 'B-VP' 'I-VP' 'B-NP' 'I-NP' 'B-PP' 'B-NP' 'O' 'O' 'B-NP' 'B-NP' 'I-NP' 'I-NP' 'B-PP' 'B-NP' 'I-NP' 'B-NP' 'I-NP' 'B-VP' 'B-NP' 'B-PP' 'B-VP' 'O']	['B-ORG' 'O' 'B-LOC' 'O' 'B-ORG' 'O' 'B-LOC' 'I-LOC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-MISC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O']

Note

The *_tags columns refer to the label column and can only be selected when running a text token classification experiment. Only one column from the available label columns can be selected when running an experiment.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

Details

Convert CoNLL-2003 dataset

from pathlib import Path

import pandas as pd

try:
    import datasets
except ImportError:
    raise ImportError("Need datasets>=1.11.0 to download English CoNLL2003 data!")

dataset = datasets.load_dataset("conll2003")

for subset in dataset:
    out_path = Path(f"/data/conll2003/{subset}.pq")
    out_path.parent.mkdir(exist_ok=True, parents=True)

    df = pd.DataFrame(dataset[subset])

    # Decode the label encoded labels
    for feature in dataset[subset].features:
        if isinstance(dataset[subset].features[feature], datasets.Sequence):
            feat = dataset[subset].features[feature].feature

            if isinstance(feat, datasets.ClassLabel):
                df[feature] = df[feature].apply(feat.int2str)

    df.rename(columns={"tokens": "text"}, inplace=True)

    df.to_parquet(out_path, engine="pyarrow", index=False)

Text metric learning

Formats
Example

The data for a text metric learning experiment can be formatted following format 1 or 2.

Format 1
Format 2

A .csv file.

csv_name.csv (1)(2)

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column

The available dataset connectors require the data for a text metric learning experiment to be in a .zip or .csv file.
Note
To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- A text column containing the input texts
- A label column containing the class names
  Note
  Texts that are similar should have the same class name.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

The ubuntu_text_metric_learning.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text metric learning problem. The structure of the .zip file is as follows:

ubuntu_text_metric_learning.zip
│   └───train.csv
│   └───test.csv

As follows, a random row from the train.csv file:

text	label	fold
what is the easiest way to strip a desktop edition to a server edition ?	16	1

Note

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

Audio

Audio regression

Format
Example

The data for an audio regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───audio_folder_name (3)
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)

The available dataset connectors require the data for an audio regression experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
  Note
  - Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
  - Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)
  Note
  H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting an audio regressuin experiment.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the audio regression experiment.
Note
All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.

The amnist_audio_regression.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an audio regression problem. The .zip file contains a .csv file and an audio folder. The structure of the .zip file is:

amnist_audio_regression.zip
│   └───amnist_meta.csv
│   │
│   └───amnist_audios
│        └───0_01_0.ogg
│        └───0_01_1.ogg
│        └───0_01_2.ogg
│           ...

The first three rows of the .csv file are:

audio	label	fold
2_26_2.ogg	2	0
2_26_38.ogg	2	1
9_26_47.ogg	9	2

Note

In this example, the data directory in the audio column is not specified. That being the case, it needs to be specified when uploading the dataset, and the amnist_audios folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

Audio classification

Format
Example

The data for an audio classification experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───audio_folder_name (3) 
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)

The available data connectors require the data for an audio classification experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
  Note
  - Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
  - Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing either multi-class labels (One-hot encoded) or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
  Note
  - H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. The classes are mutually exclusive in multi-class problems, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
  - For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An audio folder that contains all the audio files specified in the audio column above; H2O Hydrogen Torch uses the audios in this folder to run the audio classification experiment.
Note
All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.

The esc10_audio_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multiclass audio classification problem. The structure of the .zip file is:

esc10_audio_classification.zip
│   └───esc10_meta.csv
│   │
│   └───audio_esc10
│       └───2-37806-B-40.wav
│       └───5-200339-A-1.wav
│       └───1-172649-D-40.wav
│       ...

The first three rows of the .csv file are:

filename	fold	label
1-100032-A-0.wav	0	dog
1-110389-A-0.wav	0	dog
1-116765-A-41.wav	0	chainsaw

Note

In this example, the data directory in the filename column is not specified. That being the case, it needs to be specified when uploading the dataset, and the audio_files folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.

Speech

Speech recognition

Format
Example

The data for a speech recognition experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───audio_folder_name (3) 
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       ...

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)

The available data connectors require the data for a speech recognition experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
  Note
  - To learn about supported audio extensions for a speech recognition experiment, see Supported audio extensions for speech recognition.
  - Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
  tip
  For most supported speech architectures, utilize speech audios of up to 30 seconds. Attempting to train with longer speech samples may lead to:
  - Out-of-memory (OOM) issues even on high VRAM GPUs
  - Poor training performance
- One label column containing the text transcript of the audio
- An optional fold column containing cross-validation fold indexes
  Note
  The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the experiment.
Note
All audios need to have an audio extension. To learn about supported audio extensions, see Supported audio extensions for speech recognition.

The minds14_US_speech_recognition.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a speech recognition problem. The structure of the .zip file is:

minds14-US_speech_recognition.zip
│   └───annotations.csv
│   └───audio
│       └───0.wav
│       └───1.wav
│       └───2.wav
│       ...

The first three rows of the .csv file are:

file	transcript	duration
0.wav	I WOULD LIKE TO SET UP A JOINT ACCOUNT WITH MY PARTNER [...]	11
1.wav	I'M WONDERING HOW TO SET UP A JOINT ACCOUNT WITH MY WIFE [...]	7
2.wav	HI I'D LIKE TO SET UP A JOINT ACCOUNT WIH MY PARTNER I'M NOT SEEING [...]	24

Note

The duration column is not a required column when formating your dataset for a speech recognition experiment
In this example, the data directory in the file column is not specified. That being the case, it needs to be specified when uploading the dataset, and the audio folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch.

Data collection

Example 1: Amazon S3

Below, observe a Python script example parsing a folder structure of an Amazon S3 bucket collecting images into a single dataset. In other words, the script demonstrates how to create a new dataset (ZIP file) from several files in S3 to later re-upload to S3.

# Import libraries

# We use `boto3` to connect to S3
# Optionally `tqdm` can be used to show download progress
# We use pandas for data manipulation
from boto3.session import Session
from tqdm import tqdm
import pandas as pd
import shutil
import os


# Set AWS credentials
# You can set them directly or use the environment variables, if those are set
aws_access_key = os.environ["AWS_ACCESS_KEY_ID"]
aws_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]

# Set list of bucket paths, that contain image files
images_bucket_subfolders = ["h2o-release/hydrogen-torch/data-prep"]

# Set path to the train CSV
csv_path = "h2o-release/hydrogen-torch/data-prep/csvs/train.csv"

# Set allowed file extensions
allowed_extensions = [".jpg",".jpeg", ".png"]

# Files will be downloaded to data folder
data_folder = "data"
image_folder = f"{data_folder}/images"
os.makedirs(image_folder, exist_ok=True)

# Connect to S3
s3 = Session(aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key).resource("s3")


# Download train.csv
bucket, csv_path = csv_path.split("/", 1)
output_csv_path = f"{data_folder}/{os.path.basename(csv_path)}"
s3.Bucket(bucket).download_file(csv_path, output_csv_path)


# Make sure the "Image Column" contains only file names, not full paths
image_col = "image"

data = pd.read_csv(output_csv_path)
data[image_col] = data[image_col].map(os.path.basename)
data = data.to_csv(output_csv_path, index=False)


# Download all image files
for images_bucket_subfolder in images_bucket_subfolders:

    if "/" in images_bucket_subfolder:
        bucket, subfolder = images_bucket_subfolder.split("/", 1)
    else:
        bucket, subfolder = images_bucket_subfolder, ""

    s3_bucket = s3.Bucket(bucket)
    files = s3_bucket.objects

    if subfolder:
        files = files.filter(Prefix=f"{subfolder}/")


    files = list(files)

    for file in tqdm(files):
        if any([file.key.endswith(ext) for ext in allowed_extensions]):
            s3_bucket.download_file(file.key, f"{image_folder}/{os.path.basename(file.key)}")


# Create ZIP file that can be imported to H2O Hydrogen Torch
zip_file_name = "flowers_image_classification"
full_zip_file_name = shutil.make_archive(zip_file_name, 'zip', data_folder)

# Set desired S3 path where to upload the ZIP file in format "bucket_name" or "bucket_name/subfolder_1/.../subfolder_n"
upload_bucket_path = "YOUR_BUCKET_NAME/SUB_FOLDER"

# Upload the ZIP file
rel_zip_file_name = os.path.basename(full_zip_file_name)
upload_path = f"{upload_bucket_path}/{rel_zip_file_name}"
upload_bucket, upload_zip_path = upload_path.split("/", 1)
s3.Bucket(upload_bucket).upload_file(rel_zip_file_name, upload_zip_path)

Supported audio extensions for speech recognition

For speech recognition, H2O Hydrogen Torch supports the following audio extension:

Uncompressed (.wav).

Supported audio extensions for audio processing

The following is a list of supported audio extensions for audio processing in H2O Hydrogen Torch:

Uncompressed: .wav, .aiff
Lossless compressed: .flac
Lossy compressed: .mp3, .ogg

Supported image extensions for image processing

The following is a list of supported image extensions for image processing in H2O Hydrogen Torch:

Windows bitmaps: .bmp
JPEG files: .jpeg, .jpg, .jpe
JPEG 2000 files: .jp2
Portable Network Graphics: .png
WebP: .webp
Portable image format: .pbm, .pgm, .ppm, .pnm
TIFF files: .tiff, .tif
Radiance HDR: .hdr
NumPy data array: .npy
note
For 2D image processing, the data must be of shape [height, width, channels].

Supported 3D image extensions for 3D image processing

For 3D image problem types, H2O Hydrogen Torch supports the following 3D image extension:

NumPy data array: .npy
note
For 3D image processing, the data must be of shape [height, width, depth, channels].

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai

Overview​

Image​

Image regression​

3D image regression​

Image classification​

3D image classification​

Image metric learning​

Image object detection​

Hydrogen Torch format​

Example​

Individual boxes format​

Example​

COCO format​

Pascal VOC format​

Image semantic segmentation​

Hydrogen Torch format​

Example​

COCO format​

3D image semantic segmentation​

Image instance segmentation​

Hydrogen Torch format​

Example​

COCO format​

Text​

Text regression​

Text classification​

Text sequence to sequence​

Text span prediction​

Text token classification​

Text metric learning​

Audio​

Audio regression​

Audio classification​

Speech​

Speech recognition​

Data collection​

Example 1: Amazon S3​

Supported audio extensions for speech recognition​

Supported audio extensions for audio processing​

Supported image extensions for image processing​

Supported 3D image extensions for 3D image processing​

Overview

Image

Image regression

3D image regression

Image classification

3D image classification

Image metric learning

Image object detection

Hydrogen Torch format

Example

Individual boxes format

Example

COCO format

Pascal VOC format

Image semantic segmentation

Hydrogen Torch format

Example

COCO format

3D image semantic segmentation

Image instance segmentation

Hydrogen Torch format

Example

COCO format

Text

Text regression

Text classification

Text sequence to sequence

Text span prediction

Text token classification

Text metric learning

Audio

Audio regression

Audio classification

Speech

Speech recognition

Data collection

Example 1: Amazon S3

Supported audio extensions for speech recognition

Supported audio extensions for audio processing

Supported image extensions for image processing

Supported 3D image extensions for 3D image processing