Skip to main content
Version: v1.3.0

Dataset formats

Overview

The data (dataset) for one of the supported problem types needs to be formatted (prepared) by you in a certain way. Below, you can find instructions on formatting your dataset for a particular supported problem type.

H2O Label Genie logo

With H2O Label Genie (a Wave application in H2O AI Cloud), you can label your image, text, and audio data to generate annotated datasets supported in H2O Hydrogen Torch. To learn more, see H2O Label Genie | Docs.

note

To learn how to import a formatted (preprocessed) dataset, see Import a dataset.

Image

Image regression

The data for an image regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)
  1. The available dataset connectors require the data for an image regression experiment to be in a ZIP file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing the numerical labels (targets)
      Note

      H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image regression experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

3D image regression

The data for a 3D image regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)
  1. The available dataset connectors require the data for a 3D image regression experiment to be in a ZIP file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing the numerical labels (targets)
      Note

      H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image regression experiment.
    Note

    All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.

Image classification

The data for an image classification experiment needs a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)
  1. The available dataset connectors require the data for an image classification experiment to be in a ZIP file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note
      • H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
      • For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image classification experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

3D image classification

The data for a 3D image classification experiment needs a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require a label column(s)
  1. The available dataset connectors require the data for a 3D image classification experiment to be in a ZIP file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note
      • H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
      • For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image classification experiment.
    Note

    All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.

Image metric learning

The data for an image metric learning experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv file, but does not require a label column
  1. The available dataset connectors require the data for an image metric learning experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A label column containing the class names
      Note

      Similar images should have the same class name.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image metric learning experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Image object detection

H2O Hydrogen Torch supports several dataset (data) formats for an image object detection experiment. Supported formats are as follows:

Hydrogen Torch format

The data following the Hydrogen Torch format for an image object detection experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3).

folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .pq file needs to follow the format described above
  • A validation .pq file needs to follow the same format as a train .pq file
  • A test .pq file needs to follow the same format as a train .pq file, but does not require a class_id, x_min, x_max, y_min, and y_max column
  1. The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each bounding box. Each row of the dataset should contain a list of class names, where each element in the list refers to a single box
    • An x_min, x_max, y_min, and y_max column corresponding to the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a list of coordinates, where each element in the list refers to a single box
      Note
      • The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
      • The length of each list for the class_id, x_min, x_max, y_min, and y_max needs to be equal and needs to refer to the total number of bounding boxes in each respective image. If a box is not present for a given image, all lists need to be empty.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

The global_wheat_image_object_detection.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image object detection problem. The structure of the .zip file is as follows:

global_wheat_image_object_detection.zip
│ └───train.pq
│ │
│ └───images
│ └───7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg
│ └───3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg
│ └───37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg
│ ...

As follows, three random rows from the .pq file:

imageclass_idx_miny_minx_maxy_max
7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg['wheat' 'wheat' 'wheat' ...][689 718 382 ...][884 464 42 ...][754 768 450 ...][920 516 101 ...]
3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg['wheat' 'wheat' 'wheat' ...][924 698 904 ...][195 10 32 ...][981 763 938 ...][247 101 79 ...]
37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg['wheat' 'wheat' 'wheat' ...][919 811 4 ...][535 820 96 ...][1024 912 71 ...][613 894 164 ...]
Note
  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
  • To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Individual boxes format

The data following the individual boxes format for an image object detection experiment is structured as follows: A .zip file (1) containing a .csv file (2) and an image folder (3):

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv file, but does not require a class_id, x_min, x_max, y_min, and y_max column
  1. The available dataset connectors require the data for an image object detection to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each box. Each row of the dataset should contain a single box
    • An x_min, x_max, y_min, and y_max column containing the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a single coordinate value for a corresponding bounding box
      Note
      • The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
      • If a box is not present for a given image, the column class_id, x_min, x_max, y_min, and y_max should be empty.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example
imagex_miny_minx_maxy_maxclass_id
bafc.jpg31143378134wheat
bafc.jpg27683354153wheat
bafc.jpg442309541381wheat
cryv.jpg30113328124wheat
cryv.jpg24680344113wheat
cryv.jpg432303341181wheat

COCO format

The data following the COCO format for an image object detection experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

  • A train .json file needs to follow the format described above
  • A validation .json file needs to follow the same format as a train .json file
  • A test .json file needs to follow the same format as a train .json file, but does not require labels
  1. The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .json file that contains labels in a COCO format.
  3. A folder containing all the images specified in the .json file; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Pascal VOC format

The data following the Pascal VOC format for an image object detection experiment is structured as follows: A .zip file (1) containing a folder with .xml files with labels (2) and an image folder (3):

folder_name.zip (1)
│ └───xml_folder_name (2)
│ └───name_of_image.xml
│ └───name_of_image.xml
│ └───name_of_image.xml
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple folders with labels in the .zip file that you can use as train, validation, and test datasets:

  • A train folder with labels needs to follow the format described above
  • A validation folder with labels should have the same format as a train folder
  • A test folder with labels should have the same format as a train folder, but labels are not required
  1. The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A folder that contains .xml files with labels in a Pascal VOC format.
  3. An image folder that contains all the images specified in the .xml files; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Image semantic segmentation

H2O Hydrogen Torch supports several dataset (data) formats for an image semantic segmentation experiment. Supported formats are as follows:

Hydrogen Torch format

The data following the Hydrogen Torch format* for an image semantic segmentation experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .pq file needs to follow the format described above
  • A validation .pq file needs to follow the same format as a train .pq file
  • A test .pq file needs to follow the same format as a train .pq file, but does not need a class_id and rle_mask column
  1. The available dataser connectors require the data for an image semantic segmentation experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each mask. Each row of the dataset should contain a list of all possible class names
    • A rle_mask column containing run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be provided
      Note

      The length of each class_id and rle_mask list must be equal while referring to the total number of classes.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image semantic segmentation experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

The fashion_image_semantic_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image semantic segmentation problem. The structure of the .zip file is as follows:

fashion_image_semantic_segmentation.zip
│ └───train.pq
│ │
│ └───images
| └───img_0458.png
| └───img_0604.png
│ └───img_0668.png
│ ...

As follows, three random rows from the .pq file:

imageclass_idrle_mask
img_0458.png['shoes' 'pants' 'dress' 'coat' 'shirt']['180629 7 181447 17...
img_0604.png['shoes' 'pants' 'dress' 'coat' 'shirt']['189672 2 190493 9...
img_0668.png['shoes' 'pants' 'dress' 'coat' 'shirt']['108023 11 108848 11...
Note
  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
  • To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

COCO format

The data following the COCO format for an image semantic segmentation experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

  • A train .json file needs to follow the format described above
  • A validation .json file needs to follow the same format as a train .json file
  • A test .json file needs to follow the same format as a train .json file, but does not require labels
  1. The available dataset connectors require the data for an image semantic segmentation experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .json file that contains labels in a COCO format.
  3. A folder containing all the image specified in the .json file; H2O Hydrogen Torch uses the images in this folder during an image semantic segmentation experiment.

3D image semantic segmentation

The data for a 3D image semantic segmentation experiment needs to be structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .pq file needs to follow the format described above
  • A validation .pq file needs to follow the same format as a train .pq file
  • A test .pq file needs to follow the same format as a train .pq file, but does not need a class_id and rle_mask column
  1. The available dataser connectors require the data for a 3D image semantic segmentation experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each mask. Each row of the dataset should contain a list of all possible class names
    • A rle_mask column containing run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be provided
      Note

      The length of each class_id and rle_mask list must be equal while referring to the total number of classes.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the 3D image semantic segmentation experiment.
    Note

    All images need to specified an image extension. To learn about supported 3D image extensions, see Supported 3D image extensions for 3D image processing.

Image instance segmentation

H2O Hydrogen Torch supports several dataset (data) formats for an image instance segmentation experiment. Supported formats are as follows:

Hydrogen Torch format

The data following the Hydrogen Torch format for an image instance segmentation experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv, but does not require a class_id and rle_mask column
  1. The available dataset connectors require the data for an image instance segmentation experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each instance mask. Each row of the dataset should contain a list of class names, where each element in the list refers to a single mask instance.
    • A rle_mask column containing run-length-encoded (RLE) masks for each instance from the class_id column. Each row of the dataset should contain a list of RLE-encoded masks, where each element in the list refers to a single instance.
      Note

      The length of each class_id and rle_mask list must be equal while referring to the total number of instances in each respective image. If an instance is not present for a given image, all lists need to be empty.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image instance segmentation experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Example

The coco_image_instance_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image instance segmentation problem. The structure of the .zip file is as follows:

coco_image_instance_segmentation.zip
│ └───train.pq
│ │
│ └───images
│ └───000000151231.jpg
│ └───000000433826.jpg
│ └───000000061159.jpg
│ ...

As follows, three random rows from the .pq file:

image_idclass_idrle_mask
000000151231.jpg['car' 'car']['91949 7 92375 14 92801...
000000433826.jpg['car' 'car']['224473 3 224952 4 22...
000000061159.jpg['car' 'car']['161665 9 162291 25...
Note
  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
  • To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets in H2O Hydrogen Torch. .

COCO format

The data following the COCO format for an image instance segmentation experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

  • A train .json file needs to follow the format described above
  • A validation .json file needs to follow the same format as a train .json file
  • A test .json file needs to follow the same format as a train .csv file, but does not require labels
  1. The available dataset connectors require the data for an image instance segmentation to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data connectors.

  2. A .json file that contains labels in a COCO format .
  3. A folder containing all the images specified in the .json file; H2O Hydrogen Torch uses the images in this folder to run an image instance segmentation experiment.

Text

Text regression

The data for a text regression experiment can be formatted following format 1 or 2.

A .csv file.

csv_name.csv (1)(2)
  1. The available dataset connectors require the data for a text regression experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • A text column containing the texts for the experiment
    • One or more label columns containing the numerical labels (targets)
      Note

      H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Text classification

The data for a text classification experiment can be formatted following format 1 or 2.

A .csv file.

csv_name.csv (1)(2)
  1. The available dataset connectors require the data for a text classification experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • A text column containing the texts for the experiment
    • One or more label columns containing either either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note
      • H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
      • For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Text sequence to sequence

The data for a text sequence to sequence experiment can be formatted following format 1 or 2.

A .csv file.

csv_name.csv (1)(2)
  1. The available dataset connectors require the data for a text sequence to sequence experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An input-text column containing/representing the input texts
    • An output-text column containing/representing the out put texts
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Text span prediction

The data for a text span prediction experiment can be formatted following format 1 or 2.

A .csv file.

csv_name.csv (1)(2)
  1. The available dataset connectors require the data for a text span prediction experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • A context column containing/representing the input texts
    • A question column containing/representing the questions (that the input context text can answer)
    • An answer column containing/representing the substrings from the context column that answers the questions (question column)
    • An optional answer-start column containing/representing the start of the substring answers in the context column
      Note
      • The start of the substring answers needs to be specified by integers representing the index where the answer starts in the context.
      • If you do not provide an answer-start column, H2O Hydrogen Torch selects the first occurrence of the answer in the context.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Text token classification

The data for a text token classification experiment can be formatted following format 1 or 2.

A .pq (parquet) file.

parquet_name.pq (1)(2)
  1. The available dataset connectors require the data for a text token classification to be in a .zip or .pq file.
    Note

    To learn how to upload your .zip or .pq file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • A text column containing tokenized text: each sample should have a list of string tokens
    • A label column containing token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Text metric learning

The data for a text metric learning experiment can be formatted following format 1 or 2.

A .csv file.

csv_name.csv (1)(2)
  1. The available dataset connectors require the data for a text metric learning experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • A text column containing the input texts
    • A label column containing the class names
      Note

      Texts that are similar should have the same class name.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Audio

Audio regression

The data for an audio regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)
  1. The available dataset connectors require the data for an audio regression experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
      Note
      • Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
      • Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing the numerical labels (targets)
      Note

      H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting an audio regressuin experiment.

    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the audio regression experiment.
    Note

    All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.

Audio classification

The data for an audio classification experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)
  1. The available data connectors require the data for an audio classification experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
      Note
      • Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
      • Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing either multi-class labels (One-hot encoded) or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note
      • H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. The classes are mutually exclusive in multi-class problems, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
      • For binary classification experiments utilizing precison, recall, F1, F05, or F2 as a metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision, recall, F1, F05, or F2 metric.
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An audio folder that contains all the audio files specified in the audio column above; H2O Hydrogen Torch uses the audios in this folder to run the audio classification experiment.
    Note

    All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.

Speech

Speech recognition

The data for a speech recognition experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv file, but does not require a label column(s)
  1. The available data connectors require the data for a speech recognition experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
      Note
      • To learn about supported audio extensions for a speech recognition experiment, see Supported audio extensions for speech recognition.
      • Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
      tip

      For most supported speech architectures, utilize speech audios of up to 30 seconds. Attempting to train with longer speech samples may lead to:

      • Out-of-memory (OOM) issues even on high VRAM GPUs
      • Poor training performance
    • One label column containing the text transcript of the audio
    • An optional fold column containing cross-validation fold indexes
      Note

      The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the experiment.
    Note

    All audios need to have an audio extension. To learn about supported audio extensions, see Supported audio extensions for speech recognition.

Data collection

Example 1: Amazon S3

Below, observe a Python script example parsing a folder structure of an Amazon S3 bucket collecting images into a single dataset. In other words, the script demonstrates how to create a new dataset (ZIP file) from several files in S3 to later re-upload to S3.

# Import libraries

# We use `boto3` to connect to S3
# Optionally `tqdm` can be used to show download progress
# We use pandas for data manipulation
from boto3.session import Session
from tqdm import tqdm
import pandas as pd
import shutil
import os


# Set AWS credentials
# You can set them directly or use the environment variables, if those are set
aws_access_key = os.environ["AWS_ACCESS_KEY_ID"]
aws_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]

# Set list of bucket paths, that contain image files
images_bucket_subfolders = ["h2o-release/hydrogen-torch/data-prep"]

# Set path to the train CSV
csv_path = "h2o-release/hydrogen-torch/data-prep/csvs/train.csv"

# Set allowed file extensions
allowed_extensions = [".jpg",".jpeg", ".png"]

# Files will be downloaded to data folder
data_folder = "data"
image_folder = f"{data_folder}/images"
os.makedirs(image_folder, exist_ok=True)

# Connect to S3
s3 = Session(aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key).resource("s3")


# Download train.csv
bucket, csv_path = csv_path.split("/", 1)
output_csv_path = f"{data_folder}/{os.path.basename(csv_path)}"
s3.Bucket(bucket).download_file(csv_path, output_csv_path)


# Make sure the "Image Column" contains only file names, not full paths
image_col = "image"

data = pd.read_csv(output_csv_path)
data[image_col] = data[image_col].map(os.path.basename)
data = data.to_csv(output_csv_path, index=False)


# Download all image files
for images_bucket_subfolder in images_bucket_subfolders:

if "/" in images_bucket_subfolder:
bucket, subfolder = images_bucket_subfolder.split("/", 1)
else:
bucket, subfolder = images_bucket_subfolder, ""

s3_bucket = s3.Bucket(bucket)
files = s3_bucket.objects

if subfolder:
files = files.filter(Prefix=f"{subfolder}/")


files = list(files)

for file in tqdm(files):
if any([file.key.endswith(ext) for ext in allowed_extensions]):
s3_bucket.download_file(file.key, f"{image_folder}/{os.path.basename(file.key)}")


# Create ZIP file that can be imported to H2O Hydrogen Torch
zip_file_name = "flowers_image_classification"
full_zip_file_name = shutil.make_archive(zip_file_name, 'zip', data_folder)

# Set desired S3 path where to upload the ZIP file in format "bucket_name" or "bucket_name/subfolder_1/.../subfolder_n"
upload_bucket_path = "YOUR_BUCKET_NAME/SUB_FOLDER"

# Upload the ZIP file
rel_zip_file_name = os.path.basename(full_zip_file_name)
upload_path = f"{upload_bucket_path}/{rel_zip_file_name}"
upload_bucket, upload_zip_path = upload_path.split("/", 1)
s3.Bucket(upload_bucket).upload_file(rel_zip_file_name, upload_zip_path)

Supported audio extensions for speech recognition

For speech recognition, H2O Hydrogen Torch supports the following audio extension:

  • Uncompressed (.wav).

Supported audio extensions for audio processing

The following is a list of supported audio extensions for audio processing in H2O Hydrogen Torch:

  • Uncompressed: .wav, .aiff
  • Lossless compressed: .flac
  • Lossy compressed: .mp3, .ogg

Supported image extensions for image processing

The following is a list of supported image extensions for image processing in H2O Hydrogen Torch:

  • Windows bitmaps: .bmp
  • JPEG files: .jpeg, .jpg, .jpe
  • JPEG 2000 files: .jp2
  • Portable Network Graphics: .png
  • WebP: .webp
  • Portable image format: .pbm, .pgm, .ppm, .pnm
  • TIFF files: .tiff, .tif
  • Radiance HDR: .hdr
  • NumPy data array: .npy
    note

    For 2D image processing, the data must be of shape [height, width, channels].

Supported 3D image extensions for 3D image processing

For 3D image problem types, H2O Hydrogen Torch supports the following 3D image extension:

  • NumPy data array: .npy
    note

    For 3D image processing, the data must be of shape [height, width, depth, channels].


Feedback