Skip to main content
Version: v1.2.0

Dataset formats

The data (dataset) for one of the supported problem types needs to be formatted (prepared) by you in a certain way. Below, you can find instructions on formatting your dataset for a particular supported problem type.

Image regression

The data for an image regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image regression experiment to be in a ZIP file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing the numerical labels (targets)
      Note

      H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image regression experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

Image classification

The data for an image classification experiment needs a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image classification experiment to be in a ZIP file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note

      H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image classification experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

Image metric learning

The data for an image metric learning experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image metric learning experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A label column containing the class names
      Note

      Similar images should have the same class name.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image metric learning experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require label column

Image object detection

H2O Hydrogen Torch supports several dataset (data) formats for an image object detection experiment. Supported formats are as follows:

Hydrogen Torch format

The data following the Hydrogen Torch format for an image object detection experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3).

folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each bounding box. Each row of the dataset should contain a list of class names, where each element in the list refers to a single box
    • An x_min, x_max, y_min, and y_max column corresponding to the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a list of coordinates, where each element in the list refers to a single box
      Note
      • The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
      • The length of each list for the class_id, x_min, x_max, y_min, and y_max needs to be equal and needs to refer to the total number of bounding boxes in each respective image. If a box is not present for a given image, all lists need to be empty.
    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .pq file needs to follow the format described above
  • A validation .pq file needs to follow the same format as a train .pq file
  • A test .pq file needs to the same format as a train .pq file, but does not require a class_id, x_min, x_max, y_min, and y_max column

Example

The global_wheat_image_object_detection.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image object detection problem. The structure of the .zip file is as follows:

global_wheat_image_object_detection.zip
│ └───train.pq
│ │
│ └───images
│ └───7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg
│ └───3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg
│ └───37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg
│ ...

As follows, three random rows from the .pq file:

imageclass_idx_miny_minx_maxy_max
7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg['wheat' 'wheat' 'wheat' ...][689 718 382 ...][884 464 42 ...][754 768 450 ...][920 516 101 ...]
3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg['wheat' 'wheat' 'wheat' ...][924 698 904 ...][195 10 32 ...][981 763 938 ...][247 101 79 ...]
37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg['wheat' 'wheat' 'wheat' ...][919 811 4 ...][535 820 96 ...][1024 912 71 ...][613 894 164 ...]
Note
  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.
  • To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Individual boxes format

The data following the individual boxes format for an image object detection experiment is structured as follows: A .zip file (1) containing a .csv file (2) and an image folder (3):

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image object detection to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each box. Each row of the dataset should contain a single box
    • An x_min, x_max, y_min, and y_max column containing the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a single coordinate value for a corresponding bounding box
      Note
      • The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
      • If a box is not present for a given image, the column class_id, x_min, x_max, y_min, and y_max should be empty.
    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require a class_id, x_min, x_max, y_min, and y_max column

Example

imagex_miny_minx_maxy_maxclass_id
bafc.jpg31143378134wheat
bafc.jpg27683354153wheat
bafc.jpg442309541381wheat
cryv.jpg30113328124wheat
cryv.jpg24680344113wheat
cryv.jpg432303341181wheat

COCO format

The data following the COCO format for an image object detection experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .json file that contains labels in a COCO format.
  3. A folder containing all the images specified in the .json file; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

  • A train .json file needs to follow the format described above
  • A validation .json file needs to follow the same format as a train .json file
  • A test .json file needs to the same format as a train .json file, but does not require labels

Pascal VOC format

The data following the Pascal VOC format for an image object detection experiment is structured as follows: A .zip file (1) containing a folder with .xml files with labels (2) and an image folder (3):

folder_name.zip (1)
│ └───xml_folder_name (2)
│ └───name_of_image.xml
│ └───name_of_image.xml
│ └───name_of_image.xml
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A folder that contains .xml files with labels in a Pascal VOC format.
  3. An image folder that contains all the images specified in the .xml files; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple folders with labels in the .zip file that you can use as train, validation, and test datasets:

  • A train folder with labels needs to follow the format described above
  • A validation folder with labels should have the same format as a train folder
  • A test folder with labels should have the same format as a train folder, but labels are not required

Image semantic segmentation

H2O Hydrogen Torch supports several dataset (data) formats for an image semantic segmentation experiment. Supported formats are as follows:

Hydrogen Torch format

The data following the Hydrogen Torch format* for an image semantic segmentation experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataser connectors require the data for an image semantic segmentation experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each mask. Each row of the dataset should contain a list of all possible class names
    • A rle_mask column containing run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be provided
      Note

      The length of each class_id and rle_mask list must be equal while referring to the total number of classes.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image semantic segmentation experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .pq file needs to follow the format described above
  • A validation .pq file needs to follow the same format as a train .pq file
  • A test .pq file needs to the same format as a train .pq file, but does not a class_id and rle_mask column

Example

The fashion_image_semantic_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image semantic segmentation problem. The structure of the .zip file is as follows:

fashion_image_semantic_segmentation.zip
│ └───train.pq
│ │
│ └───images
| └───img_0458.png
| └───img_0604.png
│ └───img_0668.png
│ ...

As follows, three random rows from the .pq file:

imageclass_idrle_mask
img_0458.png['shoes' 'pants' 'dress' 'coat' 'shirt']['180629 7 181447 17...
img_0604.png['shoes' 'pants' 'dress' 'coat' 'shirt']['189672 2 190493 9...
img_0668.png['shoes' 'pants' 'dress' 'coat' 'shirt']['108023 11 108848 11...
Note
  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
  • To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

COCO format

The data following the COCO format for an image semantic segmentation experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image semantic segmentation experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .json file that contains labels in a COCO format.
  3. A folder containing all the image specified in the .json file; H2O Hydrogen Torch uses the images in this folder during an image semantic segmentation experiment.
Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

  • A train .json file needs to follow the format described above
  • A validation .json file needs to follow the same format as a train .json file
  • A test .json file needs to the same format as a train .json file, but does not require labels

Image instance segmentation

H2O Hydrogen Torch supports several dataset (data) formats for an image instance segmentation experiment. Supported formats are as follows:

Hydrogen Torch format

The data following the Hydrogen Torch format for an image instance segmentation experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│ └───pq_name.pq (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image instance segmentation experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • An image column containing the names of the images for the experiment, where each image has an image extension specified
      Note
      • Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
      • Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • A class_id column containing the class names of each instance mask. Each row of the dataset should contain a list of class names, where each element in the list refers to a single mask instance.
    • A rle_mask column containing run-length-encoded (RLE) masks for each instance from the class_id column. Each row of the dataset should contain a list of RLE-encoded masks, where each element in the list refers to a single instance.
      Note

      The length of each class_id and rle_mask list must be equal while referring to the total number of instances in each respective image. If an instance is not present for a given image, all lists need to be empty.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image instance segmentation experiment.
    Note

    All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to follow the same format as a train .csv, but does not require a class_id and rle_mask column

Example

The coco_image_instance_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image instance segmentation problem. The structure of the .zip file is as follows:

coco_image_instance_segmentation.zip
│ └───train.pq
│ │
│ └───images
│ └───000000151231.jpg
│ └───000000433826.jpg
│ └───000000061159.jpg
│ ...

As follows, three random rows from the .pq file:

image_idclass_idrle_mask
000000151231.jpg['car' 'car']['91949 7 92375 14 92801...
000000433826.jpg['car' 'car']['224473 3 224952 4 22...
000000061159.jpg['car' 'car']['161665 9 162291 25...
Note
  • In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
  • To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

COCO format

The data following the COCO format for an image instance segmentation experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│ └───json_name.json (2)
│ │
│ └───image_folder_name (3)
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
  1. The available dataset connectors require the data for an image instance segmentation to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data connectors.

  2. A .json file that contains labels in a COCO format .
  3. A folder containing all the images specified in the .json file; H2O Hydrogen Torch uses the images in this folder to run an image instance segmentation experiment.
Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

  • A train .json file needs to follow the format described above
  • A validation .json file needs to follow the same format as a train .json file
  • A test .json file needs to follow the same format as a train .csv file, but does not require labels

Text regression

The data for a text regression experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│ └───csv_name.csv (2)
  1. The available dataset connectors require the data for a text regression experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • A text column containing the texts for the experiment
    • One or more label columns containing the numerical labels (targets)
      Note

      H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

Text classification

The data for a text classification experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│ └───csv_name.csv (2)
  1. The available dataset connectors require the data for a text classification experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • A text column containing the texts for the experiment
    • One or more label columns containing either either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note

      H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

Text sequence to sequence

The data for a text sequence to sequence experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│ └───csv_name.csv (2)
  1. The available dataset connectors require the data for a text sequence to sequence experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An input-text column containing/representing the input texts
    • An output-text column containing/representing the out put texts
    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require an output_text column

Text span prediction

The data for a text span prediction experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│ └───csv_name.csv (2)
  1. The available dataset connectors require the data for a text span prediction experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • A context column containing/representing the input texts
    • A question column containing/representing the questions (that the input context text can answer)
    • An answer column containing/representing the substrings from the context column that answers the questions (question column)
    • An optional answer-start column containing/representing the start of the substring answers in the context column
      Note
      • The start of the substring answers needs to be specified by integers representing the index where the answer starts in the context.
      • If you do not provide an answer-start column, H2O Hydrogen Torch will select the first occurrence of the answer in the context.
    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require an answer column

Text token classification

The data for a text token classification experiment can be formatted following format 1 or 2.

Format 1

A .pq (parquet) file.

parquet_name.pq (1)(2)

Format 2

A .zip file containing a .pq (parquet) file.

folder_name.zip (1)
│ └───parquet_name.pq (2)
  1. The available dataset connectors require the data for a text token classification to be in a .zip or .pq file.
    Note

    To learn how to upload your .zip or .pq file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .pq file containing the following columns:
    • A text column containing tokenized text: each sample should have a list of string tokens
    • A label column containing token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values
    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .pq file needs to follow the format described above
  • A validation .pq file needs to follow the same format as a train .pq file
  • A test .pq file needs to the same format as a train .pq file, but does not require a label

Text metric learning

The data for a text metric learning experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│ └───csv_name.csv (2)
  1. The available dataset connectors require the data for a text metric learning experiment to be in a .zip or .csv file.
    Note

    To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • A text column containing the input texts
    • A label column containing the class names
      Note

      Texts that are similar should have the same class name.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require a label column

Audio regression

The data for an audio regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
  1. The available dataset connectors require the data for an audio regression experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
      Note
      • Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
      • Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing the numerical labels (targets)
      Note

      H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting an audio regressuin experiment.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the audio regression experiment.
    Note

    All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

Audio classification

The data for an audio classification experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│ └───csv_name.csv (2)
│ │
│ └───audio_folder_name (3)
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ └───name_of_audio.audio_extension
│ ...
  1. The available data connectors require the data for an audio classification experiment to be in a .zip file.
    Note

    To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

  2. A .csv file containing the following columns:
    • An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
      Note
      • Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
      • Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
    • One or more label columns containing either multi-class labels (One-hot encoded) or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
      Note

      H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. The classes are mutually exclusive in multi-class problems, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.

    • An optional fold column containing cross-validation fold indexes
      Note

      Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

  3. An audio folder that contains all the audio files specified in the audio column above; H2O Hydrogen Torch uses the audios in this folder to run the audio classification experiment.
    Note

    All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

  • A train .csv file needs to follow the format described above
  • A validation .csv file needs to follow the same format as a train .csv file
  • A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

Supported audio extensions for audio processing

The following is a list of supported audio extensions for audio processing in H2O Hydrogen Torch:

  • Uncompressed: .wav, .aiff
  • Lossless compressed: .flac
  • Lossy compressed: .mp3, .ogg

Supported image extensions for image processing

The following is a list of supported image extensions for image processing in H2O Hydrogen Torch:

  • Windows bitmaps: .bmp
  • JPEG files: .jpeg, .jpg, .jpe
  • JPEG 2000 files: .jp2
  • Portable Network Graphics: .png
  • WebP: .webp
  • Portable image format: .pbm, .pgm, .ppm, .pnm
  • TIFF files: .tiff, .tif
  • OpenEXR image files: .exr
  • Radiance HDR: .hdr
  • NumPy data array: .npy (data must be of shape [height, width, channels])

Feedback