Version: v1.2.0

Dataset formats

The data (dataset) for one of the supported problem types needs to be formatted (prepared) by you in a certain way. Below, you can find instructions on formatting your dataset for a particular supported problem type.

Image regression

Format
Example

The data for an image regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image regression experiment to be in a ZIP file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)
  Note
  H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image regression experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

The coins_image_regression.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image regression problem. The .zip file contains a .csv file and an image folder. The structure of the .zip file is as follows:

coins_image_regression.zip
│   └───coins_image_regression.csv
│   │
│   └───images
│       └───95_1477858074.jpg
│       └───95_1477858068.jpg
│       └───95_1477858062.jpg
│       ...

The first three rows of the .csv file are as follows:

image_path	label	fold
105_1479344562.jpg	105	1
105_1479344940.jpg	105	2
125_1479424716.jpg	125	1

Note

In this example, the data directory in the image column (image_path) is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specified as the value for the Data Folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Image classification

Format
Example

The data for an image classification experiment needs a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image classification experiment to be in a ZIP file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
  Note
  H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- An optional fold column containing cross-validation fold indexes
Note
Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image classification experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

The flower_image_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multi-class image classification problem. The structure of the .zip file is as follows:

    flower_image_classification.zip
│   └───train.csv
│   │
│   └───images
│       └───100080576_f52e8ee070_n.jpg
│       └───10043234166_e6dd915111_n.jpg
│       └───1008566138_6927679c8a.jpg
│       ...

The first three rows of the train.csv file are as follows:

image	label
5777669976_a205f61e5b.jpg	roses
4860145119_b1c3cbaa4e_n.jpg	roses
15011625580_7974c44bce.jpg	roses

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specfied as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Image metric learning

Format
Example

The data for an image metric learning experiment needs to be in a .zip file (1) containing a .csv file (2) and an image folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image metric learning experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A label column containing the class names
Note
Similar images should have the same class name.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image metric learning experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require label column

The bicycle_image_metric_learning.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an image metric learning problem. The structure of the .zip file is as follows:

bicycle_image_metric_learning.zip
│   └───train.csv
│   │
│   images
│       └───181783211141_0.jpg
│       └───181596348104_1.jpg
│       └───171166528893_0.jpg
│       ...

The first three rows of the .csv file are as follows:

image	label	fold
181783211141_0.JPG	181783211141	0
181596348104_1.JPG	181596348104	2
171166528893_0.JPG	171166528893	0

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be specify as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Image object detection

Formats
Format conversions

H2O Hydrogen Torch supports several dataset (data) formats for an image object detection experiment. Supported formats are as follows:

Hydrogen Torch format
Individual boxes format
COCO format
Pascal VOC format

Hydrogen Torch format

The data following the Hydrogen Torch format for an image object detection experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3).

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each bounding box. Each row of the dataset should contain a list of class names, where each element in the list refers to a single box
- An x_min, x_max, y_min, and y_max column corresponding to the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a list of coordinates, where each element in the list refers to a single box
  Note
  - The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
  - The length of each list for the class_id, x_min, x_max, y_min, and y_max needs to be equal and needs to refer to the total number of bounding boxes in each respective image. If a box is not present for a given image, all lists need to be empty.
- An optional fold column containing cross-validation fold indexes
Note
Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .pq file needs to follow the format described above
A validation .pq file needs to follow the same format as a train .pq file
A test .pq file needs to the same format as a train .pq file, but does not require a class_id, x_min, x_max, y_min, and y_max column

Example

The global_wheat_image_object_detection.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image object detection problem. The structure of the .zip file is as follows:

global_wheat_image_object_detection.zip
│   └───train.pq
│   │
│   └───images
│       └───7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg
│       └───3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg
│       └───37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg
│       ...

As follows, three random rows from the .pq file:

image	class_id	x_min	y_min	x_max	y_max
7cca65c2cfb161be75fa41b754ef5263ee10e679dc8900f1fa75f845899abafc.jpg	['wheat' 'wheat' 'wheat' ...]	[689 718 382 ...]	[884 464 42 ...]	[754 768 450 ...]	[920 516 101 ...]
3c6154081943882478110d2ea7ad0eef89cd954b6bd290d161385f9a5accc2fd.jpg	['wheat' 'wheat' 'wheat' ...]	[924 698 904 ...]	[195 10 32 ...]	[981 763 938 ...]	[247 101 79 ...]
37a8db49093fd08a3be9ce48bbfb1a697b5da8dd51ac9fa53fc28d924888ace8.jpg	['wheat' 'wheat' 'wheat' ...]	[919 811 4 ...]	[535 820 96 ...]	[1024 912 71 ...]	[613 894 164 ...]

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Individual boxes format

The data following the individual boxes format for an image object detection experiment is structured as follows: A .zip file (1) containing a .csv file (2) and an image folder (3):

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image object detection to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each box. Each row of the dataset should contain a single box
- An x_min, x_max, y_min, and y_max column containing the bounding box locations describing the spatial location of the objects. For each column, each row of the dataset should contain a single coordinate value for a corresponding bounding box
  Note
  - The bounding box location is represented as a rectangular box, which is determined by the x and y coordinates of the upper-left and lower-right corners.
  - If a box is not present for a given image, the column class_id, x_min, x_max, y_min, and y_max should be empty.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require a class_id, x_min, x_max, y_min, and y_max column

Example

image	x_min	y_min	x_max	y_max	class_id
bafc.jpg	311	43	378	134	wheat
bafc.jpg	276	83	354	153	wheat
bafc.jpg	442	309	541	381	wheat
cryv.jpg	301	13	328	124	wheat
cryv.jpg	246	80	344	113	wheat
cryv.jpg	432	303	341	181	wheat

COCO format

The data following the COCO format for an image object detection experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│   └───json_name.json (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .json file that contains labels in a COCO format.
A folder containing all the images specified in the .json file; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

A train .json file needs to follow the format described above
A validation .json file needs to follow the same format as a train .json file
A test .json file needs to the same format as a train .json file, but does not require labels

Pascal VOC format

The data following the Pascal VOC format for an image object detection experiment is structured as follows: A .zip file (1) containing a folder with .xml files with labels (2) and an image folder (3):

folder_name.zip (1)
│   └───xml_folder_name (2)
│       └───name_of_image.xml
│       └───name_of_image.xml
│       └───name_of_image.xml
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image object detection experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A folder that contains .xml files with labels in a Pascal VOC format.
An image folder that contains all the images specified in the .xml files; H2O Hydrogen Torch uses the images in this folder to run the image object detection experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple folders with labels in the .zip file that you can use as train, validation, and test datasets:

A train folder with labels needs to follow the format described above
A validation folder with labels should have the same format as a train folder
A test folder with labels should have the same format as a train folder, but labels are not required

Details

Individual Boxes to Hydrogen Torch format

import pandas as pd


# Read data
df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()

df[["image_id", "class_id", "x_min", "y_min", "x_max", "y_max"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Details

COCO to Hydrogen Torch format

import json
import pandas as pd


def get_object_detection(df):
    images = pd.DataFrame(df["images"])
    categories = pd.DataFrame(df["categories"])
    annotations = pd.DataFrame(df["annotations"])

    annotations["x_min"] = annotations["bbox"].map(lambda x: x[0]).astype(int)
    annotations["y_min"] = annotations["bbox"].map(lambda x: x[1]).astype(int)
    annotations["x_max"] = annotations["bbox"].map(lambda x: x[0] + x[2]).astype(int)
    annotations["y_max"] = annotations["bbox"].map(lambda x: x[1] + x[3]).astype(int)

    annotations = annotations[
        ["image_id", "category_id", "x_min", "y_min", "x_max", "y_max"]
    ]

    annotations["category_id"] = annotations["category_id"].astype(int)
    annotations = annotations.merge(
        categories[["id", "name"]].drop_duplicates(), left_on="category_id", right_on="id", how="left"
    )
    annotations = annotations.merge(
        images[["id", "file_name"]].drop_duplicates(), left_on="image_id", right_on="id", how="right"
    )

    annotations.drop(["id_x", "id_y", "image_id"], axis=1, inplace=True)

    return annotations


# Read data
with open("/data/COCO_train_annos.json", "r") as fp:
    train = json.load(fp)

# Parse COCO format
train_ann = get_object_detection(train)

# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
train_ann[["file_name", "name", "x_min", "y_min", "x_max", "y_max"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Details

Pascal VOC to Hydrogen Torch

import glob
import os
from xml.etree import ElementTree

import pandas as pd
from tqdm import tqdm


observations = []

for xml in tqdm(glob.glob("/data/Annotations/*.xml")):
    tree = ElementTree.parse(xml)
    root = tree.getroot()
    objs = root.findall("object")

    for obj in objs:
        name = obj.find("name").text

        bndbox = obj.find("bndbox")
        xmin = float(bndbox.findtext("xmin")) - 1
        ymin = float(bndbox.findtext("ymin")) - 1
        xmax = float(bndbox.findtext("xmax"))
        ymax = float(bndbox.findtext("ymax"))
        
        try:
            img_name = root.findall("path")[0].text.split("/")[-1]
        except Exception:
            img_name = root.findall("filename")[0].text

        observations.append(
            (
                img_name,
                name,
                xmin,
                ymin,
                xmax,
                ymax,
            )
        )

df = pd.DataFrame(
    observations, columns=["image", "class_id", "x_min", "y_min", "x_max", "y_max"]
)

# Prepare the processed dataset
df = df.groupby(["image"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
df.to_parquet("/data/train.pq", engine="pyarrow", index=False)

Image semantic segmentation

Formats
Helper functions
Format conversions

H2O Hydrogen Torch supports several dataset (data) formats for an image semantic segmentation experiment. Supported formats are as follows:

Hydrogen Torch format
COCO format

Hydrogen Torch format

The data following the Hydrogen Torch format* for an image semantic segmentation experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataser connectors require the data for an image semantic segmentation experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each mask. Each row of the dataset should contain a list of all possible class names
- A rle_mask column containing run-length-encoded (RLE) masks for each class from the class_id column. If there is no mask for a given class, an empty string has to be provided
  Note
  The length of each class_id and rle_mask list must be equal while referring to the total number of classes.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image semantic segmentation experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .pq file needs to follow the format described above
A validation .pq file needs to follow the same format as a train .pq file
A test .pq file needs to the same format as a train .pq file, but does not a class_id and rle_mask column

Example

The fashion_image_semantic_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image semantic segmentation problem. The structure of the .zip file is as follows:

fashion_image_semantic_segmentation.zip
│   └───train.pq
│   │
│   └───images
|       └───img_0458.png
|       └───img_0604.png    
│       └───img_0668.png
│           ...

As follows, three random rows from the .pq file:

image	class_id	rle_mask
img_0458.png	['shoes' 'pants' 'dress' 'coat' 'shirt']	['180629 7 181447 17...
img_0604.png	['shoes' 'pants' 'dress' 'coat' 'shirt']	['189672 2 190493 9...
img_0668.png	['shoes' 'pants' 'dress' 'coat' 'shirt']	['108023 11 108848 11...

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

COCO format

The data following the COCO format for an image semantic segmentation experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│   └───json_name.json (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image semantic segmentation experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .json file that contains labels in a COCO format.
A folder containing all the image specified in the .json file; H2O Hydrogen Torch uses the images in this folder during an image semantic segmentation experiment.

Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

A train .json file needs to follow the format described above
A validation .json file needs to follow the same format as a train .json file
A test .json file needs to the same format as a train .json file, but does not require labels

Details

RLE encoding and decoding functions

from typing import Tuple

import numpy as np


def mask2rle(x: np.ndarray) -> str:
    """
    Converts input masks into RLE-encoded strings.

    Args:
        x: numpy array of shape (height, width), 1 - mask, 0 - background
    Returns:
        RLE string
    """

    pixels = x.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return " ".join(str(x) for x in runs)


def rle2mask(mask_rle: str, shape: Tuple[int, int]) -> np.ndarray:
    """
    Converts RLE-encoded string into the binary mask.

    Args:
        mask_rle: RLE-encoded string
        shape: (height,width) of array to return
    Returns:
        binary mask: 1 - mask, 0 - background
    """

    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape, order="F")  # Needed to align to RLE direction

Details

.csv file with masks to Hydrogen Torch format

import pandas as pd

df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()

df[["image_id", "class_id", "rle_mask"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Details

COCO to Hydrogen Torch format

import json
import pandas as pd
from pycocotools.coco import COCO


def get_semantic_segmentation(df, coco_path):
    coco = COCO(coco_path)

    images = images[["id", "file_name"]].drop_duplicates()
    images.columns = ["image_id", "file_name"]

    categories = categories[["id", "name"]].drop_duplicates()
    categories.columns = ["category_id", "name"]
    # Filter out _background_ class
    categories = categories[categories.name != "_background_"]

    all_labels = [
    pd.DataFrame({"file_name": x, "name": categories.name.unique()})
    for x in images.file_name.unique()
    ]
    all_labels = pd.concat(all_labels)
    all_labels = all_labels.merge(images).merge(categories).reset_index(drop=True)

    rles = []
    for idx, row in all_labels.iterrows():
        yield data_split, idx / len(all_labels)
        semantic_annotations = [
            x
            for x in df["annotations"]
            if x["image_id"] == row["image_id"]
            and int(x["category_id"]) == row["category_id"]
        ]

        if len(semantic_annotations) == 0:
            rles.append("")
            continue
        semantic_mask = np.max(
            [coco.annToMask(x) for x in semantic_annotations], axis=0
        )
        # mask2rle() is defined in "Helper functions" section
        rles.append(mask2rle(semantic_mask))

    all_labels["rle_mask"] = rles

    return all_labels


# Read data
train_path = "/data/COCO_train_annos.json"
with open(train_path, "r") as fp:
    train = json.load(fp)

# Parse COCO format
train_ann = get_semantic_segmentation(df=train, coco_path=train_path)

# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: x.to_list()).reset_index()
train_ann[["file_name", "name", "rle"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Image instance segmentation

Formats
Helper functions
Format conversions

H2O Hydrogen Torch supports several dataset (data) formats for an image instance segmentation experiment. Supported formats are as follows:

Hydrogen Torch format
COCO format

Hydrogen Torch format

The data following the Hydrogen Torch format for an image instance segmentation experiment is structured as follows: A .zip file (1) containing a .pq file (parquet) (2) and an image folder (3):

folder_name.zip (1)
│   └───pq_name.pq (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image instance segmentation experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- An image column containing the names of the images for the experiment, where each image has an image extension specified
  Note
  - Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.
  - Suppose the names of the images don't specify the data directory (location of the images in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- A class_id column containing the class names of each instance mask. Each row of the dataset should contain a list of class names, where each element in the list refers to a single mask instance.
- A rle_mask column containing run-length-encoded (RLE) masks for each instance from the class_id column. Each row of the dataset should contain a list of RLE-encoded masks, where each element in the list refers to a single instance.
  Note
  The length of each class_id and rle_mask list must be equal while referring to the total number of instances in each respective image. If an instance is not present for a given image, all lists need to be empty.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An image folder that contains all the images specified in the image column; H2O Hydrogen Torch uses the images in this folder to run the image instance segmentation experiment.
Note
All images need to have an image extension. Images can contain a mix of supported image extensions. To learn about supported image extensions, see Supported image extensions for image processing.

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to follow the same format as a train .csv, but does not require a class_id and rle_mask column

Example

The coco_image_instance_segmentation.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted following the Hydrogen Torch format to solve an image instance segmentation problem. The structure of the .zip file is as follows:

coco_image_instance_segmentation.zip
│   └───train.pq
│   │
│   └───images
│       └───000000151231.jpg
│       └───000000433826.jpg
│       └───000000061159.jpg
│           ...

As follows, three random rows from the .pq file:

image_id	class_id	rle_mask
000000151231.jpg	['car' 'car']	['91949 7 92375 14 92801...
000000433826.jpg	['car' 'car']	['224473 3 224952 4 22...
000000061159.jpg	['car' 'car']	['161665 9 162291 25...

Note

In this example, the data directory in the image column is not specified. Therefore, it needs to be specified when uploading the dataset, and the images folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

COCO format

The data following the COCO format for an image instance segmentation experiment is structured as follows: A .zip file (1) containing a .json file (2) and an image folder (3):

folder_name.zip (1)
│   └───json_name.json (2)
│   │
│   └───image_folder_name (3)
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...

The available dataset connectors require the data for an image instance segmentation to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Data connectors.
A .json file that contains labels in a COCO format .
A folder containing all the images specified in the .json file; H2O Hydrogen Torch uses the images in this folder to run an image instance segmentation experiment.

Note

You can have multiple .json files in the .zip file that you can use as train, validation, and test datasets:

A train .json file needs to follow the format described above
A validation .json file needs to follow the same format as a train .json file
A test .json file needs to follow the same format as a train .csv file, but does not require labels

Details

RLE encoding and decoding functions

from typing import Tuple

import numpy as np


def mask2rle(x: np.ndarray) -> str:
    """
    Converts input masks into RLE-encoded strings.

    Args:
        x: numpy array of shape (height, width), 1 - mask, 0 - background
    Returns:
        RLE string
    """

    pixels = x.T.flatten()
    pixels = np.concatenate([[0], pixels, [0]])
    runs = np.where(pixels[1:] != pixels[:-1])[0] + 1
    runs[1::2] -= runs[::2]
    return " ".join(str(x) for x in runs)


def rle2mask(mask_rle: str, shape: Tuple[int, int]) -> np.ndarray:
    """
    Converts RLE-encoded string into the binary mask.

    Args:
        mask_rle: RLE-encoded string
        shape: (height,width) of array to return
    Returns:
        binary mask: 1 - mask, 0 - background
    """

    s = mask_rle.split()
    starts, lengths = [np.asarray(x, dtype=int) for x in (s[0:][::2], s[1:][::2])]
    starts -= 1
    ends = starts + lengths
    img = np.zeros(shape[0] * shape[1], dtype=np.uint8)
    for lo, hi in zip(starts, ends):
        img[lo:hi] = 1
    return img.reshape(shape, order="F")  # Needed to align to RLE direction

Details

.csv file with masks to Hydrogen Torch format

import pandas as pd


df = pd.read_csv("/data/train.csv")

# Prepare the processed dataset
df = df.groupby(["image_id"]).agg(lambda x: x.to_list()).reset_index()

df[["image_id", "class_id", "rle_mask"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Details

COCO to H2O Hydrogen Torch format

import json
import pandas as pd
from pycocotools.coco import COCO


def get_instance_segmentation(df, coco_path):
    coco = COCO(json_path)

    images = pd.DataFrame(df["images"])
    categories = pd.DataFrame(df["categories"])
    annotations = pd.DataFrame(df["annotations"])

    rles = []
    for idx, annotation in enumerate(df["annotations"]):
        yield data_split, idx / len(df["annotations"])
        mask = mask2rle(coco.annToMask(annotation))
        rles.append(mask)

    annotations["rle_mask"] = rles
    annotations.loc[annotations.rle_mask == "", "rle_mask"] = float("nan")

    annotations = annotations[["image_id", "category_id", "rle_mask"]]

    annotations["category_id"] = annotations["category_id"].astype(int)
    annotations = annotations.merge(
        categories[["id", "name"]].drop_duplicates(),
        left_on="category_id",
        right_on="id",
        how="left",
    )
    annotations = annotations.merge(
        images[["id", "file_name"]].drop_duplicates(),
        left_on="image_id",
        right_on="id",
        how="right",
    )

    annotations.drop(["id_x", "id_y", "image_id"], axis=1, inplace=True)
    
    return annotations


# Read data
train_path = "/data/COCO_train_annos.json"
with open(train_path, "r") as fp:
    train = json.load(fp)

# Parse COCO format
train_ann = get_instance_segmentation(df=train, coco_path=train_path)

# Prepare the processed dataset
train_ann = train_ann.groupby(["file_name"]).agg(lambda x: [] if pd.isnull(x).all() else x.to_list()).reset_index()
train_ann[["file_name", "name", "rle"]].to_parquet(
    "/data/train.pq", engine="pyarrow", index=False
)

Text regression

Format
Example

The data for a text regression experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

The available dataset connectors require the data for a text regression experiment to be in a .zip or .csv file.

Note

To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.

A .csv file containing the following columns:
- A text column containing the texts for the experiment
- One or more label columns containing the numerical labels (targets)
  Note
  H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting a new experiment.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

The wellformed_query_text_regression.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text regression problem.

As follows, two random rows from the .csv file:

rating	text
0.2	The European Union includes how many ?
1.0	What is released when an ion is formed ?

Note

The rating column refers to the label column.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Text classification

Format
Example

The data for a text classification experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

The available dataset connectors require the data for a text classification experiment to be in a .zip or .csv file.
Note
To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- A text column containing the texts for the experiment
- One or more label columns containing either either One-hot encoded multi-class labels or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
  Note
  H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. In multi-class problems, the classes are mutually exclusive, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

The amazon_reviews_text_classification.csv file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text classification problem.

The first two rows of the .csv file are as follows:

text	label
GREAT!!!!! Review: I got this toy a couple of days ago and I ABSOLUTELY LOVE IT! It is so much more realistic looking than my other baby born comfort seat. All though I dont have a baby born I had one before but I sold it at a garage sale. So I use It for my berenguar baby doll. And it even has the buckle that goes across the shoulder like a real babies car seat!!!! DEFFINATELY WORTH THE MONEY!!!!!!	Positive
This Or "Dixie Chicken" Presents Them At A Peak Review: Though lyrically the overall feel of this record is slightly provincial, it can still transport me to places I wanna be. Musically, this pop product from California is stylistically consistent. Yet the instrumentation is diverse and each member is resourceful. But it's Lowell George's vocals and slide guitar that are primarily at the center. He's not flashy and that's a positive. You get treated to 12-bar blues, a song of prescription meds for tripping and a blues with an accordian.But the three highlights are "Easy To Slip", a jaunty acoustic/electric number about lighting up and the sheer joy that memory drifting can project, "Teenage Nervous Breakdown" in which they switch to the domain of energy-driven rock and roll and the title track, a leisurely-paced country blues in which a generous helping of background vocals provides just the right amount of tension.	Positive

text

label

GREAT!!!!! Review: I got this toy a couple of days ago and I ABSOLUTELY LOVE IT! It is so much more realistic looking than my other baby born comfort seat. All though I dont have a baby born I had one before but I sold it at a garage sale. So I use It for my berenguar baby doll. And it even has the buckle that goes across the shoulder like a real babies car seat!!!! DEFFINATELY WORTH THE MONEY!!!!!!

Positive

This Or "Dixie Chicken" Presents Them At A Peak Review: Though lyrically the overall feel of this record is slightly provincial, it can still transport me to places I wanna be. Musically, this pop product from California is stylistically consistent. Yet the instrumentation is diverse and each member is resourceful. But it's Lowell George's vocals and slide guitar that are primarily at the center. He's not flashy and that's a positive. You get treated to 12-bar blues, a song of prescription meds for tripping and a blues with an accordian.But the three highlights are "Easy To Slip", a jaunty acoustic/electric number about lighting up and the sheer joy that memory drifting can project, "Teenage Nervous Breakdown" in which they switch to the domain of energy-driven rock and roll and the title track, a leisurely-paced country blues in which a generous helping of background vocals provides just the right amount of tension.

Positive

Note

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Text sequence to sequence

Format
Example

The data for a text sequence to sequence experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

The available dataset connectors require the data for a text sequence to sequence experiment to be in a .zip or .csv file.
Note
To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An input-text column containing/representing the input texts
- An output-text column containing/representing the out put texts
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require an output_text column

The cnn_dailymail_text_sequence_to_sequence.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text sequence to sequence problem. The structure of the .zip file is as follows:

cnn_dailymail_text_sequence_to_sequence.zip
│   └───train.csv

As follows, a random row from the .csv file:

Details

Random row

text	summary	id
It's official: U.S. President Barack Obama wants lawmakers to weigh in on whether to use military force in Syria. Obama sent a letter to the heads of the House and Senate on Saturday night, hours after announcing that he believes military action against Syrian targets is the right step to take over the alleged use of chemical weapons. The proposed legislation from Obama asks Congress to approve the use of military force "to deter, disrupt, prevent and degrade the potential for future uses of chemical weapons or other weapons of mass destruction." It's a step that is set to turn an international crisis into a fierce domestic political battle. There are key questions looming over the debate: What did U.N. weapons inspectors find in Syria? What happens if Congress votes no? And how will the Syrian government react? In a televised address from the White House Rose Garden earlier Saturday, the president said he would take his case to Congress, not because he has to -- but because he wants to. "While I believe I have the authority to carry out this military action without specific congressional authorization, I know that the country will be stronger if we take this course, and our actions will be even more effective," he said. "We should have this debate, because the issues are too big for business as usual." Obama said top congressional leaders had agreed to schedule a debate when the body returns to Washington on September 9. The Senate Foreign Relations Committee will hold a hearing over the matter on Tuesday, Sen. Robert Menendez said. Transcript: Read Obama's full remarks . Syrian crisis: Latest developments . U.N. inspectors leave Syria . Obama's remarks came shortly after U.N. inspectors left Syria, carrying evidence that will determine whether chemical weapons were used in an attack early last week in a Damascus suburb. "The aim of the game here, the mandate, is very clear -- and that is to ascertain whether chemical weapons were used -- and not by whom," U.N. spokesman Martin Nesirky told reporters on Saturday. But who used the weapons in the reported toxic gas attack in a Damascus suburb on August 21 has been a key point of global debate over the Syrian crisis. Top U.S. officials have said there's no doubt that the Syrian government was behind it, while Syrian officials have denied responsibility and blamed jihadists fighting with the rebels. British and U.S. intelligence reports say the attack involved chemical weapons, but U.N. officials have stressed the importance of waiting for an official report from inspectors. The inspectors will share their findings with U.N. Secretary-General Ban Ki-moon Ban, who has said he wants to wait until the U.N. team's final report is completed before presenting it to the U.N. Security Council. The Organization for the Prohibition of Chemical Weapons, which nine of the inspectors belong to, said Saturday that it could take up to three weeks to analyze the evidence they collected. "It needs time to be able to analyze the information and the samples," Nesirky said. He noted that Ban has repeatedly said there is no alternative to a political solution to the crisis in Syria, and that "a military solution is not an option." Bergen: Syria is a problem from hell for the U.S. Obama: 'This menace must be confronted' Obama's senior advisers have debated the next steps to take, and the president's comments Saturday came amid mounting political pressure over the situation in Syria. Some U.S. lawmakers have called for immediate action while others warn of stepping into what could become a quagmire. Some global leaders have expressed support, but the British Parliament's vote against military action earlier this week was a blow to Obama's hopes of getting strong backing from key NATO allies. On Saturday, Obama proposed what he said would be a limited military action against Syrian President Bashar al-Assad. Any military attack would not be open-ended or include U.S. ground forces, he said. Syria's alleged use of chemical weapons earlier this month "is an assault on human dignity," the president said. A failure to respond with force, Obama argued, "could lead to escalating use of chemical weapons or their proliferation to terrorist groups who would do our people harm. In a world with many dangers, this menace must be confronted." Syria missile strike: What would happen next? Map: U.S. and allied assets around Syria . Obama decision came Friday night . On Friday night, the president made a last-minute decision to consult lawmakers. What will happen if they vote no? It's unclear. A senior administration official told CNN that Obama has the authority to act without Congress -- even if Congress rejects his request for authorization to use force. Obama on Saturday continued to shore up support for a strike on the al-Assad government. He spoke by phone with French President Francois Hollande before his Rose Garden speech. "The two leaders agreed that the international community must deliver a resolute message to the Assad regime -- and others who would consider using chemical weapons -- that these crimes are unacceptable and those who violate this international norm will be held accountable by the world," the White House said. Meanwhile, as uncertainty loomed over how Congress would weigh in, U.S. military officials said they remained at the ready. 5 key assertions: U.S. intelligence report on Syria . Syria: Who wants what after chemical weapons horror . Reactions mixed to Obama's speech . A spokesman for the Syrian National Coalition said that the opposition group was disappointed by Obama's announcement. "Our fear now is that the lack of action could embolden the regime and they repeat his attacks in a more serious way," said spokesman Louay Safi. "So we are quite concerned." Some members of Congress applauded Obama's decision. House Speaker John Boehner, Majority Leader Eric Cantor, Majority Whip Kevin McCarthy and Conference Chair Cathy McMorris Rodgers issued a statement Saturday praising the president. "Under the Constitution, the responsibility to declare war lies with Congress," the Republican lawmakers said. "We are glad the president is seeking authorization for any military action in Syria in response to serious, substantive questions being raised." More than 160 legislators, including 63 of Obama's fellow Democrats, had signed letters calling for either a vote or at least a "full debate" before any U.S. action. British Prime Minister David Cameron, whose own attempt to get lawmakers in his country to support military action in Syria failed earlier this week, responded to Obama's speech in a Twitter post Saturday. "I understand and support Barack Obama's position on Syria," Cameron said. An influential lawmaker in Russia -- which has stood by Syria and criticized the United States -- had his own theory. "The main reason Obama is turning to the Congress: the military operation did not get enough support either in the world, among allies of the US or in the United States itself," Alexei Pushkov, chairman of the international-affairs committee of the Russian State Duma, said in a Twitter post. In the United States, scattered groups of anti-war protesters around the country took to the streets Saturday. "Like many other Americans...we're just tired of the United States getting involved and invading and bombing other countries," said Robin Rosecrans, who was among hundreds at a Los Angeles demonstration. What do Syria's neighbors think? Why Russia, China, Iran stand by Assad . Syria's government unfazed . After Obama's speech, a military and political analyst on Syrian state TV said Obama is "embarrassed" that Russia opposes military action against Syria, is "crying for help" for someone to come to his rescue and is facing two defeats -- on the political and military levels. Syria's prime minister appeared unfazed by the saber-rattling. "The Syrian Army's status is on maximum readiness and fingers are on the trigger to confront all challenges," Wael Nader al-Halqi said during a meeting with a delegation of Syrian expatriates from Italy, according to a banner on Syria State TV that was broadcast prior to Obama's address. An anchor on Syrian state television said Obama "appeared to be preparing for an aggression on Syria based on repeated lies." A top Syrian diplomat told the state television network that Obama was facing pressure to take military action from Israel, Turkey, some Arabs and right-wing extremists in the United States. "I think he has done well by doing what Cameron did in terms of taking the issue to Parliament," said Bashar Jaafari, Syria's ambassador to the United Nations. Both Obama and Cameron, he said, "climbed to the top of the tree and don't know how to get down." The Syrian government has denied that it used chemical weapons in the August 21 attack, saying that jihadists fighting with the rebels used them in an effort to turn global sentiments against it. British intelligence had put the number of people killed in the attack at more than 350. On Saturday, Obama said "all told, well over 1,000 people were murdered." U.S. Secretary of State John Kerry on Friday cited a death toll of 1,429, more than 400 of them children. No explanation was offered for the discrepancy. Iran: U.S. military action in Syria would spark 'disaster' Opinion: Why strikes in Syria are a bad idea .	Syrian official: Obama climbed to the top of the tree, "doesn't know how to get down" Obama sends a letter to the heads of the House and Senate . Obama to seek congressional approval on military action against Syria . Aim is to determine whether CW were used, not by whom, says U.N. spokesman .	0001d1afc246a7964130f43ae940af6bc6c57f01

Note

In this example, the text column refers to the input-text column, while the summary column refers to the output-text column.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Text span prediction

Format
Example

The data for a text span prediction experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

The available dataset connectors require the data for a text span prediction experiment to be in a .zip or .csv file.
Note
To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- A context column containing/representing the input texts
- A question column containing/representing the questions (that the input context text can answer)
- An answer column containing/representing the substrings from the context column that answers the questions (question column)
- An optional answer-start column containing/representing the start of the substring answers in the context column
  Note
  - The start of the substring answers needs to be specified by integers representing the index where the answer starts in the context.
  - If you do not provide an answer-start column, H2O Hydrogen Torch will select the first occurrence of the answer in the context.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require an answer column

The squad_text_span_prediction.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text span prediction problem. The structure of the .zip file is as follows:

squad_text_span_prediction.zip
│   └───squad_v1.csv

As follows, a random row from the .csv file:

question	context	answer
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?	Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.	Saint Bernadette Soubirous

Note

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Text token classification

Format
Example
Conversions

The data for a text token classification experiment can be formatted following format 1 or 2.

Format 1

A .pq (parquet) file.

parquet_name.pq (1)(2)

Format 2

A .zip file containing a .pq (parquet) file.

folder_name.zip (1)
│   └───parquet_name.pq (2)

The available dataset connectors require the data for a text token classification to be in a .zip or .pq file.
Note
To learn how to upload your .zip or .pq file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .pq file containing the following columns:
- A text column containing tokenized text: each sample should have a list of string tokens
- A label column containing token labels for the tokenized text; each sample should have a list of token labels. Labels should be represented as categorical string values
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .pq files in the .zip file that you can use as train, validation, and test dataframes:

A train .pq file needs to follow the format described above
A validation .pq file needs to follow the same format as a train .pq file
A test .pq file needs to the same format as a train .pq file, but does not require a label

The conll2003_text_token_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text token classification problem. The structure of the .zip file is as follows:

conll2003_text_token_classification.zip
│   └───test.pq
│   └───train.pq
│   └───validation.pq

As follows, a random row from the train.pq file:

id	text	pos_tags	chunk_tags	ner_tags
4158	['Nijmeh' 'of' 'Lebanon' 'beat' 'Nasr' 'of' 'Saudi' 'Arabia' '1-0' '(' 'halftime' '1-0' ')' 'in' 'their' 'Asian' 'club' 'championship' 'second' 'round' 'first' 'leg' 'tie' 'on' 'Saturday' '.']	['NNS' 'IN' 'NNP' 'VBD' 'NNP' 'IN' 'NNP' 'NNP' 'NNP' '(' 'NN' 'CD' ')' 'IN' 'PRP$' 'JJ' 'NN' 'NN' 'NN' 'NN' 'JJ' 'NN' 'NN' 'IN' 'NNP' '.']	['B-NP' 'B-VP' 'B-VP' 'I-VP' 'B-NP' 'I-NP' 'B-PP' 'B-NP' 'O' 'O' 'B-NP' 'B-NP' 'I-NP' 'I-NP' 'B-PP' 'B-NP' 'I-NP' 'B-NP' 'I-NP' 'B-VP' 'B-NP' 'B-PP' 'B-VP' 'O']	['B-ORG' 'O' 'B-LOC' 'O' 'B-ORG' 'O' 'B-LOC' 'I-LOC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'B-MISC' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O' 'O']

Note

The *_tags columns refer to the label column and can only be selected when running a text token classification experiment. Only one column from the available label columns can be selected when running an experiment.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Details

Convert CoNLL-2003 dataset

from pathlib import Path

import pandas as pd

try:
    import datasets
except ImportError:
    raise ImportError("Need datasets>=1.11.0 to download English CoNLL2003 data!")

dataset = datasets.load_dataset("conll2003")

for subset in dataset:
    out_path = Path(f"/data/conll2003/{subset}.pq")
    out_path.parent.mkdir(exist_ok=True, parents=True)

    df = pd.DataFrame(dataset[subset])

    # Decode the label encoded labels
    for feature in dataset[subset].features:
        if isinstance(dataset[subset].features[feature], datasets.Sequence):
            feat = dataset[subset].features[feature].feature

            if isinstance(feat, datasets.ClassLabel):
                df[feature] = df[feature].apply(feat.int2str)

    df.rename(columns={"tokens": "text"}, inplace=True)

    df.to_parquet(out_path, engine="pyarrow", index=False)

Text metric learning

Format
Example

The data for a text metric learning experiment can be formatted following format 1 or 2.

Format 1

A .csv file.

csv_name.csv (1)(2)

Format 2

A .zip file containing a .csv file.

folder_name.zip (1)
│   └───csv_name.csv (2)

The available dataset connectors require the data for a text metric learning experiment to be in a .zip or .csv file.
Note
To learn how to upload your .zip or .csv file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- A text column containing the input texts
- A label column containing the class names
  Note
  Texts that are similar should have the same class name.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require a label column

The ubuntu_text_metric_learning.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a text metric learning problem. The structure of the .zip file is as follows:

ubuntu_text_metric_learning.zip
│   └───train.csv
│   └───test.csv

As follows, a random row from the train.csv file:

text	label	fold
what is the easiest way to strip a desktop edition to a server edition ?	16	1

Note

To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Audio regression

Format
Example

The data for an audio regression experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───audio_folder_name (3)
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       ...

The available dataset connectors require the data for an audio regression experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
  Note
  - Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
  - Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing the numerical labels (targets)
  Note
  H2O Hydrogen Torch can train models that predict multiple labels simultaneously. You can provide multiple columns with multiple unique labels and choose which labels to predict when starting an audio regressuin experiment.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An audio folder that contains all the audio files specified in the audio column; H2O Hydrogen Torch uses the audios in this folder to run the audio regression experiment.
Note
All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

The amnist_audio_regression.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve an audio regression problem. The .zip file contains a .csv file and an audio folder. The structure of the .zip file is:

amnist_audio_regression.zip
│   └───amnist_meta.csv
│   │
│   └───amnist_audios
│        └───0_01_0.ogg
│        └───0_01_1.ogg
│        └───0_01_2.ogg
│           ...

The first three rows of the .csv file are:

audio	label	fold
2_26_2.ogg	2	0
2_26_38.ogg	2	1
9_26_47.ogg	9	2

Note

In this example, the data directory in the audio column is not specified. That being the case, it needs to be specified when uploading the dataset, and the amnist_audios folder needs to be selected as the value for the Data Folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Audio classification

Format
Example

The data for an audio classification experiment needs to be in a .zip file (1) containing a .csv file (2) and an audio folder (3).

folder_name.zip (1)
│   └───csv_name.csv (2)
│   │
│   └───audio_folder_name (3) 
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       └───name_of_audio.audio_extension
│       ...

The available data connectors require the data for an audio classification experiment to be in a .zip file.
Note
To learn how to upload your .zip file as your dataset in H2O Hydrogen Torch, see Dataset connectors.
A .csv file containing the following columns:
- An audio column containing the names of the audios for the experiment, where each audio has an audio extension specified
  Note
  - Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.
  - Suppose the names of the audio files don't specify the data directory (location of the audios in the .zip file). In that case, you can specify the data directory (data folder) when uploading the dataset or before the dataset is used for an experiment. For more information, see Import dataset settings.
- One or more label columns containing either multi-class labels (One-hot encoded) or multiple multi-class/multi-label labels. For multi-class and binary classification, a single label column is sufficient
  Note
  H2O Hydrogen Torch can solve both multi-class and multi-label classification problems. The classes are mutually exclusive in multi-class problems, while the classes represent unique labels for multi-label problems. For N label class columns, in multi-class problems, only a single column could be set to 1, while in multi-label problems, all or none could be set to 1.
- An optional fold column containing cross-validation fold indexes
  Note
  Adding a fold column splits the data into subsets. A separate model is trained for each value in the fold column, where records with the corresponding value create a holdout validation sample while all the remaining records are used for training. A holdout validation sample is created if a validation dataframe is not provided during an experiment. Five-folds are assigned randomly if a fold column is not specified, which is sometimes not the desired strategy. The fold column can include integers (0, 1, 2, … , N-1 values or 1, 2, 3… , N values) or categorical values.
An audio folder that contains all the audio files specified in the audio column above; H2O Hydrogen Torch uses the audios in this folder to run the audio classification experiment.
Note
All audios need to have an audio extension. Audios can contain a mix of supported audio extensions. To learn about supported audio extensions, see Supported audio extensions for audio processing.

Note

You can have multiple .csv files in the .zip file that you can use as train, validation, and test dataframes:

A train .csv file needs to follow the format described above
A validation .csv file needs to follow the same format as a train .csv file
A test .csv file needs to the same format as a train .csv file, but does not require label column(s)

The esc10_audio_classification.zip file is a preprocessed dataset in H2O Hydrogen Torch and was formatted to solve a multiclass audio classification problem. The structure of the .zip file is:

esc10_audio_classification.zip
│   └───esc10_meta.csv
│   │
│   └───audio_esc10
│       └───2-37806-B-40.wav
│       └───5-200339-A-1.wav
│       └───1-172649-D-40.wav
│       ...

The first three rows of the .csv file are:

filename	fold	label
1-100032-A-0.wav	0	dog
1-110389-A-0.wav	0	dog
1-116765-A-41.wav	0	chainsaw

Note

In this example, the data directory in the filename column is not specified. That being the case, it needs to be specified when uploading the dataset, and the audio_files folder needs to be selected as the value for the Data folder setting. For more information, see Import dataset settings.
To learn how to access one of the preprocessed datasets in H2O Hydrogen Torch, see Preprocessed datasets.

Supported audio extensions for audio processing

The following is a list of supported audio extensions for audio processing in H2O Hydrogen Torch:

Uncompressed: .wav, .aiff
Lossless compressed: .flac
Lossy compressed: .mp3, .ogg

Supported image extensions for image processing

The following is a list of supported image extensions for image processing in H2O Hydrogen Torch:

Windows bitmaps: .bmp
JPEG files: .jpeg, .jpg, .jpe
JPEG 2000 files: .jp2
Portable Network Graphics: .png
WebP: .webp
Portable image format: .pbm, .pgm, .ppm, .pnm
TIFF files: .tiff, .tif
OpenEXR image files: .exr
Radiance HDR: .hdr
NumPy data array: .npy (data must be of shape [height, width, channels])

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai

Image regression​

Image classification​

Image metric learning​

Image object detection​

Hydrogen Torch format​

Example​

Individual boxes format​

Example​

COCO format​

Pascal VOC format​

Image semantic segmentation​

Hydrogen Torch format​

Example​

COCO format​

Image instance segmentation​

Hydrogen Torch format​

Example​

COCO format​

Text regression​

Text classification​

Text sequence to sequence​

Text span prediction​

Text token classification​

Text metric learning​

Audio regression​

Audio classification​

Supported audio extensions for audio processing​

Supported image extensions for image processing​

Image regression

Image classification

Image metric learning

Image object detection

Hydrogen Torch format

Example

Individual boxes format

Example

COCO format

Pascal VOC format

Image semantic segmentation

Hydrogen Torch format

Example

COCO format

Image instance segmentation

Hydrogen Torch format

Example

COCO format

Text regression

Text classification

Text sequence to sequence

Text span prediction

Text token classification

Text metric learning

Audio regression

Audio classification

Supported audio extensions for audio processing

Supported image extensions for image processing