Version: v1.3.0

Demo (preprocessed) datasets

Overview

In H2O Hydrogen Torch, you can access demo (preprocessed) datasets to explore supported problem types.

Import a demo (preprocessed) dataset

To import a demo (preprocessed) dataset to H2O Hydrogen Torch, consider the following instructions:

In the H2O Hydrogen Torch navigation menu, click Import dataset.
In the Source box, select AWS S3.
In the S3 bucket name box, enter h2o-release/hydrogen-torch/1.3.0.
In the File name list, select one of the Demo (preprocessed) datasets in H2O Hydrogen Torch.
Click Continue.
Again, click Continue.
Again, click Continue.

Note

After importing a preprocessed dataset, you can use it for an experiment.
To learn how to preprocess your dataset for a particular supported problem type, see Dataset formats

Demo (preprocessed) datasets in H2O Hydrogen Torch

Below are the demo (preprocessed) datasets in H2O Hydrogen Torch.

Image

flower_image_classification.zip

Description: The flower_image_classification.zip file is a preprocessed dataset that contains images of dandelions, daisies, roses, tulips, and sunflowers.
Dataset columns: image, label
Problem type: Image classification

note

To learn more about the dataset, see Flowers Dataset.

coins_image_regression.zip

Description: The coins_image_regression.zip file is a preprocessed dataset that contains a collection of images with one or more coins. Each image has been labeled to indicate the sum of its coins. The currency of the coins is the Brazilian Real (R$).
Dataset columns: image_path, label, fold
Problem type: Image regression

note

To learn more about the dataset, see Brazilian Coins.

global_wheat_image_object_detection.zip

Description: The global_wheat_image_object_detection.zip file is a preprocessed dataset that contains a collection of images of wheat fields with bounding boxes for each identified wheat head.
Dataset columns: image, class_id, x_min, y_min, x_max, y_max
Problem type: Single-class object detection

note

To learn more about the dataset, see Global Wheat Dataset.

bicycle_image_metric_learning.zip

Description: The bicycle_image_metric_learning.zip file is a preprocessed dataset that contains images of online bicycle ads. Each ad has multiple images marked by their class ID.
Dataset columns: image, label, fold
Problem type: Image metric learning

note

To learn more about the dataset, see The Stanford Online Products dataset.

fashion_image_semantic_segmentation.zip

Description: The fashion_image_semantic_segmentation.zip file is a preprocessed dataset that contains images corresponding to fashion/apparel segmentations. As well, the dataset contains images of people wearing various clothing types in multiple poses.
Dataset columns: image, class_id, rle_mask
Problem type: Semantic segmentation

note

To learn more about the dataset, see Clothing Co-Parsing Dataset.

coco_image_instance_segmentation.zip

Description: The coco_image_instance_segmentation.zip file is a preprocessed dataset that contains a subsample of the famous Common Objects in Context (COCO) dataset. This subsample includes only a single "Car" class. In other words, all images contain a car or multiple cars.
Dataset columns: image_id, class_id, rle_mask
Problem type: Image instance segmentation

note

To learn more about the dataset, see COCO Dataset.

covid_ct_image_semantic_segmentation_3d.zip

Description: The covid_ct_image_semantic_segmentation_3d.zip file is a preprocessed dataset that contains a collection of 20 3D computed tomography (CT) images depicting the human chest and lungs.
Dataset columns: image, class_id, rle_mask
Problem type: 3D image semantic segmentation

note

To learn more about the dataset, see COVID-19 CT scans.

mnist_3d_image_regression_3d.zip

Description: The mnist_3d_image_regression_3d.zip file is a preprocessed dataset that contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database.
Dataset columns: image, label
Problem type: 3D image regression

note

To learn more about the MNIST database, see The MNIST database of handwritten digits.

mnist_3d_image_classification_3d.zip

Description: The mnist_3d_image_classification_3d.zip file is a preprocessed dataset that contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database.
Dataset columns: image, label
Problem type: 3D image classification

note

To learn more about the MNIST database, see The MNIST database of handwritten digits.

Text

amazon_reviews_text_classification.csv

Description: The amazon_reviews_text_classification.csv file is a preprocessed dataset that contains a collection of reviews from Amazon. Each review (in text form) includes the title of the review and the review itself. The dataset has been labeled to indicate whether a review is positive or negative.
Dataset columns: text, label
Problem type: Text classification

note

To learn more about the dataset, see Amazon product data.

cnn_dailymail_text_sequence_to_sequence.zip

Description: The cnn_dailymail_text_sequence_to_sequence.zip file is a preprocessed dataset that contains human-generated abstract summaries from news stories published on the CNN and Daily Mail websites.
Dataset columns: text, summary, id
Problem type: Text sequence to sequence

note

To learn more about the dataset, see abisee/cnn-dailymail.

wellformed_query_text_regression.csv

Description: The wellformed_query_text_regression.csv file is a preprocessed dataset that contains a collection of search queries. Every query was rated between 0 and 1 specifying whether or not the query was well-formed.
Dataset columns: text, rating
Problem type: Text regression

note

To learn more about the dataset, see Query-wellformedness Dataset.

conll2003_text_token_classification.zip

Description: The conll2003_text_token_classification.zip file is a preprocessed dataset that contains a collection of text pieces that have their name entities specified. Name entities refer to abstract or physical objects such as a person, product, etc., that can be indicated with a proper name.
Dataset columns: id, text, pos_tags, chunk_tags, ner_tags
Problem type: Text token classification

note

To learn more about the dataset, see Language-Independent Named Entity Recognition (II).

squad_text_span_prediction.zip

Description: The squad_text_span_prediction.zip file is a preprocessed dataset that contains questions with answers and contexts that can be used to answer the questions.
Dataset columns: question, context, answer
Problem type: Text span prediction

note

To learn more about the dataset, see The Stanford Question Answering Dataset.

ubuntu_text_metric_learning.zip

Description: The ubuntu_text_metric_learning.zip file is a preprocessed dataset that contains a preprocessed collection of questions from AskUbuntu.com. Questions are grouped in similar clusters (label).
Dataset columns: text, label, fold
Problem type: Text metric learning

note

To learn more about the dataset and its use in research, refer to the following arXiv paper: Semi-supervised Question Retrieval with Gated Convolutions, NAACL 2016, Tao Lei et al.
To view the original dataset from the authors, visit the following Github repository: AskUbuntu Question Dataset.

Audio

esc10_audio_classification.zip

Description: The esc10_audio_classification.zip file is a preprocessed dataset that contains 5-second-long recordings organized into ten classes (with 40 examples per class). Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project.
Dataset columns: filename, fold, label
Problem type: Audio classification

note

To learn more about the dataset, see ESC-50: Dataset for Environmental Sound Classification.

amnist_audio_regression.zip

Description: The amnist_audio_regression.zip file is a preprocessed dataset that contains a collection of 30,000 audio samples of spoken digits (0-9) of sixty different speakers.
Dataset columns: audio, label, fold
Problem type: Audio regression

note

To learn more about the dataset, see Audio MNIST.

Speech

minds14_US_speech_recognition.zip

Description: The minds14_US_speech_recognition.zip file is a preprocessed dataset that contains a collection of 558 speech samples (EN-US subset) related to phone banking. The dataset was manually re-annotated as the transcriptions from the original dataset were generated using automated speech models.
Dataset columns: file, transcript, duration
Problem type: Speech recognition

note

To learn more about the dataset, see MInDS-14.

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai

Overview​

Import a demo (preprocessed) dataset​

Demo (preprocessed) datasets in H2O Hydrogen Torch​

Image​

flower_image_classification.zip​

coins_image_regression.zip​

global_wheat_image_object_detection.zip​

bicycle_image_metric_learning.zip​

fashion_image_semantic_segmentation.zip​

coco_image_instance_segmentation.zip​

covid_ct_image_semantic_segmentation_3d.zip​

mnist_3d_image_regression_3d.zip​

mnist_3d_image_classification_3d.zip​

Text​

amazon_reviews_text_classification.csv​

cnn_dailymail_text_sequence_to_sequence.zip​

wellformed_query_text_regression.csv​

conll2003_text_token_classification.zip​

squad_text_span_prediction.zip​

ubuntu_text_metric_learning.zip​

Audio​

esc10_audio_classification.zip​

amnist_audio_regression.zip​

Speech​

minds14_US_speech_recognition.zip​

Overview

Import a demo (preprocessed) dataset

Demo (preprocessed) datasets in H2O Hydrogen Torch

Image

flower_image_classification.zip

coins_image_regression.zip

global_wheat_image_object_detection.zip

bicycle_image_metric_learning.zip

fashion_image_semantic_segmentation.zip

coco_image_instance_segmentation.zip

covid_ct_image_semantic_segmentation_3d.zip

mnist_3d_image_regression_3d.zip

mnist_3d_image_classification_3d.zip

Text

amazon_reviews_text_classification.csv

cnn_dailymail_text_sequence_to_sequence.zip

wellformed_query_text_regression.csv

conll2003_text_token_classification.zip

squad_text_span_prediction.zip

ubuntu_text_metric_learning.zip

Audio

esc10_audio_classification.zip

amnist_audio_regression.zip

Speech

minds14_US_speech_recognition.zip