Version: v1.4.0

Demo (preprocessed) datasets

Overview

In H2O Hydrogen Torch, you can access demo (preprocessed) datasets to explore supported problem types.

Import a demo (preprocessed) dataset

To import a demo (preprocessed) dataset to H2O Hydrogen Torch, consider the following instructions:

In the H2O Hydrogen Torch navigation menu, click Import dataset.
In the Source box, select AWS S3.
In the S3 bucket name box, enter h2o-release/hydrogen-torch/1.4.0.
In the File name list, select one of the Demo (preprocessed) datasets.
Click Continue.
Again, click Continue.
Again, click Continue.

Note

After importing a preprocessed dataset, you can use it for an experiment.
To learn how to preprocess your dataset for a particular supported problem type, see Dataset formats

Demo (preprocessed) datasets

Below are the demo (preprocessed) datasets in H2O Hydrogen Torch.

Image

flower_image_classification.zip

Description: This preprocessed dataset contains images of dandelions, daisies, roses, tulips, and sunflowers.
Dataset columns: image, label
Problem type: Image classification

note

To learn more about this dataset, see Flowers Dataset.

coins_image_regression.zip

Description: This preprocessed dataset contains a collection of images with one or more coins. Each image has been labeled to indicate the sum of its coins. The currency of the coins is the Brazilian Real (R$).
Dataset columns: image_path, label, fold
Problem type: Image regression

note

To learn more about this dataset, see Brazilian Coins.

global_wheat_image_object_detection.zip

Description: This preprocessed dataset consists of wheat field images annotated with bounding boxes around identified wheat heads.
Dataset columns: image, class_id, x_min, y_min, x_max, y_max
Problem type: Single-class object detection

note

To learn more about this dataset, see Global Wheat Dataset.

bicycle_image_metric_learning.zip

Description: This preprocessed dataset contains images of online bicycle ads. Each ad has multiple images marked by their class ID.
Dataset columns: image, label, fold
Problem type: Image metric learning

note

To learn more about this dataset, see The Stanford Online Products dataset.

fashion_image_semantic_segmentation.zip

Description: This preprocessed dataset contains images corresponding to fashion/apparel segmentations. As well, the dataset contains images of people wearing various clothing types in multiple poses.
Dataset columns: image, class_id, rle_mask
Problem type: Semantic segmentation

note

To learn more about this dataset, see Clothing Co-Parsing Dataset.

coco_image_instance_segmentation.zip

Description: This preprocessed dataset contains a subsample of the famous Common Objects in Context (COCO) dataset. This subsample includes only a single "Car" class. In other words, all images contain a car or multiple cars.
Dataset columns: image_id, class_id, rle_mask
Problem type: Image instance segmentation

note

To learn more about this dataset, see COCO Dataset.

covid_ct_image_semantic_segmentation_3d.zip

Description: This preprocessed dataset contains a collection of 20 3D computed tomography (CT) images depicting the human chest and lungs.
Dataset columns: image, class_id, rle_mask
Problem type: 3D image semantic segmentation

note

To learn more about this dataset, see COVID-19 CT scans.

mnist_3d_image_regression_3d.zip

Description: This preprocessed dataset contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database.
Dataset columns: image, label
Problem type: 3D image regression

note

To learn more about the MNIST database, see The MNIST database of handwritten digits.

mnist_3d_image_classification_3d.zip

Description: This preprocessed dataset contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database.
Dataset columns: image, label
Problem type: 3D image classification

note

To learn more about the MNIST database, see The MNIST database of handwritten digits.

Text

amazon_reviews_text_classification.csv

Description: This preprocessed dataset contains a collection of reviews from Amazon. Each review (in text form) includes the title of the review and the review itself. The dataset has been labeled to indicate whether a review is positive or negative.
Dataset columns: text, label
Problem type: Text classification

note

To learn more about this dataset, see Amazon product data.

cnn_dailymail_text_sequence_to_sequence.zip

Description: This preprocessed dataset contains human-generated abstract summaries from news stories published on the CNN and Daily Mail websites.
Dataset columns: text, summary, id
Problem type: Text sequence to sequence

note

To learn more about this dataset, see abisee/cnn-dailymail.

wellformed_query_text_regression.csv

Description: This preprocessed dataset contains a collection of search queries. Every query was rated between 0 and 1 specifying whether or not the query was well-formed.
Dataset columns: text, rating
Problem type: Text regression

note

To learn more about this dataset, see Query-wellformedness Dataset.

conll2003_text_token_classification.zip

Description: This preprocessed dataset contains a collection of text pieces that have their name entities specified. Name entities refer to abstract or physical objects such as a person, product, etc., that can be indicated with a proper name.
Dataset columns: id, text, pos_tags, chunk_tags, ner_tags
Problem type: Text token classification

note

To learn more about this dataset, see Language-Independent Named Entity Recognition (II).

squad_text_span_prediction.zip

Description: This preprocessed dataset contains questions with answers and contexts that can be used to answer the questions.
Dataset columns: question, context, answer
Problem type: Text span prediction

note

To learn more about this dataset, see The Stanford Question Answering Dataset.

ubuntu_text_metric_learning.zip

Description: This preprocessed dataset contains a preprocessed collection of questions from AskUbuntu.com. Questions are grouped in similar clusters (label).
Dataset columns: text, label, fold
Problem type: Text metric learning

note

To learn more about the dataset and its use in research, refer to the following arXiv paper: Semi-supervised Question Retrieval with Gated Convolutions, NAACL 2016, Tao Lei et al.
To view the original dataset from the authors, visit the following Github repository: AskUbuntu Question Dataset.

Image and text

food_101_imageandtext_classification.zip

Description: This preprocessed dataset contains recipe titles and images of 5 types of salads: Beet, Caesar, Caprese, Greek, and Seaweed.
Dataset columns: image, text, label
Problem type: Image and text classification

note

To learn more about this dataset, see Food-101.

Audio

esc10_audio_classification.zip

Description: This preprocessed dataset contains 5-second-long recordings organized into ten classes (with 40 examples per class). Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project.
Dataset columns: filename, fold, label
Problem type: Audio classification

note

To learn more about this dataset, see ESC-50: Dataset for Environmental Sound Classification.

amnist_audio_regression.zip

Description: This preprocessed dataset contains a collection of 30,000 audio samples of spoken digits (0-9) of sixty different speakers.
Dataset columns: audio, label, fold
Problem type: Audio regression

note

To learn more about this dataset, see Audio MNIST.

Speech

minds14_US_speech_recognition.zip

Description: This preprocessed dataset contains a collection of 558 speech samples (EN-US subset) related to phone banking. The dataset was manually re-annotated as the transcriptions from the original dataset were generated using automated speech models.
Dataset columns: file, transcript, duration
Problem type: Speech recognition

note

To learn more about this dataset, see MInDS-14.

Graph

ogbn-mag_graph_node_classification.zip

Description: This preprocessed dataset represents a heterogeneous network derived from a subset of the Microsoft Academic Graph (MAG). It contains four types of entities: papers (736,389 nodes), authors(1,134,649 nodes), institutions (8,740 nodes), and fields of study (59,965 nodes). There are four types of directed relationships between these entities: an author is "affiliated with" an institution, an author "writes" a paper, a paper "cites" another paper, and a paper "has a topic of" a field of study. Each paper is associated with a 128-dimensional word2vec feature vector that captures its content, while the other entity types do not have additional features. This dataset is structured to facilitate the analysis of complex interrelationships among academic entities.
Dataset columns: node_id, label
Problem type: Graph node classification

note

To learn more about this dataset, see Dataset ogbn-mag.

ogbn-proteins_graph_node_regression.zip

Description: This preprocessed dataset represents an undirected and weighted graph that illustrates various biological relationships between proteins. In this graph, each node represents a protein, while the edges denote different types of associations between these proteins, such as physical interactions, co-expression, or evolutionary relationships (homology). Each edge is associated with 8-dimensional features, where each dimension reflects the confidence level of a specific association type, with values ranging from 0 to 1 (with higher values indicating greater confidence). The proteins in this dataset are derived from 8 distinct species.
Dataset columns: label0 ... label111, node_id
Problem type: Graph node regression

note

To learn more about this dataset, see Dataset ogbn-proteins.

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai

Overview​

Import a demo (preprocessed) dataset​

Demo (preprocessed) datasets​

Image​

flower_image_classification.zip​

coins_image_regression.zip​

global_wheat_image_object_detection.zip​

bicycle_image_metric_learning.zip​

fashion_image_semantic_segmentation.zip​

coco_image_instance_segmentation.zip​

covid_ct_image_semantic_segmentation_3d.zip​

mnist_3d_image_regression_3d.zip​

mnist_3d_image_classification_3d.zip​

Text​

amazon_reviews_text_classification.csv​

cnn_dailymail_text_sequence_to_sequence.zip​

wellformed_query_text_regression.csv​

conll2003_text_token_classification.zip​

squad_text_span_prediction.zip​

ubuntu_text_metric_learning.zip​

Image and text​

food_101_imageandtext_classification.zip​

Audio​

esc10_audio_classification.zip​

amnist_audio_regression.zip​

Speech​

minds14_US_speech_recognition.zip​

Graph​

ogbn-mag_graph_node_classification.zip​

ogbn-proteins_graph_node_regression.zip​

Overview

Import a demo (preprocessed) dataset

Demo (preprocessed) datasets

Image

flower_image_classification.zip

coins_image_regression.zip

global_wheat_image_object_detection.zip

bicycle_image_metric_learning.zip

fashion_image_semantic_segmentation.zip

coco_image_instance_segmentation.zip

covid_ct_image_semantic_segmentation_3d.zip

mnist_3d_image_regression_3d.zip

mnist_3d_image_classification_3d.zip

Text

amazon_reviews_text_classification.csv

cnn_dailymail_text_sequence_to_sequence.zip

wellformed_query_text_regression.csv

conll2003_text_token_classification.zip

squad_text_span_prediction.zip

ubuntu_text_metric_learning.zip

Image and text

food_101_imageandtext_classification.zip

Audio

esc10_audio_classification.zip

amnist_audio_regression.zip

Speech

minds14_US_speech_recognition.zip

Graph

ogbn-mag_graph_node_classification.zip

ogbn-proteins_graph_node_regression.zip