Skip to main content
Version: v1.3.0

Demo (preprocessed) datasets

Overview

In H2O Hydrogen Torch, you can access demo (preprocessed) datasets to explore supported problem types.

Import a demo (preprocessed) dataset

To import a demo (preprocessed) dataset to H2O Hydrogen Torch, consider the following instructions:

  1. In the H2O Hydrogen Torch navigation menu, click Import dataset.
  2. In the Source box, select AWS S3.
  3. In the S3 bucket name box, enter h2o-release/hydrogen-torch/1.3.0.
  4. In the File name list, select one of the Demo (preprocessed) datasets in H2O Hydrogen Torch.
  5. Click Continue.
  6. Again, click Continue.
  7. Again, click Continue.
Note
  • After importing a preprocessed dataset, you can use it for an experiment.
  • To learn how to preprocess your dataset for a particular supported problem type, see Dataset formats

Demo (preprocessed) datasets in H2O Hydrogen Torch

Below are the demo (preprocessed) datasets in H2O Hydrogen Torch.

Image

flower_image_classification.zip

  • Description: The flower_image_classification.zip file is a preprocessed dataset that contains images of dandelions, daisies, roses, tulips, and sunflowers.
  • Dataset columns: image, label
  • Problem type: Image classification
note

To learn more about the dataset, see Flowers Dataset.

coins_image_regression.zip

  • Description: The coins_image_regression.zip file is a preprocessed dataset that contains a collection of images with one or more coins. Each image has been labeled to indicate the sum of its coins. The currency of the coins is the Brazilian Real (R$).
  • Dataset columns: image_path, label, fold
  • Problem type: Image regression
note

To learn more about the dataset, see Brazilian Coins.

global_wheat_image_object_detection.zip

  • Description: The global_wheat_image_object_detection.zip file is a preprocessed dataset that contains a collection of images of wheat fields with bounding boxes for each identified wheat head.
  • Dataset columns: image, class_id, x_min, y_min, x_max, y_max
  • Problem type: Single-class object detection
note

To learn more about the dataset, see Global Wheat Dataset.

bicycle_image_metric_learning.zip

  • Description: The bicycle_image_metric_learning.zip file is a preprocessed dataset that contains images of online bicycle ads. Each ad has multiple images marked by their class ID.
  • Dataset columns: image, label, fold
  • Problem type: Image metric learning
note

To learn more about the dataset, see The Stanford Online Products dataset.

fashion_image_semantic_segmentation.zip

  • Description: The fashion_image_semantic_segmentation.zip file is a preprocessed dataset that contains images corresponding to fashion/apparel segmentations. As well, the dataset contains images of people wearing various clothing types in multiple poses.
  • Dataset columns: image, class_id, rle_mask
  • Problem type: Semantic segmentation
note

To learn more about the dataset, see Clothing Co-Parsing Dataset.

coco_image_instance_segmentation.zip

  • Description: The coco_image_instance_segmentation.zip file is a preprocessed dataset that contains a subsample of the famous Common Objects in Context (COCO) dataset. This subsample includes only a single "Car" class. In other words, all images contain a car or multiple cars.
  • Dataset columns: image_id, class_id, rle_mask
  • Problem type: Image instance segmentation
note

To learn more about the dataset, see COCO Dataset.

covid_ct_image_semantic_segmentation_3d.zip

  • Description: The covid_ct_image_semantic_segmentation_3d.zip file is a preprocessed dataset that contains a collection of 20 3D computed tomography (CT) images depicting the human chest and lungs.
  • Dataset columns: image, class_id, rle_mask
  • Problem type: 3D image semantic segmentation
note

To learn more about the dataset, see COVID-19 CT scans.

mnist_3d_image_regression_3d.zip

  • Description: The mnist_3d_image_regression_3d.zip file is a preprocessed dataset that contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database.
  • Dataset columns: image, label
  • Problem type: 3D image regression
note

To learn more about the MNIST database, see The MNIST database of handwritten digits.

mnist_3d_image_classification_3d.zip

  • Description: The mnist_3d_image_classification_3d.zip file is a preprocessed dataset that contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database.
  • Dataset columns: image, label
  • Problem type: 3D image classification
note

To learn more about the MNIST database, see The MNIST database of handwritten digits.

Text

amazon_reviews_text_classification.csv

  • Description: The amazon_reviews_text_classification.csv file is a preprocessed dataset that contains a collection of reviews from Amazon. Each review (in text form) includes the title of the review and the review itself. The dataset has been labeled to indicate whether a review is positive or negative.
  • Dataset columns: text, label
  • Problem type: Text classification
note

To learn more about the dataset, see Amazon product data.

cnn_dailymail_text_sequence_to_sequence.zip

  • Description: The cnn_dailymail_text_sequence_to_sequence.zip file is a preprocessed dataset that contains human-generated abstract summaries from news stories published on the CNN and Daily Mail websites.
  • Dataset columns: text, summary, id
  • Problem type: Text sequence to sequence
note

To learn more about the dataset, see abisee/cnn-dailymail.

wellformed_query_text_regression.csv

  • Description: The wellformed_query_text_regression.csv file is a preprocessed dataset that contains a collection of search queries. Every query was rated between 0 and 1 specifying whether or not the query was well-formed.
  • Dataset columns: text, rating
  • Problem type: Text regression
note

To learn more about the dataset, see Query-wellformedness Dataset.

conll2003_text_token_classification.zip

  • Description: The conll2003_text_token_classification.zip file is a preprocessed dataset that contains a collection of text pieces that have their name entities specified. Name entities refer to abstract or physical objects such as a person, product, etc., that can be indicated with a proper name.
  • Dataset columns: id, text, pos_tags, chunk_tags, ner_tags
  • Problem type: Text token classification
note

To learn more about the dataset, see Language-Independent Named Entity Recognition (II).

squad_text_span_prediction.zip

  • Description: The squad_text_span_prediction.zip file is a preprocessed dataset that contains questions with answers and contexts that can be used to answer the questions.
  • Dataset columns: question, context, answer
  • Problem type: Text span prediction
note

To learn more about the dataset, see The Stanford Question Answering Dataset.

ubuntu_text_metric_learning.zip

  • Description: The ubuntu_text_metric_learning.zip file is a preprocessed dataset that contains a preprocessed collection of questions from AskUbuntu.com. Questions are grouped in similar clusters (label).
  • Dataset columns: text, label, fold
  • Problem type: Text metric learning
note

Audio

esc10_audio_classification.zip

  • Description: The esc10_audio_classification.zip file is a preprocessed dataset that contains 5-second-long recordings organized into ten classes (with 40 examples per class). Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project.
  • Dataset columns: filename, fold, label
  • Problem type: Audio classification
note

To learn more about the dataset, see ESC-50: Dataset for Environmental Sound Classification.

amnist_audio_regression.zip

  • Description: The amnist_audio_regression.zip file is a preprocessed dataset that contains a collection of 30,000 audio samples of spoken digits (0-9) of sixty different speakers.
  • Dataset columns: audio, label, fold
  • Problem type: Audio regression
note

To learn more about the dataset, see Audio MNIST.

Speech

minds14_US_speech_recognition.zip

  • Description: The minds14_US_speech_recognition.zip file is a preprocessed dataset that contains a collection of 558 speech samples (EN-US subset) related to phone banking. The dataset was manually re-annotated as the transcriptions from the original dataset were generated using automated speech models.
  • Dataset columns: file, transcript, duration
  • Problem type: Speech recognition
note

To learn more about the dataset, see MInDS-14.


Feedback