Skip to main content
Version: v1.2.0

Preprocessed datasets

In H2O Hydrogen Torch, you can access preprocessed datasets to explore supported problem types.

Import preprocessed dataset

To import a preprocessed dataset to H2O Hydrogen Torch, consider the following instructions:

  1. In the H2O Hydrogen Torch navigation menu, click Import dataset.
  2. In the File name list, select one of the preprocessed datasets in H2O Hydrogen Torch.
  3. Click Continue.
  4. Again, click Continue.
  5. Again, click Continue.
Note
  • After importing a preprocessed dataset, you will be able to use it for an experiment.
  • To learn how to preprocess your dataset for a particular supported problem type, see Dataset formats

Preprocessed datasets in H2O Hydrogen Torch

Flower image classification

  • File name: flower_image_classification.zip
  • Description: The dataset contains images of dandelions, daisies, roses, tulips, and sunflowers.
  • Dataset columns: image, label
  • Problem type: Image classification
note

To learn more about the dataset, see Flowers Dataset.

Coins image regression

  • File name: coins_image_regression.zip
  • Description: The dataset contains a collection of images with one or more coins. Each image has been labeled to indicate the sum of its coins. The currency of the coins is the Brazilian Real (R$).
  • Dataset columns: image_path, label, fold
  • Problem type: Image regression
note

To learn more about the dataset, see Brazilian Coins.

Global wheat image object detection

  • File name: global_wheat_image_object_detection.zip
  • Description: The dataset contains a collection of images of wheat fields with bounding boxes for each identified wheat head.
  • Dataset columns: image, class_id, x_min, y_min, x_max, y_max
  • Problem type: Single-class object detection
note

To learn more about the dataset, see Global Wheat Dataset.

Amazon Review text classification

  • File name: amazon_reviews_text_classification.csv
  • Description: The dataset contains a collection of reviews from Amazon. Each review (in text form) includes the title of the review and the review itself. The dataset has been labeled to indicate whether a review is positive or negative.
  • Dataset columns: text, label
  • Problem type: Text classification
note

To learn more about the dataset, see Amazon product data.

Stanford bicycle image metric learning

  • File name: bicycle_image_metric_learning.zip
  • Description: The dataset contains images of online bicycle ads. Each ad has multiple images marked by their class ID.
  • Dataset columns: image, label, fold
  • Problem type: Image metric learning
note

To learn more about the dataset, see The Stanford Online Products dataset.

Fashion image semantic segmentation

  • File name: fashion_image_semantic_segmentation.zip
  • Description: The dataset contains images corresponding to fashion/apparel segmentations. This dataset contains images of people wearing various clothing types in multiple poses.
  • Dataset columns: image, class_id, rle_mask
  • Problem type: Semantic segmentation
note

To learn more about the dataset, see Clothing Co-Parsing Dataset.

CNN/Daily mail text sequence to sequence

  • File name: cnn_dailymail_text_sequence_to_sequence.zip
  • Description: The dataset contains human-generated abstract summaries from news stories published on the CNN and Daily Mail websites.
  • Dataset columns: text, summary, id
  • Problem type: Text sequence to sequence
note

To learn more about the dataset, see abisee/cnn-dailymail.

Well-formed query text regression

  • File name: wellformed_query_text_regression.csv
  • Description: The dataset contains a collection of search queries. Every query was rated between 0 and 1 specifying whether or not the query was well-formed.
  • Dataset columns: text, rating
  • Problem type: Text regression
note

To learn more about the dataset, see Query-wellformedness Dataset.

CoNLL-2003 text token classification

  • File name: conll2003_text_token_classification.zip
  • Description: The dataset contains a collection of text pieces that have their name entities specified. Name entities refer to abstract or physical objects such as a person, product, etc., that can be indicated with a proper name.
  • Dataset columns: id, text, pos_tags, chunk_tags, ner_tags
  • Problem type: Text token classification
note

To learn more about the dataset, see Language-Independent Named Entity Recognition (II).

Squad text span prediction

  • File name: squad_text_span_prediction.zip
  • Description: The dataset contains questions with answers and contexts that can be used to answer the questions.
  • Dataset columns: question, context, answer
  • Problem type: Text span prediction
note

To learn more about the dataset, see The Stanford Question Answering Dataset.

Ubuntu text metric learning

  • File name: ubuntu_text_metric_learning.zip
  • Description: The dataset contains a preprocessed collection of questions from AskUbuntu.com. Questions are grouped in similar clusters (label).
  • Dataset columns: text, label, fold
  • Problem type: Text metric learning
note

COCO cars image instance segmentation

  • File name: coco_image_instance_segmentation.zip
  • Description: The dataset contains a subsample of the famous Common Objects in Context (COCO) dataset. This subsample includes only a single "Car" class. In other words, all images contain a car or multiple cars.
  • Dataset columns: image_id, class_id, rle_mask
  • Problem type: Image instance segmentation
note

To learn more about the dataset, see COCO Dataset.

Environmental sound audio classification

  • File name: esc10_audio_classification.zip
  • Description: The dataset contains 5-second-long recordings organized into ten classes (with 40 examples per class). Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project.
  • Dataset columns: filename, fold, label
  • Problem type: Audio classification
note

To learn more about the dataset, see ESC-50: Dataset for Environmental Sound Classification.

MNIST audio regression

  • File name: amnist_audio_regression.zip
  • Description: The dataset contains a collection of 30,000 audio samples of spoken digits (0-9) of sixty different speakers.
  • Dataset columns: audio, label, fold
  • Problem type: Audio regression
note

To learn more about the dataset, see Audio MNIST.


Feedback