Preprocessed datasets
In H2O Hydrogen Torch, you can access preprocessed datasets to explore supported problem types.
Import preprocessed dataset
To import a preprocessed dataset to H2O Hydrogen Torch, consider the following instructions:
- In the H2O Hydrogen Torch navigation menu, click Import dataset.
- In the File name list, select one of the preprocessed datasets in H2O Hydrogen Torch.
- Click Continue.
- Again, click Continue.
- Again, click Continue.
- After importing a preprocessed dataset, you will be able to use it for an experiment.
- To learn how to preprocess your dataset for a particular supported problem type, see Dataset formats
Preprocessed datasets in H2O Hydrogen Torch
Flower image classification
- File name:
flower_image_classification.zip
- Description: The dataset contains images of dandelions, daisies, roses, tulips, and sunflowers.
- Dataset columns:
image
,label
- Problem type: Image classification
To learn more about the dataset, see Flowers Dataset.
Coins image regression
- File name:
coins_image_regression.zip
- Description: The dataset contains a collection of images with one or more coins. Each image has been labeled to indicate the sum of its coins. The currency of the coins is the Brazilian Real (R$).
- Dataset columns:
image_path
,label
,fold
- Problem type: Image regression
To learn more about the dataset, see Brazilian Coins.
Global wheat image object detection
- File name:
global_wheat_image_object_detection.zip
- Description: The dataset contains a collection of images of wheat fields with bounding boxes for each identified wheat head.
- Dataset columns:
image
,class_id
,x_min
,y_min
,x_max
,y_max
- Problem type: Single-class object detection
To learn more about the dataset, see Global Wheat Dataset.
Amazon Review text classification
- File name:
amazon_reviews_text_classification.csv
- Description: The dataset contains a collection of reviews from Amazon. Each review (in text form) includes the title of the review and the review itself. The dataset has been labeled to indicate whether a review is positive or negative.
- Dataset columns:
text
,label
- Problem type: Text classification
To learn more about the dataset, see Amazon product data.
Stanford bicycle image metric learning
- File name:
bicycle_image_metric_learning.zip
- Description: The dataset contains images of online bicycle ads. Each ad has multiple images marked by their class ID.
- Dataset columns:
image
,label
,fold
- Problem type: Image metric learning
To learn more about the dataset, see The Stanford Online Products dataset.
Fashion image semantic segmentation
- File name:
fashion_image_semantic_segmentation.zip
- Description: The dataset contains images corresponding to fashion/apparel segmentations. This dataset contains images of people wearing various clothing types in multiple poses.
- Dataset columns:
image
,class_id
,rle_mask
- Problem type: Semantic segmentation
To learn more about the dataset, see Clothing Co-Parsing Dataset.
CNN/Daily mail text sequence to sequence
- File name:
cnn_dailymail_text_sequence_to_sequence.zip
- Description: The dataset contains human-generated abstract summaries from news stories published on the CNN and Daily Mail websites.
- Dataset columns:
text
,summary
,id
- Problem type: Text sequence to sequence
To learn more about the dataset, see abisee/cnn-dailymail.
Well-formed query text regression
- File name:
wellformed_query_text_regression.csv
- Description: The dataset contains a collection of search queries. Every query was rated between 0 and 1 specifying whether or not the query was well-formed.
- Dataset columns:
text
,rating
- Problem type: Text regression
To learn more about the dataset, see Query-wellformedness Dataset.
CoNLL-2003 text token classification
- File name:
conll2003_text_token_classification.zip
- Description: The dataset contains a collection of text pieces that have their name entities specified. Name entities refer to abstract or physical objects such as a person, product, etc., that can be indicated with a proper name.
- Dataset columns:
id
,text
,pos_tags
,chunk_tags
,ner_tags
- Problem type: Text token classification
To learn more about the dataset, see Language-Independent Named Entity Recognition (II).
Squad text span prediction
- File name:
squad_text_span_prediction.zip
- Description: The dataset contains questions with answers and contexts that can be used to answer the questions.
- Dataset columns:
question
,context
,answer
- Problem type: Text span prediction
To learn more about the dataset, see The Stanford Question Answering Dataset.
Ubuntu text metric learning
- File name:
ubuntu_text_metric_learning.zip
- Description: The dataset contains a preprocessed collection of questions from AskUbuntu.com. Questions are grouped in similar clusters (label).
- Dataset columns:
text
,label
,fold
- Problem type: Text metric learning
- To learn more about the dataset and its use in research, refer to the following arXiv paper: Semi-supervised Question Retrieval with Gated Convolutions, NAACL 2016, Tao Lei et al.
- To view the original dataset from the authors, visit the following Github repository: AskUbuntu Question Dataset.
COCO cars image instance segmentation
- File name:
coco_image_instance_segmentation.zip
- Description: The dataset contains a subsample of the famous Common Objects in Context (COCO) dataset. This subsample includes only a single "Car" class. In other words, all images contain a car or multiple cars.
- Dataset columns:
image_id
,class_id
,rle_mask
- Problem type: Image instance segmentation
To learn more about the dataset, see COCO Dataset.
Environmental sound audio classification
- File name:
esc10_audio_classification.zip
- Description: The dataset contains 5-second-long recordings organized into ten classes (with 40 examples per class). Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project.
- Dataset columns:
filename
,fold
,label
- Problem type: Audio classification
To learn more about the dataset, see ESC-50: Dataset for Environmental Sound Classification.
MNIST audio regression
- File name:
amnist_audio_regression.zip
- Description: The dataset contains a collection of 30,000 audio samples of spoken digits (0-9) of sixty different speakers.
- Dataset columns:
audio
,label
,fold
- Problem type: Audio regression
To learn more about the dataset, see Audio MNIST.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai