Demo (preprocessed) datasets
Overview
In H2O Hydrogen Torch, you can access demo (preprocessed) datasets to explore supported problem types.
Import a demo (preprocessed) dataset
To import a demo (preprocessed) dataset to H2O Hydrogen Torch, consider the following instructions:
- In the H2O Hydrogen Torch navigation menu, click Import dataset.
- In the Source box, select AWS S3.
- In the S3 bucket name box, enter
h2o-release/hydrogen-torch/1.3.0
. - In the File name list, select one of the Demo (preprocessed) datasets in H2O Hydrogen Torch.
- Click Continue.
- Again, click Continue.
- Again, click Continue.
- After importing a preprocessed dataset, you can use it for an experiment.
- To learn how to preprocess your dataset for a particular supported problem type, see Dataset formats
Demo (preprocessed) datasets in H2O Hydrogen Torch
Below are the demo (preprocessed) datasets in H2O Hydrogen Torch.
Image
flower_image_classification.zip
- Description: The
flower_image_classification.zip
file is a preprocessed dataset that contains images of dandelions, daisies, roses, tulips, and sunflowers. - Dataset columns:
image
,label
- Problem type: Image classification
To learn more about the dataset, see Flowers Dataset.
coins_image_regression.zip
- Description: The
coins_image_regression.zip
file is a preprocessed dataset that contains a collection of images with one or more coins. Each image has been labeled to indicate the sum of its coins. The currency of the coins is the Brazilian Real (R$). - Dataset columns:
image_path
,label
,fold
- Problem type: Image regression
To learn more about the dataset, see Brazilian Coins.
global_wheat_image_object_detection.zip
- Description: The
global_wheat_image_object_detection.zip
file is a preprocessed dataset that contains a collection of images of wheat fields with bounding boxes for each identified wheat head. - Dataset columns:
image
,class_id
,x_min
,y_min
,x_max
,y_max
- Problem type: Single-class object detection
To learn more about the dataset, see Global Wheat Dataset.
bicycle_image_metric_learning.zip
- Description: The
bicycle_image_metric_learning.zip
file is a preprocessed dataset that contains images of online bicycle ads. Each ad has multiple images marked by their class ID. - Dataset columns:
image
,label
,fold
- Problem type: Image metric learning
To learn more about the dataset, see The Stanford Online Products dataset.
fashion_image_semantic_segmentation.zip
- Description: The
fashion_image_semantic_segmentation.zip
file is a preprocessed dataset that contains images corresponding to fashion/apparel segmentations. As well, the dataset contains images of people wearing various clothing types in multiple poses. - Dataset columns:
image
,class_id
,rle_mask
- Problem type: Semantic segmentation
To learn more about the dataset, see Clothing Co-Parsing Dataset.
coco_image_instance_segmentation.zip
- Description: The
coco_image_instance_segmentation.zip
file is a preprocessed dataset that contains a subsample of the famous Common Objects in Context (COCO) dataset. This subsample includes only a single "Car" class. In other words, all images contain a car or multiple cars. - Dataset columns:
image_id
,class_id
,rle_mask
- Problem type: Image instance segmentation
To learn more about the dataset, see COCO Dataset.
covid_ct_image_semantic_segmentation_3d.zip
- Description: The
covid_ct_image_semantic_segmentation_3d.zip
file is a preprocessed dataset that contains a collection of 20 3D computed tomography (CT) images depicting the human chest and lungs. - Dataset columns:
image
,class_id
,rle_mask
- Problem type: 3D image semantic segmentation
To learn more about the dataset, see COVID-19 CT scans.
mnist_3d_image_regression_3d.zip
- Description: The
mnist_3d_image_regression_3d.zip
file is a preprocessed dataset that contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database. - Dataset columns:
image
,label
- Problem type: 3D image regression
To learn more about the MNIST database, see The MNIST database of handwritten digits.
mnist_3d_image_classification_3d.zip
- Description: The
mnist_3d_image_classification_3d.zip
file is a preprocessed dataset that contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database. - Dataset columns:
image
,label
- Problem type: 3D image classification
To learn more about the MNIST database, see The MNIST database of handwritten digits.
Text
amazon_reviews_text_classification.csv
- Description: The
amazon_reviews_text_classification.csv
file is a preprocessed dataset that contains a collection of reviews from Amazon. Each review (in text form) includes the title of the review and the review itself. The dataset has been labeled to indicate whether a review is positive or negative. - Dataset columns:
text
,label
- Problem type: Text classification
To learn more about the dataset, see Amazon product data.
cnn_dailymail_text_sequence_to_sequence.zip
- Description: The
cnn_dailymail_text_sequence_to_sequence.zip
file is a preprocessed dataset that contains human-generated abstract summaries from news stories published on the CNN and Daily Mail websites. - Dataset columns:
text
,summary
,id
- Problem type: Text sequence to sequence
To learn more about the dataset, see abisee/cnn-dailymail.
wellformed_query_text_regression.csv
- Description: The
wellformed_query_text_regression.csv
file is a preprocessed dataset that contains a collection of search queries. Every query was rated between 0 and 1 specifying whether or not the query was well-formed. - Dataset columns:
text
,rating
- Problem type: Text regression
To learn more about the dataset, see Query-wellformedness Dataset.
conll2003_text_token_classification.zip
- Description: The
conll2003_text_token_classification.zip
file is a preprocessed dataset that contains a collection of text pieces that have their name entities specified. Name entities refer to abstract or physical objects such as a person, product, etc., that can be indicated with a proper name. - Dataset columns:
id
,text
,pos_tags
,chunk_tags
,ner_tags
- Problem type: Text token classification
To learn more about the dataset, see Language-Independent Named Entity Recognition (II).
squad_text_span_prediction.zip
- Description: The
squad_text_span_prediction.zip
file is a preprocessed dataset that contains questions with answers and contexts that can be used to answer the questions. - Dataset columns:
question
,context
,answer
- Problem type: Text span prediction
To learn more about the dataset, see The Stanford Question Answering Dataset.
ubuntu_text_metric_learning.zip
- Description: The
ubuntu_text_metric_learning.zip
file is a preprocessed dataset that contains a preprocessed collection of questions from AskUbuntu.com. Questions are grouped in similar clusters (label). - Dataset columns:
text
,label
,fold
- Problem type: Text metric learning
- To learn more about the dataset and its use in research, refer to the following arXiv paper: Semi-supervised Question Retrieval with Gated Convolutions, NAACL 2016, Tao Lei et al.
- To view the original dataset from the authors, visit the following Github repository: AskUbuntu Question Dataset.
Audio
esc10_audio_classification.zip
- Description: The
esc10_audio_classification.zip
file is a preprocessed dataset that contains 5-second-long recordings organized into ten classes (with 40 examples per class). Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project. - Dataset columns:
filename
,fold
,label
- Problem type: Audio classification
To learn more about the dataset, see ESC-50: Dataset for Environmental Sound Classification.
amnist_audio_regression.zip
- Description: The
amnist_audio_regression.zip
file is a preprocessed dataset that contains a collection of 30,000 audio samples of spoken digits (0-9) of sixty different speakers. - Dataset columns:
audio
,label
,fold
- Problem type: Audio regression
To learn more about the dataset, see Audio MNIST.
Speech
minds14_US_speech_recognition.zip
- Description: The
minds14_US_speech_recognition.zip
file is a preprocessed dataset that contains a collection of 558 speech samples (EN-US subset) related to phone banking. The dataset was manually re-annotated as the transcriptions from the original dataset were generated using automated speech models. - Dataset columns:
file
,transcript
,duration
- Problem type: Speech recognition
To learn more about the dataset, see MInDS-14.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai