Demo (preprocessed) datasets
Overview
In H2O Hydrogen Torch, you can access demo (preprocessed) datasets to explore supported problem types.
Import a demo (preprocessed) dataset
To import a demo (preprocessed) dataset to H2O Hydrogen Torch, consider the following instructions:
- In the H2O Hydrogen Torch navigation menu, click Import dataset.
- In the Source box, select AWS S3.
- In the S3 bucket name box, enter h2o-release/hydrogen-torch/1.3.0.
- In the File name list, select one of the Demo (preprocessed) datasets in H2O Hydrogen Torch.
- Click Continue.
- Again, click Continue.
- Again, click Continue.
- After importing a preprocessed dataset, you can use it for an experiment.
- To learn how to preprocess your dataset for a particular supported problem type, see Dataset formats
Demo (preprocessed) datasets in H2O Hydrogen Torch
Below are the demo (preprocessed) datasets in H2O Hydrogen Torch.
Image
flower_image_classification.zip
- Description: The flower_image_classification.zipfile is a preprocessed dataset that contains images of dandelions, daisies, roses, tulips, and sunflowers.
- Dataset columns: image,label
- Problem type: Image classification
To learn more about the dataset, see Flowers Dataset.
coins_image_regression.zip
- Description: The coins_image_regression.zipfile is a preprocessed dataset that contains a collection of images with one or more coins. Each image has been labeled to indicate the sum of its coins. The currency of the coins is the Brazilian Real (R$).
- Dataset columns: image_path,label,fold
- Problem type: Image regression
To learn more about the dataset, see Brazilian Coins.
global_wheat_image_object_detection.zip
- Description: The global_wheat_image_object_detection.zipfile is a preprocessed dataset that contains a collection of images of wheat fields with bounding boxes for each identified wheat head.
- Dataset columns: image,class_id,x_min,y_min,x_max,y_max
- Problem type: Single-class object detection
To learn more about the dataset, see Global Wheat Dataset.
bicycle_image_metric_learning.zip
- Description: The bicycle_image_metric_learning.zipfile is a preprocessed dataset that contains images of online bicycle ads. Each ad has multiple images marked by their class ID.
- Dataset columns: image,label,fold
- Problem type: Image metric learning
To learn more about the dataset, see The Stanford Online Products dataset.
fashion_image_semantic_segmentation.zip
- Description: The fashion_image_semantic_segmentation.zipfile is a preprocessed dataset that contains images corresponding to fashion/apparel segmentations. As well, the dataset contains images of people wearing various clothing types in multiple poses.
- Dataset columns: image,class_id,rle_mask
- Problem type: Semantic segmentation
To learn more about the dataset, see Clothing Co-Parsing Dataset.
coco_image_instance_segmentation.zip
- Description: The coco_image_instance_segmentation.zipfile is a preprocessed dataset that contains a subsample of the famous Common Objects in Context (COCO) dataset. This subsample includes only a single "Car" class. In other words, all images contain a car or multiple cars.
- Dataset columns: image_id,class_id,rle_mask
- Problem type: Image instance segmentation
To learn more about the dataset, see COCO Dataset.
covid_ct_image_semantic_segmentation_3d.zip
- Description: The covid_ct_image_semantic_segmentation_3d.zipfile is a preprocessed dataset that contains a collection of 20 3D computed tomography (CT) images depicting the human chest and lungs.
- Dataset columns: image,class_id,rle_mask
- Problem type: 3D image semantic segmentation
To learn more about the dataset, see COVID-19 CT scans.
mnist_3d_image_regression_3d.zip
- Description: The mnist_3d_image_regression_3d.zipfile is a preprocessed dataset that contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database.
- Dataset columns: image,label
- Problem type: 3D image regression
To learn more about the MNIST database, see The MNIST database of handwritten digits.
mnist_3d_image_classification_3d.zip
- Description: The mnist_3d_image_classification_3d.zipfile is a preprocessed dataset that contains 60,000 3D digital images of numbers ranging from 0 to 9. The dataset was constructed by extracting images from the MNIST database.
- Dataset columns: image,label
- Problem type: 3D image classification
To learn more about the MNIST database, see The MNIST database of handwritten digits.
Text
amazon_reviews_text_classification.csv
- Description: The amazon_reviews_text_classification.csvfile is a preprocessed dataset that contains a collection of reviews from Amazon. Each review (in text form) includes the title of the review and the review itself. The dataset has been labeled to indicate whether a review is positive or negative.
- Dataset columns: text,label
- Problem type: Text classification
To learn more about the dataset, see Amazon product data.
cnn_dailymail_text_sequence_to_sequence.zip
- Description: The cnn_dailymail_text_sequence_to_sequence.zipfile is a preprocessed dataset that contains human-generated abstract summaries from news stories published on the CNN and Daily Mail websites.
- Dataset columns: text,summary,id
- Problem type: Text sequence to sequence
To learn more about the dataset, see abisee/cnn-dailymail.
wellformed_query_text_regression.csv
- Description: The wellformed_query_text_regression.csvfile is a preprocessed dataset that contains a collection of search queries. Every query was rated between 0 and 1 specifying whether or not the query was well-formed.
- Dataset columns: text,rating
- Problem type: Text regression
To learn more about the dataset, see Query-wellformedness Dataset.
conll2003_text_token_classification.zip
- Description: The conll2003_text_token_classification.zipfile is a preprocessed dataset that contains a collection of text pieces that have their name entities specified. Name entities refer to abstract or physical objects such as a person, product, etc., that can be indicated with a proper name.
- Dataset columns: id,text,pos_tags,chunk_tags,ner_tags
- Problem type: Text token classification
To learn more about the dataset, see Language-Independent Named Entity Recognition (II).
squad_text_span_prediction.zip
- Description: The squad_text_span_prediction.zipfile is a preprocessed dataset that contains questions with answers and contexts that can be used to answer the questions.
- Dataset columns: question,context,answer
- Problem type: Text span prediction
To learn more about the dataset, see The Stanford Question Answering Dataset.
ubuntu_text_metric_learning.zip
- Description: The ubuntu_text_metric_learning.zipfile is a preprocessed dataset that contains a preprocessed collection of questions from AskUbuntu.com. Questions are grouped in similar clusters (label).
- Dataset columns: text,label,fold
- Problem type: Text metric learning
- To learn more about the dataset and its use in research, refer to the following arXiv paper: Semi-supervised Question Retrieval with Gated Convolutions, NAACL 2016, Tao Lei et al.
- To view the original dataset from the authors, visit the following Github repository: AskUbuntu Question Dataset.
Audio
esc10_audio_classification.zip
- Description: The esc10_audio_classification.zipfile is a preprocessed dataset that contains 5-second-long recordings organized into ten classes (with 40 examples per class). Clips in this dataset have been manually extracted from public field recordings gathered by the Freesound.org project.
- Dataset columns: filename,fold,label
- Problem type: Audio classification
To learn more about the dataset, see ESC-50: Dataset for Environmental Sound Classification.
amnist_audio_regression.zip
- Description: The amnist_audio_regression.zipfile is a preprocessed dataset that contains a collection of 30,000 audio samples of spoken digits (0-9) of sixty different speakers.
- Dataset columns: audio,label,fold
- Problem type: Audio regression
To learn more about the dataset, see Audio MNIST.
Speech
minds14_US_speech_recognition.zip
- Description: The minds14_US_speech_recognition.zipfile is a preprocessed dataset that contains a collection of 558 speech samples (EN-US subset) related to phone banking. The dataset was manually re-annotated as the transcriptions from the original dataset were generated using automated speech models.
- Dataset columns: file,transcript,duration
- Problem type: Speech recognition
To learn more about the dataset, see MInDS-14.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai