Dataset formats
Overview
The dataset (data) for one of the supported problem types needs to be formatted (prepared) by you in a certain way. Below, you can find instructions on formatting your dataset for a particular supported problem type.
With H2O Label Genie (a Wave application in H2O AI Cloud), you can label your image, text, and audio data to generate annotated datasets supported in H2O Hydrogen Torch. To learn more, see H2O Label Genie | Docs.
To learn how to import a formatted (preprocessed) dataset, see Import a dataset.
- Image
- Dataset format: Image regression
- Dataset format: 3D image regression
- Dataset format: Image classification
- Dataset format: 3D image classification
- Dataset format: Image metric learning
- Dataset format: Image object detection
- Dataset format: Image semantic segmentation
- Dataset format: 3D image semantic segmentation
- Dataset format: Image instance segmentation
- Text
- Image and text
- Audio
- Speech
- Graph
Data collection
Example 1: Amazon S3
Below, observe a Python script example parsing a folder structure of an Amazon S3 bucket collecting images into a single dataset. In other words, the script demonstrates how to create a new dataset (ZIP file) from several files in S3 to later re-upload to S3.
# Import libraries
# We use `boto3` to connect to S3
# Optionally `tqdm` can be used to show download progress
# We use pandas for data manipulation
from boto3.session import Session
from tqdm import tqdm
import pandas as pd
import shutil
import os
# Set AWS credentials
# You can set them directly or use the environment variables, if those are set
aws_access_key = os.environ["AWS_ACCESS_KEY_ID"]
aws_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]
# Set list of bucket paths, that contain image files
images_bucket_subfolders = ["h2o-release/hydrogen-torch/data-prep"]
# Set path to the train CSV
csv_path = "h2o-release/hydrogen-torch/data-prep/csvs/train.csv"
# Set allowed file extensions
allowed_extensions = [".jpg",".jpeg", ".png"]
# Files will be downloaded to data folder
data_folder = "data"
image_folder = f"{data_folder}/images"
os.makedirs(image_folder, exist_ok=True)
# Connect to S3
s3 = Session(aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key).resource("s3")
# Download train.csv
bucket, csv_path = csv_path.split("/", 1)
output_csv_path = f"{data_folder}/{os.path.basename(csv_path)}"
s3.Bucket(bucket).download_file(csv_path, output_csv_path)
# Make sure the "Image Column" contains only file names, not full paths
image_col = "image"
data = pd.read_csv(output_csv_path)
data[image_col] = data[image_col].map(os.path.basename)
data = data.to_csv(output_csv_path, index=False)
# Download all image files
for images_bucket_subfolder in images_bucket_subfolders:
if "/" in images_bucket_subfolder:
bucket, subfolder = images_bucket_subfolder.split("/", 1)
else:
bucket, subfolder = images_bucket_subfolder, ""
s3_bucket = s3.Bucket(bucket)
files = s3_bucket.objects
if subfolder:
files = files.filter(Prefix=f"{subfolder}/")
files = list(files)
for file in tqdm(files):
if any([file.key.endswith(ext) for ext in allowed_extensions]):
s3_bucket.download_file(file.key, f"{image_folder}/{os.path.basename(file.key)}")
# Create ZIP file that can be imported to H2O Hydrogen Torch
zip_file_name = "flowers_image_classification"
full_zip_file_name = shutil.make_archive(zip_file_name, 'zip', data_folder)
# Set desired S3 path where to upload the ZIP file in format "bucket_name" or "bucket_name/subfolder_1/.../subfolder_n"
upload_bucket_path = "YOUR_BUCKET_NAME/SUB_FOLDER"
# Upload the ZIP file
rel_zip_file_name = os.path.basename(full_zip_file_name)
upload_path = f"{upload_bucket_path}/{rel_zip_file_name}"
upload_bucket, upload_zip_path = upload_path.split("/", 1)
s3.Bucket(upload_bucket).upload_file(rel_zip_file_name, upload_zip_path)
Supported audio extensions for speech recognition
For speech recognition, H2O Hydrogen Torch supports the following audio extension:
- Uncompressed (WAV)
Supported audio extensions for audio processing
The following is a list of supported audio extensions for audio processing in H2O Hydrogen Torch:
- Uncompressed: WAV, AIFF
- Lossless compressed: FLAC
- Lossy compressed: MP3, OGG
Supported image extensions for image processing
The following is a list of supported image extensions for image processing in H2O Hydrogen Torch:
- Windows bitmaps: BMP
- JPEG files: JPEG, JPG, JPE
- JPEG 2000 files: JP2
- Portable Network Graphics: PNG
- WebP: WEBP
- Portable image format: PBM, PGM , PPM, PNM
- TIFF files: TIFF, TIF
- Radiance HDR: HDR
- NumPy data array: NumPynote
For 2D image processing, the data must be of shape
[height, width, channels]
.
Supported 3D image extensions for 3D image processing
For 3D image problem types, H2O Hydrogen Torch supports the following 3D image extension:
- NumPy data array: NumPynote
For 3D image processing, the data must be of shape
[height, width, depth, channels]
.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai