Skip to main content
Version: Next

Dataset formats

Overview

The dataset (data) for one of the supported problem types needs to be formatted (prepared) by you in a certain way. Below, you can find instructions on formatting your dataset for a particular supported problem type.

H2O Label Genie logo

With H2O Label Genie (a Wave application in H2O AI Cloud), you can label your image, text, and audio data to generate annotated datasets supported in H2O Hydrogen Torch. To learn more, see H2O Label Genie | Docs.

note

To learn how to import a formatted (preprocessed) dataset, see Import a dataset.

Data collection

Example 1: Amazon S3

Below, observe a Python script example parsing a folder structure of an Amazon S3 bucket collecting images into a single dataset. In other words, the script demonstrates how to create a new dataset (ZIP file) from several files in S3 to later re-upload to S3.

# Import libraries

# We use `boto3` to connect to S3
# Optionally `tqdm` can be used to show download progress
# We use pandas for data manipulation
from boto3.session import Session
from tqdm import tqdm
import pandas as pd
import shutil
import os


# Set AWS credentials
# You can set them directly or use the environment variables, if those are set
aws_access_key = os.environ["AWS_ACCESS_KEY_ID"]
aws_secret_key = os.environ["AWS_SECRET_ACCESS_KEY"]

# Set list of bucket paths, that contain image files
images_bucket_subfolders = ["h2o-release/hydrogen-torch/data-prep"]

# Set path to the train CSV
csv_path = "h2o-release/hydrogen-torch/data-prep/csvs/train.csv"

# Set allowed file extensions
allowed_extensions = [".jpg",".jpeg", ".png"]

# Files will be downloaded to data folder
data_folder = "data"
image_folder = f"{data_folder}/images"
os.makedirs(image_folder, exist_ok=True)

# Connect to S3
s3 = Session(aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key).resource("s3")


# Download train.csv
bucket, csv_path = csv_path.split("/", 1)
output_csv_path = f"{data_folder}/{os.path.basename(csv_path)}"
s3.Bucket(bucket).download_file(csv_path, output_csv_path)


# Make sure the "Image Column" contains only file names, not full paths
image_col = "image"

data = pd.read_csv(output_csv_path)
data[image_col] = data[image_col].map(os.path.basename)
data = data.to_csv(output_csv_path, index=False)


# Download all image files
for images_bucket_subfolder in images_bucket_subfolders:

if "/" in images_bucket_subfolder:
bucket, subfolder = images_bucket_subfolder.split("/", 1)
else:
bucket, subfolder = images_bucket_subfolder, ""

s3_bucket = s3.Bucket(bucket)
files = s3_bucket.objects

if subfolder:
files = files.filter(Prefix=f"{subfolder}/")


files = list(files)

for file in tqdm(files):
if any([file.key.endswith(ext) for ext in allowed_extensions]):
s3_bucket.download_file(file.key, f"{image_folder}/{os.path.basename(file.key)}")


# Create ZIP file that can be imported to H2O Hydrogen Torch
zip_file_name = "flowers_image_classification"
full_zip_file_name = shutil.make_archive(zip_file_name, 'zip', data_folder)

# Set desired S3 path where to upload the ZIP file in format "bucket_name" or "bucket_name/subfolder_1/.../subfolder_n"
upload_bucket_path = "YOUR_BUCKET_NAME/SUB_FOLDER"

# Upload the ZIP file
rel_zip_file_name = os.path.basename(full_zip_file_name)
upload_path = f"{upload_bucket_path}/{rel_zip_file_name}"
upload_bucket, upload_zip_path = upload_path.split("/", 1)
s3.Bucket(upload_bucket).upload_file(rel_zip_file_name, upload_zip_path)

Supported audio extensions for speech recognition

For speech recognition, H2O Hydrogen Torch supports the following audio extension:

  • Uncompressed (WAV)

Supported audio extensions for audio processing

The following is a list of supported audio extensions for audio processing in H2O Hydrogen Torch:

  • Uncompressed: WAV, AIFF
  • Lossless compressed: FLAC
  • Lossy compressed: MP3, OGG

Supported image extensions for image processing

The following is a list of supported image extensions for image processing in H2O Hydrogen Torch:

  • Windows bitmaps: BMP
  • JPEG files: JPEG, JPG, JPE
  • JPEG 2000 files: JP2
  • Portable Network Graphics: PNG
  • WebP: WEBP
  • Portable image format: PBM, PGM , PPM, PNM
  • TIFF files: TIFF, TIF
  • Radiance HDR: HDR
  • NumPy data array: NumPy
    note

    For 2D image processing, the data must be of shape [height, width, channels].

Supported 3D image extensions for 3D image processing

For 3D image problem types, H2O Hydrogen Torch supports the following 3D image extension:

  • NumPy data array: NumPy
    note

    For 3D image processing, the data must be of shape [height, width, depth, channels].


Feedback