Version: v1.4.0

Dataset connectors

Overview

H2O Hydrogen Torch provides a number of data connectors to access external data sources.

Note

Each dataset connector requires either a single CSV file or the data to be in a ZIP file for a successful import.
The format of a dataset differs for different problem types. For more information, see Dataset formats.
Before a successful dataset import, you need to specify a set of dataset settings before the dataset can be used for a given experiment. The required dataset settings differ upon the structure and content of the dataset. For more information, see Import dataset settings.
For the S3 and Kaggle connector, you can save your AWS and Kaggle credentials in your H2O Hydrogen Torch instance to avoid the reenter of often used credentials. For more information, see App settings.

Supported dataset connectors

Upload (Standard upload feature)

The upload dataset connector requires the following parameters:

File location

AWS S3 (Amazon AWS S3)

The AWS S3 dataset connector requires the following parameters:

S3 bucket name
AWS access key
AWS secret key
File name

Azure Data Lake (Microsoft Azure Data Lake Gen2)

The Azure Data Lake dataset connector requires the following parameters:

Data lake connection string
Data lake container name
File name

Google Cloud Storage

The Google Cloud Storage dataset connector requires the following parameters:

GCS bucket name
note
The Google Cloud Storage connector supports specifying a bucket name with subdirectories, enabling users to list files within a specific subdirectory.
GCS service account JSON (GCS service account key)
note
- You need to create a GCS service account first to create (obtain) later a GCS service account key (GCS service account JSON).
  - To learn how to create a GCS service account, see Create service accounts
  - To learn how to create a GCS service account key, see Create and delete service account keys
    
    The selected role (custom role) when creating a GCS service account key needs to have the following permissions (H2O Hydrogen Torch requires such permissions to access your Google Cloud Storage):
    
    storage.buckets.get
    
    storage.buckets.list
    
    storage.objects.get
    
    storage.objects.list
    
    To learn how to create a custom role with the above permissions, see Create a custom role
  - The downloaded (obtained) GCS service account key (GCS service account JSON) has the following format:
  { "type": "service_account", "project_id": "...", "private_key_id": "...", "private_key": "-----BEGIN PRIVATE KEY-----...-----END PRIVATE KEY-----\n", "client_email": "...", "client_id": "...", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "..." }
- For a self-paced tutorial on how to obtain a GCS service account key, see Google Cloud Storage Service Account Connectivity
File name

Kaggle (Kaggle datasets)

The Kaggle dataset connector requires the following parameters:

Kaggle API command
Kaggle username
Kaggle secret key

H2O Drive (H2O.ai's data storage)

The H2O Drive dataset connector requires the following parameters:

File name

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai

Overview​

Supported dataset connectors​

Upload (Standard upload feature)​

AWS S3 (Amazon AWS S3)​

Azure Data Lake (Microsoft Azure Data Lake Gen2)​

Google Cloud Storage​

Kaggle (Kaggle datasets)​

H2O Drive (H2O.ai's data storage)​