Extend a dataset with new data
Overview
H2O Hydrogen Torch enables you to extend a dataset with new data, for example, to increase your dataset size.
H2O Hydrogen Torch does not extend a dataset in the sense that rows are combined, and duplicate rows are removed. Extend, in this case, refers to adding new dataset files to a dataset that already has certain dataset files.
Consider the following two datasets (dataset one and dataset two):
dataset_one.zip dataset_two.zip
│ └───csv_one.csv │ └───csv_two.csv
│ │ │ │
│ └───image_folder_one │ └───image_folder_two
│ └───name_of_image.image_extension │ └───name_of_image.image_extension
│ └───name_of_image.image_extension │ └───name_of_image.image_extension
│ └───name_of_image.image_extension │ └───name_of_image.image_extension
│ ... │ ...
After extending dataset one with dataset two:
extended_dataset_one.zip
│ └───csv_two.csv
│ └───csv_one.csv
│ │
│ └───image_folder_two
│ │ └───name_of_image.image_extension
│ │ └───name_of_image.image_extension
│ │ └───name_of_image.image_extension
│ │ ...
│ └───image_folder_one
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Instructions
To extend a dataset with new data, consider the following instructions:
In the H2O Hydrogen Torch navigation menu, click Import dataset.
In the Source list, select the source (data connector) that you want to use (for example, AWS S3).
- AWS S3
- Google Cloud Storage
- Kaggle
- Azure Datalake
- Upload
In the S3 bucket name box, enter the name of the S3 bucket name.
In the AWS access key box, enter the AWS access key.
NoteYou don't need to enter the AWS access key if the S3 bucket is public.
In the AWS secret key box, enter the AWS secret key.
NoteYou don't need to enter the AWS secret key if the S3 bucket is public.
In the File name list, select the file you want to use.
- In the GCS bucket name box, enter the name of the Google Cloud Storage bucket.
- In the GCS Service Account JSON box, enter the content of Google Cloud Service Account JSON file.
- In the File name list, select the file you want to use.
- In the Kaggle API command box, enter a Kaggle API command.
- In the Kaggle username box, enter your username.
- In the Kaggle secret key box, enter your kaggle secret key.
- In the Datalake connection string box, enter the Datalake connection string.
- In the Datalake container name box, enter the Datalake container name.
- In the File name box, enter the file name.
- Click Browse.
- Or drag and drop the file (dataset)
- Click Upload.
- Skip step 3.
Click Continue.
Click Merge with existing dataset.
In the Dataset list, select the dataset you want to extend with the dataset imported above.
Click Merge.
Configure the dataset settings for the dataset being extended.
NoteTo learn about the import dataset settings, see Import dataset settings.
Click Continue.
Again, click Continue.
NoteBefore you click Continue, please review the dataset preview.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai