Extend a dataset with new data
H2O Hydrogen Torch enables you to extend a dataset with new data, for example, to increase your dataset size.
H2O Hydrogen Torch does not extend a dataset in the sense that rows are combined, and duplicate rows are removed. Extend, in this case, refers to adding new dataset files to a dataset that already has certain dataset files.
Consider the following two datasets (dataset one and dataset two):
dataset_one.zip dataset_two.zip
│ └───csv_one.csv │ └───csv_two.csv
│ │ │ │
│ └───image_folder_one │ └───image_folder_two
│ └───name_of_image.image_extension │ └───name_of_image.image_extension
│ └───name_of_image.image_extension │ └───name_of_image.image_extension
│ └───name_of_image.image_extension │ └───name_of_image.image_extension
│ ... │ ...
After extending dataset one with dataset two:
extended_dataset_one.zip
│ └───csv_two.csv
│ └───csv_one.csv
│ │
│ └───image_folder_two
│ │ └───name_of_image.image_extension
│ │ └───name_of_image.image_extension
│ │ └───name_of_image.image_extension
│ │ ...
│ └───image_folder_one
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ └───name_of_image.image_extension
│ ...
Instructions
To extend a dataset with new data, consider the following instructions:
-
In the H2O Hydrogen Torch navigation menu, click Import dataset.
-
In the Source list, select the source (data connector) that you want to use (e.g., AWS S3).
- AWS S3
- Kaggle
- Azure Datalake
- Upload
-
In the S3 bucket name box, enter the name of the S3 bucket name.
-
In the AWS access key box, enter the AWS access key.
NoteYou don't need to enter the AWS access key if the S3 bucket is public.
-
In the AWS secret key box, enter the AWS secret key.
NoteYou don't need to enter the AWS secret key if the S3 bucket is public.
-
In the File name list, select the file you want to use.
- In the Kaggle API command box, enter a Kaggle API command.
- In the Kaggle username box, enter your username.
- In the Kaggle secret key box, enter your kaggle secret key.
- In the Datalake connection string box, enter the Datalake connection string.
- In the Datalake container name box, enter the Datalake container name.
- In the File name box, enter the file name.
- Click Browse.
- Or drag and drop the file (dataset)
- Click Upload.
- Skip step 3.
-
Click Continue.
-
Click Merge with existing dataset.
-
In the Dataset list, select the dataset you want to extend with the dataset imported above.
-
Click Merge.
-
Configure the dataset settings for the dataset being extended.
NoteTo learn about the import dataset settings, see Import dataset settings.
-
Click Continue.
-
Again, click Continue.
NoteBefore you click Continue, please review the dataset preview visualizations.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai