Extend a dataset with new data
Overview
H2O Hydrogen Torch enables you to extend a dataset with new data, for example, to increase your dataset size.
H2O Hydrogen Torch does not extend a dataset in the sense that rows are combined, and duplicate rows are removed. Extend, in this case, refers to adding new dataset files to a dataset that already has certain dataset files.
Consider the following two datasets (dataset one and dataset two):
dataset_one.zip                                 dataset_two.zip 
│   └───csv_one.csv                             │   └───csv_two.csv 
│   │                                           │   │
│   └───image_folder_one                        │   └───image_folder_two 
│       └───name_of_image.image_extension       │       └───name_of_image.image_extension
│       └───name_of_image.image_extension       │       └───name_of_image.image_extension
│       └───name_of_image.image_extension       │       └───name_of_image.image_extension
│       ...                                     │       ...
After extending dataset one with dataset two:
extended_dataset_one.zip 
│   └───csv_two.csv
│   └───csv_one.csv
│   │
│   └───image_folder_two
│   │   └───name_of_image.image_extension
│   │   └───name_of_image.image_extension
│    │   └───name_of_image.image_extension
│   │    ...
│   └───image_folder_one
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       └───name_of_image.image_extension
│       ...
Instructions
To extend a dataset with new data, consider the following instructions:
- 
In the H2O Hydrogen Torch navigation menu, click Import dataset. 
- 
In the Source list, select the source (data connector) that you want to use (for example, AWS S3). - AWS S3
- Google Cloud Storage
- Kaggle
- Azure Datalake
- Upload
 - 
In the S3 bucket name box, enter the name of the S3 bucket name. 
- 
In the AWS access key box, enter the AWS access key. NoteYou don't need to enter the AWS access key if the S3 bucket is public. 
- 
In the AWS secret key box, enter the AWS secret key. NoteYou don't need to enter the AWS secret key if the S3 bucket is public. 
- 
In the File name list, select the file you want to use. 
 - In the GCS bucket name box, enter the name of the Google Cloud Storage bucket.
- In the GCS Service Account JSON box, enter the content of Google Cloud Service Account JSON file.
- In the File name list, select the file you want to use.
 - In the Kaggle API command box, enter a Kaggle API command.
- In the Kaggle username box, enter your username.
- In the Kaggle secret key box, enter your kaggle secret key.
 - In the Datalake connection string box, enter the Datalake connection string.
- In the Datalake container name box, enter the Datalake container name.
- In the File name box, enter the file name.
 - Click Browse.
- Or drag and drop the file (dataset)
 
- Click Upload.
- Skip step 3.
 
- 
Click Continue. 
- 
Click Merge with existing dataset. 
- 
In the Dataset list, select the dataset you want to extend with the dataset imported above. 
- 
Click Merge. 
- 
Configure the dataset settings for the dataset being extended. NoteTo learn about the import dataset settings, see Import dataset settings. 
- 
Click Continue. 
- 
Again, click Continue. NoteBefore you click Continue, please review the dataset preview. 
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai