Driverless AI lets you split a dataset into two subsets that can be used as training and validation/test datasets during modeling. When splitting datasets for modeling, each split should have a similar distribution to avoid over fitting on the training set. Depending on the use case, you can either split the dataset randomly, perform a stratified sampling based on the target column, perform a fold column-based split to keep rows belonging to the same group together, or perform a time column-based split to train on past data and validate/test on future data.
Perform the following steps to split a dataset:
Click the dataset or select the [Click for Actions] button next to the dataset that you want to split and select Split from the submenu that appears.
The Dataset Splitter form displays. Specify an Output Name 1 and an Output Name 2 for each segment of the split. (For example, you can name one segment test and the other validation.)
Optionally specify a Target column (for stratified sampling), a Fold column (to keep rows belonging to the same group together), a Time column, and/or a Random Seed (defaults to 1234).
Use the slider to select a split ratio or enter a value in the Train/Valid Split Ratio field.
Click Save when you are done.
When this process has completed, the split datasets are made available on the Datasets page.