Data Sampling¶
Note: Sampling is not performed on Time Series experiments.
Driverless AI does not perform any type of data sampling unless the dataset is big or highly imbalanced (for improved accuracy). What is considered big is dependent on your accuracy setting and the statistical_threshold_data_size_large parameter
in the config.toml or in the Expert Settings. You can see if the data will be sampled by viewing the Experiment Preview when you set up the experiment. In the experiment preview below, I can see that my data was sampled down to 5 million rows.
If Driverless AI decides to sample the data based on these settings and the data size, then Driverless AI will perform the following types of sampling at the start of the experiment:
Random sampling for regression problems
Stratified sampling for classification problems
Imbalanced sampling for binary problems where the data is considered imbalanced
By default, imbalanced is defined as when the majority class is 5 times more common than the minority class. (This is also configurable.)
With imbalanced sampling, there are multiple approaches:
Sample both classes as needed depending on the data (automatic)
Under-sample the majority class to reach class balance
Over-sample the minority class and under-sample the majority class, depending on data
Do not perform any sampling
When imbalanced sampling is enabled, sampling is usually performed with replacement, and repeated multiple times to improve accuracy (bagging). By default, the number of bags is automatically determined, but can be specified in expert settings.