Data Sampling

Note: Sampling is not performed on Time Series experiments.

Driverless AI does not perform any type of data sampling unless the dataset is big or highly imbalanced (for improved accuracy). What is considered big is dependent on your accuracy setting and the statistical_threshold_data_size_large parameter in the config.toml or in the Expert Settings. You can see if the data will be sampled by viewing the Experiment Preview when you set up the experiment. In the experiment preview below, I can see that my data was sampled down to 5 million rows.

Experiment settings summary

If Driverless AI decides to sample the data based on these settings and the data size, then Driverless AI will perform the following types of sampling at the start of the experiment:

  • Random sampling for regression problems

  • Stratified sampling for classification problems

  • Imbalanced sampling for binary problems where the data is considered imbalanced

    • By default, imbalanced is defined as when the majority class is 5 times more common than the minority class. (This is also configurable.)

With imbalanced sampling, there are multiple approaches:

  • Sample both classes as needed depending on the data (automatic)

  • Under-sample the majority class to reach class balance

  • Over-sample the minority class and under-sample the majority class, depending on data

  • Do not perform any sampling

When imbalanced sampling is enabled, sampling is usually performed with replacement, and repeated multiple times to improve accuracy (bagging). By default, the number of bags is automatically determined, but can be specified in expert settings.