Data Sampling

Driverless AI does not perform any type of down sampling unless the dataset is big. What is considered big is dependent on your accuracy setting and the statistical_threshold_data_size_large parameter in the config.toml or in the Expert Settings. You can see if the data will be sampled by viewing the Experiment Preview when you set up the experiment. In the experiment preview below, I can see that my data was sampled down to 5 million rows.

Experiment settings summary

If Driverless AI decides to sample the data based on these settings and the data size, then Driverless AI will perform the following types of sampling:

  • Random sampling for regression problems and binary problems that are not considered imbalanced

  • Stratified sampling for multi-class problems

  • Imbalanced sampling for binary problems where the data is considered imbalanced

    • By default, imbalanced is defined as when the majority class is 5 times more common than the minority class. (This is also configurable.)

With imbalanced sampling, there are two approaches:

  • Undersampling of the majority class
  • Quantile imbalanced sampling

Quantile imbalanced sampling is not turned on by default but you can enable it in the Expert Settings. Quantile imbalanced sampling takes all of the minority class records and takes only a sample of the majority class.

The steps for Quantile Imbalanced Sampling are shown below:

  1. Train a preliminary model on a subset of data to predict the target column.
  2. Assign each record in the data a probability.
  3. Bin the probabilities into deciles for the records from the majority class.
  4. Sample the records from the majority class from each decile bin. This will ensure that the distribution of the predicted probability from the sample majority class is smooth.

Generally, we do not want to perform data sampling unless the dataset is really large. We have found that imbalanced sampling does not necessarily improve the results. You can always use the weight column if you want the majority class to be weighted more heavily in the model and your dataset is not large.