Imbalanced modeling in Driverless AI¶
This page describes Driverless AI’s imbalanced modeling capabilities.
Driverless AI offers imbalanced algorithms for use cases where there is a binary, imbalanced target. These algorithms are enabled by default if the target column is considered imbalanced. While they are enabled, Driverless AI may decide to not use them in the final model to avoid poor performance.
While Driverless AI does try imbalanced algorithms by default, they have not generally been found to improve model performance. Note that using imbalanced algorithms also results in a significantly larger final model, because multiple models are combined with different balancing ratios.
Driverless AI provides two types of imbalanced algorithms: ImbalancedXGBoost and ImbalancedLightGBM. These imbalanced algorithms train an XGBoost or LightGBM model multiple times on different samples of data and then combine the predictions of these models together. These models use different samples of data and may use different sampling ratios. (By trying multiple ratios, DAI is more likely to come up with a robust model.)
Enabling imbalanced algorithms¶
The following steps describe how to enable only imbalanced algorithms:
On the Experiment Setup page, click Expert Settings.
In the Expert Settings window, click on the Training -> Models subtab.
For the Include specific models setting, click the Select Values button.
On the Selected Included Models page, click Uncheck All, and then select only the imbalanced algorithms: ImbalancedXGBoost and ImbalancedLightGBM. Click Done to confirm your selection.
In the Expert Settings window, click the Save button.
This section describes additional tips you can make use of when enabling imbalanced algorithms.
Ensure that you select a scorer that is not biased by imbalanced data. We recommend using the following scorers:
MCC: A proportion of true negatives is used instead of an absolute number. (In imbalanced use cases, you have a high count of true negatives.)
AUCPR: In this calculation, true negatives are not used at all.
A weight column can be used to internally upsample the rare events. If you create a weight column that has a value of 10 when the target is positive and otherwise is 1, it tells the algorithms internally to consider getting the positive class correct as being ten times more important.