Data leakage and shift detection in Driverless AI

This page describes data leakage and shift detection in Driverless AI (DAI).

Overview

  • Data leakage: To detect data leakage, DAI runs a model (when available, LightGBM) to get the variable importance table, which determines the predictive power of each feature on the target variable. A simple model is then built on each feature with significant variable importance. The models with a high AUC (for classification) or R2 (for regression) score are reported to the user as potential leak features.

  • Shift detection: To detect shift in distribution between the training, validation or testing datasets, Driverless AI trains a binomial model to predict which dataset a row belongs to. For example, if a model is built using only a specific feature as a predictor and is able to separate the training and testing data with high accuracy (for example, an AUC of 0.9), then this indicates that there is a drift in the distribution of that feature in the training and testing data. Shifted features can either be dropped or used to create more meaningful aggregate features by using them as labels or bins.

Note

Shifted features are reported to the user as a notification and dropped if a threshold has been set.

Enabling leakage detection

To enable leakage detection, set the check_leakage configuration option to on (default). When this option is enabled, Driverless AI runs a model to determine the predictive power of each feature on the target variable.

If leakage detection has been enabled, then the detect_features_leakage_threshold_auc configuration option is used for per-feature leakage detection if AUC (or R2 for regression) on original data (label-encoded) is greater-than or equal to the specified value. By default, this option is set to 0.95.

Identifying features responsible for leakage

For significant features (determined by feature importance), a simple model is built on each feature. The models with a high AUC (classification) or R2 (regression) score are reported to the user as potential leaks.

If leakage detection is enabled, then the detect_features_per_feature_leakage_threshold_auc configuration option is used to notify users about features for which AUC or R2 is greater-than or equal to the specific value. By default, this option is set to 0.8.

Automatically drop features suspected in leakage

A feature is dropped when the single feature model performance exceeds the threshold for dropping features. You can specify this threshold with the drop_features_leakage_threshold_auc configuration option, which has a default value of 0.999. When the AUC (or R2 for regression), GINI, or Spearman correlation is above the specified value, the feature is dropped.

Shift detection

Driverless AI can detect data distribution shifts between train/valid/test datasets when they are provided.

Shift is detected by training a model to distinguish between train/validation/test datasets by assigning a unique target label to each of the datasets. If the model turns out to have high accuracy, data shift is reported with a notification. Shifted features can either be dropped or used to create more meaningful aggregate features by using them as labels or bins.

The following is a list of configuration options for shift detection:

  • check_distribution_shift: Specify whether to enable train/valid and train/test distribution shift detection. By default, LightGBMModel is used for shift detection unless it is turned off in the Expert Settings window, in which case only the models selected in the recipe list are used.

  • check_distribution_shift_drop: Specify whether to drop high-shift features. Note that specifying auto disables this option for time series experiments.

  • drop_features_distribution_shift_threshold_auc: Specify the maximum allowed AUC value for a feature, above which the feature is dropped.

Note

When train and test dataset differ (or train/valid or valid/test) in terms of distribution of data, then a model can be built that provides information on whether each row is in train or test. This model includes an AUC value. If this AUC, GINI, or Spearman correlation of the model is above the specified threshold, then Driverless AI considers it a strong enough shift to drop those features. The default AUC threshold is 0.999.

Dropped columns

To identify columns dropped by DAI and view information about why specific columns were dropped, view the fitted_model.pickle.meta.json file in the experiment summary zip archive.