Missing and Unseen Levels Handling

This section describes how missing and unseen levels are handled by each algorithm during training and scoring.

How Does the Algorithm Handle Missing Values During Training?

LightGBM, XGBoost, RuleFit

Driverless AI treats missing values natively. (I.e., a missing value is treated as a special value.) Experiments rarely benefit from imputation techniques, unless the user has a strong understanding of the data.

GLM

Driverless AI automatically performs mean value imputation (equivalent to setting the value to zero after standardization).

TensorFlow

Driverless AI provides an imputation setting for TensorFlow in the config.toml file: tf_nan_impute_value (post-normalization). If you set this option to 0, then missing values will be imputed by the mean. Setting it to (for example) +5 will specify 5 standard deviations above the mean of the distribution. The default value in Driverless AI is -5, which specifies that TensorFlow will treat missing values as outliers on the negative end of the spectrum. Specify 0 if you prefer mean imputation.

FTRL

In FTRL, missing values have their own representation for each datable column type. These representations are used to hash the missing value, with their column’s name, to an integer. This means FTRL replaces missing values with special constants that are the same for each column type, and then treats these special constants like a normal data value.

Unsupervised Algorithms

For unsupervised algorithms, standardization in the pre-transformation layer (where it is decided which columns and column encodings are fed in for clustering) is performed by ignoring any missing values. Scikit-learn’s StandardScaler is used internally during the standardization process. Missing values are then replaced with 0 for further calculations or clustering.

How Does the Algorithm Handle Missing Values During Scoring (Production)?

LightGBM, XGBoost, RuleFit

If missing data is present during training, these tree-based algorithms learn the optimal direction for missing data for each split (left or right). This optimal direction is then used for missing values during scoring. If no missing data is present during training (for a particular feature), then the majority path is followed if the value is missing.

GLM

Missing values are replaced by the mean value (from training), same as in training.

TensorFlow

Missing values are replaced by the same value as specified during training (parameterized by tf_nan_impute_value).

FTRL

To ensure consistency, FTRL treats missing values during scoring in exactly the same way as during training.

Clustering in Transformers

Missing values are replaced with the mean along each column. This is used only on numeric columns.

Isolation Forest Anomaly Score Transformer

Isolation Forest uses out-of-range imputation that fills missing values with the values beyond the maximum.

What Happens When You Try to Predict on a Categorical Level Not Seen During Training?

XGBoost, LightGBM, RuleFit, TensorFlow, GLM

Driverless AI’s feature engineering pipeline will compute a numeric value for every categorical level present in the data, whether it’s a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For target encoding, the global mean of the target value will be used.

FTRL

FTRL models don’t distinguish between categorical and numeric values. Whether or not FTRL saw a particular value during training, it will hash all the data, row by row, to numeric and then make predictions. Because you can think of FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable 《overlap》 in terms of unique values with the ones used to make predictions.

What Happens if the Response Has Missing Values?

All algorithms will skip an observation (record) if the response value is missing.