Missing and Unseen Levels Handling

This section describes how missing and unseen levels are handled by each algorithm during training and scoring.

How Does the Algorithm Handle Missing Values During Training?

LightGBM, XGBoost, RuleFit

Driverless AI treats missing values natively. (I.e., a missing value is treated as a special value.) Experiments rarely benefit from imputation techniques, unless the user has a strong understanding of the data.

GLM

Driverless AI automatically performs mean value imputation (equivalent to setting the value to zero after standardization).

TensorFlow

Driverless AI provides an imputation setting for TensorFlow in the config.toml file: tf_nan_impute_value (post-normalization). If you set this option to 0, then missing values will be imputed by the mean. Setting it to (for example) +5 will specify 5 standard deviations above the mean of the distribution. The default value in Driverless AI is -5, which specifies that TensorFlow will treat missing values as outliers on the negative end of the spectrum. Specify 0 if you prefer mean imputation.

FTRL

In FTRL, missing values have their own representation for each datable column type. These representations are used to hash the missing value, with their column’s name, to an integer. This means FTRL replaces missing values with special constants that are the same for each column type, and then treats these special constants like a normal data value.

How Does the Algorithm Handle Missing Values During Scoring (Production)?

LightGBM, XGBoost, RuleFit

If missing data is present during training, these tree-based algorithms learn the optimal direction for missing data for each split (left or right). This optimal direction is then used for missing values during scoring. If no missing data is present during scoring (for a particular feature), then the majority path is followed if the value is missing.

GLM

Missing values are replaced by the mean value (from training), same as in training.

TensorFlow

Missing values are replaced by the same value as specified during training (parameterized by tf_nan_impute_value).

FTRL

To ensure consistency, FTRL treats missing values during scoring in exactly the same way as during training.

Clustering in Transformers

Missing values are replaced with the mean along each column. This is used only on numeric columns.

Isolation Forest Anomaly Score Transformer

Isolation Forest uses out-of-range imputation that fills missing values with the values beyond the maximum.

What Happens When You Try to Predict on a Categorical Level Not Seen During Training?

XGBoost, LightGBM, RuleFit, TensorFlow, GLM

Driverless AI’s feature engineering pipeline will compute a numeric value for every categorical level present in the data, whether it’s a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For target encoding, the global mean of the target value will be used.

FTRL

FTRL models don’t distinguish between categorical and numeric values. Whether or not FTRL saw a particular value during training, it will hash all the data, row by row, to numeric and then make predictions. Because you can think of FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable “overlap” in terms of unique values with the ones used to make predictions.

What Happens if the Response Has Missing Values?

All algorithms will skip an observation (aka record) if the response value is missing.