Available in: Deep Learning, GLM, GAM
This option is used to specify the way that the algorithm will treat missing values. In H2O, the Deep Learning and GLM algorithms will either skip or mean-impute rows with NA values. The GLM algorithm can also use plug_values, which allows you to specify a single-row frame containing values that will be used to impute missing values of the training/validation frame. Both algorithms default to MeanImputation. Note that in Deep Learning, unseen categorical variables are imputed by adding an extra “missing” level. In GLM, unseen categorical levels are replaced by the most frequent level present in training (mod).
The fewer the NA values in your training data, the better. Always check degrees of freedom in the output model. Degrees of freedom is the number of observations used to train the model minus the size of the model (i.e., the number of features). If this number is much smaller than expected, it is likely that too many rows have been excluded due to missing values:
If you have few columns with many NAs, you might accidentally be losing all your rows, so its better to exclude (skip) them.
If you have many columns with a small fraction of uniformly distributed missing values, every row will likely have at least one missing value. In this case, impute the NAs (e.g., substitute the NAs with mean values) before modeling.