Naïve Bayes Classifier

Introduction

Naïve Bayes is a classification algorithm that relies on strong assumptions of the independence of covariates in applying Bayes Theorem. The Naïve Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset.

Naïve Bayes models are commonly used as an alternative to decision trees for classification problems. When building a Naïve Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.

MOJO Support

Naïve Bayes Classifier currently does not support MOJOs.

Defining a Naïve Bayes Model

Parameters are optional unless specified as required.

Algorithm-specific parameters

  • compute_metrics: Enable this option to compute metrics on training data. This option defaults to True (enabled).

  • eps_prob: Specify the threshold value for probability. If this threshold is not met, then the value for min_prob is used.

  • eps_sdev: Specify the threshold for standard deviation. The value must be positive. If this threshold is not met, the value for min_sdev is used.

  • laplace: Specify the Laplace smoothing parameter. The option must be an integer \(\geq\) 0 and it defaults to 0.

  • min_prob: Specify the minimum probability to use for observations without enough data. This option defaults to 0.001.

  • min_sdev: Specify the minimum standard deviation to use for observations without enough data. The option must be at least 1e-10 and defaults to 0.001.

Common parameters

  • auc_type: Set the default multinomial AUC type. Must be one of:

    • "AUTO" (default)

    • "NONE"

    • "MACRO_OVR"

    • "WEIGHTED_OVR"

    • "MACRO_OVO"

    • "WEIGHTED_OVO"

  • balance_classes: (Applicable for classification only) Specify whether to oversample the minority classes to balance the class distribution. This option can increase the data frame size. Majority classes can be undersampled to satisfy the max_after_balance_size parameter. This option is set to False (disabled) by default.

  • class_sampling_factors: (Applicable when balance_classes=True only) Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance.

  • export_checkpoints_dir: Specify a directory to which generated models will automatically be exported.

  • fold_assignment: (Applicable only if a value for nfolds is specified and fold_column is not specified) Specify the cross-validation fold assignment scheme. One of:

    • AUTO (default; uses Random)

    • Random

    • Modulo (read more about Modulo)

    • Stratified (which will stratify the folds based on the response variable for classification problems)

  • fold_column: Specify the column that contains the cross-validation fold index assignment per observation.

  • gainslift_bins: The number of bins for a Gains/Lift table. The default value is -1 and makes the binning automatic. To disable this feature, set to 0.

  • keep_cross_validation_fold_assignment: Enable this option to preserve the cross-validation fold assignment. This option defaults to False (disabled).

  • keep_cross_validation_models: Specify whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster. This option defaults to True (enabled).

  • keep_cross_validation_predictions: Enable this option to keep the cross-validation predictions. This option defaults to False (disabled).

  • ignore_const_cols: Specify whether to ignore constant training columns, since no information can be gained from them. This option defaults to True (enabled).

  • ignored_columns: (Python and Flow only) Specify the column or columns to be excluded from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.

  • max_after_balance_size: (Applicable when balance_classes=True only) Specify the maximum relative size of the training data after balancing class counts. The value can be > 1.0 and defaults to 5.0.

  • max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 (default) to disable.

  • model_id: Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.

  • nfolds: Specify the number of folds for cross-validation. This option defaults to 0 (disable) or can be \(\geq\) 2.

  • score_each_iteration: Specify whether to score during each iteration of the model training. This option defaults to False (disabled).

  • seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations. This option defaults to -1 (time-based random number).

  • training_frame: Required Specify the dataset used to build the model.

    NOTE: In Flow, if you click the Build a model button from the Parse cell, the training frame is entered automatically.

  • validation_frame: Specify the dataset used to evaluate the accuracy of the model.

  • x: Specify a vector containing the names or indices of the predictor variables to use when building the model. If x is missing, then all columns except y are used.

  • y: Required Specify the column to use as the dependent variable. The data must be categorical and must contain at least two unique categorical levels.

Interpreting a Naïve Bayes Model

The output from Naïve Bayes is a list of tables containing the a-priori and conditional probabilities of each class of the response. The a-priori probability is the estimated probability of a particular class before observing any of the predictors. Each conditional probability table corresponds to a predictor column. The row headers are the classes of the response and the column headers are the classes of the predictor. Thus, in the sample output below, the probability of survival (y) given a person is male (x) is 0.51617440.

                Sex
Survived       Male     Female
     No  0.91543624 0.08456376
     Yes 0.51617440 0.48382560

When the predictor is numeric, Naïve Bayes assumes it is sampled from a Gaussian distribution given the class of the response. The first column contains the mean and the second column contains the standard deviation of the distribution.

By default, the following output displays:

  • Output, including model category, model summary, scoring history, training metrics, and validation metrics

  • Y-Levels (levels of the response column)

  • A Priori response probabilities

  • P-conditionals

Examples

Below is a simple example showing how to build a Naïve Bayes Classifier model.

# Import the prostate dataset into H2O:
prostate <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

# Set the predictors and response; set the response as a factor:
prostate$CAPSULE <- as.factor(prostate$CAPSULE)
predictors <- c("ID", "AGE", "RACE", "DPROS" ,"DCAPS" ,"PSA", "VOL", "GLEASON")
response <- "CAPSULE"

# Build and train the model:
pros_nb <- h2o.naiveBayes(x = predictors,
                          y = response,
                          training_frame = prostate,
                          laplace = 0,
                          nfolds = 5,
                          seed = 1234)

# Eval performance:
perf <- h2o.performance(pros_nb)

# Generate the predictions on a test set (if necessary):
pred <- h2o.predict(pros_nb, newdata = prostate)
import h2o
from h2o.estimators import H2ONaiveBayesEstimator
h2o.init()

# Import the prostate dataset into H2O:
prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

# Set predictors and response; set the response as a factor:
prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
predictors = ("ID","AGE","RACE","DPROS","DCAPS","PSA","VOL","GLEASON")
response = "CAPSULE"

# Build and train the model:
pros_nb = H2ONaiveBayesEstimator(laplace=0,
                                 nfolds=5,
                                 seed=1234)
pros_nb.train(x=predictors,
              y=response,
              training_frame=prostate)

# Eval performance:
perf = pros_nb.model_performance()

# Generate predictions on a test set (if necessary):
pred = pros_nb.predict(prostate)

FAQ

  • How does the algorithm handle missing values during training?

All rows with one or more missing values (either in the predictors or the response) will be skipped during model building.

  • How does the algorithm handle missing values during testing?

If a predictor is missing, it will be skipped when taking the product of conditional probabilities in calculating the joint probability conditional on the response.

  • What happens if the response domain is different in the training and test datasets?

The response column in the test dataset is not used during scoring, so any response categories absent in the training data will not be predicted.

  • What happens when you try to predict on a categorical level not seen during training?

The conditional probability of that predictor level will be set according to the Laplace smoothing factor. If the Laplace smoothing parameter is disabled (laplace = 0), then Naive Bayes will predict a probability of 0 for any row in the test set that contains a previously unseen categorical level. However, if the Laplace smoothing parameter is used (e.g. laplace = 1), then the model can make predictions for rows that include previously unseen categorical level.

Laplace smoothing adjusts the maximum likelihood estimates by adding 1 to the numerator and \(k\) to the denominator to allow for new categorical levels in the training set:

\(\phi_{j|y=1}= \frac{\Sigma_{i=1}^m 1(x_{j}^{(i)} \ = \ 1 \ \bigcap y^{(i)} \ = \ 1) \ + \ 1}{\Sigma_{i=1}^{m}1(y^{(i)} \ = \ 1) \ + \ k}\)

\(\phi_{j|y=0}= \frac{\Sigma_{i=1}^m 1(x_{j}^{(i)} \ = \ 1 \ \bigcap y^{(i)} \ = \ 0) \ + \ 1}{\Sigma_{i \ = \ 1}^{m}1(y^{(i)} \ = \ 0) \ + \ k}\)

\(x^{(i)}\) represents features, \(y^{(i)}\) represents the response column, and \(k\) represents the addition of each new categorical level. (\(k\) functions to balance the added 1 in the numerator.)

Laplace smoothing should be used with care; it is generally intended to allow for predictions in rare events. As prediction data becomes increasingly distinct from training data, new models should be trained when possible to account for a broader set of possible feature values.

  • Does it matter if the data is sorted?

No.

  • Should data be shuffled before training?

This does not affect model building.

  • How does the algorithm handle highly imbalanced data in a response column?

Unbalanced data will not affect the model. However, if one response category has very few observations compared to the total, the conditional probability may be very low. A cutoff (eps_prob) and minimum value (min_prob) are available for the user to set a floor on the calculated probability.

  • What if there are a large number of columns?

More memory will be allocated on each node to store the joint frequency counts and sums.

  • What if there are a large number of categorical factor levels?

More memory will be allocated on each node to store the joint frequency count of each categorical predictor level with the response’s level.

  • When running PCA, is it better to create a cluster that uses many smaller nodes or fewer larger nodes?

For Naïve Bayes, we recommend using many smaller nodes because the distributed task doesn’t require intensive computation.

Naïve Bayes Algorithm

The algorithm is presented for the simplified binomial case without loss of generality.

Under the Naive Bayes assumption of independence, given a training set for a set of discrete valued features X \({(X^{(i)}, y^{(i)}; i=1,...m)}\)

The joint likelihood of the data can be expressed as:

\(\mathcal{L}(\phi(y), \phi_{i|y=1}, \phi_{i|y=0})=\Pi_{i=1}^{m}p(X^{(i)},y^{(i)})\)

The model can be parameterized by:

\(\phi_{i|y=0} = p(x_{i}=1|y=0); \phi_{i|y=1}= p(x_{i}=1|y=1);\phi(y)\)

where \(\phi_{i|y=0}= p(x_{i}=1| y=0)\) can be thought of as the fraction of the observed instances where feature \(x_{i}\) is observed, and the outcome is \(y=0,\phi_{i|y=1}=p(x_{i}=1| y=1)\) is the fraction of the observed instances where feature \(x_{i}\) is observed, and the outcome is \(y=1\), and so on.

The objective of the algorithm is to maximize with respect to \(\phi_{i|y=0}\), \(\phi_{i|y=1}\), and \(\phi(y)\) where the maximum likelihood estimates are:

\(\phi_{j|y=1}=\frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 1)}{\Sigma_{i=1}^{m}(y^{(i)}=1)}\)

\(\phi\_{j|y=0}=\frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 0)}{\Sigma_{i=1}^{m}(y^{(i)}=0)}\)

\(\phi(y)=\frac{(y^{i} = 1)}{m}\)

Once all parameters \(\phi_{j|y}\) are fitted, the model can be used to predict new examples with features \(X_{(i^*)}\). This is carried out by calculating:

\(p(y=1|x)=\frac{\Pi p(x_i|y=1) p(y=1)}{\Pi p(x_i|y=1)p(y=1) + \Pi p(x_i|y=0)p(y=0)}\)

\(p(y=0|x)=\frac{\Pi p(x_i|y=0) p(y=0)}{\Pi p(x_i|y=1)p(y=1) + \Pi p(x_i|y=0)p(y=0)}\)

and then predicting the class with the highest probability.

It is possible that prediction sets contain features not originally seen in the training set. If this occurs, the maximum likelihood estimates for these features predict a probability of 0 for all cases of \(y\).

Laplace smoothing allows a model to predict on out of training data features by adjusting the maximum likelihood estimates to be:

\(\phi_{j|y=1}=\frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 1) + 1}{\Sigma_{i=1}^{m}(y^{(i)}=1 + 2}\))

\(\phi_{j|y=0}=\frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 0) + 1}{\Sigma_{i=1}^{m}(y^{(i)}=0 + 2}\)

Note that in the general case where \(y\) takes on \(k\) values, there are \(k+1\) modified parameter estimates, and they are added in when the denominator is \(k\) (rather than 2, as shown in the two-level classifier shown here).

Laplace smoothing should be used with care; it is generally intended to allow for predictions in rare events. As prediction data becomes increasingly distinct from training data, train new models when possible to account for a broader set of possible X values.