``nbins_cats`` -------------- - Available in: GBM, DRF, Uplift DRF - Hyperparameter: yes Description ~~~~~~~~~~~ When training models with categorical columns (factors), the ``nbins_cats`` option specifies the number of bins to be included in the histogram and then split at the best point. Because H2O does not perform `one-hot `__ encoding in the tree algorithms, we look at all the factor levels of a categorical predictor up to the resolution of the histogram, and then decide for each histogram bucket whether it goes left or right during splitting. When the training data contains columns with categorical levels (factors), these factors are split by assigning an integer to each distinct categorical level, then binning the ordered integers according to the user-specified number of bins (which defaults to 1024 bins), and then picking the optimal split point among the bins. For example, if you have levels A,B,C,D,E,F,G at a certain node to be split, and you specify ``nbins_cats=4``, then the buckets {A,B},{C,D},{E,F},{G} define the grouping during the first split. Only during the next split of {A,B} (down the tree) will GBM separate {A} and {B}. The value of ``nbins_cats`` for categorical factors has a much greater impact on the generalization error rate than ``nbins`` does for real- or integer-valued columns (where higher values mainly lead to more accurate numerical split points). For columns with many factors, a small ``nbins_cats`` value can add randomness to the split decisions (because the columns are grouped together somewhat arbitrarily), while large values (for example, values as large as the number of factor levels) can lead to perfect splits, resulting in `overfitting `__ on the training set (AUC=1 in certain datasets). So this option is a very important tuning parameter that can make a big difference on the validation set accuracy. The default value for ``nbins_cats`` is 1024. Note that this default value can lead to large communication overhead for deep distributed tree models. ``nbins_cats`` can go up to 65k, which should be enough for most datasets. To make a model more general, decrease ``nbins_top_level`` and ``nbins_cats``. To make a model more specific, increase ``nbins`` and/or ``nbins_top_level`` and ``nbins_cats``. Keep in mind that increasing ``nbins_cats`` can have a dramatic effect on the amount of overfitting. **Note**: Currently in H2O, if the number of categorical values in a dataset exceeds the value specified with ``nbins_cats``, then the values are grouped into bins by `lexical ordering `__. Related Parameters ~~~~~~~~~~~~~~~~~~ - `nbins `__ - `nbins_top_level `__ Example ~~~~~~~ .. tabs:: .. code-tab:: r R library(h2o) h2o.init() # import the airlines dataset: # This dataset is used to classify whether a flight will be delayed 'YES' or not "NO" # original data can be found at http://www.transtats.bts.gov/ airlines <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") # convert columns to factors airlines["Year"] <- as.factor(airlines["Year"]) airlines["Month"] <- as.factor(airlines["Month"]) airlines["DayOfWeek"] <- as.factor(airlines["DayOfWeek"]) airlines["Cancelled"] <- as.factor(airlines["Cancelled"]) airlines['FlightNum'] <- as.factor(airlines['FlightNum']) # set the predictor names and the response column name predictors <- c("Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum") response <- "IsDepDelayed" # split into train and validation airlines_splits <- h2o.splitFrame(data = airlines, ratios = 0.8, seed = 1234) train <- airlines_splits[[1]] valid <- airlines_splits[[2]] # number of factor levels range from 2 to 2439 # ('FlightNum', [2439]) # ('Origin', [132]) # ('Dest', [134]) # ('Year', [22]) # ('UniqueCarrier', [10]) # ('DayOfWeek', [7]) # ('Month', [2]) # try a range of nbins_cats: bin_num = c(8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096) label = c("8", "16" ,"32", "64", "128", "256", "512", "1024", "2048", "4096") lapply(seq_along(1:length(bin_num)),function(num) { airlines_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, validation_frame = valid, nbins_cats = bin_num[num], nfolds = 5, seed = 1234) # print the value used and AUC score for train and valid print(paste(label[num], 'training score', h2o.auc(airlines_gbm, train = TRUE))) print(paste(label[num], 'validation score', h2o.auc(airlines_gbm, valid = TRUE))) }) # Example of values to grid over for `nbins_cats` hyper_params <- list( nbins_cats = c(8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096) ) # this example uses cartesian grid search because the search space is small # and we want to see the performance of all models. For a larger search space use # random grid search instead: list(strategy = "RandomDiscrete") # this GBM uses early stopping once the validation AUC doesn't improve by at least 0.01% for # 5 consecutive scoring events grid <- h2o.grid(x = predictors, y = response, training_frame = train, validation_frame = valid, algorithm = "gbm", grid_id = "air_grid", hyper_params = hyper_params, stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC", search_criteria = list(strategy = "Cartesian"), seed = 1234) ## Sort the grid models by AUC sorted_grid <- h2o.getGrid("air_grid", sort_by = "auc", decreasing = TRUE) sorted_grid .. code-tab:: python import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator h2o.init() h2o.cluster().show_status() # import the airlines dataset: # This dataset is used to classify whether a flight will be delayed 'YES' or not "NO" # original data can be found at http://www.transtats.bts.gov/ airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") # convert columns to factors airlines["Year"]= airlines["Year"].asfactor() airlines["Month"]= airlines["Month"].asfactor() airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() airlines["Cancelled"] = airlines["Cancelled"].asfactor() airlines['FlightNum'] = airlines['FlightNum'].asfactor() # set the predictor names and the response column name predictors = ["Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum"] response = "IsDepDelayed" # split into train and validation sets train, valid= airlines.split_frame(ratios = [.8], seed = 1234) # number of factor levels range from 2 to 2439 # ('FlightNum', [2439]) # ('Origin', [132]) # ('Dest', [134]) # ('Year', [22]) # ('UniqueCarrier', [10]) # ('DayOfWeek', [7]) # ('Month', [2]) # try a range of nbins_cats: bin_num = [8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096] label = ["8", "16", "32", "64", "128", "256", "512", "1024", "2048", "4096"] for key, num in enumerate(bin_num): # initialize the GBM estimator and set a seed for reproducibility airlines_gbm = H2OGradientBoostingEstimator(nbins_cats = num, seed =1234) airlines_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid) # print the value used and AUC score for train and valid print(label[key], 'training score', airlines_gbm.auc(train = True)) print(label[key], 'validation score', airlines_gbm.auc(valid = True)) # Example of values to grid over for `nbins_cats` # import Grid Search from h2o.grid.grid_search import H2OGridSearch # select the values for nbins_cats to grid over hyper_params = {'nbins_cats': [8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]} # this example uses cartesian grid search because the search space is small # and we want to see the performance of all models. For a larger search space use # random grid search instead: {'strategy': "RandomDiscrete"} # initialize the GBM estimator # use early stopping once the validation AUC doesn't improve by at least 0.01% for # 5 consecutive scoring events airlines_gbm_2 = H2OGradientBoostingEstimator(seed = 1234, stopping_rounds = 5, stopping_metric = "AUC", stopping_tolerance = 1e-4) # build grid search with previously made GBM and hyper parameters grid = H2OGridSearch(model = airlines_gbm_2, hyper_params = hyper_params, search_criteria = {'strategy': "Cartesian"}) # train using the grid grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid) # sort the grid models by decreasing AUC sorted_grid = grid.get_grid(sort_by = 'auc', decreasing = True) print(sorted_grid)