Build gradient boosted classification or regression trees

Builds gradient boosted classification trees and gradient boosted regression trees on a parsed data set. The default distribution function will guess the model type based on the response column type. In order to run properly, the response column must be an numeric for "gaussian" or an enum for "bernoulli" or "multinomial".

h2o.gbm(
  x,
  y,
  training_frame,
  model_id = NULL,
  validation_frame = NULL,
  nfolds = 0,
  keep_cross_validation_models = TRUE,
  keep_cross_validation_predictions = FALSE,
  keep_cross_validation_fold_assignment = FALSE,
  score_each_iteration = FALSE,
  score_tree_interval = 0,
  fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"),
  fold_column = NULL,
  ignore_const_cols = TRUE,
  offset_column = NULL,
  weights_column = NULL,
  balance_classes = FALSE,
  class_sampling_factors = NULL,
  max_after_balance_size = 5,
  ntrees = 50,
  max_depth = 5,
  min_rows = 10,
  nbins = 20,
  nbins_top_level = 1024,
  nbins_cats = 1024,
  r2_stopping = Inf,
  stopping_rounds = 0,
  stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE",
    "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error",
    "custom", "custom_increasing"),
  stopping_tolerance = 0.001,
  max_runtime_secs = 0,
  seed = -1,
  build_tree_one_node = FALSE,
  learn_rate = 0.1,
  learn_rate_annealing = 1,
  distribution = c("AUTO", "bernoulli", "quasibinomial", "multinomial", "gaussian",
    "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom"),
  quantile_alpha = 0.5,
  tweedie_power = 1.5,
  huber_alpha = 0.9,
  checkpoint = NULL,
  sample_rate = 1,
  sample_rate_per_class = NULL,
  col_sample_rate = 1,
  col_sample_rate_change_per_level = 1,
  col_sample_rate_per_tree = 1,
  min_split_improvement = 1e-05,
  histogram_type = c("AUTO", "UniformAdaptive", "Random", "QuantilesGlobal",
    "RoundRobin", "UniformRobust"),
  max_abs_leafnode_pred = Inf,
  pred_noise_bandwidth = 0,
  categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary",
    "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"),
  calibrate_model = FALSE,
  calibration_frame = NULL,
  calibration_method = c("AUTO", "PlattScaling", "IsotonicRegression"),
  custom_metric_func = NULL,
  custom_distribution_func = NULL,
  export_checkpoints_dir = NULL,
  in_training_checkpoints_dir = NULL,
  in_training_checkpoints_tree_interval = 1,
  monotone_constraints = NULL,
  check_constant_response = TRUE,
  gainslift_bins = -1,
  auc_type = c("AUTO", "NONE", "MACRO_OVR", "WEIGHTED_OVR", "MACRO_OVO", "WEIGHTED_OVO"),
  interaction_constraints = NULL,
  auto_rebalance = TRUE,
  verbose = FALSE
)

Arguments

x: (Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.
y: The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model.
training_frame: Id of the training data frame.
model_id: Destination id for this model; auto-generated if not specified.
validation_frame: Id of the validation data frame.
nfolds: Number of folds for K-fold cross-validation (0 to disable or >= 2). Defaults to 0.
keep_cross_validation_models: Logical. Whether to keep the cross-validation models. Defaults to TRUE.
keep_cross_validation_predictions: Logical. Whether to keep the predictions of the cross-validation models. Defaults to FALSE.
keep_cross_validation_fold_assignment: Logical. Whether to keep the cross-validation fold assignment. Defaults to FALSE.
score_each_iteration: Logical. Whether to score during each iteration of model training. Defaults to FALSE.
score_tree_interval: Score the model after every so many trees. Disabled if set to 0. Defaults to 0.
fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO.
fold_column: Column with cross-validation fold index assignment per observation.
ignore_const_cols: Logical. Ignore constant columns. Defaults to TRUE.
offset_column: Offset column. This will be added to the combination of columns before applying the link function.
weights_column: Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
balance_classes: Logical. Balance training data class counts via over/under-sampling (for imbalanced data). Defaults to FALSE.
class_sampling_factors: Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
max_after_balance_size: Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Defaults to 5.0.
ntrees: Number of trees. Defaults to 50.
max_depth: Maximum tree depth (0 for unlimited). Defaults to 5.
min_rows: Fewest allowed (weighted) observations in a leaf. Defaults to 10.
nbins: For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point Defaults to 20.
nbins_top_level: For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level Defaults to 1024.
nbins_cats: For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. Defaults to 1024.
r2_stopping: r2_stopping is no longer supported and will be ignored if set - please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this Defaults to 1.797693135e+308.
stopping_rounds: Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0.
stopping_metric: Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client. Must be one of: "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing". Defaults to AUTO.
stopping_tolerance: Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Defaults to 0.001.
max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.
seed: Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number).
build_tree_one_node: Logical. Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets. Defaults to FALSE.
learn_rate: Learning rate (from 0.0 to 1.0) Defaults to 0.1.
learn_rate_annealing: Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999) Defaults to 1.
distribution: Distribution function Must be one of: "AUTO", "bernoulli", "quasibinomial", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom". Defaults to AUTO.
quantile_alpha: Desired quantile for Quantile regression, must be between 0 and 1. Defaults to 0.5.
tweedie_power: Tweedie power for Tweedie regression, must be between 1 and 2. Defaults to 1.5.
huber_alpha: Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1). Defaults to 0.9.
checkpoint: Model checkpoint to resume training with.
sample_rate: Row sample rate per tree (from 0.0 to 1.0) Defaults to 1.
sample_rate_per_class: A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree
col_sample_rate: Column sample rate (from 0.0 to 1.0) Defaults to 1.
col_sample_rate_change_per_level: Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0) Defaults to 1.
col_sample_rate_per_tree: Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.
min_split_improvement: Minimum relative improvement in squared error reduction for a split to happen Defaults to 1e-05.
histogram_type: What type of histogram to use for finding optimal split points Must be one of: "AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin", "UniformRobust". Defaults to AUTO.
max_abs_leafnode_pred: Maximum absolute value of a leaf node prediction Defaults to 1.797693135e+308.
pred_noise_bandwidth: Bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions Defaults to 0.
categorical_encoding: Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO.
calibrate_model: Logical. Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities. Defaults to FALSE.
calibration_frame: Data for model calibration
calibration_method: Calibration method to use Must be one of: "AUTO", "PlattScaling", "IsotonicRegression". Defaults to AUTO.
custom_metric_func: Reference to custom evaluation function, format: `language:keyName=funcName`
custom_distribution_func: Reference to custom distribution, format: `language:keyName=funcName`
export_checkpoints_dir: Automatically export generated models to this directory.
in_training_checkpoints_dir: Create checkpoints into defined directory while training process is still running. In case of cluster shutdown, this checkpoint can be used to restart training.
in_training_checkpoints_tree_interval: Checkpoint the model after every so many trees. Parameter is used only when in_training_checkpoints_dir is defined Defaults to 1.
monotone_constraints: A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.
check_constant_response: Logical. Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not. Defaults to TRUE.
gainslift_bins: Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning. Defaults to -1.
auc_type: Set default multinomial AUC type. Must be one of: "AUTO", "NONE", "MACRO_OVR", "WEIGHTED_OVR", "MACRO_OVO", "WEIGHTED_OVO". Defaults to AUTO.
interaction_constraints: A set of allowed column interactions.
auto_rebalance: Logical. Allow automatic rebalancing of training and validation datasets Defaults to TRUE.
verbose: Logical. Print scoring history to the console (Metrics per tree). Defaults to FALSE.

Examples

if (FALSE) { # \dontrun{
library(h2o)
h2o.init()

# Run regression GBM on australia data
australia_path <- system.file("extdata", "australia.csv", package = "h2o")
australia <- h2o.uploadFile(path = australia_path)
independent <- c("premax", "salmax", "minairtemp", "maxairtemp", "maxsst",
                 "maxsoilmoist", "Max_czcs")
dependent <- "runoffnew"
h2o.gbm(y = dependent, x = independent, training_frame = australia,
        ntrees = 3, max_depth = 3, min_rows = 2)
} # }

Build gradient boosted classification or regression trees

Arguments

See also

Examples