Builds gradient boosted classification trees and gradient boosted regression trees on a parsed data set. The default distribution function will guess the model type based on the response column type. In order to run properly, the response column must be an numeric for "gaussian" or an enum for "bernoulli" or "multinomial".
h2o.gbm( x, y, training_frame, model_id = NULL, validation_frame = NULL, nfolds = 0, keep_cross_validation_models = TRUE, keep_cross_validation_predictions = FALSE, keep_cross_validation_fold_assignment = FALSE, score_each_iteration = FALSE, score_tree_interval = 0, fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"), fold_column = NULL, ignore_const_cols = TRUE, offset_column = NULL, weights_column = NULL, balance_classes = FALSE, class_sampling_factors = NULL, max_after_balance_size = 5, max_hit_ratio_k = 0, ntrees = 50, max_depth = 5, min_rows = 10, nbins = 20, nbins_top_level = 1024, nbins_cats = 1024, r2_stopping = Inf, stopping_rounds = 0, stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing"), stopping_tolerance = 0.001, max_runtime_secs = 0, seed = 1, build_tree_one_node = FALSE, learn_rate = 0.1, learn_rate_annealing = 1, distribution = c("AUTO", "bernoulli", "quasibinomial", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom"), quantile_alpha = 0.5, tweedie_power = 1.5, huber_alpha = 0.9, checkpoint = NULL, sample_rate = 1, sample_rate_per_class = NULL, col_sample_rate = 1, col_sample_rate_change_per_level = 1, col_sample_rate_per_tree = 1, min_split_improvement = 1e05, histogram_type = c("AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin"), max_abs_leafnode_pred = Inf, pred_noise_bandwidth = 0, categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"), calibrate_model = FALSE, calibration_frame = NULL, custom_metric_func = NULL, custom_distribution_func = NULL, export_checkpoints_dir = NULL, monotone_constraints = NULL, check_constant_response = TRUE, verbose = FALSE )
x  (Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used. 

y  The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model. 
training_frame  Id of the training data frame. 
model_id  Destination id for this model; autogenerated if not specified. 
validation_frame  Id of the validation data frame. 
nfolds  Number of folds for Kfold crossvalidation (0 to disable or >= 2). Defaults to 0. 
keep_cross_validation_models 

keep_cross_validation_predictions 

keep_cross_validation_fold_assignment 

score_each_iteration 

score_tree_interval  Score the model after every so many trees. Disabled if set to 0. Defaults to 0. 
fold_assignment  Crossvalidation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO. 
fold_column  Column with crossvalidation fold index assignment per observation. 
ignore_const_cols 

offset_column  Offset column. This will be added to the combination of columns before applying the link function. 
weights_column  Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are perrow observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but noninteger values are supported as well. During training, rows with higher weights matter more, due to the larger loss function prefactor. 
balance_classes 

class_sampling_factors  Desired over/undersampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. 
max_after_balance_size  Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Defaults to 5.0. 
max_hit_ratio_k  Max. number (top K) of predictions to use for hit ratio computation (for multiclass only, 0 to disable) Defaults to 0. 
ntrees  Number of trees. Defaults to 50. 
max_depth  Maximum tree depth. Defaults to 5. 
min_rows  Fewest allowed (weighted) observations in a leaf. Defaults to 10. 
nbins  For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point Defaults to 20. 
nbins_top_level  For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level Defaults to 1024. 
nbins_cats  For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. Defaults to 1024. 
r2_stopping  r2_stopping is no longer supported and will be ignored if set  please use stopping_rounds, stopping_metric and stopping_tolerance instead. Previous version of H2O would stop making trees when the R^2 metric equals or exceeds this Defaults to 1.797693135e+308. 
stopping_rounds  Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0. 
stopping_metric  Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client. Must be one of: "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing". Defaults to AUTO. 
stopping_tolerance  Relative tolerance for metricbased stopping criterion (stop if relative improvement is not at least this much) Defaults to 0.001. 
max_runtime_secs  Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0. 
seed  Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to 1 (timebased random number). 
build_tree_one_node 

learn_rate  Learning rate (from 0.0 to 1.0) Defaults to 0.1. 
learn_rate_annealing  Scale the learning rate by this factor after each tree (e.g., 0.99 or 0.999) Defaults to 1. 
distribution  Distribution function Must be one of: "AUTO", "bernoulli", "quasibinomial", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber", "custom". Defaults to AUTO. 
quantile_alpha  Desired quantile for Quantile regression, must be between 0 and 1. Defaults to 0.5. 
tweedie_power  Tweedie power for Tweedie regression, must be between 1 and 2. Defaults to 1.5. 
huber_alpha  Desired quantile for Huber/Mregression (threshold between quadratic and linear loss, must be between 0 and 1). Defaults to 0.9. 
checkpoint  Model checkpoint to resume training with. 
sample_rate  Row sample rate per tree (from 0.0 to 1.0) Defaults to 1. 
sample_rate_per_class  A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree 
col_sample_rate  Column sample rate (from 0.0 to 1.0) Defaults to 1. 
col_sample_rate_change_per_level  Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0) Defaults to 1. 
col_sample_rate_per_tree  Column sample rate per tree (from 0.0 to 1.0) Defaults to 1. 
min_split_improvement  Minimum relative improvement in squared error reduction for a split to happen Defaults to 1e05. 
histogram_type  What type of histogram to use for finding optimal split points Must be one of: "AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin". Defaults to AUTO. 
max_abs_leafnode_pred  Maximum absolute value of a leaf node prediction Defaults to 1.797693135e+308. 
pred_noise_bandwidth  Bandwidth (sigma) of Gaussian multiplicative noise ~N(1,sigma) for tree node predictions Defaults to 0. 
categorical_encoding  Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO. 
calibrate_model 

calibration_frame  Calibration frame for Platt Scaling 
custom_metric_func  Reference to custom evaluation function, format: `language:keyName=funcName` 
custom_distribution_func  Reference to custom distribution, format: `language:keyName=funcName` 
export_checkpoints_dir  Automatically export generated models to this directory. 
monotone_constraints  A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and 1 to specify a decreasing constraint. 
check_constant_response 

verbose 

predict.H2OModel
for prediction
# NOT RUN { library(h2o) h2o.init() # Run regression GBM on australia data australia_path < system.file("extdata", "australia.csv", package = "h2o") australia < h2o.uploadFile(path = australia_path) independent < c("premax", "salmax","minairtemp", "maxairtemp", "maxsst", "maxsoilmoist", "Max_czcs") dependent < "runoffnew" h2o.gbm(y = dependent, x = independent, training_frame = australia, ntrees = 3, max_depth = 3, min_rows = 2) # }