Builds a eXtreme Gradient Boosting model using the native XGBoost backend.

h2o.xgboost(
  x,
  y,
  training_frame,
  model_id = NULL,
  validation_frame = NULL,
  nfolds = 0,
  keep_cross_validation_models = TRUE,
  keep_cross_validation_predictions = FALSE,
  keep_cross_validation_fold_assignment = FALSE,
  score_each_iteration = FALSE,
  fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"),
  fold_column = NULL,
  ignore_const_cols = TRUE,
  offset_column = NULL,
  weights_column = NULL,
  stopping_rounds = 0,
  stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE",
    "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error",
    "custom", "custom_increasing"),
  stopping_tolerance = 0.001,
  max_runtime_secs = 0,
  seed = -1,
  distribution = c("AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma",
    "tweedie", "laplace", "quantile", "huber"),
  tweedie_power = 1.5,
  categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary",
    "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"),
  quiet_mode = TRUE,
  checkpoint = NULL,
  export_checkpoints_dir = NULL,
  custom_metric_func = NULL,
  ntrees = 50,
  max_depth = 6,
  min_rows = 1,
  min_child_weight = 1,
  learn_rate = 0.3,
  eta = 0.3,
  sample_rate = 1,
  subsample = 1,
  col_sample_rate = 1,
  colsample_bylevel = 1,
  col_sample_rate_per_tree = 1,
  colsample_bytree = 1,
  colsample_bynode = 1,
  max_abs_leafnode_pred = 0,
  max_delta_step = 0,
  monotone_constraints = NULL,
  interaction_constraints = NULL,
  score_tree_interval = 0,
  min_split_improvement = 0,
  gamma = 0,
  nthread = -1,
  save_matrix_directory = NULL,
  build_tree_one_node = FALSE,
  parallelize_cross_validation = TRUE,
  calibrate_model = FALSE,
  calibration_frame = NULL,
  calibration_method = c("AUTO", "PlattScaling", "IsotonicRegression"),
  max_bins = 256,
  max_leaves = 0,
  sample_type = c("uniform", "weighted"),
  normalize_type = c("tree", "forest"),
  rate_drop = 0,
  one_drop = FALSE,
  skip_drop = 0,
  tree_method = c("auto", "exact", "approx", "hist"),
  grow_policy = c("depthwise", "lossguide"),
  booster = c("gbtree", "gblinear", "dart"),
  reg_lambda = 1,
  reg_alpha = 0,
  dmatrix_type = c("auto", "dense", "sparse"),
  backend = c("auto", "gpu", "cpu"),
  gpu_id = NULL,
  gainslift_bins = -1,
  auc_type = c("AUTO", "NONE", "MACRO_OVR", "WEIGHTED_OVR", "MACRO_OVO", "WEIGHTED_OVO"),
  scale_pos_weight = 1,
  eval_metric = NULL,
  score_eval_metric_only = FALSE,
  verbose = FALSE
)

Arguments

x

(Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.

y

The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model.

training_frame

Id of the training data frame.

model_id

Destination id for this model; auto-generated if not specified.

validation_frame

Id of the validation data frame.

nfolds

Number of folds for K-fold cross-validation (0 to disable or >= 2). Defaults to 0.

keep_cross_validation_models

Logical. Whether to keep the cross-validation models. Defaults to TRUE.

keep_cross_validation_predictions

Logical. Whether to keep the predictions of the cross-validation models. Defaults to FALSE.

keep_cross_validation_fold_assignment

Logical. Whether to keep the cross-validation fold assignment. Defaults to FALSE.

score_each_iteration

Logical. Whether to score during each iteration of model training. Defaults to FALSE.

fold_assignment

Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO.

fold_column

Column with cross-validation fold index assignment per observation.

ignore_const_cols

Logical. Ignore constant columns. Defaults to TRUE.

offset_column

Offset column. This will be added to the combination of columns before applying the link function.

weights_column

Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.

stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0.

stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client. Must be one of: "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "AUCPR", "lift_top_group", "misclassification", "mean_per_class_error", "custom", "custom_increasing". Defaults to AUTO.

stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Defaults to 0.001.

max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.

seed

Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number).

distribution

Distribution function Must be one of: "AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber". Defaults to AUTO.

tweedie_power

Tweedie power for Tweedie regression, must be between 1 and 2. Defaults to 1.5.

categorical_encoding

Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO.

quiet_mode

Logical. Enable quiet mode Defaults to TRUE.

checkpoint

Model checkpoint to resume training with.

export_checkpoints_dir

Automatically export generated models to this directory.

custom_metric_func

Reference to custom evaluation function, format: `language:keyName=funcName`

ntrees

(same as n_estimators) Number of trees. Defaults to 50.

max_depth

Maximum tree depth (0 for unlimited). Defaults to 6.

min_rows

(same as min_child_weight) Fewest allowed (weighted) observations in a leaf. Defaults to 1.

min_child_weight

(same as min_rows) Fewest allowed (weighted) observations in a leaf. Defaults to 1.

learn_rate

(same as eta) Learning rate (from 0.0 to 1.0) Defaults to 0.3.

eta

(same as learn_rate) Learning rate (from 0.0 to 1.0) Defaults to 0.3.

sample_rate

(same as subsample) Row sample rate per tree (from 0.0 to 1.0) Defaults to 1.

subsample

(same as sample_rate) Row sample rate per tree (from 0.0 to 1.0) Defaults to 1.

col_sample_rate

(same as colsample_bylevel) Column sample rate (from 0.0 to 1.0) Defaults to 1.

colsample_bylevel

(same as col_sample_rate) Column sample rate (from 0.0 to 1.0) Defaults to 1.

col_sample_rate_per_tree

(same as colsample_bytree) Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.

colsample_bytree

(same as col_sample_rate_per_tree) Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.

colsample_bynode

Column sample rate per tree node (from 0.0 to 1.0) Defaults to 1.

max_abs_leafnode_pred

(same as max_delta_step) Maximum absolute value of a leaf node prediction Defaults to 0.0.

max_delta_step

(same as max_abs_leafnode_pred) Maximum absolute value of a leaf node prediction Defaults to 0.0.

monotone_constraints

A mapping representing monotonic constraints. Use +1 to enforce an increasing constraint and -1 to specify a decreasing constraint.

interaction_constraints

A set of allowed column interactions.

score_tree_interval

Score the model after every so many trees. Disabled if set to 0. Defaults to 0.

min_split_improvement

(same as gamma) Minimum relative improvement in squared error reduction for a split to happen Defaults to 0.0.

gamma

(same as min_split_improvement) Minimum relative improvement in squared error reduction for a split to happen Defaults to 0.0.

nthread

Number of parallel threads that can be used to run XGBoost. Cannot exceed H2O cluster limits (-nthreads parameter). Defaults to maximum available Defaults to -1.

save_matrix_directory

Directory where to save matrices passed to XGBoost library. Useful for debugging.

build_tree_one_node

Logical. Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets. Defaults to FALSE.

parallelize_cross_validation

Logical. Allow parallel training of cross-validation models Defaults to TRUE.

calibrate_model

Logical. Use Platt Scaling (default) or Isotonic Regression to calculate calibrated class probabilities. Calibration can provide more accurate estimates of class probabilities. Defaults to FALSE.

calibration_frame

Data for model calibration

calibration_method

Calibration method to use Must be one of: "AUTO", "PlattScaling", "IsotonicRegression". Defaults to AUTO.

max_bins

For tree_method=hist only: maximum number of bins Defaults to 256.

max_leaves

For tree_method=hist only: maximum number of leaves Defaults to 0.

sample_type

For booster=dart only: sample_type Must be one of: "uniform", "weighted". Defaults to uniform.

normalize_type

For booster=dart only: normalize_type Must be one of: "tree", "forest". Defaults to tree.

rate_drop

For booster=dart only: rate_drop (0..1) Defaults to 0.0.

one_drop

Logical. For booster=dart only: one_drop Defaults to FALSE.

skip_drop

For booster=dart only: skip_drop (0..1) Defaults to 0.0.

tree_method

Tree method Must be one of: "auto", "exact", "approx", "hist". Defaults to auto.

grow_policy

Grow policy - depthwise is standard GBM, lossguide is LightGBM Must be one of: "depthwise", "lossguide". Defaults to depthwise.

booster

Booster type Must be one of: "gbtree", "gblinear", "dart". Defaults to gbtree.

reg_lambda

L2 regularization Defaults to 1.0.

reg_alpha

L1 regularization Defaults to 0.0.

dmatrix_type

Type of DMatrix. For sparse, NAs and 0 are treated equally. Must be one of: "auto", "dense", "sparse". Defaults to auto.

backend

Backend. By default (auto), a GPU is used if available. Must be one of: "auto", "gpu", "cpu". Defaults to auto.

gpu_id

Which GPU(s) to use.

gainslift_bins

Gains/Lift table number of bins. 0 means disabled.. Default value -1 means automatic binning. Defaults to -1.

auc_type

Set default multinomial AUC type. Must be one of: "AUTO", "NONE", "MACRO_OVR", "WEIGHTED_OVR", "MACRO_OVO", "WEIGHTED_OVO". Defaults to AUTO.

scale_pos_weight

Controls the effect of observations with positive labels in relation to the observations with negative labels on gradient calculation. Useful for imbalanced problems. Defaults to 1.0.

eval_metric

Specification of evaluation metric that will be passed to the native XGBoost backend.

score_eval_metric_only

Logical. If enabled, score only the evaluation metric. This can make model training faster if scoring is frequent (eg. each iteration). Defaults to FALSE.

verbose

Logical. Print scoring history to the console (Metrics per tree). Defaults to FALSE.

Examples

if (FALSE) {
library(h2o)
h2o.init()

# Import the titanic dataset
f <- "https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv"
titanic <- h2o.importFile(f)

# Set predictors and response; set response as a factor
titanic['survived'] <- as.factor(titanic['survived'])
predictors <- setdiff(colnames(titanic), colnames(titanic)[2:3])
response <- "survived"

# Split the dataset into train and valid
splits <- h2o.splitFrame(data =  titanic, ratios = .8, seed = 1234)
train <- splits[[1]]
valid <- splits[[2]]

# Train the XGB model
titanic_xgb <- h2o.xgboost(x = predictors, y = response,
                           training_frame = train, validation_frame = valid,
                           booster = "dart", normalize_type = "tree",
                           seed = 1234)
}