Builds a Uplift Random Forest model on an H2OFrame.

h2o.upliftRandomForest(
  x,
  y,
  training_frame,
  treatment_column,
  model_id = NULL,
  validation_frame = NULL,
  score_each_iteration = FALSE,
  score_tree_interval = 0,
  ignore_const_cols = TRUE,
  ntrees = 50,
  max_depth = 20,
  min_rows = 1,
  nbins = 20,
  nbins_top_level = 1024,
  nbins_cats = 1024,
  max_runtime_secs = 0,
  seed = -1,
  mtries = -2,
  sample_rate = 0.632,
  sample_rate_per_class = NULL,
  col_sample_rate_change_per_level = 1,
  col_sample_rate_per_tree = 1,
  histogram_type = c("AUTO", "UniformAdaptive", "Random", "QuantilesGlobal",
    "RoundRobin", "UniformRobust"),
  categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary",
    "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"),
  distribution = c("AUTO", "bernoulli"),
  check_constant_response = TRUE,
  custom_metric_func = NULL,
  uplift_metric = c("AUTO", "KL", "Euclidean", "ChiSquared"),
  auuc_type = c("AUTO", "qini", "lift", "gain"),
  auuc_nbins = -1,
  stopping_rounds = 0,
  stopping_metric = c("AUTO", "AUUC", "ATE", "ATT", "ATC", "qini"),
  stopping_tolerance = 0.001,
  verbose = FALSE
)

Arguments

x

(Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used.

y

The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model.

training_frame

Id of the training data frame.

treatment_column

Define the column which will be used for computing uplift gain to select best split for a tree. The column has to divide the dataset into treatment (value 1) and control (value 0) groups. Defaults to treatment.

model_id

Destination id for this model; auto-generated if not specified.

validation_frame

Id of the validation data frame.

score_each_iteration

Logical. Whether to score during each iteration of model training. Defaults to FALSE.

score_tree_interval

Score the model after every so many trees. Disabled if set to 0. Defaults to 0.

ignore_const_cols

Logical. Ignore constant columns. Defaults to TRUE.

ntrees

Number of trees. Defaults to 50.

max_depth

Maximum tree depth (0 for unlimited). Defaults to 20.

min_rows

Fewest allowed (weighted) observations in a leaf. Defaults to 1.

nbins

For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point Defaults to 20.

nbins_top_level

For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level Defaults to 1024.

nbins_cats

For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. Defaults to 1024.

max_runtime_secs

Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0.

seed

Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default). Defaults to -1 (time-based random number).

mtries

Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictors Defaults to -2.

sample_rate

Row sample rate per tree (from 0.0 to 1.0) Defaults to 0.632.

sample_rate_per_class

A list of row sample rates per class (relative fraction for each class, from 0.0 to 1.0), for each tree

col_sample_rate_change_per_level

Relative change of the column sampling rate for every level (must be > 0.0 and <= 2.0) Defaults to 1.

col_sample_rate_per_tree

Column sample rate per tree (from 0.0 to 1.0) Defaults to 1.

histogram_type

What type of histogram to use for finding optimal split points Must be one of: "AUTO", "UniformAdaptive", "Random", "QuantilesGlobal", "RoundRobin", "UniformRobust". Defaults to AUTO.

categorical_encoding

Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO.

distribution

Distribution function Must be one of: "AUTO", "bernoulli". Defaults to AUTO.

check_constant_response

Logical. Check if response column is constant. If enabled, then an exception is thrown if the response column is a constant value.If disabled, then model will train regardless of the response column being a constant value or not. Defaults to TRUE.

custom_metric_func

Reference to custom evaluation function, format: `language:keyName=funcName`

uplift_metric

Divergence metric used to find best split when building an uplift tree. Must be one of: "AUTO", "KL", "Euclidean", "ChiSquared". Defaults to AUTO.

auuc_type

Metric used to calculate Area Under Uplift Curve. Must be one of: "AUTO", "qini", "lift", "gain". Defaults to AUTO.

auuc_nbins

Number of bins to calculate Area Under Uplift Curve. Defaults to -1.

stopping_rounds

Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 0.

stopping_metric

Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anomaly_score for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python client. Must be one of: "AUTO", "AUUC", "ATE", "ATT", "ATC", "qini". Defaults to AUTO.

stopping_tolerance

Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Defaults to 0.001.

verbose

Logical. Print scoring history to the console (Metrics per tree). Defaults to FALSE.

Value

Creates a H2OModel object of the right type.

See also

predict.H2OModel for prediction