metalearner_algorithm

  • Available in: Stacked Ensembles

  • Hyperparameter: no

Description

H2O’s Stacked Ensemble method is supervised ensemble machine learning algorithm that finds the optimal combination of a collection of prediction algorithms using a process called stacking (or Super Learning). The algorithm that learns the optimal combination of the base learners is called the metalearning algorithm or metalearner. By default, the Stacked Ensemble metalearner is a default H2O GLM with non-negative weights. The metalearner_algorithm option allows you to specify a different metalearner algorithm. Options include:

  • "AUTO" (GLM with non negative weights & standardization turned off, and if validation_frame is present, then lambda_search is set to True; may change over time). This is the default.

  • "glm" (GLM with default parameters)

  • "gbm" (GBM with default parameters)

  • "drf" (Random Forest with default parameters)

  • "deeplearning" (Deep Learning with default parameters)

  • "naivebayes" (NaiveBayes with default parameters)

  • "xgboost" (if available, XGBoost with default parameters)

The metalearner will use default hyperparameter values for the specified algorithm, unless you also specify metalearner_params to customize the hyperparameters values.

The default parameters for the metalearning algorithms may not be the best choice, so it’s a good idea to experiment a bit with different parameters using metalearner_params. In the next release of H2O, there will be an option to easily do a grid search over metalearner parameters using the standard H2O Grid interface, which will make tuning the metalearner a bit easier.

Example

library(h2o)
h2o.init()

# import the higgs_train_5k train and test datasets
train <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_test_5k.csv")

# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)

# Convert the response column in train and test datasets to a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])


# Set number of folds for base learners
nfolds <- 3

# Train & Cross-validate a GBM model
my_gbm <- h2o.gbm(x = x,
                  y = y,
                  training_frame = train,
                  distribution = "bernoulli",
                  ntrees = 10,
                  nfolds = nfolds,
                  keep_cross_validation_predictions = TRUE,
                  seed = 1)

# Train & Cross-validate an RF model
my_rf <- h2o.randomForest(x = x,
                          y = y,
                          training_frame = train,
                          ntrees = 10,
                          nfolds = nfolds,
                          keep_cross_validation_predictions = TRUE,
                          seed = 1)


# Next we can train a few different ensembles using different metalearners

# Train a stacked ensemble using the default metalearner algorithm
stack <- h2o.stackedEnsemble(x = x,
                             y = y,
                             training_frame = train,
                             base_models = list(my_gbm, my_rf))
h2o.auc(h2o.performance(stack, test))
# 0.7570171

# Train a stacked ensemble using GBM as the metalearner algorithm
# The metalearner will use GBM default values
stack_gbm <- h2o.stackedEnsemble(x = x,
                                 y = y,
                                 training_frame = train,
                                 base_models = list(my_gbm, my_rf),
                                 metalearner_algorithm = "gbm")
h2o.auc(h2o.performance(stack_gbm, test))
# 0.7511055

# Train a stacked ensemble using RF as the metalearner algorithm
# The metalearner will use RF default values
stack_rf <- h2o.stackedEnsemble(x = x,
                                y = y,
                                training_frame = train,
                                base_models = list(my_gbm, my_rf),
                                metalearner_algorithm = "drf")
h2o.auc(h2o.performance(stack_rf, test))
# 0.7232461

# Train a stacked ensemble using Deep Learning as the metalearner algorithm
# The metalearner will use RF default values
stack_dl <- h2o.stackedEnsemble(x = x,
                                y = y,
                                training_frame = train,
                                base_models = list(my_gbm, my_rf),
                                metalearner_algorithm = "deeplearning")
h2o.auc(h2o.performance(stack_dl, test))
# 0.7571556
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
h2o.init()

# import the higgs_train_5k train and test datasets
train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_train_5k.csv")
test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/testng/higgs_test_5k.csv")

# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)

# Convert the response column in train and test datasets to a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()


# Set number of folds for base learners
nfolds = 3

# Train and cross-validate a GBM model
my_gbm = H2OGradientBoostingEstimator(distribution="bernoulli",
                                      ntrees=10,
                                      nfolds=nfolds,
                                      fold_assignment="Modulo",
                                      keep_cross_validation_predictions=True,
                                      seed=1)
my_gbm.train(x=x, y=y, training_frame=train)

# Train and cross-validate an RF model
my_rf = H2ORandomForestEstimator(ntrees=50,
                                 nfolds=nfolds,
                                 fold_assignment="Modulo",
                                 keep_cross_validation_predictions=True,
                                 seed=1)
my_rf.train(x=x, y=y, training_frame=train)


# Next we can train a few different ensembles using different metalearners

# Train a stacked ensemble using the default metalearner algorithm
stack = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf])
stack.train(x=x, y=y, training_frame=train)
stack.model_performance(test).auc()
# 0.7522591310013634

# Train a stacked ensemble with a GBM metalearner algorithm
# The metalearner will use GBM default values
stack_gbm = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
                                        metalearner_algorithm="gbm")
stack_gbm.train(x=x, y=y, training_frame=train)
stack_gbm.model_performance(test).auc()
# 0.7522591310013634

# Train a stacked ensemble with a RF metalearner algorithm
# The metalearner will use RF default values
stack_rf = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
                                       metalearner_algorithm="drf")
stack_rf.train(x=x, y=y, training_frame=train)
stack_rf.model_performance(test).auc()
# 0.7016302070136065

# Train a stacked ensemble with a Deep Learning metalearner algorithm
# The metalearner will use Deep Learning default values
stack_dl = H2OStackedEnsembleEstimator(base_models=[my_gbm, my_rf],
                                       metalearner_algorithm="deeplearning")
stack_dl.train(x=x, y=y, training_frame=train)
stack_dl.model_performance(test).auc()
# 0.7634122856763638