seed

  • Available in: GBM, DRF, Deep Learning, GLM, GAM, HGLM, PCA, GLRM, Naïve-Bayes, K-Means, AutoML, XGBoost, Stacked Ensembles, Isolation Forest, Target Encoding, Extended Isolation Forest, Uplift DRF, AdaBoost, Decision Tree, ANOVAGLM, ModelSelection

  • Hyperparameter: yes

Description

This option specifies the random number generator (RNG) seed for algorithms that are dependent on randomization. When a seed is defined, the algorithm will behave deterministically.

The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.

Note that in Naïve-Bayes, this option is only used for cross-validation and when fold_assignment="Random" or "AUTO".

Example

library(h2o)
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines <-  h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

# convert columns to factors
airlines["Year"] <- as.factor(airlines["Year"])
airlines["Month"] <- as.factor(airlines["Month"])
airlines["DayOfWeek"] <- as.factor(airlines["DayOfWeek"])
airlines["Cancelled"] <- as.factor(airlines["Cancelled"])
airlines['FlightNum'] <- as.factor(airlines['FlightNum'])

# set the predictor names and the response column name
predictors <- c("Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum")
response <- "IsDepDelayed"

# split into train and validation
airlines_splits <- h2o.splitFrame(data =  airlines, ratios = 0.8, seed = 1234)
train <- airlines_splits[[1]]
valid <- airlines_splits[[2]]


# try using the `seed` parameter with a stochastic parameter like `col_sample_rate`:
# run this model twice to see if the results are different (they will be the same)
# build your model:
gbm_w_seed_1 <- h2o.gbm(x = predictors, y = response, training_frame = train,
                        validation_frame = valid, col_sample_rate = 0.7 ,
                        seed = 1234)
gbm_w_seed_2 <- h2o.gbm(x = predictors, y = response, training_frame = train,
                        validation_frame = valid, col_sample_rate = 0.7 ,
                        seed = 1234)

# print the auc for the validation data for model with a seed
print(paste('auc for the 1st model built with a seed:',
            h2o.auc(gbm_w_seed_1, valid = TRUE)))
print(paste('auc for the 2nd model built with a seed:',
            h2o.auc(gbm_w_seed_2, valid = TRUE)))

# run the same model but without a seed:
# run this model twice to see if the results are different (they will be different)
# build your model:

gbm_wo_seed_1 <- h2o.gbm(x = predictors, y = response, training_frame = train,
                        validation_frame = valid, col_sample_rate = 0.7)
gbm_wo_seed_2 <- h2o.gbm(x = predictors, y = response, training_frame = train,
                        validation_frame = valid, col_sample_rate = 0.7)

# print the auc for the validation data for model with no seed
print(paste('auc for the 1st model built WITHOUT a seed:',
            h2o.auc(gbm_wo_seed_1, valid = TRUE)))
print(paste('auc for the 2nd model built WITHOUT a seed:',
            h2o.auc(gbm_wo_seed_2, valid = TRUE)))
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()

# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "Distance", "FlightNum"]
response = "IsDepDelayed"

# split into train and validation sets
train, valid= airlines.split_frame(ratios = [.8], seed = 1234)

# try using the `seed` parameter with a stochastic parameter like `col_sample_rate`:
# run this model twice to see if the results are different (they will be the same)
# build your model:
gbm_w_seed_1 = H2OGradientBoostingEstimator(col_sample_rate = .7, seed = 1234)
gbm_w_seed_1.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

gbm_w_seed_2 = H2OGradientBoostingEstimator(col_sample_rate = .7, seed = 1234)
gbm_w_seed_2.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data for model with a seed
print('auc for the 1st model built with a seed:', gbm_w_seed_1.auc(valid=True))
print('auc for the 2nd model built with a seed:', gbm_w_seed_1.auc(valid=True))

# run the same model but without a seed:
# run this model twice to see if the results are different (they will be different)
# build your model:
gbm_wo_seed_1 = H2OGradientBoostingEstimator(col_sample_rate = .7)
gbm_wo_seed_1.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

gbm_wo_seed_2 = H2OGradientBoostingEstimator(col_sample_rate = .7)
gbm_wo_seed_2.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data for model with no seed
print('auc for the 1st model built WITHOUT a seed:', gbm_wo_seed_1.auc(valid=True))
print('auc for the 2nd model built WITHOUT a seed:', gbm_wo_seed_2.auc(valid=True))