remove_offset_effects

  • Available in: GLM

  • Hyperparameter: no

Description

This feature allows you to remove offset effects during scoring and metric calculation.

Model metrics and scoring history are calculated for both the restricted model (with offset effects removed) and the unrestricted model (with offset effect included).

To get the unrestricted model with its own metrics use glm.make_unrestricted_glm_model() / h2o.make_unrestricted_glm_model(glm).

If you set up the remove_offset_effects together with the control_variables model metrics and scoring history are calculated with both features enabled (that is, with both offset and control-variable effects removed during scoring). If you need to get a model with only one feature enabled, you can get it using glm.make_derived_glm_model(remove_control_variables_effects=True) or glm.make_derived_glm_model(remove_offset_effects=True). If both features are enabled and score_each_iteration=True or generate_scoring_history=True, training the model on big data can be slowed down. The complexity is four times higher than the standard GLM metric calculation.

Notes:

  • This option is experimental.

  • This option is not supported for multinomial, ordinal, or custom distributions.

  • This option is not available when cross validation is enabled.

  • This option is not available when Lambda search is enabled.

  • This option is not available when interactions are enabled.

Example

library(h2o)
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines <-  h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

# convert columns to factors
airlines["Year"] <- as.factor(airlines["Year"])
airlines["Month"] <- as.factor(airlines["Month"])
airlines["DayOfWeek"] <- as.factor(airlines["DayOfWeek"])
airlines["Cancelled"] <- as.factor(airlines["Cancelled"])
airlines['FlightNum'] <- as.factor(airlines['FlightNum'])

# set the predictor names and the response column name
predictors <- c("Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "FlightNum")
response <- "IsDepDelayed"

# split into train and validation
airlines_splits <- h2o.splitFrame(data =  airlines, ratios = 0.8)
train <- airlines_splits[[1]]
valid <- airlines_splits[[2]]

# try using the `remove_offset_effects` parameter:
airlines_glm <- h2o.glm(family = 'binomial', x = predictors, y = response, training_frame = train,
        validation_frame = valid,
        remove_collinear_columns = TRUE,
        score_each_iteration = TRUE,
        generate_scoring_history = TRUE,
        offset_column = "Distance",
        remove_offset_effects = TRUE)

# print the AUC for the validation data
print(h2o.auc(airlines_glm, valid = TRUE))

# take a look at the learning curve
h2o.learning_curve_plot(airlines_glm)

# get the unrestricted GLM model
unrestricted_airlines_glm <- h2o.make_unrestricted_glm_model(airlines_glm)
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init()

# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()

# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier", "DayOfWeek", "Month", "FlightNum"]
response = "IsDepDelayed"

# split into train and validation sets
train, valid= airlines.split_frame(ratios = [.8])

# try using the `remove_offset_effects` parameter:
# initialize your estimator
airlines_glm = H2OGeneralizedLinearEstimator(family = 'binomial',
                                             remove_collinear_columns = True,
                                             score_each_iteration = True,
                                             generate_scoring_history = True,
                                             offset_column = "Distance",
                                             remove_offset_effects = True)

# then train your model
airlines_glm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data
print(airlines_glm.auc(valid=True))

# take a look at the learning curve
airlines_glm.learning_curve_plot()

# get the unrestricted GLM model
unrestricted_airlines_glm = airlines_glm.make_unrestricted_glm_model()