H2O’s Rulefit algorithm combines tree ensembles and linear models to take advantage of both methods: the accuracy of a tree ensemble and the interpretability of a linear model.
The general algorithm fits a tree ensemble to the data, builds a rule ensemble by traversing each tree, evaluates the rules on the data to build a rule feature set, and fits a sparse linear model (LASSO) to the rule feature set joined with the original feature set.
Defining a RuleFit Model (Beta API)¶
Parameters are optional unless specified as required.
algorithm: Specify the algorithm to use to fit a tree ensemble. Must be one of:
lambda: Specify the regularization strength for LASSO regressor.
max_categorical_levels: Rulefit handles categorical features by EnumLimited scheme. That means it automatically reduces categorical levels to the most prevalent ones and only keeps the
max_categorical_levelsmost frequent levels. This option defaults to
max_num_rules: The maximum number of rules to return. This option defaults to
-1which means the number of rules are selected by diminishing returns in model deviance.
max_rule_length: Specify the maximal depth of trees to be fit. This option defaults to
min_rule_length: Specify the minimal depth of trees to be fit. This option defaults to
model_type: Specify the type of base learners in the ensemble. Must be one of:
rules_and_linear, the algorithm fits a linear model to the rule feature set joined with the original feature set.
rules, the algorithm fits a linear model only to the rule feature set (no linear terms can become important).
linear, the algorithm fits a linear model only to the original feature set (no rule terms can become important).
remove_duplicates: Specify whether to remove rules which are identical to an earlier rule. This option defaults to
rule_generation_ntrees: Specify the number of trees for tree ensemble. This option defaults to
auc_type: Set the default multinomial AUC type. Must be one of:
distribution: Specify the distribution (i.e. the loss function). The options are:
bernoulli– response column must be 2-class categorical
multinomial– response column must be categorical
gaussian– response column must be numeric
poisson– response column must be numeric
gamma– response column must be numeric
laplace– response column must be numeric
quantile– response column must be numeric
huber– response column must be numeric
tweedie– response column must be numeric
model_id: Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternate configurations. This option defaults to
-1(time-based random number).
training_frame: Required Specify the dataset used to build the model.
Note: In Flow, if you click the Build a model button from the Parse cell, the training frame is entered automatically.
validation_frame: Specify the dataset used to evaluate the accuracy of the model.
weights_column: Specify a column to use for the observation weights, which are used for bias correction. The specified
weights_columnmust be included in the specified
Python only: To use a weights column when passing an H2OFrame to
xinstead of a list of column names, the specified
training_framemust contain the specified
Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more due to the larger loss function pre-factor.
x: Specify a vector containing the names or indicies of the predictor variables to use when building the model. If
xis missing, then all columns except
y: Required Specify the column to use as the dependent variable.
For a regression model, this column must be numeric (Real or Int).
For a classification model, this column must be categorical (Enum or String). If the family is Binomial, the dataset cannot contain more than two levels.
Interpreting a RuleFit Model¶
The output for the RuleFit model includes:
rule importances in tabular form
training and validation metrics of the underlying linear model