Creating & Configuring H2O AutoDoc

This section includes the code examples for setting up a model, along with basic and advanced H2O AutoDoc configurations. If you want to experiment with a complete end-to-end example, run the Building an H2O Model code example before running one of the H2O AutoDoc-specific examples.

The H2O AutoDoc setup requires:

  • license key (see H2O AutoDoc License Key for an example on setting your license)

  • a running H2O Cluster

  • a trained model

H2O-3 Model Setup:

Basic configurations:

Advanced configurations:

Building an H2O Model

# import h2o and initialize h2o cluster
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

# import datasets for training and validation
train_path = "https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-train.csv"
valid_path ="https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-test.csv"

# import the train and valid dataset
train = h2o.import_file(train_path, destination_frame='CreditCard_Cat-train.csv')
valid = h2o.import_file(valid_path, destination_frame='CreditCard_Cat-test.csv')

# set predictors and response
predictors = train.columns
predictors.remove('ID')
response = "DEFAULT_PAYMENT_NEXT_MONTH"

# convert target to factor
train[response] = train[response].asfactor()
valid[response] = valid[response].asfactor()

# assign IDs for later use
h2o.assign(train, "CreditCard_TRAIN")
h2o.assign(valid, "CreditCard_VALID")

# build an H2O-3 GBM Model
gbm = H2OGradientBoostingEstimator(model_id="gbm_model", seed=1234)
gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

Generate a Default H2O AutoDoc

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "full/path/to/your/autodoc/autodoc_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# set your AutoDoc configurations
config = Config(output_path=output_file_path)

# render your AutoDoc
render_autodoc(h2o, config, model)

Push Generated AutoDoc to S3 Bucket

# Parameters the User Must Set: output_file_path
# specify the s3 URI/URL to where you want your AutoDoc saved
# we support below patterns
    # s3://<bucket>/<key>
    # https://<bucket>.s3.amazonaws.com/<key>
    # https://s3.amazonaws.com/<bucket>/<key>
# if <key> points to a directory in bucket, auto generated filename will be used
    # Eg: if <key> is  s3://<bucket>/experiment_test/,
    # then generated report will be s3://<bucket>/experiment_test/Experiment_Report_2021-08-31-13-37-12.docx

# either you should have ~/.aws/credentials configured or below
os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_aws_secret"

output_file_path = "s3://h2o-datasets/autodoc-examples/autodoc_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# set your AutoDoc configurations
config = Config(output_path=output_file_path)

# render your AutoDoc
render_autodoc(h2o, config, model)

Note: Local copy will be erased on successful upload to s3 bucket.

Push Generated AutoDoc to Github Repository

# Parameters the User Must Set: output_file_path
# specify the github URL to where you want your AutoDoc saved
# we support below patterns
    # https://github.com/<organization or username>/<repo name>/tree/<branch name>/<path>
# if <path> points to a directory in bucket, auto generated filename will be used
    # Eg: if <path> is  https://github.com/<organization or username>/<repo name>/tree/docs,
    # then generated report will be https://github.com/<organization or username>/<repo name>/tree/docs/Experiment_Report_2021-08-31-13-37-12.docx

# either you should have ~/.git_autodoc/credentials configured or below
os.environ["GITHUB_PAT"] = "your_github_personal_access_token"

output_file_path = "https://github.com/h2oai/h2o-autodoc/tree/master/tests/autodoc_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# set your AutoDoc configurations
config = Config(output_path=output_file_path)

# render your AutoDoc
render_autodoc(h2o, config, model)

Note: Local copy will be erased on successful upload to github repo.

Set the H2O AutoDoc File Type

The H2O AutoDoc can generate a Word document or markdown file. The default report is a Word document (e.g., docx).

Word Document

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "path/to/your/autodoc/my_word_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# only your output_path is required, as the default AutoDoc is a word document
config = Config(output_path=output_file_path)

# render your AutoDoc
render_autodoc(h2o, config, model)

Markdown File

Note when the main_template_type is set to “md” a zip file is returned. This zip file contains the markdown file and any images that are linked in the markdown file.

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path (make sure to keep the '.md' file extension)
output_file_path = "path/to/your/autodoc/my_markdown_report.md"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# set the exported AutoDoc to markdown ('md')
main_template_type = "md"
config = Config(output_path=output_file_path, main_template_type=main_template_type)

# render your AutoDoc
render_autodoc(h2o, config, model)

Model Interpretation Dataset

The H2O AutoDoc report can include partial dependence plots (PDPs) and Shapley value feature importance. By default, these calculations are done on the training frame. You can use the mli_frame (short for machine learning interpretability dataframe) Config parameter to specify a different dataset on which to perform these calculations. In the example below, we will specify that the machine learning interpretability (MLI) calculations are done on our model’s validation dataset, instead of the training dataset.

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "path/to/your/autodoc/my_mli_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# specify the H2OFrame on which the partial dependence and Shapley values can be calculated
# here 'valid' was created in the Build H2O Model code example
mli_frame = valid
config = Config(output_path=output_file_path, mli_frame=mli_frame)

# render your AutoDoc
render_autodoc(h2o, config, model)

Partial Dependence Features

The H2O AutoDoc report includes partial dependence plots (PDPs). By default, PDPs are shown for the top 20 features. This selection is based the model’s built-in variable importance (referred to as Native Importance in the report). You can override the default behavior with the pdp_feature_list parameter, and specify your own list of features to show in the report.

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "path/to/your/autodoc/my_pdp_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# specify the features for which you want PDP plots
# here the feature came from predictors used in the Build H2O Model code example.
pdp_feature_list = ["EDUCATION", "LIMIT_BAL", "AGE"]
config = Config(output_path=output_file_path, pdp_feature_list=pdp_feature_list)

# render your AutoDoc
render_autodoc(h2o, config, model)

Specify ICE Records

The H2O AutoDoc can overlay partial dependence plots with individual conditional expectation (ICE) plots. You can specify which observations (aka rows) you’d like to plot (manual selection), or you can let H2O AutoDoc automatically select observations.

Manual Selection

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "path/to/your/autodoc/my_manual_ice_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# specify an H2OFrame composed of the records you want shown in the ICE plots
# here 'valid' was created in the Build H2O Model code example - we use the first 2 rows.
ice_frame = valid[:2, :]
config = Config(output_path=output_file_path, ice_frame=ice_frame)

# render your AutoDoc
render_autodoc(h2o, config, model)

Automatic Selection

The num_ice_rows Config parameter controls the number of observations selected for an ICE plot. This feature is disabled by default (i.e., set to 0). Observations are selected by binning the predictions into N quantiles and selecting the first observation in each quantile.

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "path/to/your/autodoc/my_auto_ice_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# specify the number of rows you want automatically selected for ICE plots
num_ice_rows = 3
config = Config(output_path=output_file_path, num_ice_rows=num_ice_rows)

# render your AutoDoc
render_autodoc(h2o, config, model)

Enable/Disable Shapley Values

Shapley values are provided for supported H2O-3 Algorithms. (For supported algorithms, see the H2O-3 user guide.)

Note: Shapley values are enabled by default. They can take a long time, however, to complete for wide datasets. You can disable the Shapley value calculation to speed up your AutoDoc generation.

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "path/to/your/autodoc/my_shapley_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get H2O-3 objects required to create an automatic report
model = h2o.get_model("gbm_model")

# enable shapley values
use_shapley = True
config = Config(output_path=output_file_path, use_shapley=use_shapley)

# render your AutoDoc
render_autodoc(h2o, config, model)

Provide Additional Testsets

You can provide a list of additional testsets (each of which is an H2OFrame) to the render_autodoc() function. Performance metrics, plots, and tables will be created for each of these additional datasets.

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "path/to/your/autodoc/my_additional_testsets_report.docx"

# Example Code:
from h2o_autodoc import Config
from h2o_autodoc import render_autodoc

# get the H2O-3 objects required to create your AutoDoc
model = h2o.get_model("gbm_model")

# set your AutoDoc configurations
config = Config(output_path=output_file_path)

# specify additional testsets
full_test_data = h2o.import_file("https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-test.csv")
test1, test2 = full_test_data.split_frame(ratios=[.5], seed=1234, destination_frames=['mytest1', 'mytest2'])

# render your AutoDoc
render_autodoc(h2o, config, model, additional_testsets=[test1, test2])

Provide Alternative Models

You can provide a list of alternative models to the render_autodoc() function. This creates alternative model tables with parameters that a user can grid over (i.e, traditional hyperparameters plus parameters that you can grid over).

Code Example

# Parameters the User Must Set: output_file_path
# specify the full path to where you want your AutoDoc saved
# replace the path below with your own path
output_file_path = "path/to/your/autodoc/my_alternative_models_report.docx"

# Example Code:
# run AutoML to create several models
import h2o
from h2o.automl import H2OAutoML

from h2o_autodoc import Config, render_autodoc

# initialize your H2O cluster
h2o.init()

# import the titanic dataset from Amazon S3
titanic = h2o.import_file(
    "https://s3.amazonaws.com/h2o-public-test-data/"
    "smalldata/gbm_test/titanic.csv",
    destination_frame="titanic_all",
)
# specify the predictors and response
predictors = ["home.dest", "cabin", "embarked", "age"]
response = "survived"
titanic["survived"] = titanic["survived"].asfactor()

# split the titanic dataset into train, valid, and test
train, valid, test = titanic.split_frame(
    ratios=[0.8, 0.1],
    destination_frames=["titanic_train", "titanic_valid", "titanic_test"],
)

# run AutoML
automl = H2OAutoML(max_models=3, seed=1)
automl.train(
    predictors, response, training_frame=train, validation_frame=valid,
)

board = automl.leaderboard.as_data_frame()

# build a report on the best performing model
best_model = automl.leader

# compare the best model to the other models in leaderboard
models = [h2o.get_model(x) for x in board["model_id"][1:]]

# set your AutoDoc configurations
config = Config(output_path=output_file_path)

# render an AutoDoc with your best model and the alternative models
render_autodoc(
    h2o=h2o,
    config=config,
    model=best_model,
    train=train,
    valid=valid,
    test=test,
    alternative_models=models,
)