H2O AutoDoc API Docs

The H2O AutoDoc Python API allows you to automatically generate a report from the following supervised learning H2O-3 models:

  • Deep Learning

  • Distributed Random Forest (including Extremely Randomized Forest)

  • Generalized Linear Model

  • Gradient Boosting Machine

  • Stacked Ensembles

  • XGBoost

Configure H2O AutoDoc

class h2o_autodoc.Config(output_path, tmp_dir=None, template_path=None, template_sections_path=None, template_dirs=[], sub_template_type=None, main_template_type='docx', float_format='{:6.4g}', data_summary_feat_num=-1, num_features=20, plot_num_features=20, min_relative_importance=0, stats_quantiles=20, psi_quantiles=10, response_rate_quantiles=10, pdp_feature_list=None, mli_frame=None, ice_frame=None, num_ice_rows=0, cardinality_limit=25, use_hdfs=False, pdp_out_of_range=3, pdp_num_bins=10, warning_shift_auc_threshold=0.8, include_hist=True, use_shapley=True, document_reproducibility=True, license_file=None, license_text=None, progress_file_path=None, hist_features=10, export_artifacts=False)

This class configures the h2o-autodoc. The only required parameter is the output_path, which specifies where the h2o-autodoc should be saved. Additional parameters provide control over the document file type, plots, and number of features shown, among other options.

While there are multiple configurations, the simplest configuration is to provide the ‘output_path’ alone. If you only specify the file name, the h2o-autodoc saves to the document to the current directory.

Parameters
  • output_path – str: Path or s3 URL that specifies where to save the h2o-autodoc. (i.e, ‘User/username/my_report.docx’, ‘s3://<bucket>/…’, ‘https://s3.amazonaws.com/<bucket>/…’, ‘https://<bucket>.s3.amazonaws.com/…’).

  • template_path – str, optional: Path to general or custom template. Defaults to None.

  • template_sections_path – str, optional: Path to general or custom template sections. Defaults to None.

  • template_dirs – list, optional: List of alternative paths to general or custom template sections. Defaults to an empty list.

  • sub_template_type – str, optional: The document type (e.g., ‘docx’ or ‘md’). Defaults to the main_template_type value.

  • main_template_type – str, optional: The subtemplate type (e.g., ‘docx’ or ‘md’). Defaults to ‘docx’.

  • float_format – str: Format string syntax. Defaults to “{:6.4g}”: total width of 6 with 4 digits after the decimal place, using ‘g’ general format.

  • data_summary_feat_num – int: Number of features to show in data summary. Value must be an integer. Values lower than 1, e.g., 0 or -1, indicate that all columns should be shown.

  • num_features – int: The number of top features to display in the document tables. Defaults to 20.

  • plot_num_features – The number of top features to display in the document plots. Defaults to 20.

  • min_relative_importance – The minimum relative importance in order for a feature to be displayed in the feature importance table/plot. Defaults to 0.

  • stats_quantiles – int: The number of quantiles to use for prediction statistics computation. Defaults to 20.

  • psi_quantiles – int: The number of quantiles to use for population stability index computation. Defaults to 10.

  • response_rate_quantiles – int: The number of quantiles to use for response rates information computation. Defaults to 10.

  • pdp_feature_list – list or str: A list of feature names (str) for which to create partial dependence plots (pdps). To show pdps for all features set pdp_feature_list to the string “all”.

  • mli_frame – H2OFrame: An H2OFrame on which the partial dependence and Shapley values will be calculated. If no H2OFrame is specified the training frame is used. Defaults to None.

  • ice_frame – H2OFrame, optional: An H2OFrame on which the individual conditional expectation will be calculated. If no H2OFrame is specified then ice rows will be selected automatically.

  • num_ice_rows – int, optional: The number of rows to be automatically selected for independent conditional expectation from train data. This argument is ignored if ice_frame argument is provided.

  • cardinality_limit – int: The maximum number of categorical levels a feature can have, above which the partial dependence plot will not be generated. Defaults to 25.

  • use_hdfs – bool: Whether to save the document to HDFS. Requires that H2O or Sparkling Water cluster has access to HDFS. Defaults to False.

  • pdp_out_of_range – int: The number of standard deviations, outside of the range of a column, to include in partial dependence plots. This shows how the model will react to data it has not seen before. Defaults to 3.

  • pdp_num_bins – int: The number of bins for the partial dependence plot. Defaults to 10.

  • warning_shift_auc_threshold – float: The threshold for which a warning will be shown, if the auc is greater than or equal to this threshold. Defaults to 0.08.

  • use_shapley – bool: Whether to calculate Shapley values, for algorithms where it is available. Note Shapley value calculations may take a long time for very wide datasets. Defaults to True.

  • document_reproducibility – bool: Whether to include the Model Reproducibility section (only for supported models). Defaults to True.

  • license_file – str: A file system location for the license file

  • license_text – str: A license text

  • progress_file_path – str, optional: Path where the relative progress (value between 0 and 1) is saved.

  • hist_features – int or list or str: controls the number of feature distribution plots via one of three options: an integer to show the top N features, a list of feature names to show, or the string “all” to show all features (Default: 10).

  • export_artifacts – bool: whether to include the AutoDoc images (as PNGs) and tables (as CSVs) with the AutoDoc report. If set to True, the render_autodoc returns a zip with the report, a directory of the images, and a directory of the tables (Default: False).

Example

>>> from h2o_autodoc import Config
>>> from h2o_autodoc import render_autodoc
>>> path = "my_autodoc.docx"
>>> config = Config(path)
>>> render_autodoc(h2o, config, model, train, valid, test)

Sometimes you may want to see more features or less quantiles than are shown by default. These can be controlled as follows:

Example

>>> from h2o_autodoc import Config
>>> from h2o_autodoc import render_autodoc
>>> path = "path/to/report/my_autodoc.docx"
>>> config = Config(path, num_features=10, stats_quantiles=10)
>>> render_autodoc(h2o, config, model, train, valid, test)

Render the H2O AutoDoc Report

h2o_autodoc.render_autodoc(h2o, config, model, train=None, valid=None, test=None, alternative_models=[], additional_testsets=[], model_json=None)

Render the h2o-autodoc for a specific H2OModel and the training and validation H2OFrames which were used to train the H2OModel.

Parameters
  • h2o – module: the h2o module object accessed via ‘import h2o’

  • model – H2OModel: the H2O model object for which the h2o-autodoc will render a report document.

  • train – H2OFrame: the training frame used to build the H2O model.

  • valid – H2OFrame: the validation frame used to build the H2O model.

  • test – H2OFrame: additional test dataset (not used during training) which contains the same feature names found in the training dataset.

  • config – Config: the configuration settings for render_autodoc.

  • alternative_models – list: a list of H2O models. These are models that exist in the current running H2O cluster.

  • additional_testsets – list: a list of additional test H2OFrames.

  • model_json – Model-Json of the original model.

Example

>>> # import h2o and initialize h2o cluster
>>> import h2o
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> h2o.init()
>>> # specify the paths to the train and valid datasets
>>> train_path = "https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-train.csv"
>>> valid_path = "https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-test.csv"
>>> # import the train and valid dataset
>>> train = h2o.import_file(train_path)
>>> valid = h2o.import_file(valid_path)
>>> # specify the predictors and response
>>> predictors = train.columns
>>> predictors.remove('ID')
>>> response = "DEFAULT_PAYMENT_NEXT_MONTH"
>>> # convert target to factor
>>> train[response] = train[response].asfactor()
>>> valid[response] = valid[response].asfactor()
>>> # build an H2O-3 GBM Model
>>> gbm = H2OGradientBoostingEstimator(model_id="gbm_model", seed=1234)
>>> gbm.train(x = predictors, y = response, training_frame = train,
... validation_frame = valid)
>>> # import h2o_autodoc
>>> from h2o_autodoc import render_autodoc
>>> from h2o_autodoc import Config
>>> # specify the path to the output file
>>> output_file_path = 'report_H2O3.docx'
>>> config = Config(output_path=output_file_path)
>>> # render the h2o-autodoc
>>> render_autodoc(h2o,config=config,model=gbm,train=train,valid=valid)