Using H2O AutoDoc with Steam
============================

Introduction
------------

AutoDoc users can generate docs using H2O Steam through the Steam Client Python API.

**Requirements**

- Steam Client Python package. Refer to the `Enterprise Steam Download page <https://www.h2o.ai/download/#enterprise-steam>`__.
- H2O-3 Python Client. Refer to the `H2O-3 Download page <https://www.h2o.ai/download/#h2o>`__.


Configure H2O AutoDoc
---------------------

Rendering an H2O AutoDoc requires a running H2O Cluster, a trained model, and access to the datasets used to train the model.

This section includes the code examples for setting up a model, along with basic and advanced H2O AutoDoc configurations. To experiment with a complete end-to-end example, run the :ref:`build-h2o-model-ref` code example before running one of the H2O-AutoDoc-specific examples.

- Setup:

 - :ref:`steam-connect-h2o3-cluster-ref`
 - :ref:`steam-connect-sparkling-cluster-ref`
 - :ref:`steam-build-h2o-model-ref`

- Basic configurations:

 - :ref:`steam-generate-default-autodoc-ref`
 - :ref:`steam-save-autodoc-to-s3-ref`
 - :ref:`steam-save-autodoc-to-github-ref`
 - :ref:`steam-specify-file-type-ref`

- Advanced configurations:

 - :ref:`steam-specify-mli-frame-ref`
 - :ref:`steam-specify-pdp-features-ref`
 - :ref:`steam-specify-ice-frame-ref`
 - :ref:`steam-enable-shapley-values-ref`

- HDFS Notes:

 - :ref:`steam-save-and-copy-from-hdfs-ref`

.. _steam-connect-h2o3-cluster-ref:

Connecting to H2O-3 Cluster
~~~~~~~~~~~~~~~~~~~~~

Connect to your Steam-launched H2O-3. After connecting further model training and doc rendering will take place in connected Steam H2O-3 Cluster.

.. tabs::
   .. code-tab:: python Steam Python
    # import h2o and connect to running H2O cluster on Steam
    import h2o
    import h2osteam
    from h2osteam.clients import H2oClient

    # login to  Steam server
    h2osteam.login(url="https://mr-0xg9:9555/", username="username", password="password")

    # get cluster running cluster by name and connect to it
    # Once it is connected model training and AutoDoc rendering will take place on H2O-3 Cluster on Steam
    session = H2oClient.get_cluster(name="autodoc-test")
    session.connect()

.. _steam-connect-sparkling-cluster-ref:

Connecting to Sparkling Water Cluster
~~~~~~~~~~~~~~~~~~~~~

Connect to your Steam-launched Sparkling Water. After connecting further model training and doc rendering will take place in connected Steam Sparkling Cluster.

.. tabs::
   .. code-tab:: python Steam Python
    # import h2o and connect to running H2O cluster on Steam
    import h2o
    import h2osteam
    from h2osteam.clients import SparklingClient

    # login to  Steam server
    h2osteam.login(url="https://mr-0xg9:9555/", username="username", password="password")

    # get cluster running cluster by name and connect to it
    # Once it is connected model training and AutoDoc rendering will take place on H2O-3 Cluster on Steam
    session = SparklingClient.get_cluster("test-cluster")
    h2o.connect(config=session.get_h2o_config())


.. _steam-build-h2o-model-ref:

Build H2O Model on connected cluster (H2O-3 / Sparkling Water)
~~~~~~~~~~~~~~~~~~~~~

Build model on connecting cluster instance.

.. tabs::
   .. code-tab:: python H2O-3 Python
    from h2o.estimators.gbm import H2OGradientBoostingEstimator

    # import datasets for training and validation
    train_path = "https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-train.csv"
    valid_path ="https://s3.amazonaws.com/h2o-training/events/ibm_index/CreditCard_Cat-test.csv"

    # import the train and valid dataset
    train = h2o.import_file(train_path, destination_frame='CreditCard_Cat-train.csv')
    valid = h2o.import_file(valid_path, destination_frame='CreditCard_Cat-test.csv')

    # set predictors and response
    predictors = train.columns
    predictors.remove('ID')
    response = "DEFAULT_PAYMENT_NEXT_MONTH"

    # convert target to factor
    train[response] = train[response].asfactor()
    valid[response] = valid[response].asfactor()

    # assign IDs for later use
    h2o.assign(train, "CreditCard_TRAIN")
    h2o.assign(valid, "CreditCard_VALID")

    # build an H2O-3 GBM Model
    gbm = H2OGradientBoostingEstimator(model_id="gbm_model", seed=1234)
    gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)


.. _steam-generate-default-autodoc-ref:

Generate and Download a Default H2O AutoDoc
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY")
    render_autodoc(h2o, config, gbm)

.. _steam-save-autodoc-to-s3-ref:

Push Generated AutoDoc to S3 Bucket
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the s3 URI/URL to where you want your AutoDoc saved
    # we support below patterns
        # s3://<bucket>/<key>
        # https://<bucket>.s3.amazonaws.com/<key>
        # https://s3.amazonaws.com/<bucket>/<key>
    # if <key> points to a directory in bucket, auto generated filename will be used
        # Eg: if <key> is  s3://<bucket>/experiment_test/,
        # then generated report will be s3://<bucket>/experiment_test/Experiment_Report_2021-08-31-13-37-12.docx

    # either you should have ~/.aws/credentials configured or below
    os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key"
    os.environ["AWS_SECRET_ACCESS_KEY"] = "your_aws_secret"

    output_file_path = "s3://h2o-datasets/autodoc-examples/autodoc_report.docx"

    # Example Code:
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    # get the H2O-3 objects required to create your AutoDoc
    model = h2o.get_model("gbm_model")

    # set your AutoDoc configurations
    config = Config(output_path=output_file_path)

    # render your AutoDoc
    render_autodoc(h2o, config, model)

Note: Local copy will be erased on successful upload to s3 bucket.


.. _steam-save-autodoc-to-github-ref:

Push Generated AutoDoc to Github Repository
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


.. code-block:: python

    # Parameters the User Must Set: output_file_path
    # specify the github URL to where you want your AutoDoc saved
    # we support below patterns
        # https://github.com/<organization or username>/<repo name>/tree/<branch name>/<path>
    # if <path> points to a directory in bucket, auto generated filename will be used
        # Eg: if <path> is  https://github.com/<organization or username>/<repo name>/tree/docs,
        # then generated report will be https://github.com/<organization or username>/<repo name>/tree/docs/Experiment_Report_2021-08-31-13-37-12.docx

    # either you should have ~/.git_autodoc/credentials configured or below
    os.environ["GITHUB_PAT"] = "your_github_personal_access_token"

    output_file_path = "https://github.com/h2oai/h2o-autodoc/tree/master/tests/autodoc_report.docx"

    # Example Code:
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    # get the H2O-3 objects required to create your AutoDoc
    model = h2o.get_model("gbm_model")

    # set your AutoDoc configurations
    config = Config(output_path=output_file_path)

    # render your AutoDoc
    render_autodoc(h2o, config, model)

Note: Local copy will be erased on successful upload to github repo.


.. _steam-specify-file-type-ref:

Set the H2O AutoDoc File Type
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The H2O AutoDoc can generate a Word document or markdown file. The default report is a Word document (e.g., docx).


**Word Document**

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY")
    render_autodoc(h2o, config, gbm)

**Markdown File**

Note when the **main_template_type** is set to **"md"** a zip file is returned. This zip file contains the markdown file and any images that are linked in the markdown file.

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY", main_template_type="md")
    render_autodoc(h2o, config, gbm)

.. _steam-specify-mli-frame-ref:

Model Interpretation Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The H2O AutoDoc report can include partial dependence plots (PDPs) and Shapley value feature importance. By default, these calculations are done on the training frame. You can use the **mli_frame** (short for machine learning interpretability dataframe) AutoDocConfig parameter to specify a different dataset on which to perform these calculations. In the example below, we will specify that the machine learning interpretability (MLI) calculations are done on our model's validation dataset, instead of the training dataset.

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY", mli_frame=valid.frame_id)
    render_autodoc(h2o, config, gbm)


.. _steam-specify-pdp-features-ref:

Partial Dependence Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The H2O AutoDoc report includes partial dependence plots (PDPs). By default, PDPs are shown for the top 20 features. This selection is based the model's built-in variable importance (referred to as Native Importance in the report). You can override the default behavior with the **pdp_feature_list** parameter, and specify your own list of features to show in the report.

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    # specify the features you want PDP plots
    # here the feature came from predictors used in the Build H2O Model code example.
    pdp_feature_list = ["EDUCATION", "LIMIT_BAL", "AGE"]

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY", pdp_feature_list=pdp_feature_list)
    render_autodoc(h2o, config, gbm)

.. _steam-specify-ice-frame-ref:

Specify ICE Records
~~~~~~~~~~~~~~~~~~~~~

The H2O AutoDoc can overlay partial dependence plots with individual conditional expectation (ICE) plots. You can specify which observations (aka rows) you'd like to plot (manual selection), or you can let H2O AutoDoc automatically select observations.

**Manual Selection**

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    # specify the frame id for the H2OFrame composed of the records you want shown in the ICE plots
    # here 'valid' was created in the Build H2O Model code example - we use the first 2 rows.
    ice_frame_id = valid[:2, :].frame_id

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY", ice_frame=ice_frame_id)
    render_autodoc(h2o, config, gbm)

**Automatic Selection**

The **num_ice_rows** AutoDocConfig parameter controls the number of observations selected for an ICE plot. This feature is disabled by default (i.e., set to 0). Observations are selected by binning the predictions into N quantiles and selecting the first observation in each quantile.

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    # specify the number of rows you want automatically selected for ICE plots
    num_ice_rows = 3

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY", num_ice_rows=num_ice_rows)
    render_autodoc(h2o, config, gbm)

.. _steam-enable-shapley-values-ref:

Enable/Disable Shapley Values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Shapley values are provided for supported H2O-3 Algorithms. (For supported algorithms, see the `H2O-3 user guide <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html?#predict-contributions>`_.)

**Note**: Shapley values are enabled by default. They can take a long time, however, to complete for wide datasets. You can disable the Shapley value calculation to speed up your AutoDoc generation.

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY", use_shapely=True)
    render_autodoc(h2o, config, gbm)

.. _steam-save-and-copy-from-hdfs-ref:

Saving AutoDoc to HDFS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Users can directly save the AutoDoc into Hadoop File System if they are running Steam on top of Hadoop. 

.. tabs::
   .. code-tab:: python AutoDoc Python

    # Parameters the User Must Set: output_file_path
    # specify the full path to where you want your AutoDoc saved
    # replace the path below with your own path
    output_file_path = "path/to/your/autodoc/autodoc_report.docx"

    # Example Code:
    # import AutoDocConfig class
    from h2o_autodoc import Config
    from h2o_autodoc import render_autodoc

    config = Config(output_path=output_file_path, license_text="AUTODOC_LICENSE_KEY", use_hdfs=True)
    render_autodoc(h2o, config, gbm)


**Note**: To copy AutoDoc from HDFS to Local File System use below CLI command.

.. code-block:: console
    
    $ hdfs dfs -copyToLocal "/user/h2o/path/to/your/autodoc/autodoc_report.docx" "/path/on/local/file/system"