Hadoop Examples
===============

This section provides a complete example for using the Enterprise Steam Python client on Hadoop.

Launching and connecting to H2O cluster
---------------------------------------

This examples shows how to login to Steam and launch H2O cluster with 4 nodes and 10GB of memory per node.
The H2O cluster is using H2O version 3.28.0.2 and profile called ``default-h2o`` and submitting to the default YARN queue.
All other H2O parameters are pre-filled according to the selected profile.
When the cluster is up we connect to it and start importing data.

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import H2oClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = H2oClient.launch_cluster(name="test-cluster",
                                       profile_name="default-h2o",
                                       version="3.28.0.2",
                                       nodes=4,
                                       node_memory_gb=10)
    cluster.connect()
    airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
    airlines_df = h2o.import_file(path=airlines)

Providing dataset parameters to preset cluster size
---------------------------------------------------

This examples shows how to launch H2O cluster providing dataset information.
If you are not sure how to exactly size your cluster, you can provide either ``dataset_size_gb`` (for raw data source) or ``dataset_dimension`` tuple (for compressed data source) and specify whether you are going to use XGBoost algorithm on your cluster with ``using_xgboost`` parameter.
Setting these parameters will size the cluster accordingly.
If your profile does not allow to allocate recommended resources for the cluster, maximum allowed resources will be used.
Also any user-specified values of ``nodes``, ``node_memory_gb``, or ``extra_memory_percent`` will override recommended values.

Example using ``dataset_size_gb`` when using a CSV file as a data source:

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import H2oClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = H2oClient.launch_cluster(name="test-cluster",
                                       profile_name="default-h2o",
                                       version="3.28.0.2",
                                       dataset_size_gb=20,
                                       using_xgboost=True)

Example using ``dataset_dimension``, a tuple of (n_rows, n_cols) when using compressed file (e.q. parquet) as a data source:

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import H2oClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = H2oClient.launch_cluster(name="test-cluster",
                                       profile_name="default-h2o",
                                       version="3.28.0.2",
                                       dataset_dimension=(25000, 1250),
                                       using_xgboost=False)

Connecting to existing H2O cluster
----------------------------------

This example shows how to login to Steam and connect to existing H2O cluster called ``test-cluster`` and import data.

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import H2oClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = H2oClient.get_cluster("test-cluster")
    cluster.connect()
    airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
    airlines_df = h2o.import_file(path=airlines)

Saving H2O cluster data
-----------------------

This example shows how to save cluster data and restart cluster called ``test-cluster``. Setting ``save_cluster_data=True`` will make the cluster save its data on reaching idle or uptime limit.
Using ``cluster.stop(save_cluster_data=True)`` immediately stops the cluster and saves data.
Saved cluster can be started and its saved data will be loaded.

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import H2oClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = H2oClient.launch_cluster(name="test-cluster",
                                       profile_name="default-h2o",
                                       version="3.28.0.2",
                                       nodes=4,
                                       node_memory_gb=10,
                                       save_cluster_data=True)
    cluster.connect()
    # Train your models...
    cluster.stop(save_cluster_data=True)
    cluster.start(nodes=2,
                  nome_memory_gb=5,
                  save_cluster_data=False)

Launching and connecting to Sparkling Water cluster
---------------------------------------------------

This examples shows how to login to Steam and launch Sparkling Water cluster with 4 executors and 10GB of memory per executor.
The Sparking Water cluster is using Sparkling Water version 3.28.0.2 and profile called ``default-sparkling-internal`` and submitting to the ``default`` YARN queue.
Profile type dictates a cluster backend type. In this case the cluster is starting in the internal mode.
All other Sparkling Water parameters are pre-filled according to the selected profile.
When the cluster is up we can send statements to the remote Spark session to start importing data.

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import SparklingClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster",
                                                       profile_name="default-sparkling-internal",
                                                       version="3.28.0.2",
                                                       executors=4,
                                                       executor_memory_gb=10,
                                                       yarn_queue="default")

    cluster.send_statement('airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"')
    cluster.send_statement('airlines_df = h2o.import_file(path=airlines)')

Providing dataset parameters to preset Sparkling Water cluster size
-------------------------------------------------------------------

This examples shows how to launch Sparkling Water cluster providing dataset information.
If you are not sure how to exactly size your cluster, you can provide either ``dataset_size_gb`` (for raw data source) or ``dataset_dimension`` tuple (for compressed data source) and specify whether you are going to use XGBoost algorithm on your cluster with ``using_xgboost`` parameter.
Setting these parameters will size the cluster accordingly.
If your profile does not allow to allocate recommended resources for the cluster, maximum allowed resources will be used.
Also any user-specified values of ``executors``, ``executor_memory_gb``, ``h2o_nodes``, ``h2o_node_memory_gb``, or ``h2o_extra_memory_percent` will override recommended values.

Example using ``dataset_size_gb`` when using a CSV file as a data source:

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import SparklingClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster",
                                                       profile_name="default-sparkling-internal",
                                                       version="3.28.0.2",
                                                       dataset_size_gb=50,
                                                       using_xgboost=False)

Example using ``dataset_dimension``, a tuple of (n_rows, n_cols) when using compressed file (e.q. parquet) as a data source:

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import SparklingClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster",
                                                       profile_name="default-sparkling-internal",
                                                       version="3.28.0.2",
                                                       dataset_dimension=(25000, 1250),
                                                       using_xgboost=True)

Connecting to existing Sparkling Water cluster
----------------------------------------------

This example shows how to login to Steam and connect to existing Sparkling Water cluster called ``test-sparkling-cluster`` and import data.

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import SparklingClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = SparklingClient.get_cluster("test-sparkling-cluster")

    multilineStatement = '''
    airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
    airlines_df = h2o.import_file(path=airlines)
    '''

    cluster.send_statement(multilineStatement)

Saving Sparkling Water cluster data
-----------------------------------

This example shows how to save cluster data and restart cluster called ``test-cluster``. Setting ``save_cluster_data=True`` will make the cluster save its data on reaching idle or uptime limit.
Using ``cluster.stop(save_cluster_data=True)`` immediately stops the cluster and saves data.
Saved cluster can be started and its saved data will be loaded.

.. code-block:: python

    import h2o
    import h2osteam
    from h2osteam.clients import SparklingClient

    h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
    cluster = SparklingClient.launch_sparkling_cluster(name="test-cluster",
                                                       profile_name="default-sparkling-internal",
                                                       version="3.28.0.2",
                                                       executors=4,
                                                       executor_memory_gb=10,
                                                       save_cluster_data=True)
    # Train your models...
    cluster.stop(save_cluster_data=True)
    cluster.start(executors=2,
                  executor_memory_gb=5,
                  save_cluster_data=False)