Hadoop Examples =============== This section provides a complete example for using the Enterprise Steam Python client on Hadoop. Launching and connecting to H2O cluster --------------------------------------- This examples shows how to login to Steam and launch H2O cluster with 4 nodes and 10GB of memory per node. The H2O cluster is using H2O version 3.28.0.2 and profile called ``default-h2o`` and submitting to the default YARN queue. All other H2O parameters are pre-filled according to the selected profile. When the cluster is up we connect to it and start importing data. .. code-block:: python import h2o import h2osteam from h2osteam.clients import H2oClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = H2oClient.launch_cluster(name="test-cluster", profile_name="default-h2o", version="3.28.0.2", nodes=4, node_memory_gb=10) cluster.connect() airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip" airlines_df = h2o.import_file(path=airlines) Providing dataset parameters to preset cluster size --------------------------------------------------- This examples shows how to launch H2O cluster providing dataset information. If you are not sure how to exactly size your cluster, you can provide either ``dataset_size_gb`` (for raw data source) or ``dataset_dimension`` tuple (for compressed data source) and specify whether you are going to use XGBoost algorithm on your cluster with ``using_xgboost`` parameter. Setting these parameters will size the cluster accordingly. If your profile does not allow to allocate recommended resources for the cluster, maximum allowed resources will be used. Also any user-specified values of ``nodes``, ``node_memory_gb``, or ``extra_memory_percent`` will override recommended values. Example using ``dataset_size_gb`` when using a CSV file as a data source: .. code-block:: python import h2o import h2osteam from h2osteam.clients import H2oClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = H2oClient.launch_cluster(name="test-cluster", profile_name="default-h2o", version="3.28.0.2", dataset_size_gb=20, using_xgboost=True) Example using ``dataset_dimension``, a tuple of (n_rows, n_cols) when using compressed file (e.q. parquet) as a data source: .. code-block:: python import h2o import h2osteam from h2osteam.clients import H2oClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = H2oClient.launch_cluster(name="test-cluster", profile_name="default-h2o", version="3.28.0.2", dataset_dimension=(25000, 1250), using_xgboost=False) Connecting to existing H2O cluster ---------------------------------- This example shows how to login to Steam and connect to existing H2O cluster called ``test-cluster`` and import data. .. code-block:: python import h2o import h2osteam from h2osteam.clients import H2oClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = H2oClient.get_cluster("test-cluster") cluster.connect() airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip" airlines_df = h2o.import_file(path=airlines) Saving H2O cluster data ----------------------- This example shows how to save cluster data and restart cluster called ``test-cluster``. Setting ``save_cluster_data=True`` will make the cluster save its data on reaching idle or uptime limit. Using ``cluster.stop(save_cluster_data=True)`` immediately stops the cluster and saves data. Saved cluster can be started and its saved data will be loaded. .. code-block:: python import h2o import h2osteam from h2osteam.clients import H2oClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = H2oClient.launch_cluster(name="test-cluster", profile_name="default-h2o", version="3.28.0.2", nodes=4, node_memory_gb=10, save_cluster_data=True) cluster.connect() # Train your models... cluster.stop(save_cluster_data=True) cluster.start(nodes=2, nome_memory_gb=5, save_cluster_data=False) Launching and connecting to Sparkling Water cluster --------------------------------------------------- This examples shows how to login to Steam and launch Sparkling Water cluster with 4 executors and 10GB of memory per executor. The Sparking Water cluster is using Sparkling Water version 3.28.0.2 and profile called ``default-sparkling-internal`` and submitting to the ``default`` YARN queue. Profile type dictates a cluster backend type. In this case the cluster is starting in the internal mode. All other Sparkling Water parameters are pre-filled according to the selected profile. When the cluster is up we can send statements to the remote Spark session to start importing data. .. code-block:: python import h2o import h2osteam from h2osteam.clients import SparklingClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster", profile_name="default-sparkling-internal", version="3.28.0.2", executors=4, executor_memory_gb=10, yarn_queue="default") cluster.send_statement('airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"') cluster.send_statement('airlines_df = h2o.import_file(path=airlines)') Providing dataset parameters to preset Sparkling Water cluster size ------------------------------------------------------------------- This examples shows how to launch Sparkling Water cluster providing dataset information. If you are not sure how to exactly size your cluster, you can provide either ``dataset_size_gb`` (for raw data source) or ``dataset_dimension`` tuple (for compressed data source) and specify whether you are going to use XGBoost algorithm on your cluster with ``using_xgboost`` parameter. Setting these parameters will size the cluster accordingly. If your profile does not allow to allocate recommended resources for the cluster, maximum allowed resources will be used. Also any user-specified values of ``executors``, ``executor_memory_gb``, ``h2o_nodes``, ``h2o_node_memory_gb``, or ``h2o_extra_memory_percent` will override recommended values. Example using ``dataset_size_gb`` when using a CSV file as a data source: .. code-block:: python import h2o import h2osteam from h2osteam.clients import SparklingClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster", profile_name="default-sparkling-internal", version="3.28.0.2", dataset_size_gb=50, using_xgboost=False) Example using ``dataset_dimension``, a tuple of (n_rows, n_cols) when using compressed file (e.q. parquet) as a data source: .. code-block:: python import h2o import h2osteam from h2osteam.clients import SparklingClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster", profile_name="default-sparkling-internal", version="3.28.0.2", dataset_dimension=(25000, 1250), using_xgboost=True) Connecting to existing Sparkling Water cluster ---------------------------------------------- This example shows how to login to Steam and connect to existing Sparkling Water cluster called ``test-sparkling-cluster`` and import data. .. code-block:: python import h2o import h2osteam from h2osteam.clients import SparklingClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = SparklingClient.get_cluster("test-sparkling-cluster") multilineStatement = ''' airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip" airlines_df = h2o.import_file(path=airlines) ''' cluster.send_statement(multilineStatement) Saving Sparkling Water cluster data ----------------------------------- This example shows how to save cluster data and restart cluster called ``test-cluster``. Setting ``save_cluster_data=True`` will make the cluster save its data on reaching idle or uptime limit. Using ``cluster.stop(save_cluster_data=True)`` immediately stops the cluster and saves data. Saved cluster can be started and its saved data will be loaded. .. code-block:: python import h2o import h2osteam from h2osteam.clients import SparklingClient h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True) cluster = SparklingClient.launch_sparkling_cluster(name="test-cluster", profile_name="default-sparkling-internal", version="3.28.0.2", executors=4, executor_memory_gb=10, save_cluster_data=True) # Train your models... cluster.stop(save_cluster_data=True) cluster.start(executors=2, executor_memory_gb=5, save_cluster_data=False)