Hadoop Examples

This section provides a complete example for using the Enterprise Steam Python client on Hadoop.

Launching and connecting to H2O cluster

This examples shows how to login to Steam and launch H2O cluster with 4 nodes and 10GB of memory per node. The H2O cluster is using H2O version 3.28.0.2 and profile called default-h2o and submitting to the default YARN queue. All other H2O parameters are pre-filled according to the selected profile. When the cluster is up we connect to it and start importing data.

import h2o
import h2osteam
from h2osteam.clients import H2oClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = H2oClient.launch_cluster(name="test-cluster",
                                   profile_name="default-h2o",
                                   version="3.28.0.2",
                                   nodes=4,
                                   node_memory_gb=10)
cluster.connect()
airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
airlines_df = h2o.import_file(path=airlines)

Providing dataset parameters to preset cluster size

This examples shows how to launch H2O cluster providing dataset information. If you are not sure how to exactly size your cluster, you can provide either dataset_size_gb (for raw data source) or dataset_dimension tuple (for compressed data source) and specify whether you are going to use XGBoost algorithm on your cluster with using_xgboost parameter. Setting these parameters will size the cluster accordingly. If your profile does not allow to allocate recommended resources for the cluster, maximum allowed resources will be used. Also any user-specified values of nodes, node_memory_gb, or extra_memory_percent will override recommended values.

Example using dataset_size_gb when using a CSV file as a data source:

import h2o
import h2osteam
from h2osteam.clients import H2oClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = H2oClient.launch_cluster(name="test-cluster",
                                   profile_name="default-h2o",
                                   version="3.28.0.2",
                                   dataset_size_gb=20,
                                   using_xgboost=True)

Example using dataset_dimension, a tuple of (n_rows, n_cols) when using compressed file (e.q. parquet) as a data source:

import h2o
import h2osteam
from h2osteam.clients import H2oClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = H2oClient.launch_cluster(name="test-cluster",
                                   profile_name="default-h2o",
                                   version="3.28.0.2",
                                   dataset_dimension=(25000, 1250),
                                   using_xgboost=False)

Connecting to existing H2O cluster

This example shows how to login to Steam and connect to existing H2O cluster called test-cluster and import data.

import h2o
import h2osteam
from h2osteam.clients import H2oClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = H2oClient.get_cluster("test-cluster")
cluster.connect()
airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
airlines_df = h2o.import_file(path=airlines)

Saving H2O cluster data

This example shows how to save cluster data and restart cluster called test-cluster. Setting save_cluster_data=True will make the cluster save its data on reaching idle or uptime limit. Using cluster.stop(save_cluster_data=True) immediately stops the cluster and saves data. Saved cluster can be started and its saved data will be loaded.

import h2o
import h2osteam
from h2osteam.clients import H2oClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = H2oClient.launch_cluster(name="test-cluster",
                                   profile_name="default-h2o",
                                   version="3.28.0.2",
                                   nodes=4,
                                   node_memory_gb=10,
                                   save_cluster_data=True)
cluster.connect()
# Train your models...
cluster.stop(save_cluster_data=True)
cluster.start(nodes=2,
              nome_memory_gb=5,
              save_cluster_data=False)

Launching and connecting to Sparkling Water cluster

This examples shows how to login to Steam and launch Sparkling Water cluster with 4 executors and 10GB of memory per executor. The Sparking Water cluster is using Sparkling Water version 3.28.0.2 and profile called default-sparkling-internal and submitting to the default YARN queue. Profile type dictates a cluster backend type. In this case the cluster is starting in the internal mode. All other Sparkling Water parameters are pre-filled according to the selected profile. When the cluster is up we can send statements to the remote Spark session to start importing data.

import h2o
import h2osteam
from h2osteam.clients import SparklingClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster",
                                                   profile_name="default-sparkling-internal",
                                                   version="3.28.0.2",
                                                   executors=4,
                                                   executor_memory_gb=10,
                                                   yarn_queue="default")

cluster.send_statement('airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"')
cluster.send_statement('airlines_df = h2o.import_file(path=airlines)')

Providing dataset parameters to preset Sparkling Water cluster size

This examples shows how to launch Sparkling Water cluster providing dataset information. If you are not sure how to exactly size your cluster, you can provide either dataset_size_gb (for raw data source) or dataset_dimension tuple (for compressed data source) and specify whether you are going to use XGBoost algorithm on your cluster with using_xgboost parameter. Setting these parameters will size the cluster accordingly. If your profile does not allow to allocate recommended resources for the cluster, maximum allowed resources will be used. Also any user-specified values of executors, executor_memory_gb, h2o_nodes, h2o_node_memory_gb, or ``h2o_extra_memory_percent` will override recommended values.

Example using dataset_size_gb when using a CSV file as a data source:

import h2o
import h2osteam
from h2osteam.clients import SparklingClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster",
                                                   profile_name="default-sparkling-internal",
                                                   version="3.28.0.2",
                                                   dataset_size_gb=50,
                                                   using_xgboost=False)

Example using dataset_dimension, a tuple of (n_rows, n_cols) when using compressed file (e.q. parquet) as a data source:

import h2o
import h2osteam
from h2osteam.clients import SparklingClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = SparklingClient.launch_sparkling_cluster(name="test-sparkling-cluster",
                                                   profile_name="default-sparkling-internal",
                                                   version="3.28.0.2",
                                                   dataset_dimension=(25000, 1250),
                                                   using_xgboost=True)

Connecting to existing Sparkling Water cluster

This example shows how to login to Steam and connect to existing Sparkling Water cluster called test-sparkling-cluster and import data.

import h2o
import h2osteam
from h2osteam.clients import SparklingClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = SparklingClient.get_cluster("test-sparkling-cluster")

multilineStatement = '''
airlines = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip"
airlines_df = h2o.import_file(path=airlines)
'''

cluster.send_statement(multilineStatement)

Saving Sparkling Water cluster data

This example shows how to save cluster data and restart cluster called test-cluster. Setting save_cluster_data=True will make the cluster save its data on reaching idle or uptime limit. Using cluster.stop(save_cluster_data=True) immediately stops the cluster and saves data. Saved cluster can be started and its saved data will be loaded.

import h2o
import h2osteam
from h2osteam.clients import SparklingClient

h2osteam.login(url="https://steam.h2o.ai:9555", username="user01", password="access-token-here", verify_ssl=True)
cluster = SparklingClient.launch_sparkling_cluster(name="test-cluster",
                                                   profile_name="default-sparkling-internal",
                                                   version="3.28.0.2",
                                                   executors=4,
                                                   executor_memory_gb=10,
                                                   save_cluster_data=True)
# Train your models...
cluster.stop(save_cluster_data=True)
cluster.start(executors=2,
              executor_memory_gb=5,
              save_cluster_data=False)