Using Enterprise Steam with Python

This section describes how to use the Enterprise Steam for Python. Note that each Python request will result in a warning message. These warnings can be ignored.

Downloading and Installing

  1. Go to https://s3.amazonaws.com/steam-release/enterprise-steam/latest-stable.html to retrieve the latest version of Enterprise Steam.
  2. On the Steam API tab, download the Python package.
  3. Open a Terminal window, and navigate to the location where the Python .whl file was downloaded. For example:
cd ~/Downloads
  1. Install Enterprise Steam for Python using pip install <file_name>. For example:
pip install h2osteam-1.4.4-py2.py3-none-any.whl

login

In Python, use the login function to log in to your Enterprise Steam web server. Note that you must already have a username and a password. The web server and your username and password are provided to you by your Enterprise Steam Admin.

$ python
>>> import h2osteam
>>> conn = h2osteam.login(url = "https://steam.0xdata.loc",
                          verify_ssl = False,
                          username="jsmith",
                          password="jsmith")

start_h2o_cluster

Use the start_h2o_cluster function to create a new cluster. This function takes the following parameters:

  • cluster_name: Specify a name for this cluster.
  • profile_name: Specify the profile to use for this cluster.
  • num_nodes: Specify the number of nodes for the cluster.
  • node_memory: Specify the amount of memory that should be available on each node.
  • v_cores: Specify the number of virtual cores.
  • n_threads: Specify the number of threads (CPUs) to use in the cluster. Specify 0 to use all available threads.
  • max_idle_time: Specify the maximum number of hours that the cluster can be idle before gracefully shutting down. Specify 0 to turn off this setting and allow the cluster to remain idle for an unlimited amount of time.
  • max_uptime: Specify the maximum number of hours that the cluster can be running. Specify 0 to turn off this setting and allow the cluster to remain up for an unlimited amount of time.
  • extramempercent: Specify the amount of extra memory for internal JVM use outside of the Java heap. This is a percentage of memory per node. The default (and recommended) value is 10%.
  • h2o_version: The H2O engine version that this cluster will use. Note that the Enterprise Steam Admin is responsible for adding engines to Enterprise Steam.
  • yarn_queue: If your cluster contains queues for allocating cluster resources, specify the queue for this cluster. Note that the YARN Queue cannot contain spaces.
  • callback_ip: Optionally specify the IP address for callback messages from the mapper to the driver (driverif).
>>> cluster_config = conn.start_h2o_cluster(cluster_name = 'first-cluster-from-Python',
                                            profile_name = 'default',
                                            num_nodes = 2,
                                            node_memory = '30g',
                                            h2o_version = "3.22.0.1",
                                            max_idle_time = 1,
                                            max_uptime = 1)

# Call the cluster to retrieve its ID and configuration params.
>>> cluster_config
{'id': 107, 'connect_params': {'cookies': [u'first-cluster-from-Python=YW5nZWxhOmdrZm53aGJsdWY='], 'ip': 'steam.0xdata.loc', 'context_path': u'jsmit_first-cluster-from-Python', 'verify_ssl_certificates': False, 'https': True, 'port': 9999}}

Note that after you create a cluster, you can immediately connect to that cluster and begin using H2O. Refer to the following for a complete Python example.

>>> import h2o
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> h2o.connect(config = cluster_config)

# import the cars dataset
# this dataset is used to classify whether or not a car is economical based on
# the car's displacement, power, weight, and acceleration, and the year it was made
>>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

# convert response column to a factor
>>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()

# set the predictor names and the response column name
>>> predictors = ["displacement","power","weight","acceleration","year"]
>>> response = "economy_20mpg"

# split into train and validation sets
>>> train, valid = cars.split_frame(ratios = [.8], seed = 1234)

# initialize your estimator
>>> cars_gbm = H2OGradientBoostingEstimator(seed = 1234)

# train your model, specifying your 'x' predictors,
# your 'y' the response column, training_frame, and validation_frame
>>> cars_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

# print the auc for the validation data
>>> cars_gbm.auc(valid=True)

get_h2o_cluster

Use the get_h2o_cluster to retrieve information about a specific cluster using the cluster name.

>>> conn.get_h2o_cluster('first-cluster-from-Python')
{'id': 108, 'connect_params': {'cookies': [u'first-cluster-from-Python=YW5nZWxhOnA1bHRreHN5amo='], 'ip': 'steam.0xdata.loc', 'context_path': u'jsmith_first-cluster-from-Python', 'verify_ssl_certificates': False, 'https': True, 'port': 9999}}

get_h2o_clusters

Use the get_h2o_clusters to retrieve all running H2O clusters accessible to current user

>>> conn.get_h2o_clusters()

stop_h2o_cluster

Use the stop_h2o_cluster function to stop a cluster.

>>> conn.stop_h2o_cluster(cluster_config)

show_profiles

Use the show_profiles to show available profiles.

>>> conn.show_profiles(cluster_config)

start_internal_sparkling_cluster

Use the start_internal_sparkling_cluster function to create a new sparkling water cluster using internal backend. This function takes the following parameters:

  • cluster_name: Specify a name for this cluster.
  • profile_name: Specify the profile to use for this cluster.
  • h2o_version: The H2O engine version that this cluster will use. Note that the Enterprise Steam Admin is responsible for adding engines to Enterprise Steam.
  • driver_cores: Number of Spark driver cores
  • driver_memory_gb: Amount of Spark driver memory in GB
  • num_executors: Number of Spark executors
  • executor_cores: Number of Spark executor cores
  • executor_memory_gb: Amount of Spark executor memory in GB
  • h2o_node_threads: Specify the number of threads (CPUs) to use per node. Specify 0 to use all available threads.
  • start_timeout_sec: Specify start timeout in seconds
  • yarn_queue: If your cluster contains queues for allocating cluster resources, specify the queue for this cluster. Note that the YARN Queue cannot contain spaces.
  • spark_properties: Specify additional spark properties as a Python dictionary.
>>> cluster = conn.start_internal_sparkling_cluster(cluster_name="test",
                                                    profile_name="default-sparkling-internal",
                                                    h2o_version="3.22.0.1",
                                                    driver_cores=1,
                                                    driver_memory_gb=1,
                                                    num_executors=1,
                                                    executor_cores=1,
                                                    executor_memory_gb=1,
                                                    h2o_node_threads=0,
                                                    start_timeout_sec=90,
                                                    yarn_queue=None,
                                                    spark_properties={'spark.python.worker.reuse': 'true', 'key': 'val'})

start_external_sparkling_cluster

Use the start_external_sparkling_cluster function to create a new sparkling water cluster using external backend. This function takes the following parameters:

  • cluster_name: Specify a name for this cluster.
  • profile_name: Specify the profile to use for this cluster.
  • h2o_version: The H2O engine version that this cluster will use. Note that the Enterprise Steam Admin is responsible for adding engines to Enterprise Steam.
  • driver_cores: Number of Spark driver cores
  • driver_memory_gb: Amount of Spark driver memory in GB
  • num_executors: Number of Spark executors
  • executor_cores: Number of Spark executor cores
  • executor_memory_gb: Amount of Spark executor memory in GB
  • h2o_nodes: Specify the number of H2O nodes for the cluster.
  • h2o_node_memory_gb: Specify the amount of memory that should be available on each H2O node.
  • h2o_node_threads: Specify the number of threads (CPUs) to use per node. Specify 0 to use all available threads.
  • start_timeout_sec: Specify start timeout in seconds
  • yarn_queue: If your cluster contains queues for allocating cluster resources, specify the queue for this cluster. Note that the YARN Queue cannot contain spaces.
  • spark_properties: Specify additional spark properties as a Python dictionary.
>>> cluster = conn.start_external_sparkling_cluster(cluster_name="test",
                                                    profile_name="default-sparkling-external",
                                                    h2o_version="3.22.0.1",
                                                    driver_cores=1,
                                                    driver_memory_gb=1,
                                                    num_executors=1,
                                                    executor_cores=1,
                                                    executor_memory_gb=1,
                                                    h2o_nodes=1,
                                                    h2o_node_memory_gb=1,
                                                    h2o_node_threads=0,
                                                    start_timeout_sec=90,
                                                    yarn_queue=None,
                                                    spark_properties={'spark.python.worker.reuse': 'true', 'key': 'val'})

sparkling_cluster.session

Use the session function of sparkling water cluster to connect to the remote spark session and issue commands.

>>> sparkling_cluster = conn.start_internal_sparkling_cluster(.......)
>>> sparkling_cluster.session()

sparkling_cluster.send_statement

Use the send_statement function of sparkling water cluster to send a single statement to the remote spark session.

>>> sparkling_cluster = conn.start_internal_sparkling_cluster(.......)
>>> sparkling_cluster.send_statement("f_crimes = h2o.import_file(path ="../data/chicagoCrimes10k.csv",col_types =column_type)")

sparkling_cluster.detail

Use the detail function of sparkling water cluster to get an information about that sparkling water cluster.

>>> sparkling_cluster = conn.start_internal_sparkling_cluster(.......)
>>> sparkling_cluster.detail()

sparkling_cluster.stop

Use the stop function of sparkling water cluster to stop the cluster.

>>> sparkling_cluster = conn.start_internal_sparkling_cluster(.......)
>>> sparkling_cluster.stop()

upload_engine

Use the upload_engine function to upload H2O engine to Steam.

>>> conn.upload_engine("~/Downloads/h2o-3.22.0.1-hdp2.4.zip")

upload_sparkling_engine

Use the upload_sparkling_engine function to upload Sparkling Water engine to Steam.

>>> conn.upload_sparkling_engine("~/Downloads/sparkling-water-2.3.17.zip")