Using Enterprise Steam with R¶
This section describes how to use the Enterprise Steam for R. Note that this requires “urltools”. Refer to https://github.com/Ironholds/urltools/ for more information.
Downloading and Installing¶
- Go to https://s3.amazonaws.com/steam-release/enterprise-steam/latest-stable.html to retrieve the latest version of Enterprise Steam.
- On the Steam API tab, download the R package.
- Open a Terminal window, and navigate to the location where the Enterprse Steam file was downloaded. For example:
- Install Enterprise Steam for R using
R CMD INSTALL <file_name>. For example:
R CMD INSTALL h2osteam_1.2.0.tar.gz
login function to log in to your Enterprise Steam web server. Note that you must already have a username and a password. The web server and your username and password are provided to you by your Enterprise Steam Admin. This function takes the following parameters:
url: The URL of the Enterprise Steam instance
verify_ssl: Specify True or False to verify SSL certificate
username: Your username as provided by your Enterprise Steam Admin
password: Your password as provicded by your Enterprise Steam Admin
login_file: A login file where user information is stored.
login_file_passphrase: A login file where user passphrase information is stored.
$ r > library(h2osteam) > conn <- h2osteam.login(url = "https://steam.0xdata.loc", verify_ssl = F, username="jsmith", password="jsmith")
start_h2o_cluster function to create a new cluster. This function takes the following parameters:
cluster_name: Specify a name for this cluster.
profile_name: Specify the profile to use for this cluster.
num_nodes: Specify the number of nodes for the cluster.
node_memory: Specify the amount of memory that should be available on each node.
v_cores: Specify the number of virtual cores.
n_threads: Specify the number of threads (CPUs) to use in the cluster. Specify 0 to use all available threads.
max_idle_time: Specify the maximum number of hours that the cluster can be idle before gracefully shutting down. Specify 0 to turn off this setting and allow the cluster to remain idle for an unlimited amount of time.
max_uptime: Specify the maximum number of hours that the cluster can be running. Specify 0 to turn off this setting and allow the cluster to remain up for an unlimited amount of time.
extramempercent: Specify the amount of extra memory for internal JVM use outside of the Java heap. This is a percentage of memory per node. The default (and recommended) value is 10%.
h2o_engine_id: The H2O engine version that this cluster will use. Note that the Enterprise Steam Admin is responsible for adding engines to Enterprise Steam.
yarn_queue: If your cluster contains queues for allocating cluster resources, specify the queue for this cluster. Note that the YARN Queue cannot contain spaces.
> cluster_config <- h2osteam.start_h2o_cluster(conn = conn, cluster_name = "first-cluster-from-R", profile_name = "default", num_nodes = 2, node_memory = "30g", h2o_version = "22.214.171.124") # Call the cluster to retrieve its ID and configuration params. > cluster_config $id  109 $connect_params $connect_params$ip  "steam.0xdata.loc" $connect_params$port  9999 $connect_params$cookies  "first-cluster-from-R=YW5nZWxhOnVoYzdyeTNtM3g=" $connect_params$context_path  "jsmith_first-cluster-from-R" $connect_params$https  TRUE $connect_params$insecure  TRUE
Note that after you create a cluster, you can immediately connect to that cluster and begin using H2O. Refer to the following for a complete R example.
> library(h2o) > h2o.connect(config = cluster_config) # import the cars dataset # this dataset is used to classify whether or not a car is economical based on # the car's displacement, power, weight, and acceleration, and the year it was made > cars <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv") # convert response column to a factor > cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"]) # set the predictor names and the response column name > predictors <- c("displacement","power","weight","acceleration","year") > response <- "economy_20mpg" # split into train and validation sets > cars.split <- h2o.splitFrame(data = cars,ratios = 0.8, seed = 1234) > train <- cars.split[] > valid <- cars.split[] # train your model, specifying your 'x' predictors, # your 'y' the response column, training_frame, and validation_frame > cars_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, validation_frame = valid, seed = 1234) # print the auc for your model > print(h2o.auc(cars_gbm, valid = TRUE))
get_h2o_cluster to retrieve information about a specific cluster using the cluster name.
> h2osteam.get_h2o_cluster(conn, 'first-cluster-from-R') $id  109 $connect_params $connect_params$ip  "steam.0xdata.loc" $connect_params$port  9999 $connect_params$cookies  "first-cluster-from-R=YW5nZWxhOnVoYzdyeTNtM3g=" $connect_params$context_path  "jsmith_first-cluster-from-R" $connect_params$https  TRUE $connect_params$insecure  TRUE
get_h2o_clusters to retrieve all running H2O clusters accessible to current user
stop_h2o_cluster function to stop a cluster.
> h2osteam.stop_h2o_cluster(conn, cluster_config)
show_profiles to show available profiles.