Logo
3.46.0.6-1-3.4
  • About Sparkling Water
  • Typical Use Case
  • Sparkling Water Requirements
  • Installing and Starting
  • Design
  • Configuration
  • Deployment
  • Machine Learning
    • Train AutoML Model in Sparkling Water
    • Train GAM Model in Sparkling Water
    • Train GLM Model in Sparkling Water
    • Train DRF Model in Sparkling Water
    • Train Sparkling Water Algorithms with Grid Search
    • Train KMeans Model in Sparkling Water
    • Train XGBoost Model in Sparkling Water
    • Train Isolation Forest Model in Sparkling Water
    • Train Extended Isolation Forest Model in Sparkling Water
    • Train CoxPH Model in Sparkling Water
    • Train Deep Learning Model in Sparkling Water
    • Train RuleFit Model in Sparkling Water
    • Train Stacked Ensemble Model in Sparkling Water
    • Autoencoder in Sparkling Water
    • Target Encoding in Sparkling Water
    • Train Word2Vec Model in Sparkling Water
    • Principal Component Analysis (PCA) in Sparkling Water
    • Generalized Low Rank Models (GLRM) in Sparkling Water
    • Obtain SHAP values from MOJO model
    • Using H2O Binary Model in Sparkling Water
  • Metric Classes
  • Model Details
  • Algorithm Parameters
  • How to…
  • Development
  • PySparkling
  • RSparkling
  • Migration Guide
  • Frequently Asked Questions
  • Change Log
  • Change Log
H2O Sparkling Water
  • »
  • Machine Learning »
  • Train Sparkling Water Algorithms with Grid Search
  • View page source

Train Sparkling Water Algorithms with Grid Search¶

Grid Search serves for finding optimal values for hyper-parameters of a given H2O/SW algorithm. Grid Search in Sparkling Water is able to traverse hyper-space for H2OGBM, H2OXGBoost, H2ODRF, H2OGLM, H2OGAM, H2ODeepLearning, H2OKMeans, H2OCoxPH, and H2OIsolationForest. For more details about hyper-parameters for a specific algorithm (see H2O-3 documentation).

Sparkling Water provides API in Scala and Python for Grid Search. The following sections describe how to Apply Grid Search on H2ODRF in both languages. See also Parameters of H2OGridSearch.

  • Scala
  • Python

First, let’s start Sparkling Shell as

./bin/sparkling-shell

Start H2O cluster inside the Spark environment

import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
val rawSparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
val sparkDF = rawSparkDF.withColumn("CAPSULE", $"CAPSULE" cast "string")
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Define the algorithm, which will be a subject of hyper-parameter tuning

import ai.h2o.sparkling.ml.algos.H2ODRF
val algo = new H2ODRF().setLabelCol("CAPSULE")

By default, the H2ODRF algorithm distinguishes between a classification and regression problem based on the type of the label column of the training dataset. If the label column is a string column, a classification model will be trained. If the label column is a numeric column, a regression model will be trained. If you don’t want be worried about column data types, you can explicitly identify the problem by using ai.h2o.sparkling.ml.algos.classification.H2ODRFClassifier or ai.h2o.sparkling.ml.algos.regression.H2ODRFRegressor instead.

Define a hyper-space which will be traversed

import scala.collection.mutable.HashMap
val hyperSpace: HashMap[String, Array[AnyRef]] = HashMap()
hyperSpace += "ntrees" -> Array(1, 10, 30).map(_.asInstanceOf[AnyRef])
hyperSpace += "mtries" -> Array(-1, 5, 10).map(_.asInstanceOf[AnyRef])

Pass the algorithm and hyper-space to the grid search and set properties defining the way how the hyper-space will be traversed.

Sparkling Water supports two strategies for traversing hyperspace:

  • Cartesian - (Default) This strategy tries out every possible combination of hyper-parameter values and finishes after the whole space is traversed.

  • RandomDiscrete - In each iteration, the strategy randomly selects the combination of values from the hyper-space and can be terminated before the whole space is traversed. The termination depends on various criteria (consider parameters: maxRuntimeSecs, maxModels, stoppingRounds, stoppingTolerance, stoppingMetric). For details see H2O-3 documentation

import ai.h2o.sparkling.ml.algos.H2OGridSearch
val grid = new H2OGridSearch()
    .setHyperParameters(hyperSpace)
    .setAlgo(algo)
    .setStrategy("Cartesian")

Fit the grid search to get the best DRF model.

val model = grid.fit(trainingDF)

You can also get raw model details by calling the getModelDetails() method available on the model as:

model.getModelDetails()

Run Predictions

model.transform(testingDF).show(false)

First, let’s start PySparkling Shell as

./bin/pysparkling

Start H2O cluster inside the Spark environment

from pysparkling import *
hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
sparkDF = hc.asSparkFrame(frame)
sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string"))
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

Train the model. You can configure all the available DRF arguments using provided setters or constructor parameters, such as the label column.

from pysparkling.ml import H2ODRF
algo = H2ODRF(labelCol = "CAPSULE")

By default, the H2ODRF algorithm distinguishes between a classification and regression problem based on the type of the label column of the training dataset. If the label column is a string column, a classification model will be trained. If the label column is a numeric column, a regression model will be trained. If you don’t want to be worried about column data types, you can explicitly identify the problem by using H2ODRFClassifier or H2ODRFRegressor instead.

Define a hyper-space which will be traversed

hyperSpace = {"ntrees": [1, 10, 30], "mtries": [-1, 5, 10]}

Pass the algorithm and hyper-space to the grid search and set properties defining the way how the hyper-space will be traversed.

Sparkling Water supports two strategies for traversing hyperspace:

  • Cartesian - (Default) This strategy tries out every possible combination of hyper-parameter values and finishes after the whole space is traversed.

  • RandomDiscrete - In each iteration, the strategy randomly selects the combination of values from the hyper-space and can be terminated before the whole space is traversed. The termination depends on various criteria (consider parameters: maxRuntimeSecs, maxModels, stoppingRounds, stoppingTolerance, stoppingMetric). For details see H2O-3 documentation

from pysparkling.ml import H2OGridSearch
grid = H2OGridSearch(hyperParameters=hyperSpace, algo=algo, strategy="Cartesian")

Fit the grid search to get the best DRF model.

model = grid.fit(trainingDF)

You can also get raw model details by calling the getModelDetails() method available on the model as:

model.getModelDetails()

Run Predictions

model.transform(testingDF).show(truncate = False)
Next Previous

© Copyright 2016-2020 H2O.ai. Last updated on Nov 19, 2024.

Built with Sphinx using a theme provided by Read the Docs.