Train KMeans Model in Sparkling Water

Introduction

K-Means falls in the general category of clustering algorithms. For more more comprehensive description see H2O-3 K Means documentation.

Example

The following section describes how to train the KMeans model in Sparkling Water in Scala & Python following the same example as H2O-3 documentation mentioned above. See also Parameters of H2OKMeans and Details of H2OKMeansMOJOModel.

Scala

First, let’s start Sparkling Shell as

./bin/sparkling-shell

Start H2O cluster inside the Spark environment

import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/iris/iris_wheader.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("iris_wheader.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Set the predictors

val predictors = Array("sepal_len", "sepal_wid", "petal_len", "petal_wid")

Build and train the model. You can configure all the available KMeans arguments using provided setters.

import ai.h2o.sparkling.ml.algos.H2OKMeans
val estimator = new H2OKMeans()
   .setEstimateK(true)
   .setSeed(1234)
   .setFeaturesCols(predictors)
val model = estimator.fit(trainingDF)

Eval performance

val metrics = model.getTrainingMetrics()
println(metrics)

Run Predictions

model.transform(testingDF).show(false)

You can also get model details via calling methods listed in Details of H2OKMeansMOJOModel.

Python

First, let’s start PySparkling Shell as

./bin/pysparkling

Start H2O cluster inside the Spark environment

from pysparkling import *
hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/iris/iris_wheader.csv")
sparkDF = hc.asSparkFrame(frame)
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

Set the predictors

predictors = ["sepal_len", "sepal_wid", "petal_len", "petal_wid"]

Build and train the model. You can configure all the available KMeans arguments using provided setters or constructor parameters.

from pysparkling.ml import H2OKMeans
estimator = H2OKMeans(
               estimateK = True,
               seed = 1234,
               featuresCols = predictors)
model = estimator.fit(trainingDF)

Eval performance

metrics = model.getTrainingMetrics()
print(metrics)

Run Predictions

model.transform(testingDF).show(truncate = False)

You can also get model details via calling methods listed in Details of H2OKMeansMOJOModel.