Train Word2Vec Model in Sparkling Water

Sparkling Water provides API for H2O Word2Vec in Scala and Python. The following sections describe how to train the Word2Vec model in Sparkling Water in both languages. See also Parameters of H2OWord2Vec.

Scala

First, let’s start Sparkling Shell as

./bin/sparkling-shell

Start H2O cluster inside the Spark environment

import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/craigslistJobTitles.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("craigslistJobTitles.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Create the pipeline with the H2O Word2Vec. You can configure all the available Word2Vec arguments using provided setters.

import ai.h2o.sparkling.ml.algos.H2OGBM
import ai.h2o.sparkling.ml.features.H2OWord2Vec
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover}

val tokenizer = new RegexTokenizer()
  .setInputCol("jobtitle")
  .setMinTokenLength(2)

val stopWordsRemover = new StopWordsRemover()
  .setInputCol(tokenizer.getOutputCol)

val w2v = new H2OWord2Vec()
  .setSentSampleRate(0)
  .setEpochs(10)
  .setInputCol(stopWordsRemover.getOutputCol)

val gbm = new H2OGBM()
  .setLabelCol("category")
  .setFeaturesCols(w2v.getOutputCol)

val pipeline = new Pipeline().setStages(Array(tokenizer, stopWordsRemover, w2v, gbm))

Train the pipeline:

val model = pipeline.fit(trainingDF)

Run Predictions

model.transform(testingDF).show(false)

Python

First, let’s start PySparkling Shell as

./bin/pysparkling

Start H2O cluster inside the Spark environment

from pysparkling import *
hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/craigslistJobTitles.csv")
sparkDF = hc.asSparkFrame(frame.set_names(['category', 'jobtitle']))
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])

Create the pipeline with the Word2Vec. You can configure all the available Word2Vec arguments using provided setters.

from pysparkling.ml import H2OGBM, H2OWord2Vec
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover

tokenizer = RegexTokenizer(inputCol="jobtitle", minTokenLength=2)
stopWordsRemover = StopWordsRemover(inputCol=tokenizer.getOutputCol())
w2v = H2OWord2Vec(sentSampleRate=0, epochs=10, inputCol=stopWordsRemover.getOutputCol())
gbm = H2OGBM(labelCol="category", featuresCols=[w2v.getOutputCol()])

pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, w2v, gbm])

Train the pipeline:

model = pipeline.fit(trainingDF)

Run Predictions

model.transform(testingDF).show(truncate = False)