Train Word2Vec Model in Sparkling Water --------------------------------------- Sparkling Water provides API for H2O Word2Vec in Scala and Python. The following sections describe how to train the Word2Vec model in Sparkling Water in both languages. See also :ref:`parameters_H2OWord2Vec`. .. content-tabs:: .. tab-container:: Scala :title: Scala First, let's start Sparkling Shell as .. code:: shell ./bin/sparkling-shell Start H2O cluster inside the Spark environment .. code:: scala import ai.h2o.sparkling._ import java.net.URI val hc = H2OContext.getOrCreate() Parse the data using H2O and convert them to Spark Frame .. code:: scala import org.apache.spark.SparkFiles spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/craigslistJobTitles.csv") val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("craigslistJobTitles.csv")) val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2)) Create the pipeline with the H2O Word2Vec. You can configure all the available Word2Vec arguments using provided setters. .. code:: scala import ai.h2o.sparkling.ml.algos.H2OGBM import ai.h2o.sparkling.ml.features.H2OWord2Vec import org.apache.spark.ml.Pipeline import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover} val tokenizer = new RegexTokenizer() .setInputCol("jobtitle") .setMinTokenLength(2) val stopWordsRemover = new StopWordsRemover() .setInputCol(tokenizer.getOutputCol) val w2v = new H2OWord2Vec() .setSentSampleRate(0) .setEpochs(10) .setInputCol(stopWordsRemover.getOutputCol) val gbm = new H2OGBM() .setLabelCol("category") .setFeaturesCols(w2v.getOutputCol) val pipeline = new Pipeline().setStages(Array(tokenizer, stopWordsRemover, w2v, gbm)) Train the pipeline: .. code:: scala val model = pipeline.fit(trainingDF) Run Predictions .. code:: scala model.transform(testingDF).show(false) .. tab-container:: Python :title: Python First, let's start PySparkling Shell as .. code:: shell ./bin/pysparkling Start H2O cluster inside the Spark environment .. code:: python from pysparkling import * hc = H2OContext.getOrCreate() Parse the data using H2O and convert them to Spark Frame .. code:: python import h2o frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/craigslistJobTitles.csv") sparkDF = hc.asSparkFrame(frame.set_names(['category', 'jobtitle'])) [trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2]) Create the pipeline with the Word2Vec. You can configure all the available Word2Vec arguments using provided setters. .. code:: python from pysparkling.ml import H2OGBM, H2OWord2Vec from pyspark.ml import Pipeline from pyspark.ml.feature import RegexTokenizer, StopWordsRemover tokenizer = RegexTokenizer(inputCol="jobtitle", minTokenLength=2) stopWordsRemover = StopWordsRemover(inputCol=tokenizer.getOutputCol()) w2v = H2OWord2Vec(sentSampleRate=0, epochs=10, inputCol=stopWordsRemover.getOutputCol()) gbm = H2OGBM(labelCol="category", featuresCols=[w2v.getOutputCol()]) pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, w2v, gbm]) Train the pipeline: .. code:: python model = pipeline.fit(trainingDF) Run Predictions .. code:: python model.transform(testingDF).show(truncate = False)