# Using the MOJO Scoring Pipeline with Spark/Sparkling Water¶

MOJO scoring pipeline artifacts can be used in Spark to carry out predictions in parallel using the Sparkling Water API. This section shows how to load and run predictions on the MOJO scoring pipeline in Spark using Scala and the Python API.

Note: Sparkling Water is backwards compatible with MOJO versions produced by different Driverless AI versions.

## Requirements¶

• You must have a Spark cluster with the Sparkling Water JAR file passed to Spark.
• To run with PySparkling, you must have the PySparkling zip file.

The H2OContext does not have to be created if you only want to run predictions on MOJOs using Spark. This is because the scoring is independent of the H2O run-time.

In order use the MOJO scoring pipeline, Driverless AI license has to be passed to Spark. This can be achieved via --jars argument of the Spark launcher scripts.

Note: In Local Spark mode, please use --driver-class-path to specify path to the license file.

## PySparkling¶

First, start PySpark with PySparkling Python package and Driverless AI license.

./bin/pyspark --jars license.sig --py-files pysparkling.zip


./bin/pysparkling --jars license.sig


At this point, you should have available a PySpark interactive terminal where you can try out predictions. If you would like to productionalize the scoring process, you can use the same configuration, except instead of using ./bin/pyspark, you would use ./bin/spark-submit to submit your job to a cluster.

# First, specify the dependency
from pysparkling.ml import H2OMOJOPipelineModel

# The 'namedMojoOutputColumns' option ensures that the output columns are named properly.
# If you want to use old behavior when all output columns were stored inside an array,
# set it to False. However we strongly encourage users to use True which is defined as a default value.
settings = H2OMOJOSettings(namedMojoOutputColumns = True)

# Load the pipeline. 'settings' is an optional argument. If it's not specified, the default values are used.
mojo = H2OMOJOPipelineModel.createFromMojo("file:///path/to/the/pipeline.mojo", settings)

# Load the data as Spark's Data Frame

# Run the predictions. The predictions contain all the original columns plus the predictions
predictions = mojo.transform(dataFrame)

# You can easily get the predictions for a desired column using the helper function as
predictions.select(mojo.selectPredictionUDF("AGE")).collect()


## Sparkling Water¶

First, start Spark with Sparkling Water Scala assembly and Driverless AI license.

./bin/spark-shell --jars license.sig,sparkling-water-assembly.jar


./bin/sparkling-shell --jars license.sig


At this point, you should have available a Sparkling Water interactive terminal where you can carry out predictions. If you would like to productionalize the scoring process, you can use the same configuration, except instead of using ./bin/spark-shell, you would use ./bin/spark-submit to submit your job to a cluster.

// First, specify the dependency
import ai.h2o.sparkling.ml.models.H2OMOJOPipelineModel

// The 'namedMojoOutputColumns' option ensures that the output columns are named properly.
// If you want to use old behavior when all output columns were stored inside an array,
// set it to false. However we strongly encourage users to use true which is defined as a default value.
val settings = H2OMOJOSettings(namedMojoOutputColumns = true)

// Load the pipeline. 'settings' is an optional argument. If it's not specified, the default values are used.
val mojo = H2OMOJOPipelineModel.createFromMojo("file:///path/to/the/pipeline.mojo", settings)

// Load the data as Spark's Data Frame

// Run the predictions. The predictions contain all the original columns plus the predictions