Deploying PySparkling Pipeline Models
-------------------------------------

This tutorial demonstrates how we can import PySparkling pipeline models for scoring.

Let's first create and export the model as:

.. code:: python

    from pyspark.ml import Pipeline
    from pysparkling import *
    from pysparkling.ml import *

    hc = H2OContext.getOrCreate()

    # Helper method to locate the data file
    def locate(file_name):
        return "examples/smalldata/smsData.txt"

    # Prepare the data
    def load():
        row_rdd = spark.sparkContext.textFile(locate("smsData.txt")).map(lambda x: x.split("\t", 1)).filter(lambda r: r[0].strip())
        return spark.createDataFrame(row_rdd, ["label", "text"])

    # load the data
    data = load()

    # Create the H2O GBM pipeline stage
    gbm = H2OGBM(splitRatio=0.8, seed=1, labelCol="label")

    # Create a pipeline with a single GBM step
    pipeline = Pipeline(stages=[gbm])

    # Fit and export the pipeline
    model = pipeline.fit(data)
    model.save("exported_model")


Once we have exported the model, let's start a new ``./pysparkling`` shell as we want to demonstrate that for scoring,
H2OContext does not need to be created as the ``H2OGBM`` step is internally using MOJO which does not require run-time of H2O.


First, we need to ensure that all Java classes internally stored in the PySparkling distribution are distributed in the Spark
cluster. For that, we use the following code:

.. code:: python

    from pysparkling.initializer import Initializer
    Initializer.load_sparkling_jar(spark)

Once we initialized PySparkling, we can load the model as:

.. code:: python

    from pyspark.ml import PipelineModel
    model = PipelineModel.load("exported_model")

And we can run the predictions on the model as:

.. code:: python

    df_for_predictions = ..
    model.transform(df_for_predictions)

If we don't initialize the PySparkling using the ``Initializer``, we would get ``class not found`` exception during loading
the model as Spark would not know about the required classes. But as we can see, we do not need to initialize ``H2OContext``
for scoring tasks.