.. _isolation_forest:

Train Isolation Forest Model in Sparkling Water
-----------------------------------------------

Sparkling Water provides API for H2O Isolation Forest in Scala and Python.
The following sections describe how to train the Isolation Forest model in Sparkling Water in both languages.
See also :ref:`parameters_H2OIsolationForest` and :ref:`model_details_H2OIsolationForestMOJOModel`.

.. content-tabs::

    .. tab-container:: Scala
        :title: Scala

        First, let's start Sparkling Shell as

        .. code:: shell

            ./bin/sparkling-shell

        Start H2O cluster inside the Spark environment

        .. code:: scala

            import ai.h2o.sparkling._
            import java.net.URI
            val hc = H2OContext.getOrCreate()

        Parse the data using H2O and convert them to Spark Frame

        .. code:: scala

            import org.apache.spark.SparkFiles
            spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/anomaly/ecg_discord_train.csv")
            spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/anomaly/ecg_discord_test.csv")
            val trainingDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("ecg_discord_train.csv"))
            val testingDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("ecg_discord_test.csv"))

        Train the model. You can configure all the available Isolation Forest arguments using provided setters.

        .. code:: scala

            import ai.h2o.sparkling.ml.algos.H2OIsolationForest
            val estimator = new H2OIsolationForest()
            val model = estimator.fit(trainingDF)

        Run Predictions

        .. code:: scala

            model.transform(testingDF).show(false)

        You can also get model details via calling methods listed in :ref:`model_details_H2OIsolationForestMOJOModel`.


    .. tab-container:: Python
        :title: Python

        First, let's start PySparkling Shell as

        .. code:: shell

            ./bin/pysparkling

        Start H2O cluster inside the Spark environment

        .. code:: python

            from pysparkling import *
            hc = H2OContext.getOrCreate()

        Parse the data using H2O and convert them to Spark Frame

        .. code:: python

            import h2o
            trainingFrame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/anomaly/ecg_discord_train.csv")
            trainingDF = hc.asSparkFrame(trainingFrame)
            testingFrame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/anomaly/ecg_discord_test.csv")
            testingDF = hc.asSparkFrame(testingFrame)

        Train the model. You can configure all the available Isolation Forest arguments using provided setters or constructor parameters.

        .. code:: python

            from pysparkling.ml import H2OIsolationForest
            estimator = H2OIsolationForest()
            model = estimator.fit(trainingDF)

        Run Predictions

        .. code:: python

            model.transform(testingDF).show(truncate = False)

        You can also get model details via calling methods listed in :ref:`model_details_H2OIsolationForestMOJOModel`.


Train Isolation Forest with H2OGridSearch
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you're not sure about exact values for hyper-parameters of Isolation Forest, you can plug ``H2OIsolationForest`` to
``H2OGridSearch`` and define a hyper-parameter space to be walked through. Unlike other Sparkling Water algorithms used in
``H2OGridSearch``, you must pass ``validationDataFrame`` to ``H2OIsolationForest`` as a parameter in order to
``H2OGridSearch`` be able to evaluate produced models. The validation data frame has to have an extra column identifying
whether the row represents an anomaly or not. The column can contain only two string values, where a value for the negative
case, must be alphabetically smaller then the value for the positive case. E.g.: ``"0"``/``"1"``, ``"no"``/``"yes"``,
``"false"``/``"true"``, etc.

.. content-tabs::

    .. tab-container:: Scala
        :title: Scala

        Let's load a training and validation dataset at first:

        .. code:: scala

            import org.apache.spark.SparkFiles
            spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
            spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate_anomaly_validation.csv")
            val trainingDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv"))
            val validationDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate_anomaly_validation.csv"))

        Create an algorithm instance, pass validation data frame, and specify a column identifying an anomaly:

        .. code:: scala

            import ai.h2o.sparkling.ml.algos.H2OIsolationForest
            val algorithm = new H2OIsolationForest()
            algorithm.setValidationDataFrame(validationDF)
            algorithm.setValidationLabelCol("isAnomaly")

        Define a hyper-parameter space:

        .. code:: scala

            import scala.collection.mutable
            val hyperParams: mutable.HashMap[String, Array[AnyRef]] = mutable.HashMap()
            hyperParams += "ntrees" -> Array(10, 20, 30).map(_.asInstanceOf[AnyRef])
            hyperParams += "maxDepth" -> Array(5, 10, 20).map(_.asInstanceOf[AnyRef])

        Pass the prepared hyper-parameter space and algorithm to ``H2OGridSearch`` and run it:

        .. code:: scala

            import ai.h2o.sparkling.ml.algos.H2OGridSearch
            val grid = new H2OGridSearch()
            grid.setAlgo(algorithm)
            grid.setHyperParameters(hyperParams)
            val model = grid.fit(trainingDF)

        ``Logloss`` is a default metric for the model comparision produced by grids and can be changed via the method
        ``setSelectBestModelBy`` on ``H2OGridSearch``.

    .. tab-container:: Python
        :title: Python

        Let's load a training and validation dataset at first:

        .. code:: python

            import h2o
            trainingFrame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv")
            trainingDF = hc.asSparkFrame(trainingFrame)
            validationFrame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate_anomaly_validation.csv")
            validationDF = hc.asSparkFrame(validationFrame)

        Create an algorithm instance, pass validation data frame, and specify a column identifying an anomaly:

        .. code:: python

            from pysparkling.ml import H2OIsolationForest
            algorithm = H2OIsolationForest(validationDataFrame=validationDF, validationLabelCol="isAnomaly")

        Define a hyper-parameter space:

        .. code:: python

            hyperSpace = {"ntrees": [10, 20, 30], "maxDepth": [5, 10, 20]}

        Pass the prepared hyper-parameter space and algorithm to ``H2OGridSearch`` and run it:

        .. code:: python

            from pysparkling.ml import H2OGridSearch
            grid = H2OGridSearch(hyperParameters=hyperSpace, algo=algorithm)
            model = grid.fit(trainingDF)

        ``Logloss`` is a default metric for the model comparision produced by grids and can be changed via the method
        ``setSelectBestModelBy`` on ``H2OGridSearch``.