Running Sparkling Water on Kerberized Hadoop Cluster

Sparkling Water can run on Kerberized Hadoop cluster and also supports Kerberos authentification for clients and Flow access. This tutorial shows how to configure Sparkling Water to run on kerberized Hadoop cluster. If you are also interested in using Kerberos authentification, please read Enabling Kerberos Authentication.

Sparkling Water supports the Kerberized cluster in both internal and external backend.

Internal Backend

To make Sparkling Water aware of the Kerberized cluster, you can call:

bin/sparkling-shell --conf "spark.yarn.principal=PRINCIPAL" --conf "spark.yarn.keytab=/path/to/keytab"

or you can create the Kerberos ticket in beforehand using kinit and call just

./bin/sparkling-shell

In this case, Sparking Water will use the created ticket and we don’t need to pass the configuration details.

External Backend

In External Backend, we are also starting the H2O cluster on YARN and we need to make sure it is secured as well.

You can start Sparkling Water as:

bin/sparkling-shell --conf "spark.yarn.principal=PRINCIPAL" --conf "spark.yarn.keytab=/path/to/keytab"

In this case, the value of spark.yarn.principal and spark.yarn.keytab properties will be also used to set spark.ext.h2o.external.kerberos.principal and spark.ext.h2o.external.kerberos.keytab correspondingly. These options are used to set up Kerberos on H2O external cluster via Sparkling Water.

You can also set the spark.ext.h2o.external.kerberos.principal and spark.ext.h2o.external.kerberos.keytab options directly.

The simplest option you can also start Sparkling Water is:

./bin/sparkling-shell

In this case, we assume that the ticket has been created using kinit and it will be used for both Spark and external H2O cluster.

The same configuration is valid also for PySparkling and RSparkling.