Running Sparkling Water in Kubernetes ------------------------------------- Sparkling Water can be executed inside the Kubernetes cluster. Sparkling Water supports Kubernetes since Spark version 2.4. Before we start, please check the following: 1. Please make sure we are familiar with how to run Spark on Kubernetes at `Spark Kubernetes documentation `__. 2. Ensure that we have a working Kubernetes Cluster and ``kubectl`` installed 3. Ensure we have ``SPARK_HOME`` set up to a home directory of our Spark distribution of version SUBST_SPARK_VERSION 4. Run ``kubectl cluster-info`` to obtain Kubernetes master URL. 5. Have internet connection so Kubernetes can download Sparkling Water docker images 6. If we have some non-default network policies applied to the namespace where Sparkling Water is supposed to run, make sure that the following ports are exposed: all Spark ports and ports 54321 and 54322 as these are also necessary by H2O to be able to communicate. The examples below are using the default Kubernetes namespace which we enable for Spark as: .. code:: bash kubectl create clusterrolebinding default --clusterrole=edit --serviceaccount=default:default --namespace=default We can also use different namespace setup for Spark. In that case please don't forget to pass ``--conf spark.kubernetes.authenticate.driver.serviceAccountName=serviceName`` to our Spark commands. Internal Backend ~~~~~~~~~~~~~~~~ In the internal backend of Sparkling Water, we need to pass the option ``spark.scheduler.minRegisteredResourcesRatio=1`` to our Spark job invocation. This ensures that Spark waits for all resources and therefore Sparkling Water will start H2O on all requested executors. Dynamic allocation must be disabled in Spark. .. content-tabs:: .. tab-container:: Scala :title: Scala Both cluster and client deployment modes of Kubernetes are supported. **To submit Scala job in a cluster mode, run:** .. code:: bash $SPARK_HOME/bin/spark-submit \ --master "k8s://KUBERNETES_ENDPOINT" \ --deploy-mode cluster \ --conf spark.scheduler.minRegisteredResourcesRatio=1 \ --conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:SUBST_SW_VERSION \ --conf spark.executor.instances=3 \ --conf spark.driver.host=sparkling-water-app \ --conf spark.kubernetes.driver.pod.name=sparkling-water-app \ --class ai.h2o.sparkling.KubernetesTest \ local:///opt/sparkling-water/tests/kubernetesTest.jar **To start an interactive shell in a client mode:** 1. Create Headless service so Spark executors can reach the driver node .. code:: bash cat <`__ with **one important exception**. The image to be used need to be `h2oai/sparkling-water-external-backend:SUBST_SW_VERSION` and not the base H2O image as mentioned in H2O documentation as Sparkling Water enhances the H2O image with additional dependencies. In order for Sparkling Water to be able to connect to the H2O cluster, we need to get the address of the leader node of the H2O cluster. If we followed the H2O documentation on how to start H2O cluster on Kubernetes, the address is ``h2o-service.default.svc.cluster.local:54321`` where the first part is the H2O headless service name and the second part is the name of the namespace. After we created the external H2O backend, we can connect to it from Sparkling Water clients as: .. content-tabs:: .. tab-container:: Scala :title: Scala Both cluster and client deployment modes of Kubernetes are supported. **To submit Scala job in a cluster mode, run:** .. code:: bash $SPARK_HOME/bin/spark-submit \ --master "k8s://KUBERNETES_ENDPOINT" \ --deploy-mode cluster \ --conf spark.scheduler.minRegisteredResourcesRatio=1 \ --conf spark.kubernetes.container.image=h2oai/sparkling-water-scala:SUBST_SW_VERSION \ --conf spark.executor.instances=2 \ --conf spark.driver.host=sparkling-water-app \ --conf spark.kubernetes.driver.pod.name=sparkling-water-app \ --conf spark.ext.h2o.backend.cluster.mode=external \ --conf spark.ext.h2o.external.start.mode=manual \ --conf spark.ext.h2o.external.memory=2G \ --conf spark.ext.h2o.cloud.representative=h2o-service.default.svc.cluster.local:54321 \ --conf spark.ext.h2o.cloud.name=root \ --class ai.h2o.sparkling.KubernetesTest \ local:///opt/sparkling-water/tests/kubernetesTest.jar **To start an interactive shell in a client mode:** 1. Create Headless service so Spark executors can reach the driver node .. code:: bash cat <