HDFS Setup

Driverless AI lets you explore HDFS data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with HDFS.

Note: For Docker 19.03 and later, use the --gpus all flag with docker run to enable GPU support. The older nvidia-docker wrapper is deprecated and no longer recommended. Ensure that the NVIDIA Container Toolkit is installed. To check your Docker version, run docker version.

Description of Configuration Attributes

hdfs_config_path (Required): The location the HDFS config folder path. This folder can contain multiple config files.
hdfs_auth_type (Required): Specifies the HDFS authentication. Available values are:
- principal: Authenticate with HDFS with a principal user.
- keytab: Authenticate with a keytab (recommended). If running DAI as a service, then the Kerberos keytab needs to be owned by the DAI user.
- keytabimpersonation: Login with impersonation using a keytab.
- noauth: No authentication needed.
key_tab_path: The path of the principal key tab file. This is required when hdfs_auth_type='principal'.
hdfs_app_principal_user: The Kerberos application principal user. This is required when hdfs_auth_type='keytab'.
hdfs_app_jvm_args: JVM args for HDFS distributions. Separate each argument with spaces.
- -Djava.security.krb5.conf
- -Dsun.security.krb5.debug
- -Dlog4j.configuration
hdfs_app_classpath: The HDFS classpath.
hdfs_app_supported_schemes: The list of DFS schemas that is used to check whether a valid input to the connector has been established. For example:
```
hdfs_app_supported_schemes = ['hdfs://', 'maprfs://', 'custom://']
```
The following are the default values for this option. Additional schemas can be supported by adding values that are not selected by default to the list.
- hdfs://
- maprfs://
- swift://
hdfs_max_files_listed: Specifies the maximum number of files that are viewable in the connector UI. Defaults to 100 files. To view more files, increase the default value.
hdfs_init_path: Specifies the starting HDFS path displayed in the UI of the HDFS browser.
enabled_file_systems: The file systems you want to enable. This must be configured in order for data connectors to function properly.

Example 1: Enable HDFS with No Authentication

This example enables the HDFS data connector and disables HDFS authentication. It does not pass any HDFS configuration file; however it configures Docker DNS by passing the name and IP of the HDFS name node. This lets you reference data stored in HDFS directly using name node address, for example: hdfs://name.node/datasets/iris.csv.

 docker run --gpus all \
   --pid=host \
   --init \
   --rm \
   --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
   --add-host name.node:172.16.2.186 \
   -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
   -e DRIVERLESS_AI_HDFS_AUTH_TYPE='noauth'  \
   -e DRIVERLESS_AI_PROCSY_PORT=8080 \
   -p 12345:12345 \
   -v /etc/passwd:/etc/passwd:ro \
   -v /etc/group:/etc/group:ro \
   -v /tmp/dtmp/:/tmp \
   -v /tmp/dlog/:/log \
   -v /tmp/dlicense/:/license \
   -v /tmp/ddata/:/data \
   -u $(id -u):$(id -g) \
   h2oai/dai-ubi8-x86_64:2.3.0-cuda11.8.0.xx

This example shows how to configure HDFS options in the config.toml file, and then specify that file when starting Driverless AI in Docker. Note that this example enables HDFS with no authentication.

Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed.

enabled_file_systems = "file, upload, hdfs"

procsy_ip = "127.0.0.1"

procsy_port = 8080

Mount the config.toml file into the Docker container.

 docker run --gpus all \
    --pid=host \
    --init \
    --rm \
    --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
    --add-host name.node:172.16.2.186 \
    -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
    -p 12345:12345 \
    -v /local/path/to/config.toml:/path/in/docker/config.toml \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -v /tmp/dtmp/:/tmp \
    -v /tmp/dlog/:/log \
    -v /tmp/dlicense/:/license \
    -v /tmp/ddata/:/data \
    -u $(id -u):$(id -g) \
   h2oai/dai-ubi8-x86_64:2.3.0-cuda11.8.0.xx

This example enables the HDFS data connector and disables HDFS authentication in the config.toml file. This allows users to reference data stored in HDFS directly using the name node address, for example: hdfs://name.node/datasets/iris.csv.

Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

Specify the following configuration options in the config.toml file. Note that the procsy port, which defaults to 12347, also has to be changed.

# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs"

Save the changes when you are done, then stop/restart Driverless AI.

Example 2: Enable HDFS with Keytab-Based Authentication

Notes:

If using Kerberos Authentication, then the time on the Driverless AI server must be in sync with Kerberos server. If the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures.
If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user; otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authentication and, hence, fail.

This example:

Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
Configures the environment variable DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER to reference a user for whom the keytab was created (usually in the form of user@realm).

 docker run --gpus all \
     --pid=host \
     --init \
     --rm \
     --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
     -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
     -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytab'  \
     -e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<<keytabname>>' \
     -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<user@kerberosrealm>>' \
     -e DRIVERLESS_AI_PROCSY_PORT=8080 \
     -p 12345:12345 \
     -v /etc/passwd:/etc/passwd:ro \
     -v /etc/group:/etc/group:ro \
     -v /tmp/dtmp/:/tmp \
     -v /tmp/dlog/:/log \
     -v /tmp/dlicense/:/license \
     -v /tmp/ddata/:/data \
     -u $(id -u):$(id -g) \
     h2oai/dai-ubi8-x86_64:2.3.0-cuda11.8.0.xx

This example:

Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
Configures the option hdfs_app_prinicpal_user to reference a user for whom the keytab was created (usually in the form of user@realm).

Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed.

enabled_file_systems = "file, upload, hdfs"

procsy_ip = "127.0.0.1"

procsy_port = 8080

hdfs_auth_type = "keytab"

key_tab_path = "/tmp/<keytabname>"

hdfs_app_principal_user = "<user@kerberosrealm>"

Mount the config.toml file into the Docker container.

docker run --gpus all \
  --pid=host \
  --init \
  --rm \
  --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-ubi8-x86_64:2.3.0-cuda11.8.0.xx

This example:

Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
Configures the option hdfs_app_prinicpal_user to reference a user for whom the keytab was created (usually in the form of user@realm).

Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

Specify the following configuration options in the config.toml file.

# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytab"

# Path of the principal key tab file
key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)
hdfs_app_principal_user = "<user@kerberosrealm>"

Save the changes when you are done, then stop/restart Driverless AI.

Example 3: Enable HDFS with Keytab-Based Impersonation

Notes:

If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server.
If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.
Logins are case sensitive when keytab-based impersonation is configured.

The example:

Sets the authentication type to keytabimpersonation.
Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
Configures the DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER variable, which references a user for whom the keytab was created (usually in the form of user@realm).

 docker run --gpus all \
     --pid=host \
     --init \
     --rm \
     --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
     -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
     -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytabimpersonation'  \
     -e DRIVERLESS_AI_KEY_TAB_PATH='/tmp/<<keytabname>>' \
     -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<appuser@kerberosrealm>>' \
     -e DRIVERLESS_AI_PROCSY_PORT=8080 \
     -p 12345:12345 \
     -v /etc/passwd:/etc/passwd:ro \
     -v /etc/group:/etc/group:ro \
     -v /tmp/dlog/:/log \
     -v /tmp/dlicense/:/license \
     -v /tmp/ddata/:/data \
     -u $(id -u):$(id -g) \
     h2oai/dai-ubi8-x86_64:2.3.0-cuda11.8.0.xx

This example:

Sets the authentication type to keytabimpersonation.
Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
Configures the hdfs_app_principal_user variable, which references a user for whom the keytab was created (usually in the form of user@realm).

Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed.

enabled_file_systems = "file, upload, hdfs"

procsy_ip = "127.0.0.1"

procsy_port = 8080

hdfs_auth_type = "keytabimpersonation"

key_tab_path = "/tmp/<keytabname>"

hdfs_app_principal_user = "<user@kerberosrealm>"

Mount the config.toml file into the Docker container.

docker run --gpus all \
  --pid=host \
  --init \
  --rm \
  --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-ubi8-x86_64:2.3.0-cuda11.8.0.xx

This example:

Sets the authentication type to keytabimpersonation.
Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
Configures the hdfs_app_principal_user variable, which references a user for whom the keytab was created (usually in the form of user@realm).

Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

Specify the following configuration options in the config.toml file.

# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytabimpersonation"

# Path of the principal key tab file
key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)
hdfs_app_principal_user = "<user@kerberosrealm>"

Save the changes when you are done, then stop/restart Driverless AI.

Specifying a Hadoop Platform

The following example shows how to build an H2O-3 Hadoop image and run Driverless AI. This example uses CDH 6.0. Change the H2O_TARGET to specify a different platform.

Clone and then build H2O-3 for CDH 6.0.

git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew clean build -x test
export H2O_TARGET=cdh6.0
export BUILD_HADOOP=true
./gradlew clean build -x test

Start H2O.

docker run -it --rm \
  -v `pwd`:`pwd` \
  -w `pwd` \
  --entrypoint bash \
  --network=host \
  -p 8020:8020  \
  docker.h2o.ai/cdh-6-w-hive \
  -c 'sudo -E startup.sh && \
  source /envs/h2o_env_python3.11/bin/activate && \
  hadoop jar h2o-hadoop-3/h2o-cdh6.0-assembly/build/libs/h2odriver.jar -libjars "$(cat /opt/hive-jars/hive-libjars)" -n 1 -mapperXmx 2g -baseport 54445 -notify h2o_one_node -ea -disown && \
  export CLOUD_IP=localhost && \
  export CLOUD_PORT=54445 && \
  make -f scripts/jenkins/Makefile.jenkins test-hadoop-smoke; \
  bash'

Run the Driverless AI HDFS connector.

java -cp connectors/hdfs.jar ai.h2o.dai.connectors.HdfsConnector

Verify the commands for ls and cp, for example.

{"coreSiteXmlPath": "/etc/hadoop/conf", "keyTabPath": "", authType: "noauth", "srcPath": "hdfs://localhost/user/jenkins/", "dstPath": "/tmp/xxx", "command": "cp", "user": "", "appUser": ""}