HDFS Setup¶
Driverless AI allows you to explore HDFS data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with HDFS.
Description of Configuration Attributes¶
hdfs_config_path
(Required): The location the HDFS config folder path. This folder can contain multiple config files.hdfs_auth_type
(Required): Specifies the HDFS authentication. Available values are:noauth
(Default): No authentication needed.keytab
: Authenticate with a keytab (recommended). If you are running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.keytabimpersonation
: Login with impersonation using a keytab.
key_tab_path
: The path of the principal key tab file. This is required whenhdfs_auth_type='principal'
.hdfs_app_principal_user
: The Kerberos application principal user. This is required whenhdfs_auth_type='keytab'
.hdfs_app_jvm_args
: JVM args for HDFS distributions. Separate each argument with spaces.-Djava.security.krb5.conf
-Dsun.security.krb5.debug
-Dlog4j.configuration
hdfs_app_classpath
: The HDFS classpath.hdfs_app_supported_schemes
: Specifies a list of supported DFS schemas that is used to ensure a valid input to the connector For example:
"['hdfs://', 'maprfs://', 'swift://']"
hdfs_max_files_listed
: Specifies the maximum number of files that are viewable in the connector UI. Defaults to 100 files. To view more files, increase the default value.hdfs_init_path
: Specifies the starting HDFS path displayed in the UI of the HDFS browser.enabled_file_systems
: The file systems you want to enable. This must be configured in order for data connectors to function properly.
Start Driverless AI¶
This section describes how to enable the kdb+ data connector when starting Driverless AI in Docker. This can done by specifying each environment variable in the nvidia-docker run
command or by editing the configuration options in the config.toml file and then specifying that file in the nvidia-docker run
command.
Enable HDFS with No Authentication¶
This example enables the HDFS data connector and disables HDFS authentication. It does not pass any HDFS configuration file; however it configures Docker DNS by passing the name and IP of the HDFS name node. This allows users to reference data stored in HDFS directly using name node address, for example: hdfs://name.node/datasets/iris.csv
. Replace TAG below with the image tag.
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
--add-host name.node:172.16.2.186 \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='noauth' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Enable HDFS with Keytab-Based Authentication¶
Notes:
If using Kerberos Authentication, the time on the Driverless AI server must be in sync with Kerberos server. If the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures.
If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user; otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authentication and, hence, fail.
This example:
Places keytabs in the
/tmp/dtmp
folder on your machine and provides the file path as described below.Configures the environment variable
DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER
to reference a user for whom the keytab was created (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytab' \
-e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<<keytabname>>' \
-e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<user@kerberosrealm>>' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dtmp/:/tmp \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Enable HDFS with Keytab-Based Impersonation¶
Notes:
If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server.
If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user.
Logins are case sensitive when keytab-based impersonation is configured.
The example:
Sets the authentication type to
keytabimpersonation
.Places keytabs in the
/tmp/dtmp
folder on your machine and provides the file path as described below.Configures the
DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER
variable, which references a user for whom the keytab was created (usually in the form of user@realm).
Replace TAG below with the image tag.
# Docker instructions
nvidia-docker run \
--pid=host \
--init \
--rm \
--shm-size=256m \
-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
-e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytabimpersonation' \
-e DRIVERLESS_AI_KEY_TAB_PATH='/tmp/<<keytabname>>' \
-e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<appuser@kerberosrealm>>' \
-e DRIVERLESS_AI_PROCSY_PORT=8080 \
-p 12345:12345 \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /tmp/dlog/:/log \
-v /tmp/dlicense/:/license \
-v /tmp/ddata/:/data \
-u $(id -u):$(id -g) \
h2oai/dai-centos7-x86_64:TAG
Start DAI by Updating the config.toml File¶
This example shows how to configure HDFS options in the config.toml file, and then specify that file when starting Driverless AI in Docker. Note that this example enables HDFS with no authentication.
Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed.
enabled_file_systems = "file, upload, hdfs"
procsy_ip = "127.0.0.1"
procsy_port = 8080
Mount the config.toml file into the Docker container.
nvidia-docker run \ --pid=host \ --init \ --rm \ --shm-size=256m \ --add-host name.node:172.16.2.186 \ -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \ -p 12345:12345 \ -v /local/path/to/config.toml:/path/in/docker/config.toml \ -v /etc/passwd:/etc/passwd:ro \ -v /etc/group:/etc/group:ro \ -v /tmp/dtmp/:/tmp \ -v /tmp/dlog/:/log \ -v /tmp/dlicense/:/license \ -v /tmp/ddata/:/data \ -u $(id -u):$(id -g) \ h2oai/dai-centos7-x86_64:TAG
Specifying a Hadoop Platform¶
The following example shows how to build an H2O-3 Hadoop image and run Driverless AI on that image. This example uses CDH 6.0. Change the H2O_TARGET
to specify a different platform.
Clone and then build H2O-3 for CDH 6.0.
git clone https://github.com/h2oai/h2o-3.git cd h2o-3 ./gradlew clean build -x test export H2O_TARGET=cdh6.0 export BUILD_HADOOP=true ./gradlew clean build -x test
Start Driverless AI.
docker run -it --rm \ -v `pwd`:`pwd` \ -w `pwd` \ --entrypoint bash \ --network=host \ -p 8020:8020 \ docker.h2o.ai/cdh-6-w-hive \ -c 'sudo -E startup.sh && \ source /envs/h2o_env_python3.6/bin/activate && \ hadoop jar h2o-hadoop-3/h2o-cdh6.0-assembly/build/libs/h2odriver.jar -libjars "$(cat /opt/hive-jars/hive-libjars)" -n 1 -mapperXmx 2g -baseport 54445 -notify h2o_one_node -ea -disown && \ export CLOUD_IP=localhost && \ export CLOUD_PORT=54445 && \ make -f scripts/jenkins/Makefile.jenkins test-hadoop-smoke; \ bash'
Run the Driverless AI HDFS connector.
java -cp h2oai-dai-connectors.jar ai.h2o.dai.connectors.HdfsConnector
Verify the commands for
ls
andcp
, for example.
{"coreSiteXmlPath": "/etc/hadoop/conf", "keyTabPath": "", authType: "noauth", "srcPath": "hdfs://localhost/user/jenkins/", "dstPath": "/tmp/xxx", "command": "cp", "user": "", "appUser": ""}