HDFS 设置¶

Driverless AI 让您能从 Driverless AI 应用程序内搜索 HDFS 数据源。本节介绍如何配置 Driverless AI 与 HDFS 配合使用。

请注意：根据您所安装的 Docker 版本，在启动 Driverless AI Docker 映像时，使用 docker run --runtime=nvidia (>= Docker 19.03) 或 nvidia-docker (< Docker 19.03) 命令。使用 docker version 检查所使用的 Docker 版本。

配置属性说明¶

Hdfs_config_path （必须设置）：HDFS 配置文件夹路径位置。此文件夹可包含多个配置文件。
Hdfs_auth_type （必须设置）：指定 HDFS 身份验证。可用值如下：
- principal: 使用主用户对 HDFS 进行身份验证。
- keytab: 使用密钥表进行身份验证（建议）。如果将 DAI 作为一项服务运行，则 Kerberos 密钥表的所有者需为 DAI 用户。
- keytabimpersonation: 使用密钥表进行模拟登录。
- noauth: 无需身份验证。
key_tab_path: 主密钥表文件路径。当 hdfs_auth_type='principal' 时必须设置。
hdfs_app_principal_user: Kerberos 应用程序主体用户。当 hdfs_auth_type='keytab' 时必须设置。
hdfs_app_jvm_args: 用于 HDFS 分布的 JVM 参数。用空格分隔每个参数。
- -Djava.security.krb5.conf
- -Dsun.security.krb5.debug
- -Dlog4j.configuration
hdfs_app_classpath: HDFS classpath。
hdfs_app_supported_schemes: DFS 模式列表，用于检查是否已建立与连接器之间的有效输入。例如：
```
hdfs_app_supported_schemes = ['hdfs://', 'maprfs://', 'custom://']
```
此选项的默认值如下所示。将默认未选择的值添加至此列表，即可支持其他模式。
- hdfs://
- maprfs://
- swift://
hdfs_max_files_listed: 指定在连接器 UI 中可查看的最大文件数量。默认为 100 个文件。若需查看更多文件，科增大默认值。
hdfs_init_pat: 指定 HDFS 浏览器 UI 中显示的 HDFS 启动路径。
enabled_file_systems: 您要启用的文件系统。为使数据连接器正常运行，必须进行此项配置。

示例 1：启用无身份验证的 HDFS¶

此示例启用了 HDFS 数据连接器并禁用了 HDFS 身份验证。此连接器不传递任何 HDFS 配置文件；但是，可通过传递 HDFS 名称节点的名称和 IP 来配置 Docker DNS。这让您能直接使用名称节点地址引用 HDFS 中储存的数据，例如 hdfs://name.node/datasets/iris.csv.

 nvidia-docker run \
   --pid=host \
   --init \
   --rm \
   --shm-size=256m \
   --add-host name.node:172.16.2.186 \
   -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
   -e DRIVERLESS_AI_HDFS_AUTH_TYPE='noauth'  \
   -e DRIVERLESS_AI_PROCSY_PORT=8080 \
   -p 12345:12345 \
   -v /etc/passwd:/etc/passwd:ro \
   -v /etc/group:/etc/group:ro \
   -v /tmp/dtmp/:/tmp \
   -v /tmp/dlog/:/log \
   -v /tmp/dlicense/:/license \
   -v /tmp/ddata/:/data \
   -u $(id -u):$(id -g) \
   h2oai/dai-centos7-x86_64:1.10.1-cuda11.2.2.xx

此示例展示了如何在 config.toml 文件中配置 HDFS 选项，然后当在 Docker 中启动 Driverless AI 时指定此文件。请注意，本示例启用了无身份验证的 HDFS。

配置 Driverless AI config.toml 文件。设置以下配置选项。请注意，默认为 12347 的 procsy 端口也必须进行更改。

enabled_file_systems = "file, upload, hdfs"

procsy_ip = "127.0.0.1"

procsy_port = 8080

将 config.toml 文件挂载至 Docker 容器。

 nvidia-docker run \
    --pid=host \
    --init \
    --rm \
    --shm-size=256m \
    --add-host name.node:172.16.2.186 \
    -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
    -p 12345:12345 \
    -v /local/path/to/config.toml:/path/in/docker/config.toml \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -v /tmp/dtmp/:/tmp \
    -v /tmp/dlog/:/log \
    -v /tmp/dlicense/:/license \
    -v /tmp/ddata/:/data \
    -u $(id -u):$(id -g) \
   h2oai/dai-centos7-x86_64:1.10.1-cuda11.2.2.xx

此示例在 config.toml 文件中启用了 HDFS 数据连接器并禁用了 HDFS 身份验证。这允许用户直接使用名称节点地址引用 HDFS 中储存的数据，例如 hdfs://name.node/datasets/iris.csv.

导出 Driverless AI config.toml 文件或将其添加至 ~/.bashrc。例如：

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

在 config.toml 文件中指定以下配置选项。请注意，默认为 12347 的 procsy 端口也必须进行更改。

# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs"

完成后，保存更改，然后停止/重启 Driverless AI。

示例 2：启用带基于密钥表的身份验证的 HDFS¶

请注意：

如果使用 Kerberos 身份验证，则 Driverless AI 服务器上的时间必须与 Kerberos 服务器上的时间一致。如果客户端和 DC 之间的时间差异为 5 分钟或以上，则 Kerberos 会出现故障。
如果将 Driverless AI 作为一项服务运行，则 Kerberos 密钥表的所有者需为 Driverless AI 用户。否则， Driverless AI 将无法读取/访问密钥表，并将导致回退至简单身份验证，从而失败。

在此示例中：

将密钥表放置于计算机的 /tmp/dtmp 文件夹中，并提供如下所述的文件路径。
配置 DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER 环境变量，以引用为其创建了密钥表的用户（通常以 user@realm 的形式引用）。

 nvidia-docker run \
     --pid=host \
     --init \
     --rm \
     --shm-size=256m \
     -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
     -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytab'  \
     -e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<<keytabname>>' \
     -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<user@kerberosrealm>>' \
     -e DRIVERLESS_AI_PROCSY_PORT=8080 \
     -p 12345:12345 \
     -v /etc/passwd:/etc/passwd:ro \
     -v /etc/group:/etc/group:ro \
     -v /tmp/dtmp/:/tmp \
     -v /tmp/dlog/:/log \
     -v /tmp/dlicense/:/license \
     -v /tmp/ddata/:/data \
     -u $(id -u):$(id -g) \
     h2oai/dai-centos7-x86_64:1.10.1-cuda11.2.2.xx

在此示例中：

将密钥表放置于计算机的 /tmp/dtmp 文件夹中，并提供如下所述的文件路径。
配置 hdfs_app_prinicpal_user 选项，以引用为其创建了密钥表的用户（通常以 user@realm 的形式引用）。

配置 Driverless AI config.toml 文件。设置以下配置选项。请注意，默认为 12347 的 procsy 端口也必须进行更改。

enabled_file_systems = "file, upload, hdfs"

procsy_ip = "127.0.0.1"

procsy_port = 8080

hdfs_auth_type = "keytab"

key_tab_path = "/tmp/<keytabname>"

hdfs_app_principal_user = "<user@kerberosrealm>"

将 config.toml 文件挂载至 Docker 容器。

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  --shm-size=256m \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-centos7-x86_64:1.10.1-cuda11.2.2.xx

在此示例中：

将密钥表放置于计算机的 /tmp/dtmp 文件夹中，并提供如下所述的文件路径。
配置 hdfs_app_prinicpal_user 选项，以引用为其创建了密钥表的用户（通常以 user@realm 的形式引用）。

导出 Driverless AI config.toml 文件或将其添加至 ~/.bashrc。例如：

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

在 config.toml 文件中指定以下配置选项。

# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytab"

# Path of the principal key tab file
key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)
hdfs_app_principal_user = "<user@kerberosrealm>"

完成后，保存更改，然后停止/重启 Driverless AI。

示例 3：启用带基于密钥表的模拟的 HDFS¶

请注意：

如果使用 Kerberos，请确保 Driverless AI 的时间与 Kerberos 服务器上的时间一致。
如果将 Driverless AI 作为一项服务运行，则 Kerberos 密钥表的所有者需为 Driverless AI。
配置基于密钥表的模拟登录时，登录名区分大小写。

在此示例中：

将身份验证类型设置为 keytabimpersonation.
将密钥表放置于计算机的 /tmp/dtmp 文件夹中，并提供如下所述的文件路径。
配置 DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER 变量，以引用为其创建密钥表的用户（通常以 user@realm 的形式引用）。

 nvidia-docker run \
     --pid=host \
     --init \
     --rm \
     --shm-size=256m \
     -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
     -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytabimpersonation'  \
     -e DRIVERLESS_AI_KEY_TAB_PATH='/tmp/<<keytabname>>' \
     -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<appuser@kerberosrealm>>' \
     -e DRIVERLESS_AI_PROCSY_PORT=8080 \
     -p 12345:12345 \
     -v /etc/passwd:/etc/passwd:ro \
     -v /etc/group:/etc/group:ro \
     -v /tmp/dlog/:/log \
     -v /tmp/dlicense/:/license \
     -v /tmp/ddata/:/data \
     -u $(id -u):$(id -g) \
     h2oai/dai-centos7-x86_64:1.10.1-cuda11.2.2.xx

在此示例中：

将身份验证类型设置为 keytabimpersonation.
将密钥表放置于计算机的 /tmp/dtmp 文件夹中，并提供如下所述的文件路径。
配置 hdfs_app_principal_user 变量，以引用为其创建了密钥表的用户（通常以 user@realm 的形式引用）。

配置 Driverless AI config.toml 文件。设置以下配置选项。请注意，默认为 12347 的 procsy 端口也必须进行更改。

enabled_file_systems = "file, upload, hdfs"

procsy_ip = "127.0.0.1"

procsy_port = 8080

hdfs_auth_type = "keytabimpersonation"

key_tab_path = "/tmp/<keytabname>"

hdfs_app_principal_user = "<user@kerberosrealm>"

将 config.toml 文件挂载至 Docker 容器。

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  --shm-size=256m \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-centos7-x86_64:1.10.1-cuda11.2.2.xx

在此示例中：

将身份验证类型设置为 keytabimpersonation.
将密钥表放置于计算机的 /tmp/dtmp 文件夹中，并提供如下所述的文件路径。
配置 hdfs_app_principal_user 变量，以引用为其创建了密钥表的用户（通常以 user@realm 的形式引用）。

导出 Driverless AI config.toml 文件或将其添加至 ~/.bashrc。例如：

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

在 config.toml 文件中指定以下配置选项。

# IP address and port of procsy process.
procsy_ip = "127.0.0.1"
procsy_port = 8080

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytabimpersonation"

# Path of the principal key tab file
key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)
hdfs_app_principal_user = "<user@kerberosrealm>"

完成后，保存更改，然后停止/重启 Driverless AI。

指定 Hadoop 平台¶

以下示例展示了如何创建 H2O-3 Hadoop 映像并运行 Driverless AI。此示例中使用了 CDH 6.0。更改 H2O_TARGET 以指定另一个平台。

克隆，然后构建用于 CDH 6.0 的 H2O-3。

git clone https://github.com/h2oai/h2o-3.git
cd h2o-3
./gradlew clean build -x test
export H2O_TARGET=cdh6.0
export BUILD_HADOOP=true
./gradlew clean build -x test

启动 H2O。

docker run -it --rm \
  -v `pwd`:`pwd` \
  -w `pwd` \
  --entrypoint bash \
  --network=host \
  -p 8020:8020  \
  docker.h2o.ai/cdh-6-w-hive \
  -c 'sudo -E startup.sh && \
  source /envs/h2o_env_python3.8/bin/activate && \
  hadoop jar h2o-hadoop-3/h2o-cdh6.0-assembly/build/libs/h2odriver.jar -libjars "$(cat /opt/hive-jars/hive-libjars)" -n 1 -mapperXmx 2g -baseport 54445 -notify h2o_one_node -ea -disown && \
  export CLOUD_IP=localhost && \
  export CLOUD_PORT=54445 && \
  make -f scripts/jenkins/Makefile.jenkins test-hadoop-smoke; \
  bash'

运行 Driverless AI HDFS 连接器。

java -cp connectors/hdfs.jar ai.h2o.dai.connectors.HdfsConnector

例如，验证 Is 和 cp 命令。

{"coreSiteXmlPath": "/etc/hadoop/conf", "keyTabPath": "", authType: "noauth", "srcPath": "hdfs://localhost/user/jenkins/", "dstPath": "/tmp/xxx", "command": "cp", "user": "", "appUser": ""}