Hive 设置¶

Driverless AI 让您能从 Driverless AI 应用程序内搜索 Hive 数据源。本节介绍如何配置 Driverless AI 与 Hive 配合使用。

请注意：根据您所安装的 Docker 版本，在启动 Driverless AI Docker 映像时，使用 docker run --runtime=nvidia (>= Docker 19.03) 或 nvidia-docker (< Docker 19.03) 命令。使用 docker version 检查所使用的 Docker 版本。

配置属性说明¶

enabled_file_systems: 您要启用的文件系统。为使数据连接器正常运行，必须进行此项配置。
hive_app_configs: Hive 连接器配置。输入与配置 HDFS 连接器相似。重要密钥包括：
- hive_conf_path: Hive 配置路径。这可以有多个文件（例如，hive-site.xml、hdfs-site.xml 等。）
- auth_type : 将 noauth、keytab 或 keytabimpersonation 中的一种指定用于 Kerberos 身份验证。
- keytab_path : 指定用于身份验证的 Kerberos 密钥表的路径（如果使用 auth_type="noauth" ，则可以为 ""）
- Principal_user : 指定 Kerberos 应用程序主用户（使用 auth_type="keytab" 或 auth_type="keytabimpersonation" 时需要指定）

请注意：

使用 Hive 连接器时，假设 DAI 正在边缘节点上运行。但如果使用的是非边缘节点，则可能会出现问题（例如缺少类、依赖项、授权错误）。
- 确保 core-site.xml 文件（如 Hadoop conf）与其余文件（hive-site.xml、hdfs-site.xml 等）也都存在于 Hive conf 中。core-site.xml 文件应配置了代理用户（如 hadoop.proxyuser.hive.hosts 和 hadoop.proxyuser.hive.groups）。
- 如果您将 tez 用作 Hive 执行引擎，确保所需 tez 依赖项（classpath、jar 等）在 DAI 节点上可用。或者，您可以通过将 hive-site.xml 文件中的 hive.execution.engine``值更改为 ``mr 或 spark 来使用 DAI 附带的内部引擎。

此配置应为具有多个密钥的 JSON/字典字符串。例如：

"""{
  "hive_connection_1": {
   "hive_conf_path": "/path/to/hive/conf",
   "auth_type": "one of ['noauth', 'keytab',
   'keytabimpersonation']",
   "keytab_path": "/path/to/<filename>.keytab",
   "principal_user": "hive/node1.example.com@EXAMPLE.COM",
  },
  "hive_connection_2": {
   "hive_conf_path": "/path/to/hive/conf_2",
   "auth_type": "one of ['noauth', 'keytab',
   'keytabimpersonation']",
   "keytab_path": "/path/to/<filename_2>.keytab",
   "principal_user": "hive/node2.example.com@EXAMPLE.COM",
  }
}"""
请注意：hive_app_configs 的预期输入为 JSON 字符串. 必须使用双引号 ("...") 来表示 JSON 字典内的密钥和值，外引号必须为 ""\”、''' 或 ' 的格式。根据配置值的应用方式，可能需要不同形式的外引号。以下示例展示了两种应用外引号的独特方法。
通过 config.toml 文件应用的配置值：
hive_app_configs = """{"my_json_string": "value", "json_key_2": "value2"}"""
通过环境变量应用的配置值：
DRIVERLESS_AI_HIVE_APP_CONFIGS='{"my_json_string": "value", "json_key_2": "value2"}'

Hive_app_jvm_args ：可选择为 Hive 连接器指定其他 Java 虚拟机 (JVM) 参数。每个参数必须用空格隔开。

请注意：

如果您的 Kerberos 设置需要自定义 JAAS configuration file ，使用 hive_app_jvm_args 指定相应文件：

hive_app_jvm_args = "-Xmx20g -Djava.security.auth.login.config=/etc/dai/jaas.conf"

jaas.conf 文件示例：

com.sun.security.jgss.initiate {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 useTicketCache=false
 principal="hive/localhost@EXAMPLE.COM" [Replace this line]
 doNotPrompt=true
 keyTab="/path/to/hive.keytab" [Replace this line]
 debug=true;
};

hive_app_classpath ：可选择为 Hive 连接器指定替代 classpath。

启用有身份验证的 Hive¶

本节介绍如何在于 Docker 中启动 Driverless AI 时启用 Hive。通过在 nvidia-docker run 命令中指定每个环境变量，或通过在 config.toml 文件中编辑此配置选项，然后在 nvidia-docker run 命令中指定此文件即可。

启动 Driverless AI Docker 映像。

  nvidia-docker run \
    --pid=host \
    --init \
    --rm \
    --shm-size=256m \
    --add-host name.node:172.16.2.186 \
    -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs,hive" \
    -e DRIVERLESS_AI_HIVE_APP_CONFIGS='{"hive_connection_2: {"hive_conf_path":"/etc/hadoop/conf",
                                                         "auth_type":"keytabimpersonation",
                                                         "keytab_path":"/etc/dai/steam.keytab",
                                                         "principal_user":"steam/mr-0xg9.0xdata.loc@H2OAI.LOC"}}' \
    -p 12345:12345 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -v /tmp/dtmp/:/tmp \
    -v /tmp/dlog/:/log \
    -v /tmp/dlicense/:/license \
    -v /tmp/ddata/:/data \
    -v /path/to/hive/conf:/path/to/hive/conf/in/docker \
    -v /path/to/hive.keytab:/path/in/docker/hive.keytab \
    -u $(id -u):${id -g) \
    h2oai/dai-centos7-x86_64:1.10.1-cuda11.2.2.xx

本示例展示如何在 config.toml 文件中配置 Hive 选项并在于 Docker 中启动 Driverless AI 时指定此文件。

在 Driverless AI config.toml 文件中启用并配置 Hive 连接器。Hive 连接器配置必须为带有多个密钥的 JSON/字典字符串。

enabled_file_systems = "file, hdfs, s3, hive"
hive_app_configs = """{"hive_1": {"auth_type": "keytab",
                                  "key_tab_path": "/path/to/hive.keytab",
                                  "hive_conf_path": "/path/to/hive-resources",
                                  "principal_user": "hive/localhost@EXAMPLE.COM"}}"""

将 config.toml 文件挂载至 Docker 容器。

  nvidia-docker run \
    --pid=host \
    --init \
    --rm \
    --shm-size=256m \
    --add-host name.node:172.16.2.186 \
    -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
    -p 12345:12345 \
    -v /local/path/to/config.toml:/path/in/docker/config.toml \
    -v /etc/passwd:/etc/passwd:ro /
    -v /tmp/dtmp/:/tmp \
    -v /tmp/dlog/:/log \
    -v /tmp/dlicense/:/license \
    -v /tmp/ddata/:/data \
   -v /path/to/hive/conf:/path/to/hive/conf/in/docker \
   -v /path/to/hive.keytab:/path/in/docker/hive.keytab \
   -u $(id -u):$(id -g) \
   h2oai/dai-centos7-x86_64:1.10.1-cuda11.2.2.xx

此操作可启用 Hive 连接器。

导出 Driverless AI config.toml或将其添加至 ~/.bashrc。

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

在 config.toml 文件中指定以下配置选项。

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs, s3, hive"


# Configuration for Hive Connector
# Note that inputs are similar to configuring HDFS connectivity
# Important keys:
# * hive_conf_path - path to hive configuration, may have multiple files. Typically: hive-site.xml, hdfs-site.xml, etc
# * auth_type - one of `noauth`, `keytab`, `keytabimpersonation` for kerberos authentication
# * keytab_path - path to the kerberos keytab to use for authentication, can be "" if using `noauth` auth_type
# * principal_user = Kerberos app principal user. Required when using auth_type `keytab` or `keytabimpersonation`
# JSON/Dictionary String with multiple keys. Example:
# """{
# "hive_connection_1": {
# "hive_conf_path": "/path/to/hive/conf",
# "auth_type": "one of ['noauth', 'keytab', 'keytabimpersonation']",
# "keytab_path": "/path/to/<filename>.keytab",
# principal_user": "hive/localhost@EXAMPLE.COM",
# }
# }"""
#
hive_app_configs = """{"hive_1": {"auth_type": "keytab",
                                  "key_tab_path": "/path/to/hive.keytab",
                                  "hive_conf_path": "/path/to/hive-resources",
                                  "principal_user": "hive/localhost@EXAMPLE.COM"}}"""

完成后，保存更改，然后停止/重启 Driverless AI。

使用 Hive 添加数据集¶

启用 Hive 连接器后，您可以通过在 添加数据集（或拖放） 下拉菜单中选择 Hive 来添加数据集。

选择您想要使用的 Hive 配置。

指定以下信息以添加数据集。

Hive 数据库：指定您要查询的 Hive 数据库名称。

Hadoop 配置路径：指定 Hive 配置文件的路径。

Hive Kerberos 密钥表路径：指定 Hive Kerberos 密钥表的路径。

Hive Kerberos 主体：指定 Hive Kerberos 主体。如果 Hive 身份验证类型为 keytabimpersonation，则需要指定。

Hive 身份验证类型：指定身份验证类型。类型可以为 noauth、密钥表或 keytabimpersonation。

输入要保存为的数据集名称：可选择为您所上传的数据集指定新名称。

SQL Query：指定您想要执行的 Hive 查询。