Hive Setup

Driverless AI lets you explore Hive data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Hive.

Note: For Docker 19.03 and later, use the --gpus all flag with docker run to enable GPU support. The older nvidia-docker wrapper is deprecated and no longer recommended. Ensure that the NVIDIA Container Toolkit is installed. To check your Docker version, run docker version.

Description of Configuration Attributes

enabled_file_systems: The file systems you want to enable. This must be configured in order for data connectors to function properly.
hive_app_configs: Configuration for Hive Connector. Inputs are similar to configuring the HDFS connector. Important keys include:
- hive_conf_path: The path to Hive configuration. This can have multiple files (e.g. hive-site.xml, hdfs-site.xml, etc.)
- auth_type: Specify one of noauth, keytab, or keytabimpersonation for Kerberos authentication
- keytab_path: Specify the path to Kerberos keytab to use for authentication (this can be "" if using auth_type="noauth")
- principal_user: Specify the Kerberos app principal user (required when using auth_type="keytab" or auth_type="keytabimpersonation")

Notes:

With Hive connectors, it is assumed that DAI is running on the edge node. However, if a non-edge node is being used, then there could be issues (e.g. missing classes, dependencies, authorization errors).
- Ensure the core-site.xml file (from e.g Hadoop conf) is also present in the Hive conf with the rest of the files (hive-site.xml, hdfs-site.xml, etc.). The core-site.xml file should have proxyuser configured (e.g. hadoop.proxyuser.hive.hosts & hadoop.proxyuser.hive.groups).
- If you have tez as the Hive execution engine, make sure that the required tez dependencies (classpaths, jars, etc.) are available on the DAI node. Alternatively, you can use internal engines that come with DAI by changing your hive.execution.engine value in the hive-site.xml file to mr or spark.

The configuration should be JSON/Dictionary String with multiple keys. For example:

"""{
  "hive_connection_1": {
   "hive_conf_path": "/path/to/hive/conf",
   "auth_type": "one of ['noauth', 'keytab',
   'keytabimpersonation']",
   "keytab_path": "/path/to/<filename>.keytab",
   "principal_user": "hive/node1.example.com@EXAMPLE.COM",
  },
  "hive_connection_2": {
   "hive_conf_path": "/path/to/hive/conf_2",
   "auth_type": "one of ['noauth', 'keytab',
   'keytabimpersonation']",
   "keytab_path": "/path/to/<filename_2>.keytab",
   "principal_user": "hive/node2.example.com@EXAMPLE.COM",
  }
}"""
Note: The expected input of hive_app_configs is a JSON string. Double quotation marks ("...") must be used to denote keys and values within the JSON dictionary, and outer quotations must be formatted as either """, ''', or '. Depending on how the configuration value is applied, different forms of outer quotations may be required. The following examples show two unique methods for applying outer quotations.
Configuration value applied with the config.toml file:
hive_app_configs = """{"my_json_string": "value", "json_key_2": "value2"}"""
Configuration value applied with an environment variable:
DRIVERLESS_AI_HIVE_APP_CONFIGS='{"my_json_string": "value", "json_key_2": "value2"}'

hive_app_jvm_args: Optionally specify additional Java Virtual Machine (JVM) args for the Hive connector. Each arg must be separated by a space.

Notes:

If a custom JAAS configuration file is needed for your Kerberos setup, use hive_app_jvm_args to specify the appropriate file:

hive_app_jvm_args = "-Xmx20g -Djava.security.auth.login.config=/etc/dai/jaas.conf"

Sample jaas.conf file:

com.sun.security.jgss.initiate {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 useTicketCache=false
 principal="hive/localhost@EXAMPLE.COM" [Replace this line]
 doNotPrompt=true
 keyTab="/path/to/hive.keytab" [Replace this line]
 debug=true;
};

hive_app_classpath: Optionally specify an alternative classpath for the Hive connector.

Troubleshooting

The following section contains important troubleshooting information.

Supported Java system properties

You can use the following supported Java system properties in the DAI Hive connector for troubleshooting purposes.

dai.connectors.hive.save_as: You can use this system property to change the data format that is used for saving queried data. The possible values are parquet (default), csv, json, and orc. The following is an example of how you can set this system property in the config.toml file: hive_app_jvm_args="-Xmx4g -Ddai.connectors.hive.save_as=orc"

Enable Hive with Authentication

This section describes how to enable Hive when starting Driverless AI in Docker. This can be done by specifying each environment variable in the docker run --gpus all command or by editing the configuration options in the config.toml file and then specifying that file in the docker run --gpus all command.

Start the Driverless AI Docker Image.

  docker run --gpus all \
    --pid=host \
    --init \
    --rm \
    --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
    --add-host name.node:172.16.2.186 \
    -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs,hive" \
    -e DRIVERLESS_AI_HIVE_APP_CONFIGS='{"hive_connection_2: {"hive_conf_path":"/etc/hadoop/conf",
                                                         "auth_type":"keytabimpersonation",
                                                         "keytab_path":"/etc/dai/steam.keytab",
                                                         "principal_user":"steam/mr-0xg9.0xdata.loc@H2OAI.LOC"}}' \
    -p 12345:12345 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -v /tmp/dtmp/:/tmp \
    -v /tmp/dlog/:/log \
    -v /tmp/dlicense/:/license \
    -v /tmp/ddata/:/data \
    -v /path/to/hive/conf:/path/to/hive/conf/in/docker \
    -v /path/to/hive.keytab:/path/in/docker/hive.keytab \
    -u $(id -u):${id -g) \
    h2oai/dai-ubi8-x86_64:2.2.0-cuda11.8.0.xx

This example shows how to configure Hive options in the config.toml file, and then specify that file when starting Driverless AI in Docker.

Enable and configure the Hive connector in the Driverless AI config.toml file. The Hive connector configuration must be a JSON/Dictionary string with multiple keys.

enabled_file_systems = "file, hdfs, s3, hive"
hive_app_configs = """{"hive_1": {"auth_type": "keytab",
                                  "key_tab_path": "/path/to/hive.keytab",
                                  "hive_conf_path": "/path/to/hive-resources",
                                  "principal_user": "hive/localhost@EXAMPLE.COM"}}"""

Mount the config.toml file into the Docker container.

  docker run --gpus all \
    --pid=host \
    --init \
    --rm \
    --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
    --add-host name.node:172.16.2.186 \
    -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
    -p 12345:12345 \
    -v /local/path/to/config.toml:/path/in/docker/config.toml \
    -v /etc/passwd:/etc/passwd:ro /
    -v /tmp/dtmp/:/tmp \
    -v /tmp/dlog/:/log \
    -v /tmp/dlicense/:/license \
    -v /tmp/ddata/:/data \
   -v /path/to/hive/conf:/path/to/hive/conf/in/docker \
   -v /path/to/hive.keytab:/path/in/docker/hive.keytab \
   -u $(id -u):$(id -g) \
   h2oai/dai-ubi8-x86_64:2.2.0-cuda11.8.0.xx

This enables the Hive connector.

Export the Driverless AI config.toml file or add it to ~/.bashrc.

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

Specify the following configuration options in the config.toml file.

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, hdfs, s3, hive"


# Configuration for Hive Connector
# Note that inputs are similar to configuring HDFS connectivity
# Important keys:
# * hive_conf_path - path to hive configuration, may have multiple files. Typically: hive-site.xml, hdfs-site.xml, etc
# * auth_type - one of `noauth`, `keytab`, `keytabimpersonation` for kerberos authentication
# * keytab_path - path to the kerberos keytab to use for authentication, can be "" if using `noauth` auth_type
# * principal_user = Kerberos app principal user. Required when using auth_type `keytab` or `keytabimpersonation`
# JSON/Dictionary String with multiple keys. Example:
# """{
# "hive_connection_1": {
# "hive_conf_path": "/path/to/hive/conf",
# "auth_type": "one of ['noauth', 'keytab', 'keytabimpersonation']",
# "keytab_path": "/path/to/<filename>.keytab",
# principal_user": "hive/localhost@EXAMPLE.COM",
# }
# }"""
#
hive_app_configs = """{"hive_1": {"auth_type": "keytab",
                                  "key_tab_path": "/path/to/hive.keytab",
                                  "hive_conf_path": "/path/to/hive-resources",
                                  "principal_user": "hive/localhost@EXAMPLE.COM"}}"""

Save the changes when you are done, then stop/restart Driverless AI.

Adding Datasets Using Hive

After the Hive connector is enabled, you can add datasets by selecting Hive from the Add Dataset (or Drag and Drop) drop-down menu.

Select the Hive configuraton that you want to use.

Specify the following information to add your dataset.

Hive Database: Specify the name of the Hive database that you are querying.

Hadoop Configuration Path: Specify the path to your Hive configuration file.

Hive Kerberos Keytab Path: Specify the path for the Hive Kerberos keytab.

Hive Kerberos Principal: Specify the Hive Kerberos principal. This is required if the Hive Authentication Type is keytabimpersonation.

Hive Authentication Type: Specify the authentication type. This can be noauth, keytab, or keytabimpersonation.

Enter Name for Dataset to be saved as: Optionally specify a new name for the dataset that you are uploading.

SQL Query: Specify the Hive query that you want to execute.