Hive Setup

Driverless AI lets you explore Hive data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Hive.

Note: Depending on your Docker install version, use either the docker run --runtime=nvidia (>= Docker 19.03) or nvidia-docker (< Docker 19.03) command when starting the Driverless AI Docker image. Use docker version to check which version of Docker you are using.

Description of Configuration Attributes

  • enabled_file_systems: The file systems you want to enable. This must be configured in order for data connectors to function properly.

  • hive_app_configs: Configuration for Hive Connector. Inputs are similar to configuring the HDFS connector. Important keys include:

    • hive_conf_path: The path to Hive configuration. This can have multiple files (e.g. hive-site.xml, hdfs-site.xml, etc.)

    • auth_type: Specify one of noauth, keytab, or keytabimpersonation for Kerberos authentication

    • keytab_path: Specify the path to Kerberos keytab to use for authentication (this can be "" if using auth_type="noauth")

    • principal_user: Specify the Kerberos app principal user (required when using auth_type="keytab" or auth_type="keytabimpersonation")

Notes:

  • With Hive connectors, it is assumed that DAI is running on the edge node. However, if a non-edge node is being used, then there could be issues (e.g. missing classes, dependencies, authorization errors).

    • Ensure the core-site.xml file (from e.g Hadoop conf) is also present in the Hive conf with the rest of the files (hive-site.xml, hdfs-site.xml, etc.). The core-site.xml file should have proxyuser configured (e.g. hadoop.proxyuser.hive.hosts & hadoop.proxyuser.hive.groups).

    • If you have tez as the Hive execution engine, make sure that the required tez dependencies (classpaths, jars, etc.) are available on the DAI node. Alternatively, you can use internal engines that come with DAI by changing your hive.execution.engine value in the hive-site.xml file to mr or spark.

The configuration should be JSON/Dictionary String with multiple keys. For example:

"""{
  "hive_connection_1": {
   "hive_conf_path": "/path/to/hive/conf",
   "auth_type": "one of ['noauth', 'keytab',
   'keytabimpersonation']",
   "keytab_path": "/path/to/<filename>.keytab",
   "principal_user": "hive/node1.example.com@EXAMPLE.COM",
  },
  "hive_connection_2": {
   "hive_conf_path": "/path/to/hive/conf_2",
   "auth_type": "one of ['noauth', 'keytab',
   'keytabimpersonation']",
   "keytab_path": "/path/to/<filename_2>.keytab",
   "principal_user": "hive/node2.example.com@EXAMPLE.COM",
  }
}"""

Note: The expected input of hive_app_configs is a JSON string. Double quotation marks ("...") must be used to denote keys and values within the JSON dictionary, and outer quotations must be formatted as either """, ''', or '. Depending on how the configuration value is applied, different forms of outer quotations may be required. The following examples show two unique methods for applying outer quotations.

  • Configuration value applied with the config.toml file:

hive_app_configs = """{"my_json_string": "value", "json_key_2": "value2"}"""
  • Configuration value applied with an environment variable:

DRIVERLESS_AI_HIVE_APP_CONFIGS='{"my_json_string": "value", "json_key_2": "value2"}'
  • hive_app_jvm_args: Optionally specify additional Java Virtual Machine (JVM) args for the Hive connector. Each arg must be separated by a space.

Notes:

  • If a custom JAAS configuration file is needed for your Kerberos setup, use hive_app_jvm_args to specify the appropriate file:

hive_app_jvm_args = "-Xmx20g -Djava.security.auth.login.config=/etc/dai/jaas.conf"

Sample jaas.conf file:

com.sun.security.jgss.initiate {
 com.sun.security.auth.module.Krb5LoginModule required
 useKeyTab=true
 useTicketCache=false
 principal="hive/localhost@EXAMPLE.COM" [Replace this line]
 doNotPrompt=true
 keyTab="/path/to/hive.keytab" [Replace this line]
 debug=true;
};
  • hive_app_classpath: Optionally specify an alternative classpath for the Hive connector.

Troubleshooting

The following section contains important troubleshooting information.

Supported Java system properties

You can use the following supported Java system properties in the DAI Hive connector for troubleshooting purposes.

  • dai.connectors.hive.save_as: You can use this system property to change the data format that is used for saving queried data. The possible values are parquet (default), csv, json, and orc. The following is an example of how you can set this system property in the config.toml file: hive_app_jvm_args="-Xmx4g -Ddai.connectors.hive.save_as=orc"

Enable Hive with Authentication

This section describes how to enable Hive when starting Driverless AI in Docker. This can be done by specifying each environment variable in the nvidia-docker run command or by editing the configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.

  1. Start the Driverless AI Docker Image.

  nvidia-docker run \
    --pid=host \
    --init \
    --rm \
    --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
    --add-host name.node:172.16.2.186 \
    -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs,hive" \
    -e DRIVERLESS_AI_HIVE_APP_CONFIGS='{"hive_connection_2: {"hive_conf_path":"/etc/hadoop/conf",
                                                         "auth_type":"keytabimpersonation",
                                                         "keytab_path":"/etc/dai/steam.keytab",
                                                         "principal_user":"steam/mr-0xg9.0xdata.loc@H2OAI.LOC"}}' \
    -p 12345:12345 \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -v /tmp/dtmp/:/tmp \
    -v /tmp/dlog/:/log \
    -v /tmp/dlicense/:/license \
    -v /tmp/ddata/:/data \
    -v /path/to/hive/conf:/path/to/hive/conf/in/docker \
    -v /path/to/hive.keytab:/path/in/docker/hive.keytab \
    -u $(id -u):${id -g) \
    h2oai/dai-ubi8-x86_64:1.11.1.1-cuda11.8.0.xx

Adding Datasets Using Hive

After the Hive connector is enabled, you can add datasets by selecting Hive from the Add Dataset (or Drag and Drop) drop-down menu.

  1. Select the Hive configuraton that you want to use.

Select Hive configuration
  1. Specify the following information to add your dataset.

  • Hive Database: Specify the name of the Hive database that you are querying.

  • Hadoop Configuration Path: Specify the path to your Hive configuration file.

  • Hive Kerberos Keytab Path: Specify the path for the Hive Kerberos keytab.

  • Hive Kerberos Principal: Specify the Hive Kerberos principal. This is required if the Hive Authentication Type is keytabimpersonation.

  • Hive Authentication Type: Specify the authentication type. This can be noauth, keytab, or keytabimpersonation.

  • Enter Name for Dataset to be saved as: Optionally specify a new name for the dataset that you are uploading.

  • SQL Query: Specify the Hive query that you want to execute.

Configure Hive query