Using Data Connectors with the Docker Image

Available file systems can be configured via the enabled_file_systems property. Note that each property must be prepended with DRIVERLESS_AI_ For example:

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  -u `id -u`:`id -g` \
  -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs,gcs,gbq" \
  -v `pwd`/data:/data \
  -v `pwd`/log:/log \
  -v `pwd`/license:/license \
  -v `pwd`/tmp:/tmp \
  opsh2oai/h2oai-runtime

This sections that follow shows examples describing how to use environment variables to enable HDFS, S3, Google Cloud Storage, and Google Big Query data sources.

HDFS Setup

Driverless AI allows you to explore HDFS data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with HDFS.

Description of Configuration Attributes

  • hdfs_auth_type: Selects HDFS authentication. Available values are:

    • principal
    • keytab
    • keytabimpersonation
    • noauth
  • hdfs_core_site_xml_path: The location of core-site.xml configuration file.

HDFS with No Authentication

This example enables the HDFS data connector and disables HDFS authentication. It does not pass any HDFS configuration file; however it configures Docker DNS by passing the name and IP of the HDFS name node. This allows users to reference data stored in HDFS directly using name node address, for example: hdfs://name.node/datasets/iris.csv.

nvidia-docker run \
 --add-host name.node:172.16.2.186 \
 -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
 -e DRIVERLESS_AI_HDFS_AUTH_TYPE='noauth'  \
 -p 12345:12345 \
 --init -it --rm \
 -v /tmp/dtmp/:/tmp \
 -v /tmp/dlog/:/log \
 -u $(id -u):$(id -g) \
 opsh2oai/h2oai-runtime

HDFS with Keytab-Based Authentication

This example:

  • Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
  • Configures the environment variable DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER to reference a user for whom the keytab was created (usually in the form of user@realm).
# Docker instructions
nvidia-docker run \
 -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
 -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytab'  \
 -e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<<keytabname>>' \
 -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<user@kerberosrealm>>' \
 -p 12345:12345 \
 --init -it --rm \
 -v /tmp/dtmp/:/tmp \
 -v /tmp/dlog/:/log \
 -u $(id -u):$(id -g) \
 opsh2oai/h2oai-runtime

HDFS with Keytab-Based Impersonation

The example:

  • Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
  • Configures the DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER variable, which references a user for whom the keytab was created (usually in the form of user@realm).
  • Configures the DRIVERLESS_AI_HDFS_APP_LOGIN_USER variable, which references a user who is being impersonated (usually in the form of user@realm).
# Docker instructions
nvidia-docker run \
 -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,hdfs" \
 -e DRIVERLESS_AI_HDFS_AUTH_TYPE='Keytab'  \
 -e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<<keytabname>>' \
 -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<<appuser@kerberosrealm>>' \
 -e DRIVERLESS_AI_HDFS_APP_LOGIN_USER='<<thisuser@kerberosrealm>>' \
 -p 12345:12345 \
 --init -it --rm \
 -v /tmp/dtmp/:/tmp \
 -v /tmp/dlog/:/log \
 -u $(id -u):$(id -g) \
 opsh2oai/h2oai-runtime

S3 Setup

Driverless AI allows you to explore S3 data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with S3.

Description of Configuration Attributes

  • aws_access_key_id: The S3 access key ID
  • aws_secret_access_key: The S3 access key

S3 with No Authentication

This example enables the S3 data connector and disables authentication. It does not pass any S3 access key or secret; however it configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.

nvidia-docker run \
 --add-host name.node:172.16.2.186 \
 -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3" \
 -p 12345:12345 \
 --init -it --rm \
 -v /tmp/dtmp/:/tmp \
 -v /tmp/dlog/:/log \
 -u $(id -u):$(id -g) \
 opsh2oai/h2oai-runtime

S3 with Authentication

This example enables the S3 data connector with authentication by passing an S3 access key ID and an access key. It also configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.

nvidia-docker run \
 --add-host name.node:172.16.2.186 \
 -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3" \
 -e DRIVERLESS_AI_AWS_AUTH="True"
 -e DRIVERLESS_AI_AWS_ACCESS_KEY_ID="<access_key_id>" \
 -e DRIVERLESS_AI_AWS_SECRET_ACCESS_KEY="<access_key>" \
 -p 12345:12345 \
 --init -it --rm \
 -v /tmp/dtmp/:/tmp \
 -v /tmp/dlog/:/log \
 -u $(id -u):$(id -g) \
 opsh2oai/h2oai-runtime

Google Cloud Storage Setup

Driverless AI allows you to explore Google Cloud Storage data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Google Cloud Storage. This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication.

In order to enable the GCS data connector with authentication, you must:

  1. Retrieve a JSON authentication file from GCP.
  2. Mount the JSON file to the Docker instance.
  3. Specify the path to the /json_auth_file.json in the GCS_PATH_TO_SERVICE_ACCOUNT_JSON environmental variable.

Note: The account JSON includes authentications as provided by the system administrator. You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all.

GCS with Authentication

This example enables the GCS data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google Cloud Storage authentications.

nvidia-docker run \
    --rm \
    -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,gcs" \
    -e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON="/service_account_json.json" \
    -u `id -u`:`id -g` \
    -p 12345:12345 \
    -p 54321:54321 \
    -p 9090:9090 \
    -v `pwd`/data:/data \
    -v `pwd`/log:/log \
    -v `pwd`/license:/license \
    -v `pwd`/tmp:/tmp \
    -v `pwd`/service_account_json.json:/service_account_json.json \
    opsh2oai/h2oai-runtime

Google Big Query

Driverless AI allows you to explore Google BigQuery data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Google BigQuery. This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication.

In order to enable the GBQ data connector with authentication, you must:

  1. Retrieve a JSON authentication file from GCP.
  2. Mount the JSON file to the Docker instance.
  3. Specify the path to the /json_auth_file.json in the GCS_PATH_TO_SERVICE_ACCOUNT_JSON environmental variable.

Notes:

  • The account JSON includes authentications as provided by the system administrator. You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all.
  • Google BigQuery APIs limit the amount of data that can be extracted to a single file at 1GB. Any queries larger than this will fail.

GBQ with Authentication

This example enables the GBQ data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google BigQuery authentications.

nvidia-docker run \
    --rm \
    -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,gbq" \
    -e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON="/service_account_json.json" \
    -u `id -u`:`id -g` \
    -p 12345:12345 \
    -p 54321:54321 \
    -p 9090:9090 \
    -v `pwd`/data:/data \
    -v `pwd`/log:/log \
    -v `pwd`/license:/license \
    -v `pwd`/tmp:/tmp \
    -v `pwd`/service_account_json.json:/service_account_json.json \
    opsh2oai/h2oai-runtime