Using Data Connectors with Native Installs

The config.toml file is available in the etc/dai folder after the RPM, DEB, or TAR SH is installed. Edit following sections to enable access to HDFS, S3, Google Cloud Storage, Google BigQuery, Minio, and Snowflake.

Before each step, be sure to export the Driverless AI config.toml file or add it to ~/.bashrc. For example:

export DRIVERLESS_AI_CONFIG_FILE=“/config/config.toml”

HDFS Setup

This section provides instructions for configuring Driverless AI to work with HDFS.

HDFS with No Authentication

This example enables the HDFS data connector and disables HDFS authentication in the config.toml file. It does not pass any HDFS configuration file; however it configures Docker DNS by passing the name and IP of the HDFS name node. This allows users to reference data stored in HDFS directly using name node address, for example: hdfs://name.node/datasets/iris.csv.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
enabled_file_systems = "file, hdfs"

Save the changes when you are done, then stop/restart Driverless AI.

HDFS with Keytab-Based Authentication

This example:

  • Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
  • Configures the environment variable hdfs_app_prinicpal_user to reference a user for whom the keytab was created (usually in the form of user@realm).
# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytab"

# Path of the principal key tab file
key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)
hdfs_app_principal_user = "<user@kerberosrealm>"

Save the changes when you are done, then stop/restart Driverless AI.

HDFS with Keytab-Based Impersonation

The example:

  • Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
  • Configures the hdfs_app_principal_user variable, which references a user for whom the keytab was created (usually in the form of user@realm).
  • Configures the hdfs_app_login_user variable, which references a user who is being impersonated (usually in the form of user@realm).
# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytab"

# Path of the principal key tab file
key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)
hdfs_app_principal_user = "<user@kerberosrealm>"

# Specify the user id of the current user here as user@realm
hdfs_app_login_user = "<user@realm>"

Save the changes when you are done, then stop/restart Driverless AI.

S3 Setup

This section provides instructions for configuring Driverless AI to work with S3.

S3 with No Authentication

This example enables the S3 data connector and disables authentication. It does not pass any S3 access key or secret; however it configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
enabled_file_systems = "file, s3"

Save the changes when you are done, then stop/restart Driverless AI.

S3 with Authentication

This example enables the S3 data connector with authentication by passing an S3 access key ID and an access key. It also configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, s3"

# AWS authentication settings
#   True : Authenticated connection
#   False : Unverified connection
aws_auth = "True"

# S3 Connector credentials
aws_access_key_id = "<access_key_id>"
aws_secret_access_key = "<access_key>"

Save the changes when you are done, then stop/restart Driverless AI.

Google Cloud Storage Setup

This section provides instructions for configuring Driverless AI to work with Google Cloud Storage. This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication.

In order to enable the GCS data connector with authentication, you must:

  1. Obtain a JSON authentication file from GCP.
  2. Mount the JSON file to the Docker instance.
  3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json environmental variable.

Note: The account JSON includes authentications as provided by the system administrator. You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all.

GCS with Authentication

This example enables the GCS data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google Cloud Storage authentications.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
enabled_file_systems = "file, gcs"

# GCS Connector credentials
# example (suggested) -- "/licenses/my_service_account_json.json"
gcs_path_to_service_account_json = "/service_account_json.json"

Save the changes when you are done, then stop/restart Driverless AI.

Google Big Query

Driverless AI allows you to explore Google BigQuery data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Google BigQuery. This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication.

In order to enable the GBQ data connector with authentication, you must:

  1. Obtain a JSON authentication file from GCP.
  2. Mount the JSON file to the Docker instance.
  3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json environmental variable.

Notes:

  • The account JSON includes authentications as provided by the system administrator. You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all.
  • Google BigQuery APIs limit the amount of data that can be extracted to a single file at 1GB. Any queries larger than this will fail.

Google BigQuery with Authentication

This example enables the GBQ data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google BigQuery authentications.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
enabled_file_systems = "file, gbq"

# GCS Connector credentials
# example (suggested) -- "/licenses/my_service_account_json.json"
gcs_path_to_service_account_json = "/service_account_json.json"

Save the changes when you are done, then stop/restart Driverless AI.

Minio Setup

This section provides instructions for configuring Driverless AI to work with Minio. Note that unlike S3, authentication must also be configured when the Minio data connector is specified.

Minio with Authentication

This example enables the Minio data connector with authentication by passing an endpoint URL, access key ID, and an access key. It also configures Docker DNS by passing the name and IP of the Minio endpoint. This allows users to reference data stored in Minio directly using the endpoint URL, for example: http://<endpoint_url>/<bucket>/datasets/iris.csv.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
enabled_file_systems = "file, minio"

# Minio Connector credentials
minio_endpoint_url = "<endpoint_url>"
minio_access_key_id = "<access_key_id>"
minio_secret_access_key = "<access_key>"

Save the changes when you are done, then stop/restart Driverless AI.

Snowflake

Driverless AI allows you to explore Snowflake data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Snowflake. This setup requires you to enable authentication. If you enable Snowflake connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication.

Snowflake with Authentication

This example enables the Snowflake data connector with authentication by passing the account, user, and password variables.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, gbq"

# Snowflake Connector credentials
snowflake_account = "<account_id>"
snowflake_user = "<username>"
snowflake_password = "<password>"

Save the changes when you are done, then stop/restart Driverless AI.