Using Data Connectors with Native Installs

The config.toml file is available in the etc/dai folder after the RPM or DEB is installed. Edit following sections to enable access to HDFS, S3, Google Cloud Storage, and Google BigQuery.

HDFS Setup

This section provides instructions for configuring Driverless AI to work with HDFS.

HDFS with No Authentication

This example enables the HDFS data connector and disables HDFS authentication in the config.toml file. It does not pass any HDFS configuration file; however it configures Docker DNS by passing the name and IP of the HDFS name node. This allows users to reference data stored in HDFS directly using name node address, for example: hdfs://name.node/datasets/iris.csv.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, hdfs"

Save the changes when you are done, then stop/restart Driverless AI.

HDFS with Keytab-Based Authentication

This example:

  • Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
  • Configures the environment variable hdfs_app_prinicpal_user to reference a user for whom the keytab was created (usually in the form of user@realm).
# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytab"

# Path of the principal key tab file
key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)
hdfs_app_principal_user = "<user@kerberosrealm>"

Save the changes when you are done, then stop/restart Driverless AI.

HDFS with Keytab-Based Impersonation

The example:

  • Places keytabs in the /tmp/dtmp folder on your machine and provides the file path as described below.
  • Configures the hdfs_app_principal_user variable, which references a user for whom the keytab was created (usually in the form of user@realm).
  • Configures the hdfs_app_login_user variable, which references a user who is being impersonated (usually in the form of user@realm).
# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, hdfs"

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
hdfs_auth_type = "keytab"

# Path of the principal key tab file
key_tab_path = "/tmp/<keytabname>"

# Kerberos app principal user (recommended)
hdfs_app_principal_user = "<user@kerberosrealm>"
# Specify the user id of the current user here as user@realm
hdfs_app_login_user = "<user@realm>"

Save the changes when you are done, then stop/restart Driverless AI.

S3 Setup

This section provides instructions for configuring Driverless AI to work with S3.

S3 with No Authentication

This example enables the S3 data connector and disables authentication. It does not pass any S3 access key or secret; however it configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, s3"

Save the changes when you are done, then stop/restart Driverless AI.

S3 with Authentication

This example enables the S3 data connector with authentication by passing an S3 access key ID and an access key. It also configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data stored in S3 directly using name node address, for example: s3://name.node/datasets/iris.csv.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, s3"

# AWS authentication settings
#   True : Authenticated connection
#   False : Unverified connection
aws_auth = "True"

# S3 Connector credentials
aws_access_key_id = "<access_key_id>"
aws_secret_access_key = "<access_key>"

Save the changes when you are done, then stop/restart Driverless AI.

Google Cloud Storage Setup

This section provides instructions for configuring Driverless AI to work with Google Cloud Storage. This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication.

In order to enable the GCS data connector with authentication, you must:

  1. Retrieve a JSON authentication file from GCP.
  2. Mount the JSON file to the Docker instance.
  3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json environmental variable.

Note: The account JSON includes authentications as provided by the system administrator. You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all.

GCS with Authentication

This example enables the GCS data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google Cloud Storage authentications.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, gcs"

# GCS Connector credentials
# example (suggested) -- "/licenses/my_service_account_json.json"
gcs_path_to_service_account_json = "/service_account_json.json"

Save the changes when you are done, then stop/restart Driverless AI.

Google Big Query

Driverless AI allows you to explore Google BigQuery data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Google BigQuery. This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication.

In order to enable the GBQ data connector with authentication, you must:

  1. Retrieve a JSON authentication file from GCP.
  2. Mount the JSON file to the Docker instance.
  3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json environmental variable.

Notes:

  • The account JSON includes authentications as provided by the system administrator. You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all.
  • Google BigQuery APIs limit the amount of data that can be extracted to a single file at 1GB. Any queries larger than this will fail.

GBQ with Authentication

This example enables the GBQ data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google BigQuery authentications.

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file, gbq"

# GCS Connector credentials
# example (suggested) -- "/licenses/my_service_account_json.json"
gcs_path_to_service_account_json = "/service_account_json.json"

Save the changes when you are done, then stop/restart Driverless AI.