Azure Blob Storage Setup

Driverless AI lets you explore Azure Blob Storage data sources from within the Driverless AI application.

Note: For Docker 19.03 and later, use the --gpus all flag with docker run to enable GPU support. The older nvidia-docker wrapper is deprecated and no longer recommended. Ensure that the NVIDIA Container Toolkit is installed. To check your Docker version, run docker version. Supported Data Sources Using the Azure Blob Storage Connector ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The following data sources can be used with the Azure Blob Storage connector.

Azure Blob Storage (general purpose v1)
Blob Storage
Azure Files (File Storage)
Azure Data Lake Storage Gen 2 (Storage V2)

The following data sources can be used with the Azure Blob Storage connector when also using the HDFS connector.

Azure Data Lake Gen 1 (HDFS connector required)
Azure Data Lake Gen 2 (HDFS connector optional)

Description of Configuration Attributes

The following configuration attributes are specific to enabling Azure Blob Storage.

azure_blob_account_name: The Microsoft Azure Storage account name. This should be the dns prefix created when the account was created (for example, “mystorage”).
azure_blob_account_key: Specify the account key that maps to your account name.
azure_connection_string: Optionally specify a new connection string. With this option, you can include an override for a host, port, and/or account name. For example,
```
azure_connection_string = "DefaultEndpointsProtocol=http;AccountName=<account_name>;AccountKey=<account_key>;BlobEndpoint=http://<host>:<port>/<account_name>;"
```
azure_blob_init_path: Specifies the starting Azure Blob Storage path displayed in the UI of the Azure Blob Storage browser.
enabled_file_systems: The file systems you want to enable. This must be configured in order for data connectors to function properly.

The following additional configuration attributes can be used for enabling an HDFS Connector to connect to Azure Data Lake Gen 1 (and optionally with Azure Data Lake Gen 2).

hdfs_config_path: The location the HDFS config folder path. This folder can contain multiple config files.
hdfs_app_classpath: The HDFS classpath.
hdfs_app_supported_schemes: Supported schemas list is used as an initial check to ensure valid input to connector.

Example 1: Enabling the Azure Blob Storage Data Connector

This example enables the Azure Blob Storage data connector by specifying environment variables when starting the Driverless AI Docker image. This lets users reference data stored on your Azure storage account using the account name, for example: https://mystorage.blob.core.windows.net.

 docker run --gpus all \
   --pid=host \
   --init \
   --rm \
   --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
   -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,azrbs" \
   -e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_NAME="mystorage" \
   -e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_KEY="<access_key>" \
   -p 12345:12345 \
   -v /tmp/dtmp/:/tmp \
   -v /tmp/dlog/:/log \
   -v /tmp/dlicense/:/license \
   -v /tmp/ddata/:/data \
   -u $(id -u):$(id -g) \
   h2oai/dai-ubi8-x86_64:2.1.0-cuda11.8.0.xx

This example shows how to configure Azure Blob Storage options in the config.toml file, and then specify that file when starting Driverless AI in Docker.

Configure the Driverless AI config.toml file. Set the following configuration options:

enabled_file_systems = "file, upload, azrbs"

azure_blob_account_name = "mystorage"

azure_blob_account_key = "<account_key>"

Mount the config.toml file into the Docker container.

 docker run --gpus all \
  --pid=host \
  --init \
  --rm \
  --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-ubi8-x86_64:2.1.0-cuda11.8.0.xx

This example shows how to enable the Azure Blob Storage data connector in the config.toml file when starting Driverless AI in native installs. This lets users reference data stored on your Azure storage account using the account name, for example: https://mystorage.blob.core.windows.net.

Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:

# DEB and RPM
export DRIVERLESS_AI_CONFIG_FILE="/etc/dai/config.toml"

# TAR SH
export DRIVERLESS_AI_CONFIG_FILE="/path/to/your/unpacked/dai/directory/config.toml"

Specify the following configuration options in the config.toml file.

# File System Support
# upload : standard upload feature
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# dtap : Blue Data Tap file system, remember to configure the DTap section below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)
# jdbc: JDBC Connector, remember to configure JDBC below. (jdbc_app_configs)
# hive: Hive Connector, remember to configure Hive below. (hive_app_configs)
# recipe_url: load custom recipe from URL
# recipe_file: load custom recipe from local file system
enabled_file_systems = "file, azrbs"

# Azure Blob Storage Connector credentials
azure_blob_account_name = "mystorage"
azure_blob_account_key = "<account_key>"

Save the changes when you are done, then stop/restart Driverless AI.

Example 2: Mount Azure File Shares to the Local File System

Supported Data Sources Using the Local File System

Azure Files (File Storage)

Mounting Azure File Shares

Azure file shares can be mounted into the Local File system of Driverless AI. To mount the Azure file share, follow the steps listed on https://docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-linux.

Example 3: Enable HDFS Connector to Connect to Azure Data Lake Gen 1

This example enables the HDFS Connector to connect to Azure Data Lake Gen1. This lets users reference data stored on your Azure Data Lake using the adl uri, for example: adl://myadl.azuredatalakestore.net.

Create an Azure AD web application for service-to-service authentication: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory
Add the information from your web application to the Hadoop core-site.xml configuration file:

<configuration>
  <property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>ClientCredential</value>
  </property>
  <property>
    <name>fs.adl.oauth2.refresh.url</name>
    <value>Token endpoint created in step 1.</value>
  </property>
  <property>
    <name>fs.adl.oauth2.client.id</name>
    <value>Client ID created in step 1</value>
  </property>
  <property>
    <name>fs.adl.oauth2.credential</name>
    <value>Client Secret created in step 1</value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>ADL URIt</value>
  </property>
</configuration>

Take note of the Hadoop Classpath and add the azure-datalake-store.jar file. This file can found on any Hadoop version in: $HADOOP_HOME/share/hadoop/tools/lib/*.

echo "$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*"

Configure the Driverless AI config.toml file. Set the following configuration options:

enabled_file_systems = "upload, file, hdfs, azrbs, recipe_file, recipe_url"
hdfs_config_path = "/path/to/hadoop/conf"
hdfs_app_classpath = "/hadoop/classpath/"
hdfs_app_supported_schemes = "['adl://']"

Mount the config.toml file into the Docker container.

 docker run --gpus all \
  --pid=host \
  --init \
  --rm \
  --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-ubi8-x86_64:2.1.0-cuda11.8.0.xx

Create an Azure AD web application for service-to-service authentication. https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory
Add the information from your web application to the hadoop core-site.xml configuration file:

<configuration>
  <property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>ClientCredential</value>
  </property>
  <property>
    <name>fs.adl.oauth2.refresh.url</name>
    <value>Token endpoint created in step 1.</value>
  </property>
  <property>
    <name>fs.adl.oauth2.client.id</name>
    <value>Client ID created in step 1</value>
  </property>
  <property>
    <name>fs.adl.oauth2.credential</name>
    <value>Client Secret created in step 1</value>
  </property>
  <property>
    <name>fs.defaultFS</name>
    <value>ADL URIt</value>
  </property>
</configuration>

Take note of the Hadoop Classpath and add the azure-datalake-store.jar file. This file can found on any hadoop version in: $HADOOP_HOME/share/hadoop/tools/lib/*

echo "$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*"

Configure the Driverless AI config.toml file. Set the following configuration options:

enabled_file_systems = "upload, file, hdfs, azrbs, recipe_file, recipe_url"
hdfs_config_path = "/path/to/hadoop/conf"
hdfs_app_classpath = "/hadoop/classpath/"
hdfs_app_supported_schemes = "['adl://']"

Save the changes when you are done, then stop/restart Driverless AI.

Example 4: Enable HDFS Connector to Connect to Azure Data Lake Gen 2

This example enables the HDFS Connector to connect to Azure Data Lake Gen2. This lets users reference data stored on your Azure Data Lake using the Azure Blob File System Driver, for example: abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>.

Create an Azure Service Principal: https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal
Grant permissions to the Service Principal created on step 1 to access blobs: https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad
Add the information from your web application to the Hadoop core-site.xml configuration file:

<configuration>
  <property>
    <name>fs.azure.account.auth.type</name>
    <value>OAuth</value>
  </property>
  <property>
    <name>fs.azure.account.oauth.provider.type</name>
    <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
  </property>
  <property>
    <name>fs.azure.account.oauth2.client.endpoint</name>
    <value>Token endpoint created in step 1.</value>
  </property>
  <property>
    <name>fs.azure.account.oauth2.client.id</name>
    <value>Client ID created in step 1</value>
  </property>
  <property>
    <name>fs.azure.account.oauth2.client.secret</name>
    <value>Client Secret created in step 1</value>
  </property>
</configuration>

Take note of the Hadoop Classpath and add the required jar files. These files can found on any Hadoop version 3.2 or higher at: $HADOOP_HOME/share/hadoop/tools/lib/*

echo "$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*"
Note: ABFS is only supported for Hadoop version 3.2 or higher.

Configure the Driverless AI config.toml file. Set the following configuration options:

enabled_file_systems = "upload, file, hdfs, azrbs, recipe_file, recipe_url"
hdfs_config_path = "/path/to/hadoop/conf"
hdfs_app_classpath = "/hadoop/classpath/"
hdfs_app_supported_schemes = "['abfs://']"

Mount the config.toml file into the Docker container.

  docker run --gpus all \
    --pid=host \
    --init \
    --rm \
    --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
    --add-host name.node:172.16.2.186 \
    -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
    -p 12345:12345 \
    -v /local/path/to/config.toml:/path/in/docker/config.toml \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -v /tmp/dtmp/:/tmp \
    -v /tmp/dlog/:/log \
    -v /tmp/dlicense/:/license \
    -v /tmp/ddata/:/data \
    -u $(id -u):$(id -g) \
    h2oai/dai-ubi8-x86_64:2.1.0-cuda11.8.0.xx

Create an Azure Service Principal. https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal
Grant permissions to the Service Principal created on step 1 to access blobs: https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad
Add the information from your web application to the hadoop core-site.xml configuration file:

<configuration>
  <property>
    <name>fs.azure.account.auth.type</name>
    <value>OAuth</value>
  </property>
  <property>
    <name>fs.azure.account.oauth.provider.type</name>
    <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
  </property>
  <property>
    <name>fs.azure.account.oauth2.client.endpoint</name>
    <value>Token endpoint created in step 1.</value>
  </property>
  <property>
    <name>fs.azure.account.oauth2.client.id</name>
    <value>Client ID created in step 1</value>
  </property>
  <property>
    <name>fs.azure.account.oauth2.client.secret</name>
    <value>Client Secret created in step 1</value>
  </property>
</configuration>

Take note of the Hadoop Classpath and add the required jar files. These files can found on any hadoop version 3.2 or higher at: $HADOOP_HOME/share/hadoop/tools/lib/*

echo "$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*"
Note: ABFS is only supported for hadoop version 3.2 or higher

Configure the Driverless AI config.toml file. Set the following configuration options:

enabled_file_systems = "upload, file, hdfs, azrbs, recipe_file, recipe_url"
hdfs_config_path = "/path/to/hadoop/conf"
hdfs_app_classpath = "/hadoop/classpath/"
hdfs_app_supported_schemes = "['abfs://']"

Save the changes when you are done, then stop/restart Driverless AI.

Export MOJO artifact to Azure Blob Storage

In order to export the MOJO artifact to Azure Blob Storage, you must enable support for the shared access signatures (SAS) token. You can enable support for the SAS token by setting the following variables in the config.toml file:

enable_artifacts_upload=true
artifacts_store="azure"
artifacts_azure_sas_token="token"

For instructions on exporting artifacts, see Exporting Artifacts.

FAQ

Can I connect to my storage account using Private Endpoints?

Yes. Driverless AI can use private endpoints if Driverless AI is located in the allowed VNET.

Does Driverless AI support secure transfer?

Yes. The Azure Blob Storage Connector make all connections over HTTPS.

Does Driverless AI support hierarchical namespaces?

Yes.

Can I use Azure Managed Identities (MSI) to access the DataLake?

Yes. If Driverless AI is running on an Azure VM with managed identities. To enable the HDFS Connector to use MSI to authenticate, add to the core-site.xml:

For Gen1:

<property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>MSI</value>
</property>

For Gen2:

<property>
    <name>fs.azure.account.auth.type</name>
    <value>OAuth</value>
</property>
<property>
    <name>fs.azure.account.oauth.provider.type</name>
    <value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value>
</property>