Azure Blob Store Setup

Driverless AI allows you to explore Azure Blob Store data sources from within the Driverless AI application. This section describes how to enable the Azure Blob Store data connector in Docker environments.

Supported Data Sources Using the Azure Blob Store Connector

The following data sources can be used with the Azure Blob Store connector.

The following data sources can be used with the Azure Blob Store connector when also using the HDFS connector.

Description of Configuration Attributes

The following configuration attributes are specific to enabling Azure Blob Storage.

  • azure_blob_account_name: The Microsoft Azure Storage account name. This should be the dns prefix created when the account was created (for example, “mystorage”).

  • azure_blob_account_key: Specify the account key that maps to your account name.

  • azure_connection_string: Optionally specify a new connection string. With this option, you can include an override for a host, port, and/or account name. For example,

azure_connection_string = "DefaultEndpointsProtocol=http;AccountName=<account_name>;AccountKey=<account_key>;BlobEndpoint=http://<host>:<port>/<account_name>;"
  • azure_blob_init_path: Specifies the starting Azure Blob store path displayed in the UI of the Azure Blob store browser.

  • enabled_file_systems: The file systems you want to enable. This must be configured in order for data connectors to function properly.

The following additional configuration attributes can be used for enabling an HDFS Connector to connect to Azure Data Lake Gen 1 (and optionally with Azure Data Lake Gen 2).

  • hdfs_config_path: The location the HDFS config folder path. This folder can contain multiple config files.

  • hdfs_app_classpath: The HDFS classpath.

  • hdfs_app_supported_schemes: Supported schemas list that is used as an initial check to ensure valid input to connector.

Example 1: Enabling the Azure Blob Store Data Connector

This section describes how to enable the Azure Blob Store data connector when starting Driverless AI in Docker. This can done by specifying each environment variable in the nvidia-docker run command or by editing the configuration options in the config.toml file and then specifying that file in the nvidia-docker run command.

Start DAI Using Environment Variables

This example enables the Azure Blob Store data connector. This allows users to reference data stored on your Azure storage account using the account name, for example: https://mystorage.blob.core.windows.net. Replace TAG below with the image tag.

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  --shm-size=256m \
  -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,azrbs" \
  -e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_NAME="mystorage" \
  -e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_KEY="<access_key>" \
  -p 12345:12345 \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-centos7-x86_64:TAG

Start DAI by Updating the config.toml File

This example shows how to configure Azure Blob Store options in the config.toml file, and then specify that file when starting Driverless AI in Docker.

  1. Configure the Driverless AI config.toml file. Set the following configuration options:

  • enabled_file_systems = "file, upload, azrbs"

  • azure_blob_account_name = "mystorage"

  • azure_blob_account_key = "<account_key>"

  1. Mount the config.toml file into the Docker container.

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  --shm-size=256m \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-centos7-x86_64:TAG

Example 2: Mount Azure File Shares to the Local File System

Supported Data Sources Using the Local File System

  • Azure Files (File Storage)

Mounting Azure File Shares

Azure file shares can be mounted into the Local File system of Driverless AI. To mount the Azure file share, follow the steps listed on https://docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-linux.

Example 3: Enable HDFS Connector to Connect to Azure Data Lake Gen 1

This example enables the HDFS Connector to connect to Azure Data Lake Gen1. This allows users to reference data stored on your Azure Data Lake using the adl uri, for example: adl://myadl.azuredatalakestore.net.

Start DAI

  1. Create an Azure AD web application for service-to-service authentication: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory

  2. Add the information from your web application to the Hadoop core-site.xml configuration file:

<configuration>
        <property>
                <name>fs.adl.oauth2.access.token.provider.type</name>
                <value>ClientCredential</value>
        </property>
        <property>
                <name>fs.adl.oauth2.refresh.url</name>
                <value>Token endpoint created in step 1.</value>
        </property>
        <property>
                <name>fs.adl.oauth2.client.id</name>
                <value>Client ID created in step 1</value>
        </property>
        <property>
                <name>fs.adl.oauth2.credential</name>
                <value>Client Secret created in step 1</value>
        </property>
        <property>
                <name>fs.defaultFS</name>
                <value>ADL URIt</value>
        </property>
</configuration>
  1. Take note of the Hadoop Classpath and add the azure-datalake-store.jar file. This file can found on any Hadoop version in: $HADOOP_HOME/share/hadoop/tools/lib/*.

echo "$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*"
  1. Configure the Driverless AI config.toml file. Set the following configuration options:

enabled_file_systems = "upload, file, hdfs, azrbs, recipe_file, recipe_url"
hdfs_config_path = "/path/to/hadoop/conf"
hdfs_app_classpath = "/hadoop/classpath/"
hdfs_app_supported_schemes = "['adl://']"
  1. Mount the config.toml file into the Docker container.

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  --shm-size=256m \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-centos7-x86_64:TAG

Example 4: Enable HDFS Connector to Connect to Azure Data Lake Gen 2

This example enables the HDFS Connector to connect to Azure Data Lake Gen2. This allows users to reference data stored on your Azure Data Lake using the Azure Blob Fyle System Driver, for example: abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>.

Start DAI

  1. Create an Azure Service Principal: https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal

  2. Grant permissions to the Service Principal created on step 1 to access blobs: https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad

  3. Add the information from your web application to the Hadoop core-site.xml configuration file:

<configuration>
        <property>
                <name>fs.azure.account.auth.type</name>
                <value>OAuth</value>
        </property>
        <property>
                <name>fs.azure.account.oauth.provider.type</name>
                <value>org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider</value>
        </property>
        <property>
                <name>fs.azure.account.oauth2.client.endpoint</name>
                <value>Token endpoint created in step 1.</value>
        </property>
        <property>
                <name>fs.azure.account.oauth2.client.id</name>
                <value>Client ID created in step 1</value>
        </property>
        <property>
                <name>fs.azure.account.oauth2.client.secret</name>
                <value>Client Secret created in step 1</value>
        </property>
</configuration>
  1. Take note of the Hadoop Classpath and add the required jar files. These files can found on any Hadoop version 3.2 or higher at: $HADOOP_HOME/share/hadoop/tools/lib/*

echo "$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*"

Note: ABFS is only supported for Hadoop version 3.2 or higher.

  1. Configure the Driverless AI config.toml file. Set the following configuration options:

enabled_file_systems = "upload, file, hdfs, azrbs, recipe_file, recipe_url"
hdfs_config_path = "/path/to/hadoop/conf"
hdfs_app_classpath = "/hadoop/classpath/"
hdfs_app_supported_schemes = "['abfs://']"
  1. Mount the config.toml file into the Docker container.

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  --shm-size=256m \
  --add-host name.node:172.16.2.186 \
  -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \
  -p 12345:12345 \
  -v /local/path/to/config.toml:/path/in/docker/config.toml \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /tmp/dtmp/:/tmp \
  -v /tmp/dlog/:/log \
  -v /tmp/dlicense/:/license \
  -v /tmp/ddata/:/data \
  -u $(id -u):$(id -g) \
  h2oai/dai-centos7-x86_64:TAG

FAQ

Can I connect to my storage account using Private Endpoints?

Yes. Driverless AI can use privated endpoints if Driverless AI is located in the allowed VNET.

Does Driverless AI support secure transfer?

Yes. The Azure Blob Store Connector make all connections over HTTPS.

Does Driverless AI support hierarchical namespaces?

Yes.

Can I use Azure Managed Identities (MSI) to access the DataLake?

Yes. If Driverless AI is running on an Azure VM with managed identities. To enable the HDFS Connector to use MSI to authenticate, add to the core-site.xml:

For Gen1:

<property>
    <name>fs.adl.oauth2.access.token.provider.type</name>
    <value>MSI</value>
</property>

For Gen2:

<property>
    <name>fs.azure.account.auth.type</name>
    <value>OAuth</value>
</property>
<property>
    <name>fs.azure.account.oauth.provider.type</name>
    <value>org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider</value>
</property>