Google BigQuery Setup

Driverless AI lets you explore Google BigQuery (GBQ) data sources from within the Driverless AI application. This page provides instructions for configuring Driverless AI to work with GBQ.

Note

The setup described on this page requires you to enable authentication. Enabling the GCS and/or GBQ connectors causes those file systems to be displayed in the UI, but the GCS and GBQ connectors cannot be used without first enabling authentication.

Before enabling the GBQ data connector with authentication, the following steps must be performed:

  1. In the Google Cloud Platform (GCP), create a private key for your service account. To create a private key, click Service Accounts > Keys, and then click the Add Key button. When the Create private key dialog appears, select JSON as the key type. To finish creating the JSON private key and download it to your local file system, click Create.

  2. Mount the downloaded JSON file to the Docker instance.

  3. Specify the path to the downloaded and mounted auth-key.json file with the gcs_path_to_service_account_json config option.

Note

Depending on your Docker install version, use either the docker run --runtime=nvidia (>= Docker 19.03) or nvidia-docker (< Docker 19.03) command when starting the Driverless AI Docker image. Use docker version to check which version of Docker you are using.

The following sections describe how to enable the GBQ data connector:

Enabling GBQ with the config.toml file

This example enables the GBQ data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google BigQuery authentications.

 nvidia-docker run \
     --pid=host \
     --rm \
     --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
     -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,gbq" \
     -e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON="/service_account_json.json" \
     -u `id -u`:`id -g` \
     -p 12345:12345 \
     -v `pwd`/data:/data \
     -v `pwd`/log:/log \
     -v `pwd`/license:/license \
     -v `pwd`/tmp:/tmp \
     -v `pwd`/service_account_json.json:/service_account_json.json \
     h2oai/dai-ubi8-x86_64:1.11.0-cuda11.8.0.xx

Enabling GBQ by setting an environment variable

The GBQ data connector can be configured by setting the GOOGLE_APPLICATION_CREDENTIALS environment variable as follows:

export GOOGLE_APPLICATION_CREDENTIALS="SERVICE_ACCOUNT_KEY_PATH"

In the preceding example, replace SERVICE_ACCOUNT_KEY_PATH with the path of the JSON file that contains your service account key. The following is an example of how this might look:

export GOOGLE_APPLICATION_CREDENTIALS="/etc/dai/service-account.json"

To see how to set this environment variable with Docker, refer to the following example:

nvidia-docker run \
    --pid=host \
    --rm \
    --shm-size=2g --cap-add=SYS_NICE --ulimit nofile=131071:131071 --ulimit nproc=16384:16384 \
    -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,gbq" \
    -e GOOGLE_APPLICATION_CREDENTIALS="/service_account.json" \
    -u `id -u`:`id -g` \
    -p 12345:12345 \
    -v `pwd`/data:/data \
    -v `pwd`/log:/log \
    -v `pwd`/license:/license \
    -v `pwd`/tmp:/tmp \
    -v `pwd`/service_account_json.json:/service_account_json.json \
    h2oai/dai-ubi8-x86_64:1.11.0-cuda11.8.0.xx

For more information on setting the GOOGLE_APPLICATION_CREDENTIALS environment variable, refer to the official documentation on setting the environment variable.

Enabling GBQ by enabling Workload Identity for your GKE cluster

The GBQ data connector can be configured by enabling Workload Identity for your Google Kubernetes Engine (GKE) cluster. For information on how to enable Workload Identity, refer to the official documentation on enabling Workload Identity on a GKE cluster.

Note

If Workload Identity is enabled, then the GOOGLE_APPLICATION_CREDENTIALS environment variable does not need to be set.

Adding Datasets Using GBQ

After Google BigQuery is enabled, you can add datasets by selecting Google Big Query from the Add Dataset (or Drag and Drop) drop-down menu.

Note

To run a BigQuery query with Driverless AI, the associated service account must have the following Identity and Access Management (IAM) permissions:

bigquery.jobs.create
bigquery.tables.create
bigquery.tables.delete
bigquery.tables.export
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.list
bigquery.tables.update
bigquery.tables.updateData
storage.buckets.get
storage.objects.create
storage.objects.delete
storage.objects.list
storage.objects.update

For a list of all Identity and Access Management permissions, refer to the IAM permissions reference from the official Google Cloud documentation.

Add Dataset

Specify the following information to add your dataset:

  1. Enter BQ Dataset ID with write access to create temporary table: Enter a dataset ID in Google BigQuery that this user has read/write access to. BigQuery uses this dataset as the location for the new table generated by the query.

Note: Driverless AI’s connection to GBQ will inherit the top-level directory from the service JSON file. So if a dataset named “my-dataset” is in a top-level directory named “dai-gbq”, then the value for the dataset ID input field would be “my-dataset” and not “dai-gbq:my-dataset”.

  1. Enter Google Storage destination bucket: Specify the name of Google Cloud Storage destination bucket. Note that the user must have write access to this bucket.

  2. Enter Name for Dataset to be saved as: Specify a name for the dataset, for example, my_file.

  3. Enter BigQuery Query (Use StandardSQL): Enter a StandardSQL query that you want BigQuery to execute. For example: SELECT * FROM <my_dataset>.<my_table>.

  4. (Optional) Specify a project to use with the GBQ connector. This is equivalent to providing --project when using a command-line interface.

  5. When you are finished, select the Click to Make Query button to add the dataset.

Make BigQuery