Setting Environment Variables

Driverless AI provides a number of environment variables that can be passed when starting Driverless AI or specified in a config.toml file. The complete list of variables is in the Using the config.toml File section. The steps for specifying variables vary depending on whether you are running a Docker image or if you installed a Driverless AI RPM or DEB.

Setting Variables in Docker Images

Each property must be prepended with DRIVERLESS_AI_. The example below starts Driverless AI with environment variables that enable the S3 and HDFS access (without authentication)

nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  -u `id -u`:`id -g` \
  -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS="file,s3,hdfs" \
  -e DRIVERLESS_AI_AUTHENTICATION_METHOD="local" \
  -e DRIVERLESS_AI_LOCAL_HTPASSWD_FILE="<htpasswd_file_location>" \
  -v `pwd`/data:/data \
  -v `pwd`/log:/log \
  -v `pwd`/license:/license \
  -v `pwd`/tmp:/tmp \
  opsh2oai/h2oai-runtime

Setting Variables in Native Installs

The config.toml file is available in the etc/dai folder after the RPM or DEB is installed. Edit the desired variables in this file, and then restart Driverless AI.

The example below shows the environment variables in the config.toml file to set when enabling the S3 and HDFS access (without authentication)

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
enabled_file_systems = "file,s3,hdfs"

# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam :  Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# for additional settings under LDAP settings
authentication_method = "local"

# Local password file
# Generating a htpasswd file: see syntax below
# htpasswd -B "<location_to_place_htpasswd_file>" "<username>"
# note: -B forces use of brcypt, a secure encryption method
local_htpasswd_file = "<htpasswd_file_location>"

Using the config.toml File

Rather than passing individual parameters, admins can instead reference a config.toml file when starting the Driverless AI Docker image. The config.toml file includes all possible configuration options that would otherwise be specified in the nvidia-docker run command. This file is located in a folder on the container. You can make updates to environment variables directly in this file and then specify that file when starting the Driverless AI Docker image.

  1. Copy the config.toml file from inside the Docker image to your local filesystem. (Note that the example below assumes that you are running a version of Driverless AI that is tagged as “latest.”)
mkdir config
nvidia-docker run \
  --pid=host \
  --rm \
  --init \
  -u `id -u`:`id -g` \
  -v `pwd`/config:/config \
  --entrypoint bash \
  opsh2oai/h2oai-runtime \
  -c "cp config.toml /config"
  1. Edit the desired variables in the config.toml file. Save your changes when you are done.
  2. Start Driverless AI with the DRIVERLESS_AI_CONFIG_FILE environment variable. Make sure this points to the location of the edited config.toml file so that the software finds the configuration file:
nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  -u `id -u`:`id -g` \
  -e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml" \
  -v `pwd`/config:/config \
  -v `pwd`/data:/data \
  -v `pwd`/log:/log \
  -v `pwd`/license:/license \
  -v `pwd`/tmp:/tmp \
  opsh2oai/h2oai-runtime

For reference, below is a copy of the standard config.toml file included with this version of Driverless AI

##############################################################################
##                        DRIVERLESS AI CONFIGURATION FILE
#
# Comments:
# This file is authored in TOML (see https://github.com/toml-lang/toml)
#
# Config Override Chain
# Configuration variables for Driverless AI can be provided in several ways,
# the config engine reads and overides variables in the following order
#
# 1. h2oai/config/config.toml
# [internal not visible to users]
#
# 2. config.toml
# [place file in a folder/mount file in docker container and provide path
# in "DRIVERLESS_AI_CONFIG_FILE" environment variable]
#
# 3. Environment variable
# [configuration variables can also be provided as environment variables
# they must have the prefix "DRIVERLESS_AI_" followed by
# variable name in caps e.g "authentication_method" can be provided as
# "DRIVERLESS_AI_AUTHENTICATION_METHOD"]


##############################################################################
## Setup : Configure application server here (ip, ports, authentication, file
# types etc)

# IP address and port of process proxy.
#process_server_ip = "127.0.0.1"
#process_server_port = 8080

# IP address and port of H2O instance.
#h2o_ip = "127.0.0.1"
#h2o_port = 54321

# IP address and port for Driverless AI HTTP server.
#ip = "127.0.0.1"
#port = 12345

# https settings
#
# You can make a self-signed certificate for testing with the following commands:
#
#     sudo openssl req -x509 -newkey rsa:4096 -keyout private_key.pem -out cert.pem -days 3650 -nodes -subj "/O=Driverless AI"
#     sudo chown dai:dai cert.pem private_key.pem
#     sudo chmod 600 cert.pem private_key.pem
#     sudo mv cert.pem private_key.pem /etc/dai
#
#enable_https = false
#ssl_key_file = "/etc/dai/private_key.pem"
#ssl_crt_file = "/etc/dai/cert.pem"

# Data directory. All application data and files related datasets and
# experiments are stored in this directory.

#data_directory = "./tmp"

# Whether to run quick performance benchmark at start of application and each
# experiment
#enable_benchmark = false

# Whether to run quick startup checks at start of application
#enable_startup_checks = true

# Whether to opt in to usage statistics and bug reporting
#usage_stats_opt_in = false

# Whether to verbosely log datatable calls
#datatable_verbose_log = false

# Whether to create the Python scoring pipeline at the end of each experiment
#make_python_scoring_pipeline = true

# Whether to create the MOJO scoring pipeline at the end of each experiment
# Note: Not all transformers or main models are available for MOJO (e.g. no gblinear main model)
#make_mojo_scoring_pipeline = false

# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam :  Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# for additional settings under LDAP settings
#authentication_method = "unvalidated"

# LDAP Settings - Either the Recipe 0 or Recipe 1 can be used to configure LDAP
# Recipe 0 : Simple authentication using UserID and Password, Does not use SSL Cert
# Recipe 1 : Use SSL and global credential to connect to ldap and search user in group, If user is found Authenticate using User Credentials

# Recipe 0
#ldap_server = ""
#ldap_port = ""
#ldap_dc = ""

# Recipe 1
# Use this recipe when The below 3 step approach is required
# Step 1, 2: Use a machine/global credential to connect to ldap and using SSL
# Step 2: Using the above connections, Search for user in ldap to authorize
# Step 3: Authenticate using users own credentials

#ldap_server = ""
#ldap_port = ""
#ldap_recipe = ""          # When using this recipe, needs to be set to "1"
#ldap_tls_file = ""        # Provide Cert file location
#ldap_ou_dn = ""           # DN with OU where user needs to be found in search
#ldap_base_dn = ""         # Base DN where user needs to be found in search
#ldap_search_filter = ""   # Search Filter for finding user
#ldap_search_user_id = ""
#ldap_search_password = ""
#ldap_use_ssl = ""
#ldap_user_prefix = ""  # user='ldap_user_prefix={},{}'.format(ldap_app_user_id, ldap_ou_dn) for step 1

# Local password file
# Generating a htpasswd file: see syntax below
# htpasswd -B "<location_to_place_htpasswd_file>" "<username>"
# note: -B forces use of brcypt, a secure encryption method
#local_htpasswd_file = ""

# Supported file formats (file name endings must match for files to show up in file browser)
#supported_file_types = "csv, tsv, txt, dat, tgz, gz, bz2, zip, xz, xls, xlsx, nff, feather, bin, arff, parquet"

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the hadoop coresite and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
#enabled_file_systems = "file, hdfs, s3"

# do_not_log_list : add configurations that you do not wish to be recorded in logs here
#do_not_log_list = "local_htpasswd_file, aws_access_key_id, aws_secret_access_key"

##############################################################################
## Hardware: Configure hardware settings here (GPUs, CPUs, Memory, etc.)

# Max number of CPU cores to use per experiment. Set to <= 0 to use all cores.
#max_cores = 0

# Number of GPUs to use per model training task.  Set to -1 for all GPUs.
# Currently n_gpus!=1 disables GPU locking, so is only recommended for single
# experiments and single users.
# Ignored if GPUs disabled or no GPUs on system.
#num_gpus = 1

# Which gpu_id to start with
# If use CUDA_VISIBLE_DEVICES=... to control GPUs, gpu_id=0 is still the
# first in that list of devices.
# E.g. if CUDA_VISIBLE_DEVICES="4,5" then gpu_id_start=0 will refer to the
# device #4.
#gpu_id_start = 0

# Maximum number of workers for DriverlessAI server pool (only 1 needed
# currently)
#max_workers = 1

# Period (in seconds) of ping by DriverlessAI server to each experiment
# (in order to get logger info like disk space and memory usage)
# 0 means don't print anything
#ping_period = 60

# Minimum amount of disk space in GB needed to run experiments.
# Experiments will fail if this limit is crossed.
#disk_limit_gb = 5

# Minimum amount of system memory in GB needed to start experiments
#memory_limit_gb = 5

# Minimum number of rows needed to run experiments (values lower than 100
# might not work)
#min_num_rows = 100

# Precision of how data is stored and precision that most computations are performed at
# "float32" best for speed, "float64" best for accuracy or very large input values
# "float32" allows numbers up to about +-3E38 with relative error of about 1E-7
# "float64" allows numbers up to about +-1E308 with relative error of about 1E-16
#data_precision = "float64"

# Precision of some transformers, like TruncatedSVD.
# (Same options and notes as data_precision)
# Useful for higher precision in transformers with numerous operations that can accumulate error
# Also useful if want faster performance for transformers but otherwise want data stored in high precision
#transformer_precision = "auto"


##############################################################################
## Machine Learning : Configure machine learning configurations here
# (Data, Feature Engineering, Modelling etc)

# Seed for random number generator to make experiments reproducible (on same hardware), only active if 'reproducible' mode is enabled
#seed = 1234

# List of values that should be interpreted as missing values during data import. Applies both to numeric and string columns. Note that 'nan' is always interpreted as a missing value for numeric columns.
#missing_values = "['', '?', 'None', 'nan', 'NA', 'N/A', 'inf', '-inf', '1.7976931348623157e+308', '-1.7976931348623157e+308']"

# Internal threshold for number of rows x number of columns to trigger certain statistical
# techniques to increase statistical fidelity
#statistical_threshold_data_size_small = 100000

# Internal threshold for number of rows x number of columns to trigger certain statistical
# techniques that can speed up modeling
#statistical_threshold_data_size_large = 100000000

# Maximum number of columns
#max_cols = 10000

# Maximum allowed ratio of uniques for categorical columns (otherwise will mark this column as ID)
#max_relative_cardinality = 0.95

# Maximum allowed number of uniques for categorical columns (otherwise will mark this column as ID)
#max_absolute_cardinality = 10000

# Maximum number of uniques allowed in fold column
#max_fold_uniques = 100000

# Whether to treat some numerical features as categorical
#num_as_cat = true

# Max number of uniques for integer/real/bool valued columns to be treated as categoricals too (test applies to first statistical_threshold_data_size_small rows only)
#max_int_as_cat_uniques = 50

# Number of folds for feature evolution models
# Increasing this will put a lower fraction of data into validation and more into training
# E.g. num_folds=3 means 67%/33% training/validation splits
# Actual value will vary for small or big data cases
#num_folds = 3

# Accuracy setting equal and above which enables full cross-validation
#full_cv_accuracy_switch = 8

# Accuracy setting equal and above which enables stacked ensemble as final model
#ensemble_accuracy_switch = 5

# Number of fold splits to use for ensemble >= 2
# Actual value will vary for small or big data cases
#num_ensemble_folds = 5

# Number of repeats for each fold
# (modified slightly for small or big data cases)
#fold_reps = 1

# For binary classification: ratio of majority to minority class equal and above which to enable undersampling
#imbalance_ratio_undersampling_threshold = 5

# Smart sampling method for imbalanced binary classification (only if class ratio is above the threshold provided above)
#smart_imbalanced_sampling = true

# Maximum number of classes
#max_num_classes = 100

# Whether to enable early stopping
#early_stopping = true

# Normalized probability of choosing to lag non-targets relative to targets
#prob_lag_non_targets = 0.1

# Unnormalized probability of choosing other lag based time-series transformers
#prob_lagsinteraction = 0.1
#prob_lagsaggregates = 0.1

# Whether to explore unused genes (true) or to explore evenly (false)
#explore_more_unused_genes = false

# Whether to anneal so that exploits instead of explores as mutations are done on individual
#explore_gene_anneal = false

# Whether to anneal so that less rapid growth of gene addition (IMP) and more random addition (RND)
#explore_grow_anneal = false

# Threshold for average string-is-text score as determined by internal heuristics
# Higher values will favor string columns as categoricals, lower values will favor string columns as text
#string_col_as_text_threshold = 0.3

# Interpretability setting equal and above which will use monotonicity constraints in GBM
#monotonicity_constraints_interpretability_switch = 7

# When parameter tuning, choose 2**(parameter_tune_level + parameter_tuning_offset) models to tune
# Can make this lower to avoid excessive tuning, or make higher to do
# enhanced tuning
#parameter_tuning_offset = 2

# Accuracy setting equal and above which enables tuning of target transform for regression
#tune_target_transform_accuracy_switch = 3

# Accuracy setting equal and above which enables tuning of model parameters
#tune_parameters_accuracy_switch = 3

# Probability of adding genes
#prob_add_genes = 0.5

# Probability of pruning genes
#prob_prune_genes = 0.5

# Probability of evolving xgboost parameters
#prob_perturb_xgb = 0.5

# Given probability of adding genes, this is probability that will add best genes when adding
#prob_addbest_genes = 0.5

# Tournament style
# "uniform" : all individuals in population compete to win as best
# "model" : individuals with same model type compete
# "feature" : individuals with similar feature types compete (NOT IMPLEMENTED YET)
# "model" and "feature" styles preserve at least one winner for each type (and so 2 total indivs of each type after mutation)
#tournament_style = "model"

# number of individuals at accuracy 1 (i.e. models built per iteration, which compete in feature evolution)
# 4 times this default for accuracy 10
# If using GPUs, restricted so always 1 model scored per GPU per iteration
# (modified slightly for small or big data cases)
#num_individuals = 2

# set fixed number of individuals (if > 0) - useful to compare different hardware configurations
#fixed_num_individuals = 0

# Black list of transformers (i.e. transformers to not use, independent of
# the interpretability setting)
# for multi-class: "['NumCatTETransformer', 'TextLinModelTransformer',
# 'FrequentTransformer', 'CVTargetEncodeF', 'ClusterDistTransformer',
# 'WeightOfEvidenceTransformer', 'TruncSVDNumTransformer', 'CVCatNumEncodeF',
# 'DatesTransformer', 'TextTransformer', 'FilterTransformer',
# 'NumToCatWoETransformer', 'NumToCatTETransformer', 'ClusterTETransformer',
# 'BulkInteractionsTransformer']"
#
# for regression/binary: "['TextTransformer', 'ClusterDistTransformer',
# 'FilterTransformer', 'TextLinModelTransformer', 'NumToCatTETransformer',
# 'DatesTransformer', 'WeightOfEvidenceTransformer', 'BulkInteractionsTransformer',
# 'FrequentTransformer', 'CVTargetEncodeF', 'NumCatTETransformer',
# 'NumToCatWoETransformer', 'TruncSVDNumTransformer', 'ClusterTETransformer',
# 'CVCatNumEncodeF']"
#
# This list appears in the experiment logs (search for "Transformers used")
# e.g. to disable all Target Encoding: black_list_transformers =
# "['NumCatTETransformer', 'CVTargetEncodeF', 'NumToCatTETransformer',
# 'ClusterTETransformer']"
#black_list_transformers = ""

# Black list of genes (i.e. genes (built on top of transformers) to not use,
# independent of the interpretability setting)
#
# for multi-class: "['BulkInteractionsGene', 'WeightOfEvidenceGene',
# 'NumToCatTargetEncodeSingleGene', 'FilterGene', 'TextGene', 'FrequentGene',
# 'NumToCatWeightOfEvidenceGene', 'NumToCatWeightOfEvidenceMonotonicGene', '
# CvTargetEncodeSingleGene', 'DateGene', 'NumToCatTargetEncodeMultiGene', '
# DateTimeGene', 'TextLinRegressorGene', 'ClusterIDTargetEncodeSingleGene',
# 'CvCatNumEncodeGene', 'TruncSvdNumGene', 'ClusterIDTargetEncodeMultiGene',
# 'NumCatTargetEncodeMultiGene', 'CvTargetEncodeMultiGene', 'TextLinClassifierGene',
# 'NumCatTargetEncodeSingleGene', 'ClusterDistGene']"
#
# for regression/binary: "['CvTargetEncodeSingleGene', 'NumToCatTargetEncodeSingleGene',
# 'CvCatNumEncodeGene', 'ClusterIDTargetEncodeSingleGene', 'TextLinRegressorGene',
# 'CvTargetEncodeMultiGene', 'ClusterDistGene', 'FilterGene', 'DateGene',
# 'ClusterIDTargetEncodeMultiGene', 'NumToCatTargetEncodeMultiGene',
# 'NumCatTargetEncodeMultiGene', 'TextLinClassifierGene', 'WeightOfEvidenceGene',
# 'FrequentGene', 'TruncSvdNumGene', 'BulkInteractionsGene', 'TextGene',
# 'DateTimeGene', 'NumToCatWeightOfEvidenceGene',
# 'NumToCatWeightOfEvidenceMonotonicGene', ''NumCatTargetEncodeSingleGene']"
#
# This list appears in the experiment logs (search for "Genes used")
# e.g. to disable bulk interaction gene, use:  black_list_genes =
#"['BulkInteractionsGene']"
#black_list_genes = ""

# Whether to enable GBM models
#enable_gbm = true

# Upper limit for interpretability settings to enable GBM models (for tuning and feature evolution)
#gbm_interpretability_switch = 10

# Lower limit for accuracy settings to enable GBM models (for tuning and feature evolution)
#gbm_accuracy_switch = 1

# Upper limit for number of classes to use GBM for multiclass problems, will fall back to other models
#gbm_num_classes_limit = 5

# Whether to enable TensorFlow models (alpha)
#enable_tensorflow = false

# Upper limit for interpretability settings to enable TensorFlow models (for tuning and feature evolution)
#tensorflow_interpretability_switch = 6

# Lower limit for accuracy settings to enable Tensorflow models (for tuning and feature evolution)
#tensorflow_accuracy_switch = 5

# Max. number of epochs for TensorFlow models
#tensorflow_max_epochs = 100

# Whether to enable GLM models
#enable_glm = true

# Lower limit for interpretability settings to enable GLM models (for tuning and feature evolution)
#glm_interpretability_switch = 6

# Upper limit for accuracy settings to enable GLM models (for tuning and feature evolution)
#glm_accuracy_switch = 5

# Whether to enable RuleFit support (alpha)
#enable_rulefit = false

# Lower limit for interpretability settings to enable RuleFit models (for tuning and feature evolution)
#rulefit_interpretability_switch = 4

# Upper limit for accuracy settings to enable RuleFit models (for tuning and feature evolution)
#rulefit_accuracy_switch = 8

# Max number of rules to be used for RuleFit models (-1 for all)
#rulefit_max_num_rules = 100

# Max tree depth for RuleFit models
#rulefit_max_tree_depth = 6

# Max number of trees for RuleFit models
#rulefit_max_num_trees = 50

# Enable time series recipe
#time_series_recipe = true

# Max. sample size for automatic determination of time series train/valid split properties, only if time column is selected
#max_time_series_properties_sample_size = 1000000

# Whether to enable train/valid and train/test distribution shift detection
#check_distribution_shift = true

# Maximum number of GBM trees or GLM iterations
# Early-stopping usually chooses less
#max_nestimators = 3000

# Upper limit on learning rate for feature engineering GBM models
#max_learning_rate = 0.05

# Lower limit on learning rate for final ensemble GBM models
#min_learning_rate = 0.01

# Whether to speed up predictions used inside MLI with a fast approximation
#mli_fast_approx = true

# When number of rows are above this limit sample for MLI for scoring UI data
#mli_sample_above_for_scoring = 1000000

# When number of rows are above this limit sample for MLI for training surrogate models
#mli_sample_above_for_training = 100000

# When sample for MLI how many rows to sample
#mli_sample_size = 100000

# how many bins to do quantile binning
#mli_num_quantiles = 10

# mli random forest number of trees
#mli_drf_num_trees = 100

# mli random forest max depth
#mli_drf_max_depth = 20

# not only sample training, but also sample scoring
#mli_sample_training = true

# regularization strength for k-LIME GLM's
#klime_lambda = [1e-6, 1e-8]
#klime_alpha = 0.0

# mli converts numeric columns to enum when cardinality is <= this value
#mli_max_numeric_enum_cardinality = 25

##############################################################################
## Machine Learning Output : What kinds of files are written related to the machine learning process

# Whether to dump every scored individual's variable importance (both derived and original) to csv/tabulated/json file
# produces files like: individual_id%d.iter%d*features*
#dump_varimp_every_scored_indiv = false

# Whether to dump every scored individual's model parameters to csv/tabulated file
# produces files like: individual_id%d.iter%d*params*
#dump_modelparams_every_scored_indiv = false

##############################################################################
## Connectors : Configure connector specifications here

# Configurations for a HDFS data source
# Path of hdfs coresite.xml
#core_site_xml_path = ""
# Path of the principal key tab file
#key_tab_path = ""

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended)
#   keytabimpersonation : Login with impersonation using a keytab
#hdfs_auth_type = "noauth"

# Kerberos app principal user (recommended)
#hdfs_app_principal_user = ""
# Specify the user id of the current user here as user@realm
#hdfs_app_login_user = ""
# JVM args for HDFS distributions
#hdfs_app_jvm_args = ""
# hdfs class path
#hdfs_app_classpath = ""

# AWS authentication settings
#   True : Authenticated connection
#   False : Unverified connection
#aws_auth = "False"

# S3 Connector credentials
#aws_access_key_id = ""
#aws_secret_access_key = ""

# GCS Connector credentials
# example (suggested) -- "/licenses/my_service_account_json.json"
#gcs_path_to_service_account_json = ""

# Minio Connector credentials
#minio_endpoint_url = ""
#minio_access_key_id = ""
#minio_secret_access_key = ""

# Snowflake Connector credentials
#snowflake_account = ""
#snowflake_user = ""
#snowflake_password = ""

##############################################################################
## END