Using the config.toml File

Admins can edit a config.toml file when starting the Driverless AI Docker image. The config.toml file includes all possible configuration options that would otherwise be specified in the nvidia-docker run command. This file is located in a folder on the container. You can make updates to environment variables directly in this file. Driverless AI will use the updated config.toml file when starting from native installs. Docker users can specify that updated config.toml file when starting Driverless AI Docker image.

Docker Image Users

  1. Copy the config.toml file from inside the Docker image to your local filesystem.
# Make a config directory
mkdir config

# Copy the config.toml file to the new config directory.
nvidia-docker run \
  --pid=host \
  --rm \
  --init \
  -u `id -u`:`id -g` \
  -v `pwd`/config:/config \
  --entrypoint bash \
  h2oai/dai-centos7-x86_64:1.4.0-9.0
  -c "cp /etc/dai/config.toml /config"
  1. Edit the desired variables in the config.toml file. Save your changes when you are done.
  2. Start Driverless AI with the DRIVERLESS_AI_CONFIG_FILE environment variable. Make sure this points to the location of the edited config.toml file so that the software finds the configuration file.
nvidia-docker run \
  --pid=host \
  --init \
  --rm \
  -u `id -u`:`id -g` \
  -e DRIVERLESS_AI_CONFIG_FILE="/config/config.toml" \
  -v `pwd`/config:/config \
  -v `pwd`/data:/data \
  -v `pwd`/log:/log \
  -v `pwd`/license:/license \
  -v `pwd`/tmp:/tmp \
  h2oai/dai-centos7-x86_64:1.4.0-9.0

Native Install Users

Native installs include DEBs, RPMs, and TAR SH installs.

  1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:
export DRIVERLESS_AI_CONFIG_FILE=“/config/config.toml”
  1. Edit the desired variables in the config.toml file. Save your changes when you are done.
  2. Start Driverless AI. Note that the command used to start Driverless AI varies depending on your install type.

For reference, below is a copy of the standard config.toml file included with this version of Driverless AI. The rest of this section describes some examples showing how to set different environment variables.

Sample Config.toml File

##############################################################################
##                        DRIVERLESS AI CONFIGURATION FILE
#
# Comments:
# This file is authored in TOML (see https://github.com/toml-lang/toml)
#
# Config Override Chain
# Configuration variables for Driverless AI can be provided in several ways,
# the config engine reads and overides variables in the following order
#
# 1. h2oai/config/config.toml
# [internal not visible to users]
#
# 2. config.toml
# [place file in a folder/mount file in docker container and provide path
# in "DRIVERLESS_AI_CONFIG_FILE" environment variable]
#
# 3. Environment variable
# [configuration variables can also be provided as environment variables
# they must have the prefix "DRIVERLESS_AI_" followed by
# variable name in caps e.g "authentication_method" can be provided as
# "DRIVERLESS_AI_AUTHENTICATION_METHOD"]


##############################################################################
## Setup : Configure application server here (ip, ports, authentication, file
# types etc)

# IP address and port of autoviz process.
#vis_server_ip = "127.0.0.1"
#vis_server_port = 12346

# IP address and port of procsy process.
#procsy_ip = "127.0.0.1"
#procsy_port = 12347

# IP address and port of H2O instance.
#h2o_ip = "127.0.0.1"
#h2o_port = 54321

# IP address and port for Driverless AI HTTP server.
#ip = "127.0.0.1"
#port = 12345

# File upload limit (default 100GB)
#max_file_upload_size = 104857600000

# Verbosity of logging
# 0: quiet   (CRITICAL, ERROR, WARNING)
# 1: default (CRITICAL, ERROR, WARNING, INFO, DATA)
# 2: verbose (CRITICAL, ERROR, WARNING, INFO, DATA, DEBUG)
#log_level = 1

# https settings
#
# You can make a self-signed certificate for testing with the following commands:
#
#     sudo openssl req -x509 -newkey rsa:4096 -keyout private_key.pem -out cert.pem -days 3650 -nodes -subj "/O=Driverless AI"
#     sudo chown dai:dai cert.pem private_key.pem
#     sudo chmod 600 cert.pem private_key.pem
#     sudo mv cert.pem private_key.pem /etc/dai
#
#enable_https = false
#ssl_key_file = "/etc/dai/private_key.pem"
#ssl_crt_file = "/etc/dai/cert.pem"

# SSL TLS
#ssl_no_sslv2 = true
#ssl_no_sslv3 = true
#ssl_no_tlsv1 = true
#ssl_no_tlsv1_1 = true
#ssl_no_tlsv1_2 = false
#ssl_no_tlsv1_3 = false

# Data directory. All application data and files related datasets and
# experiments are stored in this directory.

#data_directory = "./tmp"

# Whether to run quick performance benchmark at start of application
#enable_benchmark = true

# Whether to run quick performance benchmark at start of each experiment
#enable_benchmark_each_experiment = false

# Whether to run quick startup checks at start of application
#enable_startup_checks = true

# Whether to opt in to usage statistics and bug reporting
#usage_stats_opt_in = false

# Whether to verbosely log datatable calls
#datatable_verbose_log = false

# Whether to create the Python scoring pipeline at the end of each experiment
#make_python_scoring_pipeline = true

# Whether to create the MOJO scoring pipeline at the end of each experiment
# Note: Not all transformers or main models are available for MOJO (e.g. no gblinear main model)
#make_mojo_scoring_pipeline = false

# authentication_method
# unvalidated : Accepts user id and password, does not validate password
# none : Does not ask for user id or password, authenticated as admin
# pam :  Accepts user id and password, Validates user with operating system
# ldap : Accepts user id and password, Validates against an ldap server, look
# local: Accepts a user id and password, Validated against a htpasswd file provided in local_htpasswd_file
# ibm_spectrum_conductor: Authenticate with IBM conductor auth api
# for additional settings under LDAP settings
#authentication_method = "unvalidated"

# LDAP Settings - Either the Recipe 0 or Recipe 1 can be used to configure LDAP
# Recipe 0 : Simple authentication using UserID and Password, Does not use SSL Cert
# Recipe 1 : Use SSL and global credential to connect to ldap and search user in group, If user is found Authenticate using User Credentials

# LDAP Credentials
#ldap_recipe = "0"          # When using this recipe, needs to be set to "1"

#ldap_server = ""
#ldap_port = ""
#ldap_bind_dn = ""
#ldap_bind_password = ""
#ldap_dc = ""               # Deprecated to be removed in future releases use ldap_bin_dn, ldap_base_dn instead

#ldap_base_dn = ""
#ldap_base_filter = ""
#ldap_user_name_attribute = ""

# Recipe 1
# Use this recipe when The below 3 step approach is required
# Step 1, 2: Use a machine/global credential to connect to ldap and using SSL
# Step 2: Using the above connections, Search for user in ldap to authorize
# Step 3: Authenticate using users own credentials
#ldap_tls_file = ""        # Provide Cert file location
#ldap_ou_dn = ""           # DN with OU where user needs to be found in search
#ldap_search_filter = ""   # Search Filter for finding user
#ldap_search_user_id = ""
#ldap_search_password = ""
#ldap_use_ssl = ""
#ldap_user_prefix = ""  # user='ldap_user_prefix={},{}'.format(ldap_app_user_id, ldap_ou_dn) for step 1

# Local password file
# Generating a htpasswd file: see syntax below
# htpasswd -B "<location_to_place_htpasswd_file>" "<username>"
# note: -B forces use of brcypt, a secure encryption method
#local_htpasswd_file = ""

# Supported file formats (file name endings must match for files to show up in file browser)
#supported_file_types = "csv, tsv, txt, dat, tgz, gz, bz2, zip, xz, xls, xlsx, nff, jay, feather, bin, arff, parquet"

# File System Support
# file : local file system/server file system
# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below
# s3 : Amazon S3, optionally configure secret and access key below
# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below
# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below
# minio : Minio Cloud Storage, remember to configure secret and access key below
# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)
# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)
#enabled_file_systems = "file, hdfs, s3"

# do_not_log_list : add configurations that you do not wish to be recorded in logs here
#do_not_log_list = "local_htpasswd_file, aws_access_key_id, aws_secret_access_key, snowflake_password, snowflake_url, snowflake_user, snowflake_account, minio_endpoint_url, minio_access_key_id, minio_secret_access_key, kdb_user, kdb_password, ldap_bind_password, gcs_path_to_service_account_json"

##############################################################################
## Hardware: Configure hardware settings here (GPUs, CPUs, Memory, etc.)

# Max number of CPU cores to use per experiment. Set to <= 0 to use all cores.
# One can also set environment variable "OMP_NUM_THREADS" to number of cores to use for OpenMP
# e.g. In bash: export OMP_NUM_THREADS=32
#max_cores = 0

# Minimum number of threads for datatable during data munging
# Not 1, so that if imbalance of work, older tasks will still use this minimum number of cores
#min_dt_threads_munging = 4

# Like min_dt_threads_munging but for final pipeline munging
#min_dt_threads_final_munging = 4

# Number of GPUs to use per model training task.  Set to -1 for all GPUs.
# Currently num_gpus_per_model!=1 disables GPU locking, so is only recommended for single
# experiments and single users.
# Ignored if GPUs disabled or no GPUs on system.
# More info at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation
#num_gpus_per_model = 1

# Number of GPUs to use per experiment for training task.  Set to -1 for all GPUs.
# Currently num_gpus_per_experiment!=-1 disables GPU locking, so is only recommended for single
# experiments and single users.
# Ignored if GPUs disabled or no GPUs on system.
# More info at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation
#num_gpus_per_experiment = -1

# Which gpu_id to start with
# If using CUDA_VISIBLE_DEVICES=... to control GPUs (preferred method), gpu_id=0 is the
# first in that restricted list of devices.
# E.g. if CUDA_VISIBLE_DEVICES="4,5" then gpu_id_start=0 will refer to the
# device #4.
# E.g. from expert mode, to run 2 experiments, each on a distinct GPU out of 2 GPUs:
# Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=0
# Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=1
# E.g. from expert mode, to run 2 experiments, each on a distinct GPU out of 8 GPUs:
# Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=0
# Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=4
# E.g. Like just above, but now run on all 4 GPUs/model
# Experiment#1: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=0
# Experiment#2: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=4
# If num_gpus_per_model!=1, global GPU locking is disabled
# (because underlying algorithms don't support arbitrary gpu ids, only sequential ids),
# so must setup above correctly to avoid overlap across all experiments by all users
# More info at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation
#gpu_id_start = 0

# Maximum number of workers for DriverlessAI server pool (only 1 needed
# currently)
#max_workers = 1

# Period (in seconds) of ping by DriverlessAI server to each experiment
# (in order to get logger info like disk space and memory usage)
# 0 means don't print anything
#ping_period = 60

# Minimum amount of disk space in GB needed to run experiments.
# Experiments will fail if this limit is crossed.
#disk_limit_gb = 5

# Minimum amount of system memory in GB needed to start experiments
#memory_limit_gb = 5

# Minimum number of rows needed to run experiments (values lower than 100
# might not work)
#min_num_rows = 100

# Precision of how data is stored and precision that most computations are performed at
# "float32" best for speed, "float64" best for accuracy or very large input values
# "float32" allows numbers up to about +-3E38 with relative error of about 1E-7
# "float64" allows numbers up to about +-1E308 with relative error of about 1E-16
#data_precision = "float64"

# Precision of some transformers, like TruncatedSVD.
# (Same options and notes as data_precision)
# Useful for higher precision in transformers with numerous operations that can accumulate error
# Also useful if want faster performance for transformers but otherwise want data stored in high precision
#transformer_precision = "auto"


##############################################################################
## Machine Learning : Configure machine learning configurations here
# (Data, Feature Engineering, Modelling etc)

# Seed for random number generator to make experiments reproducible (on same hardware), only active if 'reproducible' mode is enabled
#seed = 1234

# List of values that should be interpreted as missing values during data import. Applies both to numeric and string columns. Note that 'nan' is always interpreted as a missing value for numeric columns.
#missing_values = "['', '?', 'None', 'nan', 'NA', 'N/A', 'unknown', 'inf', '-inf', '1.7976931348623157e+308', '-1.7976931348623157e+308']"

# For tensorflow, what numerical value to give to missing values, where numeric values are standardized
# So 0 is center of distribution, and if Normal distribution then +-5 is 5 standard deviations away from the center.
# In many cases, an out of bounds value is a good way to represent missings, but in some cases the mean (0) may be better.
#tf_nan_impute_value = -5

# Internal threshold for number of rows x number of columns to trigger certain statistical
# techniques to increase statistical fidelity
#statistical_threshold_data_size_small = 100000

# Internal threshold for number of rows x number of columns to trigger certain statistical
# techniques that can speed up modeling
#statistical_threshold_data_size_large = 100000000

# Upper limit on the number of rows for feature evolution (applies to both training and validation/holdout splits)
# Depending on accuracy settings, a fraction of this value will be used
#max_rows_feature_evolution = 1000000

# Maximum number of columns
#max_cols = 10000

# Maximum number of columns selected out of original set of original columns, using feature selection
# The selection is based upon how well target encoding (or frequency encoding if not available) on categoricals and numerics treated as categoricals
#max_orig_cols_selected = 10000

# Maximum number of numeric columns selected, above which will do feature selection
#max_orig_numeric_cols_selected = 10000

# Maximum number of non-numeric columns selected, above which will do feature selection on all features and avoid num_as_cat
#max_orig_nonnumeric_cols_selected = 500

# Factor times max_orig_cols_selected by which selection is based upon no target encoding and no num_as_cat
#max_orig_cols_selected_simple_factor = 2

# Maximum allowed fraction of uniques for integer and categorical columns (otherwise will treat column as ID and drop)
#max_relative_cardinality = 0.95

# Maximum allowed number of uniques for integer and categorical columns (otherwise will treat column as ID and drop)
#max_absolute_cardinality = 1000000

# Whether to treat some numerical features as categorical
#num_as_cat = true

# Max number of uniques for integer/real/bool valued columns to be treated as categoricals too (test applies to first statistical_threshold_data_size_small rows only)
#max_int_as_cat_uniques = 50

# Number of folds for feature evolution models
# Increasing this will put a lower fraction of data into validation and more into training
# E.g. num_folds=3 means 67%/33% training/validation splits
# Actual value will vary for small or big data cases
#num_folds = 3

# Accuracy setting equal and above which enables full cross-validation during feature evolution
#full_cv_accuracy_switch = 8

# Accuracy setting equal and above which enables stacked ensemble as final model
#ensemble_accuracy_switch = 5

# Number of fold splits to use for ensemble >= 2
# Actual value will vary for small or big data cases
#num_ensemble_folds = 5

# Number of repeats for each fold
# (modified slightly for small or big data cases)
#fold_reps = 1

# For binary classification: ratio of majority to minority class equal and above which to enable undersampling
#imbalance_ratio_undersampling_threshold = 5

# Smart sampling method for imbalanced binary classification (only if class ratio is above the threshold provided above)
#smart_imbalanced_sampling = false

# Maximum number of classes
#max_num_classes = 100

# Number of actuals vs. predicted to generate
#num_actuals_vs_predicted = 100

# Whether to use H2O.ai brain, the local caching and smart re-use of prior models to generate features for new models
#  Will use H2O.ai brain cache if cache file has no extra column names per column type,
#  cache exactly matches classes, class labels, and time series options,
#  interpretability of cache is equal or lower,
#  main model (booster) is allowed by new experiment
# Level of brain to use (for chosen level, where higher levels will also do all lower level operations automatically)
# -1 = Don't use any brain cache and don't write any cache
#  0 = Don't use any brain cache but still write cache
#      Use case: Want to save model for later use, but want current model to be built without any brain models
#  1 = smart checkpoint if passed in old experiment_id to pull from (via GUI, running "restart from checkpoint" or chose which experiment to resume from)
#      Use case: From GUI select prior experiment using the right-hand panel, and select "RESTART FROM LAST CHECKPOINT" to use specific experiment's model to build new models from
#  2 = smart checkpoint from H2O.ai brain cache of individual best models
#      Use case: No need to select a particular prior experiment, we scan through H2O.ai brain cache for best models to restart from
#  3 = smart checkpoint like level #1, but for entire population.  Tune only if brain population insufficient size
#      (will re-score entire population in single iteration, so appears to take longer to complete first iteration)
#  4 = smart checkpoint like level #2, but for entire population.  Tune only if brain population insufficient size
#      (will re-score entire population in single iteration, so appears to take longer to complete first iteration)
#  5 = like #4, but will scan over entire brain cache of populations to get best scored individuals, starting from resumed experiment if chosen.
#      (can be slower due to brain cache scanning if big cache)
# Other use cases:
# a) Re-build on different data: Use same column names and fewer or more rows (applicable to 1 - 5)
# b) Re-fit only final pipeline: Like (a), but choose time=1 and feature_brain_level=3 - 5
# c) Re-build on more columns: Add columns, so model builds upon old model built from old column names (1 - 5)
#feature_brain_level = 2

# Maximum number of brain individuals pulled from H2O.ai brain cache for feature_brain_level=1, 2
#max_num_brain_indivs = 3

# Directory, relative to data_directory, to store H2O.ai brain meta model files
#brain_rel_dir = "H2O.ai_brain"

# Maximum size in bytes the brain will store
# -1: unlimited
# >=0 number of GB to limit brain to
#brain_max_size_GB = 20

# Whether to enable early stopping
#early_stopping = true

# Minimum number of DAI iterations
# Can be used for restarting when know want to continue for longer despite score not improving.
#min_dai_iterations = 0

# Maximum features per model (and each model within the final model if ensemble) kept just after scoring them
# Keeps top varaible importance features, prunes rest away, after each scoring.
# Final ensemble will exclude any pruned-away features and only train on kept features,
#   but may contain a few new features due to fitting on different data view
# Final scoring pipeline will exclude any pruned-away features,
#   but may contain a few new features due to fitting on different data view
# -1 means no restrictions except internally-determined memory restrictions
#nfeatures_max = -1

# How much effort to spend on feature engineering (0...10)
# Heuristic combination of various developer-level toml parameters
# 0   : keep only numeric features, only model tuning during evolution
# 1   : keep only numeric features and frequency-encoded categoricals, only model tuning during evolution
# 2-3 : Like #1 but some model and feature tuning during evolution.  No Text features.
# 4   : Like #5, but slightly more focused on model tuning
# 5   : Default.  Balanced feature-model tuning
# 6-7 : Like #5, but slightly more focused on feature engineering
# 8   : Like #6-7, but even more focused on feature engineering with high feature generation rate, no feature dropping even if high interpretability
# 9-10: Like #8, but no model tuning during feature evolution
#feature_engineering_effort = 5

# Threshold for average string-is-text score as determined by internal heuristics
# Higher values will favor string columns as categoricals, lower values will favor string columns as text
#string_col_as_text_threshold = 0.3

# Mininum fraction of uniques for string columns to be considered as possible text (otherwise categorical)
#string_col_as_text_min_relative_cardinality = 0.1

# Mininum number of uniques for string columns to be considered as possible text (otherwise categorical)
#string_col_as_text_min_absolute_cardinality = 100

# Interpretability setting equal and above which will use monotonicity constraints in GBM
#monotonicity_constraints_interpretability_switch = 7

# Maximum number of input columns to use to generate new features
#max_feature_interaction_depth = 8

# When parameter tuning, choose 2**(parameter_tune_level + parameter_tuning_offset) models to tune
# Can make this lower to avoid excessive tuning, or make higher to do
# enhanced tuning
#parameter_tuning_offset = 2

# Accuracy setting equal and above which enables tuning of target transform for regression
#tune_target_transform_accuracy_switch = 3

# Whether to automatically select target transformation for regression problems
# Can choose: 'identity' to disable any transformation
# Use tune_target_transform_accuracy_switch=11 to force to always use this choice
#target_transformer = 'auto'

# Accuracy setting equal and above which enables tuning of model parameters
#tune_parameters_accuracy_switch = 3

# Tournament style
# "auto" : Choose based upon accuracy, etc.
# "fullstack" : Choose among optimal model and feature types
# "uniform" : all individuals in population compete to win as best
# "model" : individuals with same model type compete
# "feature" : individuals with similar feature types compete
# "model" and "feature" styles preserve at least one winner for each type (and so 2 total indivs of each type after mutation)
#tournament_style = "auto"

# Interpretability above which will use "uniform" tournament style
#tournament_style_interpretability_switch = 6

# Accuracy equal and above which uses model style if tournament_style = "auto"
#tournament_style_accuracy_switch = 5

# Accuracy equal and above which uses feature style if tournament_style = "auto"
#tournament_feature_style_accuracy_switch = 6

# Accuracy equal and above which uses fullstack style if tournament_style = "auto"
#tournament_fullstack_style_accuracy_switch = 7

# number of individuals at accuracy 1 (i.e. models built per iteration, which compete in feature evolution)
# 4 times this default for accuracy 10
# If using GPUs, restricted so always 1 model scored per GPU per iteration
# (modified slightly for small or big data cases)
#num_individuals = 2

# set fixed number of individuals (if > 0) - useful to compare different hardware configurations
#fixed_num_individuals = 0

# set fixed number of folds (if > 0) - useful for quick runs regardless of data
#fixed_num_folds = 0

# set fixed number of fold reps (if > 0) - useful for quick runs regardless of data
#fixed_fold_reps = 0

# set true to force only first fold for models - useful for quick runs regardless of data
#fixed_only_first_fold_model = false

# number of unique targets or folds counts after which switch to faster/simpler non-natural sorting and print outs
#sanitize_natural_sort_limit = 1000

# Whether target encoding is generally enabled
#enable_target_encoding = true

# Black list of transformers (i.e. transformers to not use, independent of
# the interpretability setting)
# for multi-class: "['NumCatTETransformer', 'TextLinModelTransformer',
# 'FrequentTransformer', 'CVTargetEncodeF', 'ClusterDistTransformer',
# 'WeightOfEvidenceTransformer', 'TruncSVDNumTransformer', 'CVCatNumEncodeF',
# 'DatesTransformer', 'TextTransformer', 'FilterTransformer',
# 'NumToCatWoETransformer', 'NumToCatTETransformer', 'ClusterTETransformer',
# 'BulkInteractionsTransformer']"
#
# for regression/binary: "['TextTransformer', 'ClusterDistTransformer',
# 'FilterTransformer', 'TextLinModelTransformer', 'NumToCatTETransformer',
# 'DatesTransformer', 'WeightOfEvidenceTransformer', 'BulkInteractionsTransformer',
# 'FrequentTransformer', 'CVTargetEncodeF', 'NumCatTETransformer',
# 'NumToCatWoETransformer', 'TruncSVDNumTransformer', 'ClusterTETransformer',
# 'CVCatNumEncodeF']"
#
# This list appears in the experiment logs (search for "Transformers used")
# e.g. to disable all Target Encoding: black_list_transformers =
# "['NumCatTETransformer', 'CVTargetEncodeF', 'NumToCatTETransformer',
# 'ClusterTETransformer']"
#black_list_transformers = ""

# Black list of genes (i.e. genes (built on top of transformers) to not use,
# independent of the interpretability setting)
#
# for multi-class: "['BulkInteractionsGene', 'WeightOfEvidenceGene',
# 'NumToCatTargetEncodeSingleGene', 'FilterGene', 'TextGene', 'FrequentGene',
# 'NumToCatWeightOfEvidenceGene', 'NumToCatWeightOfEvidenceMonotonicGene', '
# CvTargetEncodeSingleGene', 'DateGene', 'NumToCatTargetEncodeMultiGene', '
# DateTimeGene', 'TextLinRegressorGene', 'ClusterIDTargetEncodeSingleGene',
# 'CvCatNumEncodeGene', 'TruncSvdNumGene', 'ClusterIDTargetEncodeMultiGene',
# 'NumCatTargetEncodeMultiGene', 'CvTargetEncodeMultiGene', 'TextLinClassifierGene',
# 'NumCatTargetEncodeSingleGene', 'ClusterDistGene']"
#
# for regression/binary: "['CvTargetEncodeSingleGene', 'NumToCatTargetEncodeSingleGene',
# 'CvCatNumEncodeGene', 'ClusterIDTargetEncodeSingleGene', 'TextLinRegressorGene',
# 'CvTargetEncodeMultiGene', 'ClusterDistGene', 'FilterGene', 'DateGene',
# 'ClusterIDTargetEncodeMultiGene', 'NumToCatTargetEncodeMultiGene',
# 'NumCatTargetEncodeMultiGene', 'TextLinClassifierGene', 'WeightOfEvidenceGene',
# 'FrequentGene', 'TruncSvdNumGene', 'BulkInteractionsGene', 'TextGene',
# 'DateTimeGene', 'NumToCatWeightOfEvidenceGene',
# 'NumToCatWeightOfEvidenceMonotonicGene', ''NumCatTargetEncodeSingleGene']"
#
# This list appears in the experiment logs (search for "Genes used")
# e.g. to disable bulk interaction gene, use:  black_list_genes =
#"['BulkInteractionsGene']"
#black_list_genes = ""

# Parameters for LightGBM to override DAI parameters
# parameters shoudld be given as XGBoost equivalent unless unique LightGBM parameter
# e.g. params_lightgbm = "{'objective': 'binary:logistic', 'n_estimators': 100, 'max_leaves': 64, 'random_state': 1234}"
# e.g. params_lightgbm = {'n_estimators': 600, 'learning_rate': 0.1, 'reg_alpha': 0.0, 'reg_lambda': 0.5, 'gamma': 0, 'max_depth': 0, 'max_bin': 128, 'max_leaves': 256, 'scale_pos_weight': 1.0, 'max_delta_step': 3.469919910597877, 'min_child_weight': 1, 'subsample': 0.9, 'colsample_bytree': 0.3, 'tree_method': 'gpu_hist', 'grow_policy': 'lossguide', 'min_data_in_bin': 3, 'min_child_samples': 5, 'early_stopping_rounds': 20, 'num_classes': 2, 'objective': 'binary:logistic', 'eval_metric': 'logloss', 'random_state': 987654, 'early_stopping_threshold': 0.01, 'monotonicity_constraints': False, 'silent': True, 'debug_verbose': 0, 'subsample_freq': 1}"
# avoid including "system"-level parameters like 'n_gpus': 1, 'gpu_id': 0, , 'n_jobs': 1, 'booster': 'lightgbm'
# also likely should avoid parameters like: 'objective': 'binary:logistic', unless one really knows what one is doing (e.g. alternative objectives)
# See: https://xgboost.readthedocs.io/en/latest/parameter.html
# And see: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst
#params_lightgbm = "{}"

# Parameters for XGBoost to override DAI parameters
# similar parameters as lightgbm since lightgbm parameters are transcribed from xgboost equivalent versions
# e.g. params_xgboost = "{'n_estimators': 100, 'max_leaves': 64, 'max_depth': 0, 'random_state': 1234}"
# See: https://xgboost.readthedocs.io/en/latest/parameter.html
#params_xgboost = "{}"

# Parameters for Tensorflow to override DAI parameters
# e.g. params_tensorflow = "{'lr': 0.01, 'add_wide': False, 'add_attention': True, 'epochs': 30, 'layers': (100, 100), 'activation': 'selu', 'batch_size': 64, 'chunk_size': 1000, 'dropout': 0.3, 'strategy': 'one_shot', 'l1': 0.0, 'l2': 0.0, 'ort_loss': 0.5, 'ort_loss_tau': 0.01, 'normalize_type': 'streaming'}"
# See: https://keras.io/ , e.g. for activations: https://keras.io/activations/
# Example layers: (500, 500, 500), (100, 100, 100), (100, 100), (50, 50)
# Strategies: '1cycle' or 'one_shot', See: https://github.com/fastai/fastai
# normalize_type: 'streaming' or 'global' (using sklearn StandardScaler)
#params_tensorflow = "{}"

# Parameters for XGBoost's gblinear to override DAI parameters
# e.g. params_gblinear = "{'n_estimators': 100}"
# See: https://xgboost.readthedocs.io/en/latest/parameter.html
#params_gblinear = "{}"

# Parameters for Rulefit to override DAI parameters
# e.g. params_rulefit = "{'max_leaves': 64}"
# See: https://xgboost.readthedocs.io/en/latest/parameter.html
#params_rulefit = "{}"

# Whether to enable XGBoost models (auto/on/off)
#enable_xgboost = "auto"

# Internal threshold for number of rows x number of columns to trigger no xgboost models due to high memory use
#xgboost_threshold_data_size_large = 100000000

# Internal threshold for number of rows x number of columns to trigger no xgboost models due to limits on GPU memory capability
#xgboost_gpu_threshold_data_size_large = 30000000

# Whether to enable GLM models (auto/on/off)
#enable_glm = "auto"

# Whether to enable LightGBM models (auto/on/off)
#enable_lightgbm = "auto"

# Maximum number of GBM trees or GLM iterations
# Early-stopping usually chooses less
#max_nestimators = 3000

# Maximum tree depth (and corresponding max max_leaves as 2**max_max_depth)
#max_max_depth = 12

# Maximum max_bin for any tree
#max_max_bin = 256

# Minimum max_bin for any tree
#min_max_bin = 32

# Factor by which rf gets more depth than gbdt
#factor_rf = 1.5

# Upper limit on learning rate for GBM models
#max_learning_rate = 0.5

# Lower limit on learning rate for feature engineering GBM models
#min_learning_rate = 0.05

# Lower limit on learning rate for final ensemble GBM models
#min_learning_rate_final = 0.01

# Whether to enable TensorFlow models (alpha) (auto/on/off)
#enable_tensorflow = "off"

# Max. number of epochs for TensorFlow models
#tensorflow_max_epochs = 100

# Max. number of epochs for TensorFlow NLP feature models
#tensorflow_max_epochs_nlp = 2

# Whether to force tensorflow on no matter any conditions
#enable_tensorflow_force = false

# Whether to use NLP recipe if tensorflow enabled
#enable_tensorflow_nlp = true

# Whether to enable RuleFit support (alpha) (auto/on/off)
#enable_rulefit = "off"

# Max number of rules to be used for RuleFit models (-1 for all)
#rulefit_max_num_rules = 100

# Max tree depth for RuleFit models
#rulefit_max_tree_depth = 6

# Max number of trees for RuleFit models
#rulefit_max_num_trees = 50

# Internal threshold for number of rows x number of columns to trigger no rulefit models due to being too slow currently
#rulefit_threshold_data_size_large = 1000000

# Enable time series recipe
#time_series_recipe = true

# earliest datetime for automatic conversion of integers in %Y%m%d format to a time column during parsing
#min_ymd_timestamp = 19700101

# lastet datetime for automatic conversion of integers in %Y%m%d format to a time column during parsing
#max_ymd_timestamp = 20300101

# maximum number of data samples (randomly selected rows) for date/datetime format detection
#max_rows_datetime_format_detection = 100000

# Whether to enable train/valid and train/test distribution shift detection
#check_distribution_shift = true

# Whether to only check certain features based upon the value of shift_key_features_varimp
#check_reduced_features = true

# Number of trees to use to train model to check shift in distribution
# No larger than max_nestimators
#shift_trees = 100

# The value of max_bin to use for trees to use to train model to check shift in distribution
#shift_max_bin = 256

# The value of max_depth to use for trees to use to train model to check shift in distribution
#shift_max_depth = 4

# Normalized training variable importance above which to check the feature for shift
# Useful to avoid checking likely unimportant features
#shift_key_features_varimp = 0.01

# If distribution shift detection is enabled, show features for which shift AUC is above this value
# (AUC of a binary classifier that predicts whether given feature value belongs to train or test data)
#detect_features_distribution_shift_threshold_auc = 0.55

# If distribution shift detection is enabled, drop features for which shift AUC is above this value
# (AUC of a binary classifier that predicts whether given feature value belongs to train or test data)
#drop_features_distribution_shift_threshold_auc = 0.6

# Minimum number of features to keep, keeping least shifted feature at least if 1
#drop_features_distribution_shift_min_features = 1

# Whether to trace detect_types call for each batch
#trace_detect_types = false

# Whether to trace fit_transform during scoring of population for each transformer
#trace_fit_transform = false

# Whether to trace final model fit_transforms and any pipeline calls for each transformer
#trace_final_fit_transform = false

# Whether to trace final model transforms and any pipeline calls
#trace_final_transform = false

# How close to the optimal value (usually 1 or 0) does the validation score need to be to be considered perfect (to stop the experiment)?
#abs_tol_for_perfect_score = 1e-4

# When number of rows are above this limit sample for MLI for scoring UI data
#mli_sample_above_for_scoring = 1000000

# When number of rows are above this limit sample for MLI for training surrogate models
#mli_sample_above_for_training = 100000

# When sample for MLI how many rows to sample
#mli_sample_size = 100000

# how many bins to do quantile binning
#mli_num_quantiles = 10

# mli random forest number of trees
#mli_drf_num_trees = 100

# mli random forest max depth
#mli_drf_max_depth = 20

# not only sample training, but also sample scoring
#mli_sample_training = true

# regularization strength for k-LIME GLM's
#klime_lambda = [1e-6, 1e-8]
#klime_alpha = 0.0

# mli converts numeric columns to enum when cardinality is <= this value
#mli_max_numeric_enum_cardinality = 25

# Maximum number of features allowed for k-LIME k-means clustering
#mli_max_number_cluster_vars = 6

#Use all columns for k-LIME k-means clustering (this will override `mli_max_number_cluster_vars` if set to `true`
#use_all_columns_klime_kmeans = false

#Strict version check for MLI
#mli_strict_version_check = true

#MLI cloud name
#mli_cloud_name = ""

##############################################################################
## Machine Learning Output : What kinds of files are written related to the machine learning process

# Whether to dump every scored individual's variable importance (both derived and original) to csv/tabulated/json file
# produces files like: individual_id%d.iter%d*features*
#dump_varimp_every_scored_indiv = false

# Whether to dump every scored individual's model parameters to csv/tabulated file
# produces files like: individual_id%d.iter%d*params*
#dump_modelparams_every_scored_indiv = false

# Location of the AutoDoc template
#autodoc_template = "report_template.md"

##############################################################################
## Connectors : Configure connector specifications here
## Note that if using Kerberos, be sure that the DAI time
## is synched with the Kerberos server.

# Configurations for a HDFS data source
# Path of hdfs coresite.xml
# core_site_xml_path is deprecated, please use hdfs_config_path
#core_site_xml_path = ""

# HDFS config folder path , can contain multiple config files
#hdfs_config_path = ""

# Path of the principal key tab file
#key_tab_path = ""

# HDFS connector
# Auth type can be Principal/keytab/keytabPrincipal
# Specify HDFS Auth Type, allowed options are:
#   noauth : No authentication needed
#   principal : Authenticate with HDFS with a principal user
#   keytab : Authenticate with a Key tab (recommended). If running
#             DAI as a service, then the Kerberos keytab needs to
#             be owned by the DAI user.
#   keytabimpersonation : Login with impersonation using a keytab
#hdfs_auth_type = "noauth"

# Kerberos app principal user (recommended)
#hdfs_app_principal_user = ""
# Specify the user id of the current user here as user@realm
#hdfs_app_login_user = ""

# JVM args for HDFS distributions, provide args seperate by space
# -Djava.security.krb5.conf=<path>/krb5.conf
# -Dsun.security.krb5.debug=true
# -Dlog4j.configuration=file:///<path>log4j.properties
#hdfs_app_jvm_args = ""
# hdfs class path
#hdfs_app_classpath = ""

# AWS authentication settings
#   True : Authenticated connection
#   False : Unverified connection
#aws_auth = "False"

# S3 Connector credentials
#aws_access_key_id = ""
#aws_secret_access_key = ""

# Starting S3 path displayed in UI S3 browser
#s3_init_path = "s3://h2o-public-test-data/smalldata/"

# GCS Connector credentials
# example (suggested) -- "/licenses/my_service_account_json.json"
#gcs_path_to_service_account_json = ""

# Minio Connector credentials
#minio_endpoint_url = ""
#minio_access_key_id = ""
#minio_secret_access_key = ""

# Snowflake Connector credentials
# Recommended Provide: url, user, password
# Optionally Provide: account, user, password
# Example URL: https://<snowflake_account>.<region>.snowflakecomputing.com
#snowflake_url = ""
#snowflake_user = ""
#snowflake_password = ""
#snowflake_account = ""

# KDB Connector credentials
#kdb_user = ""
#kdb_password = ""
#kdb_hostname = ""
#kdb_port = ""
#kdb_app_classpath = ""
#kdb_app_jvm_args = ""

# Notification scripts
# - the variable points to a location of script which is executed at given event in experiment lifecycle
# - the script should have executable flag enabled
# - use of absolute path is suggested
# The on experiment start notification script location
#listeners_experiment_start = ""
# The on experiment finished notification script location
#listeners_experiment_done = ""

##############################################################################
## END