System Settings¶
exclusive_mode
¶
Exclusive level of access to node resources
There are three levels of access:
safe: this level assumes that there might be another experiment also running on same node.
moderate: this level assumes that there are no other experiments or tasks running on the same node, but still only uses physical core counts.
max: this level assumes that there is absolutly nothing else running on the node except the experiment
The default level is “safe” and the equivalent config.toml parameter is exclusive_mode
. If multinode is enabled, this option has no effect, unless worker_remote_processors=1 when it will still be applied. Each exclusive mode can be chosen, and then fine-tuned using each expert settings. Changing the exclusive mode will reset all exclusive mode related options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of exclusive mode rules. If you choose to do new/continued/refitted/retrained experiment from parent experiment, all the mode rules are not re-applied and any fine-tuning is preserved. To reset mode behavior, one can switch between ‘safe’ and the desired mode. This way the new child experiment will use the default system resources for the chosen mode.
max_cores
¶
Number of Cores to Use
Specify the number of cores to use per experiment. Note that if you specify 0, all available cores will be used. Lower values can reduce memory usage but might slow down the experiment. This value defaults to 0(all). One can also set it using the environment variable OMP_NUM_THREADS or OPENBLAS_NUM_THREADS (e.g., in bash: ‘export OMP_NUM_THREADS=32’ or ‘export OPENBLAS_NUM_THREADS=32’)
max_fit_cores
¶
Maximum Number of Cores to Use for Model Fit
Specify the maximum number of cores to use for a model’s fit call. Note that if you specify 0, all available cores will be used. This value defaults to 10.
use_dask_cluster
¶
If full dask cluster is enabled, use full cluster
Specify whether to use full multinode distributed cluster (True) or single-node dask (False). In some cases, using entire cluster can be inefficient. E.g. several DGX nodes can be more efficient, if used one DGX at a time for medium-sized data. The equivalent config.toml parameter is use_dask_cluster
.
max_predict_cores
¶
Maximum Number of Cores to Use for Model Predict
Specify the maximum number of cores to use for a model’s predict call. Note that if you specify 0, all available cores will be used. This value defaults to 0(all).
max_predict_cores_in_dai
¶
Maximum Number of Cores to Use for Model Transform and Predict When Doing MLI, AutoDoc
Specify the maximum number of cores to use for a model’s transform and predict call when doing operations in the Driverless AI MLI GUI and the Driverless AI R and Python clients. Note that if you specify 0, all available cores will be used. This value defaults to 4.
batch_cpu_tuning_max_workers
¶
Tuning Workers per Batch for CPU
Specify the number of workers used in CPU mode for tuning. A value of 0 uses the socket count, while a value of -1 uses all physical cores greater than or equal to 1. This value defaults to 0(socket count).
cpu_max_workers
¶
Number of Workers for CPU Training
Specify the number of workers used in CPU mode for training:
0: Use socket count (Default)
-1: Use all physical cores >= 1 that count
num_gpus_per_experiment
¶
#GPUs/Experiment
Specify the number of GPUs to use per experiment. A value of -1 (default) specifies to use all available GPUs. Must be at least as large as the number of GPUs to use per model (or -1). In multinode context when using dask, this refers to the per-node value.
min_num_cores_per_gpu
¶
Num Cores/GPU
Specify the number of CPU cores per GPU. In order to have a sufficient number of cores per GPU, this setting limits the number of GPUs used. This value defaults to 2.
num_gpus_per_model
¶
#GPUs/Model
Specify the number of GPUs to user per model. The equivalent config.toml parameter is num_gpus_per_model
and the default value is 1. Currently num_gpus_per_model other than 1 disables GPU locking, so is only recommended for single experiments and single users. Setting this parameter to -1 means use all GPUs per model. In all cases, XGBoost tree and linear models use the number of GPUs specified per model, while LightGBM and Tensorflow revert to using 1 GPU/model and run multiple models on multiple GPUs. FTRL does not use GPUs. Rulefit uses GPUs for parts involving obtaining the tree using LightGBM. In multinode context when using dask, this parameter refers to the per-node value.
num_gpus_for_prediction
¶
Num. of GPUs for Isolated Prediction/Transform
Specify the number of GPUs to use for predict
for models and transform
for transformers when running outside of fit
/fit_transform
. If predict
or transform
are called in the same process as fit
/fit_transform
, the number of GPUs will match. New processes will use this count for applicable models and transformers. Note that enabling tensorflow_nlp_have_gpus_in_production
will override this setting for relevant TensorFlow NLP transformers. The equivalent config.toml parameter is num_gpus_for_prediction
and the default value is “0”.
Note: When GPUs are used, TensorFlow, PyTorch models and transformers, and RAPIDS always predict on GPU. And RAPIDS requires Driverless AI python scoring package also to be used on GPUs. In multinode context when using dask, this refers to the per-node value.
gpu_id_start
¶
GPU Starting ID
Specify Which gpu_id to start with.
If using CUDA_VISIBLE_DEVICES=… to control GPUs (preferred method), gpu_id=0 is the
first in that restricted list of devices. For example, if CUDA_VISIBLE_DEVICES='4,5'
then gpu_id_start=0
will refer to device #4.
From expert mode, to run 2 experiments, each on a distinct GPU out of 2 GPUs, then:
Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=0
Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=1
From expert mode, to run 2 experiments, each on a distinct GPU out of 8 GPUs, then:
Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=0
Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=4
To run on all 4 GPUs/model, then
Experiment#1: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=0
Experiment#2: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=4
If num_gpus_per_model!=1, global GPU locking is disabled. This is because the underlying algorithms do not support arbitrary gpu ids, only sequential ids, so be sure to set this value correctly to avoid overlap across all experiments by all users.
More information is available at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation Note that gpu selection does not wrap, so gpu_id_start + num_gpus_per_model must be less than the number of visibile GPUs.
assumed_simultaneous_dt_forks_munging
¶
Assumed/Expected number of munging forks
Expected maximum number of forks, used to ensure datatable doesn’t overload system. For actual use beyond this value, system will start to have slow-down issues. THe default value is 3.
max_max_dt_threads_munging
¶
Maximum of threads for datatable for munging
Maximum number of threads for datatable for munging.
max_dt_threads_munging
¶
Max Number of Threads to Use for datatable and OpenBLAS for Munging and Model Training
Specify the maximum number of threads to use for datatable and OpenBLAS during data munging (applied on a per process basis):
0 = Use all threads
-1 = Automatically select number of threads (Default)
max_dt_threads_readwrite
¶
Max Number of Threads to Use for datatable Read and Write of Files
Specify the maximum number of threads to use for datatable during data reading and writing (applied on a per process basis):
0 = Use all threads
-1 = Automatically select number of threads (Default)
max_dt_threads_stats_openblas
¶
Max Number of Threads to Use for datatable Stats and OpenBLAS
Specify the maximum number of threads to use for datatable stats and OpenBLAS (applied on a per process basis):
0 = Use all threads
-1 = Automatically select number of threads (Default)
allow_reduce_features_when_failure
¶
Whether to reduce features when model fails (GPU OOM Protection)
Big models (on big data or with lot of features) can run out of memory on GPUs. This option is primarily useful for avoiding model building failure due to GPU Out Of Memory (OOM). Currently is applicable to all non-dask XGBoost models (i.e. GLMModel, XGBoostGBMModel, XGBoostDartModel, XGBoostRFModel),during normal fit or when using Optuna.
This is acheived by reducing features until model does not fail. For example, If XGBoost runs out of GPU memory, this is detected, and (regardless of setting of skip_model_failures), we perform feature selection using XGBoost on subsets of features. The dataset is progressively reduced by factor of 2 with more models to cover all features. This splitting continues until no failure occurs. Then all sub-models are used to estimate variable importance by absolute information gain, in order to decide which features to include. Finally, a single model with the most important features is built using the feature count that did not lead to OOM.
Note:
This option is set to ‘auto’ -> ‘on’ by default i.e whenever the conditions are favorable, it is set to ‘on’.
Reproducibility is not guaranteed when this option is turned on. Hence if user enables reproducibility for the experiment, ‘auto’ automatically sets this option to ‘off’. This is because the condition of running OOM can change for same experiment seed.
Reduction is only done on features and not on rows for the feature selection step.
Also see reduce_repeats_when_failure and fraction_anchor_reduce_features_when_failure
reduce_repeats_when_failure
¶
Number of repeats for models used for feature selection during failure recovery
With allow_reduce_features_when_failure, this controls how many repeats of sub-models are used for feature selection. A single repeat only has each sub-model consider a single sub-set of features, while repeats shuffle hich features are considered allowing more chance to find important interactions. More repeats can lead to higher accuracy. The cost of this option is proportional to the repeat count. The default value is 1.
fraction_anchor_reduce_features_when_failure
¶
Fraction of features treated as anchor for feature selection during failure recovery
With allow_reduce_features_when_failure, this controls the fraction of features treated as an anchor that are fixed for all sub-models. Each repeat gets new anchors. For tuning and evolution, the probability depends upon any prior importance (if present) from other individuals, while final model uses uniform probability for anchor features. The default fraction is 0.1.
xgboost_reduce_on_errors_list
¶
Errors From XGBoost That Trigger Reduction of Features
Error strings from XGBoost that are used to trigger re-fit on reduced sub-models. See allow_reduce_features_when_failure.
lightgbm_reduce_on_errors_list
¶
Errors From LightGBM That Trigger Reduction of Features
Error strings from LightGBM that are used to trigger re-fit on reduced sub-models. See allow_reduce_features_when_failure.
num_gpus_per_hyperopt_dask
¶
GPUs / HyperOptDask
Specify the number of GPUs to use per model hyperopt training task. To use all GPUs, set this to -1. For example, when this is set to -1 and there are 4 GPUs available, all of them can be used for the training of a single model across a Dask cluster. Ignored if GPUs are disabled or if there are no GPUs on system. In multinode context, this refers to the per-node value.
detailed_traces
¶
Enable Detailed Traces
Specify whether to enable detailed tracing in Driverless AI trace when running an experiment. This is disabled by default.
debug_log
¶
Enable Debug Log Level
If enabled, the log files will also include debug logs. This is disabled by default.
log_system_info_per_experiment
¶
Enable Logging of System Information for Each Experiment
Specify whether to include system information such as CPU, GPU, and disk space at the start of each experiment log. Note that this information is already included in system logs. This is enabled by default.