Experiment configuration¶
max_runtime_minutes
¶
Max. runtime in minutes before triggering the 〈Finish〉 button. Approximately enforced. (0 = disabled) (Number) (Expert Setting)
Default value 1440
If the experiment is not done after this many minutes, stop feature engineering and model tuning as soon as possible and proceed with building the final modeling pipeline and deployment artifacts, independent of model score convergence or pre-determined number of iterations. Only active is not in reproducible mode. Depending on the data and experiment settings, overall experiment runtime can differ significantly from this setting.
min_auto_runtime_minutes
¶
Min. runtime in minutes for automatic runtime control (0 = disabled) (Number) (Expert Setting)
Default value 60
if non-zero, then set max_runtime_minutes automatically to min(max_runtime_minutes, max(min_auto_runtime_minutes, runtime estimate)) when enable_preview_time_estimate is true, so that the preview performs a best estimate of the runtime. Set to zero to disable runtime estimate being used to constrain runtime of experiment.
max_runtime_minutes_smart
¶
Smart runtime mode (Boolean) (Expert Setting)
Default value True
Whether to tune max_runtime_minutes based upon final number of base models,so try to trigger start of final model in order to better ensure stop entire experiment before max_runtime_minutes.Note: If the time given is short enough that tuning models are reduced belowfinal model expectations, the final model may be shorter than expected leadingto an overall shorter experiment time.
max_runtime_minutes_until_abort
¶
Max. runtime in minutes before triggering the 〈Abort〉 button.(0 = disabled) (Number) (Expert Setting)
Default value 10080
If the experiment is not done after this many minutes, push the abort button. Preserves experiment artifacts made so far for summary and log zip files, but further artifacts are made.
strict_reproducible_for_max_runtime
¶
Whether to disable time-based limits when reproducible is set (Boolean) (Expert Setting)
Default value True
- If reproducbile is set, then experiment and all artifacts are reproducible, however then experiments may take arbitrarily long for a given choice of dials, features, and models.
Setting this to False allows the experiment to complete after a fixed time, with all aspects of the model and feature building are reproducible and seeded, but the overall experiment behavior will not necessarily be reproducible if later iterations would have been used in final model building. This should set to True if every seeded experiment of exact same setup needs to generate the exact same final model, regardless of duration.
enable_preview_time_estimate
¶
Whether to have preview estimate runtime (Boolean) (Expert Setting)
Default value True
Uses model built on large number of experiments to estimate runtime. It can be inaccurate in cases that were not trained on.
enable_preview_mojo_size_estimate
¶
Whether to have preview estimate mojo size (Boolean) (Expert Setting)
Default value True
Uses model built on large number of experiments to estimate mojo size. It can be inaccurate in cases that were not trained on.
enable_preview_cpu_memory_estimate
¶
Whether to have preview estimate max cpu memory (Boolean) (Expert Setting)
Default value True
Uses model built on large number of experiments to estimate max cpu memory. It can be inaccurate in cases that were not trained on.
enable_preview_time_estimate_rough
¶
enable_preview_time_estimate_rough (Boolean)
Default value False
time_abort
¶
Time to trigger the 〈Abort〉 button. (String) (Expert Setting)
Default value ''
If the experiment is not done by this time, push the abort button. Accepts time in format given by time_abort_format (defaults to %Y-%m-%d %H:%M:%S)assuming a time zone set by time_abort_timezone (defaults to UTC). One can also give integer seconds since 1970-01-01 00:00:00 UTC. Applies to time on a DAI worker that runs experiments. Preserves experiment artifacts made so far for summary and log zip files, but further artifacts are made.NOTE: If start new experiment with same parameters, restart, or refit, thisabsolute time will apply to such experiments or set of leaderboard experiments.
delete_model_dirs_and_files
¶
delete_model_dirs_and_files (Boolean)
Default value True
Whether to delete all directories and files matching experiment pattern when call do_delete_model (True), or whether to just delete directories (False). False can be used to preserve experiment logs that do not take up much space.
delete_data_dirs_and_files
¶
delete_data_dirs_and_files (Boolean)
Default value True
Whether to delete all directories and files matching dataset pattern when call do_delete_dataset (True), or whether to just delete directories (False). False can be used to preserve dataset logs that do not take up much space.
recipe
¶
Pipeline Building Recipe (String) (Expert Setting)
Default value 'auto'
# Recipe type ## Recipes override any GUI settings - 〈auto〉: all models and features automatically determined by experiment settings, toml settings, and feature_engineering_effort
- 〈compliant〉like 〈auto〉 except:
interpretability=10 (to avoid complexity, overrides GUI or python client chose for interpretability)
enable_glm=〉on〉 (rest 〈off〉, to avoid complexity and be compatible with algorithms supported by MLI)
fixed_ensemble_level=0: Don’t use any ensemble
*feature_brain_level=0*(: No feature brain used (to ensure every restart is identical)
max_feature_interaction_depth=1: interaction depth is set to 1 (no multi-feature interactions to avoid complexity)
target_transformer=〉identity〉: for regression (to avoid complexity)
check_distribution_shift_drop=〉off〉: Don’t use distribution shift between train, valid, and test to drop features (bit risky without fine-tuning)
- 〈monotonic_gbm〉like 〈auto〉 except:
monotonicity_constraints_interpretability_switch=1: enable monotonicity constraints
self.config.monotonicity_constraints_correlation_threshold = 0.01: see below
monotonicity_constraints_drop_low_correlation_features=true: drop features that aren’t correlated with target by at least 0.01 (specified by parameter above)
fixed_ensemble_level=0: Don’t use any ensemble (to avoid complexity)
included_models=[〈LightGBMModel〉]
included_transformers=[〈OriginalTransformer〉]: only original (numeric) features will be used
feature_brain_level=0: No feature brain used (to ensure every restart is identical)
monotonicity_constraints_log_level=〉high〉
autodoc_pd_max_runtime=-1: no timeout for PDP creation in AutoDoc
- 〈kaggle〉like 〈auto〉 except:
external validation set is concatenated with train set, with target marked as missing
test set is concatenated with train set, with target marked as missing
transformers that do not use the target are allowed to fit_transform across entire train + validation + test
several config toml expert options open-up limits (e.g. more numerics are treated as categoricals)
- Note: If plentiful memory, can:
choose kaggle mode and then change fixed_feature_interaction_depth to large negative number,
- otherwise default number of features given to transformer is limited to 50 by default
choose mutation_mode = 《full》, so even more types are transformations are done at once per transformer
〈nlp_model〉: Only enables NLP models that process pure text
〈nlp_transformer〉: Only enables NLP transformers that process pure text, while any model type is allowed
〈image_model〉: Only enables Image models that process pure images
〈image_transformer〉: Only enables Image transformers that process pure images, while any model type is allowed
〈unsupervised〉: Only enables unsupervised transformers, models and scorers
〈gpus_max〉: Maximize use of GPUs (e.g. use XGBoost, rapids, Optuna hyperparameter search, etc.)
〈more_overfit_protection〉: Potentially improve overfit, esp. for small data, by disabling target encoding and making GA behave like final model for tree counts and learning rate
〈feature_store_mojo〉: Creates a MOJO to be used as transformer in the H2O Feature Store, to augment data on a row-by-row level based on Driverless AI’s feature engineering. Only includes transformers that don’t depend on the target, since features like target encoding need to be created at model fitting time to avoid data leakage. And features like lags need to be created from the raw data, they can’t be computed with a row-by-row MOJO transformer.
Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings. Changing the pipeline building recipe will reset all pipeline building recipe options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline building recipe rules.
If choose to do new/continued/refitted/retrained experiment from parent experiment, the recipe rules are not re-applied and any fine-tuning is preserved. To reset recipe behavior, one can switch between 〈auto〉 and the desired mode. This way the new child experiment will use the default settings for the chosen recipe.
custom_unsupervised_expert_mode
¶
Whether to treat custom unsupervised model like UnsupervisedModel (Boolean) (Expert Setting)
Default value False
- Whether to treat model like UnsupervisedModel, so that one specifies each scorer, pretransformer, and transformer in expert panel like one would do for supervised experiments.
Otherwise (False), custom unsupervised models will assume the model itself specified these. If the unsupervised model chosen has _included_transformers, _included_pretransformers, and _included_scorers selected, this should be set to False (default) else should be set to True. Then if one wants the unsupervised model to only produce 1 gene-transformer, then the custom unsupervised model can have:
_ngenes_max = 1 _ngenes_max_by_layer = [1000, 1]
The 1000 for the pretransformer layer just means that layer can have any number of genes. Choose 1 if you expect single instance of the pretransformer to be all one needs, e.g. consumes input features fully and produces complete useful output features.
enable_genetic_algorithm
¶
Enable genetic algorithm for selection and tuning of features and models (String) (Expert Setting)
Default value 'auto'
Whether to enable genetic algorithm for selection and hyper-parameter tuning of features and models. - If disabled (〈off〉), will go directly to final pipeline training (using default feature engineering and feature selection). - 〈auto〉 is same as 〈on〉 unless pure NLP or Image experiment. - 《Optuna》: Uses DAI genetic algorithm for feature engineering, but model hyperparameters are tuned with Optuna.
In the Optuna case, the scores shown in the iteration panel are the best score and trial scores.
Optuna mode currently only uses Optuna for XGBoost, LightGBM, and CatBoost (custom recipe).
If Pruner is enabled, as is default, Optuna mode disables mutations of eval_metric so pruning uses same metric across trials to compare properly.
Currently does not supported when pre_transformers or multi-layer pipeline used, which must go through at least one round of tuning or evolution.
make_python_scoring_pipeline
¶
Make Python scoring pipeline (String) (Expert Setting)
Default value 'auto'
Whether to create the Python scoring pipeline at the end of each experiment.
make_mojo_scoring_pipeline
¶
Make MOJO scoring pipeline (String) (Expert Setting)
Default value 'auto'
Whether to create the MOJO scoring pipeline at the end of each experiment. If set to 《auto》, will attempt to create it if possible (without dropping capabilities). If set to 《on》, might need to drop some models, transformers or custom recipes.
make_triton_scoring_pipeline
¶
Make Triton scoring pipeline (String) (Expert Setting)
Default value 'off'
Whether to create a C++ MOJO based Triton scoring pipeline at the end of each experiment. If set to 《auto》, will attempt to create it if possible (without dropping capabilities). If set to 《on》, might need to drop some models, transformers or custom recipes. Requires make_mojo_scoring_pipeline != 《off》.
auto_deploy_triton_scoring_pipeline
¶
Whether to automatically deploy every model to built-in or remote Triton inference server. (String) (Expert Setting)
Default value 'off'
Whether to automatically deploy the model to the Triton inference server at the end of each experiment. 《local》 will deploy to the local (built-in) Triton inference server to location specified by triton_model_repository_dir_local. 《remote》 will deploy to the remote Triton inference server to location provided by triton_host_remote (and optionally, triton_model_repository_dir_remote). 《off》 requires manual action (Deploy wizard or Python client or manual transfer of exported Triton directory from Deploy wizard) to deploy the model to Triton.
triton_dedup_local_tmp
¶
triton_dedup_local_tmp (Boolean)
Default value True
Replace duplicate files inside the Triton tmp directory with hard links, to significantly reduce the used disk space for local Triton deployments.
triton_dedup_local_tmp_timeout
¶
triton_dedup_local_tmp_timeout (Number)
Default value 30
triton_mini_acceptance_test_local
¶
Test local Triton deployments during creation of MOJO pipeline. (Boolean) (Expert Setting)
Default value True
Test local Triton deployments during creation of MOJO pipeline. Requires enable_triton_server_local and make_triton_scoring_pipeline to be enabled.
triton_mini_acceptance_test_remote
¶
Test remote Triton deployments during creation of MOJO pipeline. (Boolean) (Expert Setting)
Default value True
Test remote Triton deployments during creation of MOJO pipeline. Requires triton_host_remote to be configured and make_triton_scoring_pipeline to be enabled.
triton_client_timeout_testing
¶
triton_client_timeout_testing (Number)
Default value 300
test_triton_when_making_mojo_pipeline_only
¶
test_triton_when_making_mojo_pipeline_only (Boolean)
Default value False
triton_push_bytes_local_singlenode
¶
triton_push_bytes_local_singlenode (Boolean)
Default value False
mojo_for_predictions_benchmark
¶
mojo_for_predictions_benchmark (Boolean)
Default value True
Perform timing and accuracy benchmarks for Injected MOJO scoring vs Python scoring. This is for full scoring data, and can be slow. This also requires hard asserts. Doesn’t force MOJO scoring by itself, so depends on mojo_for_predictions=〉on〉 if want full coverage.
mojo_for_predictions_benchmark_slower_than_python_threshold
¶
mojo_for_predictions_benchmark_slower_than_python_threshold (Number)
Default value 10
Fail hard if MOJO scoring is this many times slower than Python scoring.
mojo_for_predictions_benchmark_slower_than_python_min_rows
¶
mojo_for_predictions_benchmark_slower_than_python_min_rows (Number)
Default value 100
Fail hard if MOJO scoring is slower than Python scoring by a factor specified by mojo_for_predictions_benchmark_slower_than_python_threshold, but only if have at least this many rows. To reduce false positives.
mojo_for_predictions_benchmark_slower_than_python_min_seconds
¶
mojo_for_predictions_benchmark_slower_than_python_min_seconds (Float)
Default value 2.0
Fail hard if MOJO scoring is slower than Python scoring by a factor specified by mojo_for_predictions_benchmark_slower_than_python_threshold, but only if takes at least this many seconds. To reduce false positives.
mojo_for_predictions
¶
Allow use of MOJO for making predictions (String) (Expert Setting)
Default value 'auto'
Use MOJO for making fast low-latency predictions after experiment has finished (when applicable, for AutoDoc/Diagnostics/Predictions/MLI and standalone Python scoring via scorer.zip). For 〈auto〉, only use MOJO if number of rows is equal or below mojo_for_predictions_max_rows. For larger frames, it can be faster to use the Python backend since used libraries are more likely already vectorized.
mojo_for_predictions_max_rows
¶
Max number of rows for C++ MOJO predictions (Number) (Expert Setting)
Default value 10000
For smaller datasets, the single-threaded but low latency C++ MOJO runtime can lead to significantly faster scoring times than the regular in-Driverless AI Python scoring environment. If enable_mojo=True is passed to the predict API, and the MOJO exists and is applicable, then use the MOJO runtime for datasets that have fewer or equal number of rows than this threshold. MLI/AutoDoc set enable_mojo=True by default, so this setting applies. This setting is only used if mojo_for_predictions is 〈auto〉.
mojo_for_predictions_batch_size
¶
Batch size for C++ MOJO predictions. (Number) (Expert Setting)
Default value 100
Batch size (in rows) for C++ MOJO predictions. Only when enable_mojo=True is passed to the predict API, and when the MOJO is applicable (e.g., fewer rows than mojo_for_predictions_max_rows). Larger values can lead to faster scoring, but use more memory.
reduce_mojo_size
¶
Attempt to reduce the size of the MOJO (Boolean) (Expert Setting)
Default value False
Whether to attempt to reduce the size of the MOJO scoring pipeline. A smaller MOJO will also lead to less memory footprint during scoring. It is achieved by reducing some other settings like interaction depth, and hence can affect the predictive accuracy of the model.
make_pipeline_visualization
¶
Make pipeline visualization (String) (Expert Setting)
Default value 'auto'
Whether to create the pipeline visualization at the end of each experiment. Uses MOJO to show pipeline, input features, transformers, model, and outputs of model. MOJO-capable tree models show first tree.
make_python_pipeline_visualization
¶
Make python pipeline visualization (String) (Expert Setting)
Default value 'auto'
Whether to create the python pipeline visualization at the end of each experiment. Each feature and transformer includes a variable importance at end in brackets. Only done when forced on, and artifacts as png files will appear in summary zip. Each experiment has files per individual in final population: 1) preprune_False_0.0 : Before final pruning, without any additional variable importance threshold pruning 2) preprune_True_0.0 : Before final pruning, with additional variable importance <=0.0 pruning 3) postprune_False_0.0 : After final pruning, without any additional variable importance threshold pruning 4) postprune_True_0.0 : After final pruning, with additional variable importance <=0.0 pruning 5) posttournament_False_0.0 : After final pruning and tournament, without any additional variable importance threshold pruning 6) posttournament_True_0.0 : After final pruning and tournament, with additional variable importance <=0.0 pruning 1-5 are done with 〈on〉 while 〈auto〉 only does 6 corresponding to the final post-pruned individuals. Even post pruning, some features have zero importance, because only those genes that have value+variance in variable importance of value=0.0 get pruned. GA can have many folds with positive variance for a gene, and those are not removed in case they are useful features for final model. If small mojo option is chosen (reduce_mojo_size True), then the variance of feature gain is ignored for which genes and features are pruned as well as for what appears in the graph.
pass_env_to_deprecated_python_scoring
¶
Pass environment variables to deprecated python scoring package (Boolean) (Expert Setting)
Default value False
- Pass environment variables from running Driverless AI instance to Python scoring pipeline for
- deprecated models, when they are used to make predictions. Use with caution.
If config.toml overrides are set by env vars, and they differ from what the experiment’s env
looked like when it was trained, then unexpected consequences can occur. Enable this only to 》 override certain well-controlled settings like the port for H2O-3 custom recipe server.
benchmark_mojo_latency
¶
Measure MOJO scoring latency (String) (Expert Setting)
Default value 'auto'
Whether to measure the MOJO scoring latency at the time of MOJO creation.
mojo_building_timeout
¶
Timeout in seconds to wait for MOJO creation at end of experiment. (Float) (Expert Setting)
Default value 1800.0
If MOJO creation times out at end of experiment, can still make MOJO from the GUI or from the R/Py clients (timeout doesn’t apply there).
mojo_vis_building_timeout
¶
Timeout in seconds to wait for MOJO visualization creation at end of experiment. (Float) (Expert Setting)
Default value 600.0
If MOJO visualization creation times out at end of experiment, MOJO is still created if possible within the time limit specified by mojo_building_timeout.
mojo_building_parallelism
¶
Number of parallel workers to use during MOJO creation (-1 = all cores) (Number) (Expert Setting)
Default value -1
If MOJO creation is too slow, increase this value. Higher values can finish faster, but use more memory. If MOJO creation fails due to an out-of-memory error, reduce this value to 1. Set to -1 for all physical cores.
mojo_building_parallelism_base_model_size_limit
¶
Size of base models to allow mojo_building_parallelism (Number) (Expert Setting)
Default value 100000000
- Size in bytes that all pickled and compressed base models have to satisfy to use parallel MOJO building.
For large base models, parallel MOJO building can use too much memory. Only used if final_fitted_model_per_model_fold_files is true.
show_pipeline_sizes
¶
Whether to show model and pipeline sizes in logs (String) (Expert Setting)
Default value 'auto'
- Whether to show model and pipeline sizes in logs.
If 〈auto〉, then not done if more than 10 base models+folds, because expect not concerned with size.
max_workers
¶
max_workers (Number)
Default value 1
Maximum number of workers for Driverless AI server pool (only 1 needed currently)
max_cores_dai
¶
max_cores_dai (Number)
Default value -1
Max number of CPU cores to use across all of DAI experiments and tasks. -1 is all available, with stall_subprocess_submission_dai_fork_threshold_count=0 means restricted to core count.
virtual_cores_per_physical_core
¶
virtual_cores_per_physical_core (Number)
Default value 0
Number of virtual cores per physical core (0: auto mode, >=1 use that integer value). If >=1, the reported physical cores in logs will match the virtual cores divided by this value.
min_virtual_cores_per_physical_core_if_unequal
¶
min_virtual_cores_per_physical_core_if_unequal (Number)
Default value 2
Mininum number of virtual cores per physical core. Only applies if virtual cores != physical cores. Can help situations like Intel i9 13900 with 24 physical cores and only 32 virtual cores. So better to limit physical cores to 16.
override_physical_cores
¶
override_physical_cores (Number)
Default value 0
- Number of physical cores to assume are present (0: auto, >=1 use that integer value).
If for some reason DAI does not automatically figure out physical cores correctly, one can override with this value. Some systems, especially virtualized, do not always provide correct information about the virtual cores, physical cores, sockets, etc.
override_virtual_cores
¶
override_virtual_cores (Number)
Default value 0
- Number of virtual cores to assume are present (0: auto, >=1 use that integer value).
If for some reason DAI does not automatically figure out virtual cores correctly, or only a portion of the system is to be used, one can override with this value. Some systems, especially virtualized, do not always provide correct information about the virtual cores, physical cores, sockets, etc.
stall_subprocess_submission_dai_fork_threshold_count
¶
stall_subprocess_submission_dai_fork_threshold_count (Number)
Default value 0
Stall submission of tasks if total DAI fork count exceeds count (-1 to disable, 0 for automatic of max_cores_dai)
stall_subprocess_submission_mem_threshold_pct
¶
stall_subprocess_submission_mem_threshold_pct (Number)
Default value 2
Stall submission of tasks if system memory available is less than this threshold in percent (set to 0 to disable). Above this threshold, the number of workers in any pool of workers is linearly reduced down to 1 once hitting this threshold.
max_cores_by_physical
¶
max_cores_by_physical (Boolean)
Default value True
Whether to set automatic number of cores by physical (True) or logical (False) count. Using all logical cores can lead to poor performance due to cache thrashing.
max_cores_limit
¶
max_cores_limit (Number)
Default value 200
Absolute limit to core count
max_predict_cores_in_dai_reduce_factor
¶
max_predict_cores_in_dai_reduce_factor (Number)
Default value 4
Factor by which to reduce physical cores, to use for post-model experiment tasks like autoreport, MLI, etc.
max_max_predict_cores_in_dai
¶
max_max_predict_cores_in_dai (Number)
Default value 10
Maximum number of cores to use for post-model experiment tasks like autoreport, MLI, etc.
assumed_simultaneous_dt_forks_stats_openblas
¶
assumed_simultaneous_dt_forks_stats_openblas (Number)
Default value 1
Expected maximum number of forks by computing statistics during ingestion, used to ensure datatable doesn’t overload system
max_max_dt_threads_stats_openblas
¶
max_max_dt_threads_stats_openblas (Number)
Default value 8
Expected maximum of threads for datatable no matter if many more cores
kaggle_username
¶
Kaggle username (String) (Expert Setting)
Default value ''
Kaggle username for automatic submission and scoring of test set predictions. See https://github.com/Kaggle/kaggle-api#api-credentials for details on how to obtain Kaggle API credentials》,
kaggle_key
¶
Kaggle key (String) (Expert Setting)
Default value ''
Kaggle key for automatic submission and scoring of test set predictions. See https://github.com/Kaggle/kaggle-api#api-credentials for details on how to obtain Kaggle API credentials》,
kaggle_timeout
¶
Kaggle submission timeout in seconds (Number) (Expert Setting)
Default value 120
Max. number of seconds to wait for Kaggle API call to return scores for given predictions
disk_limit_gb
¶
disk_limit_gb (Number)
Default value 5
Minimum amount of disk space in GB needed to run experiments. Experiments will fail if this limit is crossed. This limit exists because Driverless AI needs to generate data for model training feature engineering, documentation and other such processes.
memory_limit_gb
¶
memory_limit_gb (Number)
Default value 5
Minimum amount of system memory in GB needed to start experiments. Similarly with disk space, a certain amount of system memory is needed to run some basic operations.
min_num_rows
¶
Min. number of rows needed to run experiment (Number) (Expert Setting)
Default value 100
Minimum number of rows needed to run experiments (values lower than 100 might not work). A minimum threshold is set to ensure there is enough data to create a statistically reliable model and avoid other small-data related failures.
reproducibility_level
¶
Reproducibility Level (Number) (Expert Setting)
Default value 1
Level of reproducibility desired (for same data and same inputs). Only active if 〈reproducible〉 mode is enabled (GUI button enabled or a seed is set from the client API). Supported levels are:
reproducibility_level = 1 for same experiment results as long as same O/S, same CPU(s) and same GPU(s) reproducibility_level = 2 for same experiment results as long as same O/S, same CPU architecture and same GPU architecture reproducibility_level = 3 for same experiment results as long as same O/S, same CPU architecture, not using GPUs reproducibility_level = 4 for same experiment results as long as same O/S, (best effort)
seed
¶
Random seed (Number) (Expert Setting)
Default value 1234
Seed for random number generator to make experiments reproducible, to a certain reproducibility level (see above). Only active if 〈reproducible〉 mode is enabled (GUI button enabled or a seed is set from the client API).
glm_nan_impute_training_data
¶
glm_nan_impute_training_data (Boolean)
Default value False
Whether to impute (to mean) for GLM on training data.
glm_nan_impute_validation_data
¶
glm_nan_impute_validation_data (Boolean)
Default value False
Whether to impute (to mean) for GLM on validation data.
glm_nan_impute_prediction_data
¶
glm_nan_impute_prediction_data (Boolean)
Default value True
Whether to impute (to mean) for GLM on prediction data (required for consistency with MOJO).
max_cols
¶
max_cols (Number)
Default value 10000000
Maximum number of columns to start an experiment. This threshold exists to constraint the # complexity and the length of the Driverless AI’s processes.
max_rows_cv_in_cv_gini
¶
max_rows_cv_in_cv_gini (Number)
Default value 100000
Largest number of rows to use for cv in cv for target encoding when doing gini scoring test
max_rows_constant_model
¶
max_rows_constant_model (Number)
Default value 1000000
Largest number of rows to use for constant model fit, otherwise sample randomly
max_rows_final_ensemble_base_model_fold_scores
¶
max_rows_final_ensemble_base_model_fold_scores (Number)
Default value 1000000
Largest number of rows to use for final ensemble base model fold cores, otherwise sample randomly
num_folds
¶
num_folds (Number) (Expert Setting)
Default value 3
Number of folds for models used during the feature engineering process. Increasing this will put a lower fraction of data into validation and more into training (e.g., num_folds=3 means 67%/33% training/validation splits). Actual value will vary for small or big data cases.
fold_balancing_repeats_times_rows
¶
fold_balancing_repeats_times_rows (Float)
Default value 100000000.0
max_fold_balancing_repeats
¶
max_fold_balancing_repeats (Number)
Default value 10
fixed_split_seed
¶
fixed_split_seed (Number)
Default value 0
show_fold_stats
¶
show_fold_stats (Boolean)
Default value True
allow_different_classes_across_fold_splits
¶
Allow different sets of classes across all train/validation fold splits (Boolean) (Expert Setting)
Default value True
For multiclass problems only. Whether to allow different sets of target classes across (cross-)validation fold splits. Especially important when passing a fold column that isn’t balanced w.r.t class distribution.
save_validation_splits
¶
Store internal validation split row indices (Boolean) (Expert Setting)
Default value False
Includes pickles of (train_idx, valid_idx) tuples (numpy row indices for original training data) for all internal validation folds in the experiment summary zip. For debugging.
max_num_classes
¶
Max. number of classes for classification problems (Number) (Expert Setting)
Default value 1000
Maximum number of classes to allow for a classification problem. High number of classes may make certain processes of Driverless AI time-consuming. Memory requirements also increase with higher number of classes
max_num_classes_compute_roc
¶
Max. number of classes to compute ROC and confusion matrix for classification problems (Number) (Expert Setting)
Default value 200
Maximum number of classes to compute ROC and CM for, beyond which roc_reduce_type choice for reduction is applied. Too many classes can take much longer than model building time.
max_num_classes_client_and_gui
¶
Max. number of classes to show in GUI for confusion matrix (Number) (Expert Setting)
Default value 10
Maximum number of classes to show in GUI for confusion matrix, showing first max_num_classes_client_and_gui labels. Beyond 6 classes the diagnostics launched from GUI are visually truncated. This will only modify client-GUI launched diagnostics if changed in config.toml and server is restarted, while this value can be changed in expert settings to control experiment plots.
roc_reduce_type
¶
ROC/CM reduction technique for large class counts (String) (Expert Setting)
Default value 'rows'
If too many classes when computing roc, reduce by 《rows》 by randomly sampling rows, or reduce by truncating classes to no more than max_num_classes_compute_roc. If have sufficient rows for class count, can reduce by rows.
max_rows_cm_ga
¶
Maximum number of rows to obtain confusion matrix related plots during feature evolution (Number) (Expert Setting)
Default value 500000
Maximum number of rows to obtain confusion matrix related plots during feature evolution. Does not limit final model calculation.
num_actuals_vs_predicted
¶
num_actuals_vs_predicted (Number)
Default value 100
Number of actuals vs. predicted data points to use in order to generate in the relevant plot/graph which is shown at the right part of the screen within an experiment.
use_feature_brain_new_experiments
¶
Whether to use Feature Brain for new experiments. (Boolean) (Expert Setting)
Default value False
- Whether to use feature_brain results even if running new experiments.
Feature brain can be risky with some types of changes to experiment setup. Even rescoring may be insufficient, so by default this is False. For example, one experiment may have training=external validation by accident, and get high score, and while feature_brain_reset_score=〉on〉 means we will rescore, it will have already seen during training the external validation and leak that data as part of what it learned from. If this is False, feature_brain_level just sets possible models to use and logs/notifies, but does not use these feature brain cached models.
resume_data_schema
¶
Whether to reuse dataset schema. (String) (Expert Setting)
Default value 'auto'
Whether reuse dataset schema, such as data types set in UI for each column, from parent experiment (〈on〉) or to ignore original dataset schema and only use new schema (〈off〉). resume_data_schema=True is a basic form of data lineage, but it may not be desirable if data colunn names changed to incompatible data types like int to string. 〈auto〉: for restart, retrain final pipeline, or refit best models, default is to resume data schema, but new experiments would not by default reuse old schema. 〈on〉: force reuse of data schema from parent experiment if possible 〈off〉: don’t reuse data schema under any case. The reuse of the column schema can also be disabled by: in UI: selecting Parent Experiment as None in client: setting resume_experiment_id to None
resume_data_schema_old_logic
¶
resume_data_schema_old_logic (Boolean)
Default value False
feature_brain_level
¶
Model/Feature Brain Level (0..10) (Number) (Expert Setting)
Default value 2
Whether to show (or use) results from H2O.ai brain: the local caching and smart re-use of prior experiments, in order to generate more useful features and models for new experiments. See use_feature_brain_new_experiments for how new experiments by default do not use brain cache. It can also be used to control checkpointing for experiments that have been paused or interrupted. DAI will use H2O.ai brain cache if cache file has a) any matching column names and types for a similar experiment type b) exactly matches classes c) exactly matches class labels d) matches basic time series choices e) interpretability of cache is equal or lower f) main model (booster) is allowed by new experiment. Level of brain to use (for chosen level, where higher levels will also do all lower level operations automatically) -1 = Don’t use any brain cache and don’t write any cache 0 = Don’t use any brain cache but still write cache
Use case: Want to save model for later use, but want current model to be built without any brain models
- 1 = smart checkpoint from latest best individual model
Use case: Want to use latest matching model, but match can be loose, so needs caution
- 2 = smart checkpoint from H2O.ai brain cache of individual best models
Use case: DAI scans through H2O.ai brain cache for best models to restart from
- 3 = smart checkpoint like level #1, but for entire population. Tune only if brain population insufficient size
(will re-score entire population in single iteration, so appears to take longer to complete first iteration)
- 4 = smart checkpoint like level #2, but for entire population. Tune only if brain population insufficient size
(will re-score entire population in single iteration, so appears to take longer to complete first iteration)
- 5 = like #4, but will scan over entire brain cache of populations to get best scored individuals
(can be slower due to brain cache scanning if big cache)
- 1000 + feature_brain_level (above positive values) = use resumed_experiment_id and actual feature_brain_level,
to use other specific experiment as base for individuals or population, instead of sampling from any old experiments
GUI has 3 options and corresponding settings: 1) New Experiment: Uses feature brain level default of 2 2) New Experiment With Same Settings: Re-uses the same feature brain level as parent experiment 3) Restart From Last Checkpoint: Resets feature brain level to 1003 and sets experiment ID to resume from
(continued genetic algorithm iterations)
Retrain Final Pipeline: Like Restart but also time=0 so skips any tuning and heads straight to final model (assumes had at least one tuning iteration in parent experiment)
Other use cases: a) Restart on different data: Use same column names and fewer or more rows (applicable to 1 - 5) b) Re-fit only final pipeline: Like (a), but choose time=1 and feature_brain_level=3 - 5 c) Restart with more columns: Add columns, so model builds upon old model built from old column names (1 - 5) d) Restart with focus on model tuning: Restart, then select feature_engineering_effort = 3 in expert settings e) can retrain final model but ignore any original features except those in final pipeline (normal retrain but set brain_add_features_for_new_columns=false) Notes: 1) In all cases, we first check the resumed experiment id if given, and then the brain cache 2) For Restart cases, may want to set min_dai_iterations to non-zero to force delayed early stopping, else may not be enough iterations to find better model. 3) A 《New experiment with Same Settings》 of a Restart will use feature_brain_level=1003 for default Restart mode (revert to 2, or even 0 if want to start a fresh experiment otherwise)
feature_brain_reset_score
¶
Whether to re-score models from brain cache (String) (Expert Setting)
Default value 'auto'
- Whether to smartly keep score to avoid re-munging/re-training/re-scoring steps brain models (〈auto〉), always
force all steps for all brain imports (〈on〉), or never rescore (〈off〉). 〈auto〉 only re-scores if a difference in current and prior experiment warrants re-scoring, like column changes, metric changes, etc. 〈on〉 is useful when smart similarity checking is not reliable enough. 〈off〉 is uesful when know want to keep exact same features and model for final model refit, despite changes in seed or other behaviors in features that might change the outcome if re-scored before reaching final model. If set off, then no limits are applied to features during brain ingestion, while can set brain_add_features_for_new_columns to false if want to ignore any new columns in data. In addition, any unscored individuals loaded from parent experiment are not rescored when doing refit or retrain. Can also set refit_same_best_individual True if want exact same best individual (highest scored model+features) to be used regardless of any scoring changes.
max_num_brain_indivs
¶
max_num_brain_indivs (Number)
Default value 3
Maximum number of brain individuals pulled from H2O.ai brain cache for feature_brain_level=1, 2
feature_brain_save_every_iteration
¶
Feature Brain Save every which iteration (0 = disable) (Number) (Expert Setting)
Default value 0
Save feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration == 0, to be able to restart/refit with which_iteration_brain >= 0 0 means disable
which_iteration_brain
¶
Feature Brain Restart from which iteration (-1 = auto) (Number) (Expert Setting)
Default value -1
When doing restart or re-fit type feature_brain_level with resumed_experiment_id, choose which iteration to start from, instead of only last best -1 means just use last best Usage: 1) Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number 2) Identify which iteration brain dump one wants to restart/refit from 3) Restart/Refit from original experiment, setting which_iteration_brain to that number in expert settings Note: If restart from a tuning iteration, this will pull in entire scored tuning population and use that for feature evolution
refit_same_best_individual
¶
Feature Brain refit uses same best individual (Boolean) (Expert Setting)
Default value False
When doing re-fit from feature brain, if change columns or features, population of individuals used to refit from may change order of which was best, leading to better result chosen (False case). But sometimes want to see exact same model/features with only one feature added, and then would need to set this to True case. E.g. if refit with just 1 extra column and have interpretability=1, then final model will be same features, with one more engineered feature applied to that new original feature.
restart_refit_redo_origfs_shift_leak
¶
For restart-refit, select which steps to do (List) (Expert Setting)
Default value []
When doing restart or re-fit of experiment from feature brain, sometimes user might change data significantly and then warrant redoing reduction of original features by feature selection, shift detection, and leakage detection. However, in other cases, if data and all options are nearly (or exactly) identical, then these steps might change the features slightly (e.g. due to random seed if not setting reproducible mode), leading to changes in features and model that is refitted. By default, restart and refit avoid these steps assuming data and experiment setup have no changed significantly. If check_distribution_shift is forced to on (instead of auto), then this option is ignored. In order to ensure exact same final pipeline is fitted, one should also set: 1) brain_add_features_for_new_columns false 2) refit_same_best_individual true 3) feature_brain_reset_score 〈off〉 4) force_model_restart_to_defaults false The score will still be reset if the experiment metric chosen changes, but changes to the scored model and features will be more frozen in place.
brain_rel_dir
¶
brain_rel_dir (String)
Default value 'H2O.ai_brain'
Directory, relative to data_directory, to store H2O.ai brain meta model files
brain_add_features_for_new_columns
¶
Feature Brain adds features with new columns even during retraining final model (Boolean) (Expert Setting)
Default value True
Whether to take any new columns and add additional features to pipeline, even if doing retrain final model. In some cases, one might have a new dataset but only want to keep same pipeline regardless of new columns, in which case one sets this to False. For example, new data might lead to new dropped features, due to shift or leak detection. To avoid change of feature set, one can disable all dropping of columns, but set this to False to avoid adding any columns as new features, so pipeline is perfectly preserved when changing data.
force_model_restart_to_defaults
¶
Restart-refit use default model settings if model switches (Boolean) (Expert Setting)
Default value True
If restart/refit and no longer have the original model class available, be conservative and go back to defaults for that model class. If False, then try to keep original hyperparameters, which can fail to work in general.
min_dai_iterations
¶
Min. DAI iterations (Number) (Expert Setting)
Default value 0
Minimum number of Driverless AI iterations to stop the feature evolution/engineering process even if score is not improving. Driverless AI needs to run for at least that many iterations before deciding to stop. It can be seen a safeguard against suboptimal (early) convergence.
tensorflow_nlp_have_gpus_in_production
¶
tensorflow_nlp_have_gpus_in_production (Boolean)
Default value False
bert_migration_timeout_secs
¶
bert_migration_timeout_secs (Number)
Default value 600
enable_bert_transformer_acceptance_test
¶
enable_bert_transformer_acceptance_test (Boolean)
Default value False
enable_bert_model_acceptance_test
¶
enable_bert_model_acceptance_test (Boolean)
Default value False
target_transformer
¶
Select target transformation of the target for regression problems (String) (Expert Setting)
Default value 'auto'
Select a target transformation for regression problems. Must be one of: [〈auto〉, 〈identity〉, 〈identity_noclip〉, 〈center〉, 〈standardize〉, 〈unit_box〉, 〈log〉, 〈log_noclip〉, 〈square〉, 〈sqrt〉, 〈double_sqrt〉, 〈inverse〉, 〈anscombe〉, 〈logit〉, 〈sigmoid〉]. If set to 〈auto〉, will automatically pick the best target transformer (if accuracy is set to tune_target_transform_accuracy_switch or larger, considering interpretability level of each target transformer), otherwise will fall back to 〈identity_noclip〉 (easiest to interpret, Shapley values are in original space, etc.). All transformers except for 〈center〉, 〈standardize〉, 〈identity_noclip〉 and 〈log_noclip〉 perform clipping to constrain the predictions to the domain of the target in the training data. Use 〈center〉, 〈standardize〉, 〈identity_noclip〉 or 〈log_noclip〉 to disable clipping and to allow predictions outside of the target domain observed in the training data (for parametric models or custom models that support extrapolation).
target_transformer_tuning_choices
¶
Select all allowed target transformations of the target for regression problems when doing target transformer tuning (List) (Expert Setting)
Default value ['identity', 'identity_noclip', 'center', 'standardize', 'unit_box', 'log', 'square', 'sqrt', 'double_sqrt', 'anscombe', 'logit', 'sigmoid']
Select list of target transformers to use for tuning. Only for target_transformer=〉auto〉 and accuracy >= tune_target_transform_accuracy_switch.
tournament_style
¶
Tournament model for genetic algorithm (String) (Expert Setting)
Default value 'auto'
Tournament style (method to decide which models are best at each iteration) 〈auto〉 : Choose based upon accuracy and interpretability 〈uniform〉 : all individuals in population compete to win as best (can lead to all, e.g. LightGBM models in final ensemble, which may not improve ensemble performance due to lack of diversity) 〈model〉 : individuals with same model type compete (good if multiple models do well but some models that do not do as well still contribute to improving ensemble) 〈feature〉 : individuals with similar feature types compete (good if target encoding, frequency encoding, and other feature sets lead to good results) 〈fullstack〉 : Choose among optimal model and feature types 〈model〉 and 〈feature〉 styles preserve at least one winner for each type (and so 2 total indivs of each type after mutation) For each case, a round robin approach is used to choose best scores among type of models to choose from. If enable_genetic_algorithm==〉Optuna〉, then every individual is self-mutated without any tournament during the genetic algorithm. The tournament is only used to prune-down individuals for, e.g., tuning -> evolution and evolution -> final model.
max_fold_reps_hard_limit
¶
max_fold_reps_hard_limit (Number)
Default value 20
sanitize_natural_sort_limit
¶
sanitize_natural_sort_limit (Number)
Default value 1000
number of unique targets or folds counts after which switch to faster/simpler non-natural sorting and print outs
head_tail_fold_id_report_length
¶
head_tail_fold_id_report_length (Number)
Default value 30
number of fold ids to report cardinality for, both most common (head) and least common (tail)
cvte_cv_in_cv_use_model
¶
cvte_cv_in_cv_use_model (Boolean)
Default value False
For target encoding, whether a model is used to compute Ginis for checking sanity of transformer. Requires cvte_cv_in_cv to be enabled. If enabled, CV-in-CV isn’t done in case the check fails.
include_all_as_pretransformers_if_none_selected
¶
include_all_as_pretransformers_if_none_selected (Boolean)
Default value False
force_include_all_as_pretransformers_if_none_selected
¶
force_include_all_as_pretransformers_if_none_selected (Boolean)
Default value False
dask_retrials_allreduce_empty_issue
¶
dask_retrials_allreduce_empty_issue (Number)
Default value 5
Number of retrials for dask fit to protect against known xgboost issues https://github.com/dmlc/xgboost/issues/6272 https://github.com/dmlc/xgboost/issues/6551
use_xgboost_xgbfi
¶
use_xgboost_xgbfi (Boolean)
Default value False
Whether to use (and expect exists) xgbfi feature interactions for xgboost.
scale_mem_for_max_bin
¶
scale_mem_for_max_bin (Number)
Default value 10737418240
Amount of memory which can handle max_bin = 256 can handle 125 columns and max_bin = 32 for 1000 columns As available memory on system goes higher than this scale, can handle proportionally more columns at higher max_bin Currently set to 10GB
factor_rf
¶
factor_rf (Float)
Default value 1.25
Factor by which rf gets more depth than gbdt
fixed_num_folds_evolution
¶
Number of cross-validation folds for feature evolution (-1 = auto) (Number) (Expert Setting)
Default value -1
Specify the fixed number of cross-validation folds (if >= 2) for feature evolution. (The actual number of splits allowed can be less and is determined at experiment run-time).
fixed_num_folds
¶
Number of cross-validation folds for final model (-1 = auto) (Number) (Expert Setting)
Default value -1
Specify the fixed number of cross-validation folds (if >= 2) for the final model. (The actual number of splits allowed can be less and is determined at experiment run-time).
fixed_only_first_fold_model
¶
Force only first fold for models (String) (Expert Setting)
Default value 'auto'
set 《on》 to force only first fold for models - useful for quick runs regardless of data
fixed_fold_reps
¶
Number of repeated cross-validation folds. 0 is auto. (Number) (Expert Setting)
Default value 0
Set the number of repeated cross-validation folds for feature evolution and final models (if > 0), 0 is default. Only for ensembles that do cross-validation (so no external validation and not time-series), not for single final models.
feature_evolution_data_size
¶
Max. num. of rows x num. of columns for feature evolution data splits (not for final pipeline) (Number) (Expert Setting)
Default value 300000000
Upper limit on the number of rows x number of columns for feature evolution (applies to both training and validation/holdout splits) feature evolution is the process that determines which features will be derived. Depending on accuracy settings, a fraction of this value will be used
final_pipeline_data_size
¶
Max. num. of rows x num. of columns for reducing training data set (for final pipeline) (Number) (Expert Setting)
Default value 1000000000
Upper limit on the number of rows x number of columns for training final pipeline.
limit_validation_size
¶
Limit validation size (Boolean) (Expert Setting)
Default value True
Whether to automatically limit validation data size using feature_evolution_data_size (giving max_rows_feature_evolution shown in logs) for tuning-evolution, and using final_pipeline_data_size, max_validation_to_training_size_ratio_for_final_ensemble for final model.
max_validation_to_training_size_ratio_for_final_ensemble
¶
Max. size of validation data relative to training data (for final pipeline), otherwise will sample (Float) (Expert Setting)
Default value 2.0
Smaller values can speed up final pipeline model training, as validation data is only used for early stopping. Note that final model predictions and scores will always be provided on the full dataset provided.
force_stratified_splits_for_imbalanced_threshold_binary
¶
Perform stratified sampling for binary classification if the target is more imbalanced than this. (Float) (Expert Setting)
Default value 0.01
Ratio of minority to majority class of the target column beyond which stratified sampling is done for binary classification. Otherwise perform random sampling. Set to 0 to always do random sampling. Set to 1 to always do stratified sampling.
force_stratified_splits_for_binary_max_rows
¶
Perform stratified sampling for binary classification if the dataset has fewer rows than this. (Number) (Expert Setting)
Default value 1000000
stratify_for_regression
¶
Perform stratified sampling for regression problems (using binning). (Boolean) (Expert Setting)
Default value True
Specify whether to do stratified sampling for validation fold creation for iid regression problems. Otherwise perform random sampling.
cols_to_drop_sanitized
¶
cols_to_drop_sanitized (List)
Default value []
cols_to_group_by_sanitized
¶
cols_to_group_by_sanitized (List)
Default value []
leaderboard_mode
¶
Control the automatic leaderboard mode (String) (Expert Setting)
Default value 'baseline'
〈baseline〉: Explore exemplar set of models with baselines as reference. 〈random〉: Explore 10 random seeds for same setup. Useful since nature of genetic algorithm is noisy and repeats might get better results, or one can ensemble the custom individuals from such repeats. 〈line〉: Explore good model with all features and original features with all models. Useful as first exploration. 〈line_all〉: Like 〈line〉, but enable all models and transformers possible instead of only what base experiment setup would have inferred. 〈product〉: Explore one-by-one Cartesian product of each model and transformer. Useful for exhaustive exploration.
leaderboard_off
¶
leaderboard_off (Boolean)
Default value False
Controls whether users can launch an experiment in Leaderboard mode form the UI.
default_knob_offset_accuracy
¶
Offset for default accuracy knob (Number) (Expert Setting)
Default value 0
- Allows control over default accuracy knob setting.
If default models are too complex, set to -1 or -2, etc. If default models are not accurate enough, set to 1 or 2, etc.
default_knob_offset_time
¶
Offset for default time knob (Number) (Expert Setting)
Default value 0
- Allows control over default time knob setting.
If default experiments are too slow, set to -1 or -2, etc. If default experiments finish too fast, set to 1 or 2, etc.
default_knob_offset_interpretability
¶
Offset for default interpretability knob (Number) (Expert Setting)
Default value 0
- Allows control over default interpretability knob setting.
If default models are too simple, set to -1 or -2, etc. If default models are too complex, set to 1 or 2, etc.
drop_features_distribution_shift_min_features
¶
drop_features_distribution_shift_min_features (Number)
Default value 1
Minimum number of features to keep, keeping least shifted feature at least if 1
shift_high_notification_level
¶
shift_high_notification_level (Float)
Default value 0.8
Shift beyond which shows HIGH notification, else MEDIUM
leakage_key_features_varimp_if_no_early_stopping
¶
leakage_key_features_varimp_if_no_early_stopping (Float)
Default value 0.05
Like leakage_key_features_varimp, but applies if early stopping disabled when can trust multiple leaks to get uniform varimp.
drop_features_leakage_min_features
¶
drop_features_leakage_min_features (Number)
Default value 1
Minimum number of features to keep, keeping least leakage feature at least if 1
check_system
¶
Whether to check system installation on server startup (Boolean)
Default value True
mutate_timeout
¶
mutate_timeout (Number)
Default value 600
How many seconds to allow mutate to take, nominally only takes few seconds at most. But on busy system doing many individuals, might take longer. Optuna sometimes live lock hangs in scipy random distribution maker.
gpu_locking_trust_pool_submission
¶
gpu_locking_trust_pool_submission (Boolean)
Default value True
- Whether to trust GPU locking for submission of GPU jobs to limit memory usage.
If False, then wait for as GPU submissions to be less than number of GPUs, even if later jobs could be purely CPU jobs that did not need to wait. Only applicable if not restricting number of GPUs via num_gpus_per_experiment, else have to use resources instead of relying upon locking.
gpu_locking_free_dead
¶
gpu_locking_free_dead (Boolean)
Default value True
Whether to steal GPU locks when process is neither on GPU PID list nor using CPU resources at all (e.g. sleeping). Only steal from multi-GPU locks that are incomplete. Prevents deadlocks in case multi-GPU model hangs.
tensorflow_allow_cpu_only
¶
tensorflow_allow_cpu_only (Boolean)
Default value False
check_pred_contribs_sum
¶
check_pred_contribs_sum (Boolean)
Default value False
debug_debug_xgboost_splits
¶
debug_debug_xgboost_splits (Boolean)
Default value False
stalled_time_kill_ref
¶
stalled_time_kill_ref (Float)
Default value 440.0
Amount of time to stall (in seconds) before killing the job (assumes it hung). Reference time is scaled by train data shape of rows * cols to get used stalled_time_kill
long_time_psdump
¶
long_time_psdump (Number)
Default value 1800
Amount of time between checks for some process taking long time, every cycle full process list will be dumped to console or experiment logs if possible.
do_psdump
¶
do_psdump (Boolean)
Default value False
Whether to dump ps every long_time_psdump
livelock_signal
¶
livelock_signal (Boolean)
Default value False
Whether to check every long_time_psdump seconds and SIGUSR1 to all children to see where maybe stuck or taking long time.
num_cpu_sockets_override
¶
num_cpu_sockets_override (Number)
Default value 0
Value to override number of sockets, in case DAIs determination is wrong, for non-trivial systems. 0 means auto.
num_gpus_override
¶
num_gpus_override (Number)
Default value -1
Value to override number of GPUs, in case DAIs determination is wrong, for non-trivial systems. -1 means auto.Can also set min_num_cores_per_gpu=-1 to allowany number of GPUs for each experiment regardlessof number of cores.
show_gpu_usage_only_if_locked
¶
show_gpu_usage_only_if_locked (String)
Default value 'auto'
Whether to show GPU usage only when locking. 〈auto〉 means 〈on〉 if num_gpus_override is different than actual total visible GPUs, else it means 〈off〉
meta_weight_allowed_for_reference
¶
Min. weight of meta learner for reference models during ensembling. If 1.0, then reference model must be the clear winner to be kept. Set to 0.0 to never drop reference models (Float) (Expert Setting)
Default value 1.0
show_full_pipeline_details
¶
Whether to show full pipeline details (Boolean) (Expert Setting)
Default value False
num_transformed_features_per_pipeline_show
¶
Number of features to show when logging size of fitted transformers (Number) (Expert Setting)
Default value 10
many_columns_count
¶
Number of columns beyond which reduce expensive tasks at cost of some accuracy. (Number)
Default value 400
columns_count_interpretable
¶
Number of columns beyond which do not set default knobs to high interpretability even if bigger data. (Number)
Default value 200
fast_approx_num_trees
¶
fast_approx_num_trees (Number) (Expert Setting)
Default value 250
Max. number of trees to use for fast_approx=True (e.g., for AutoDoc/MLI).
fast_approx_do_one_fold
¶
fast_approx_do_one_fold (Boolean) (Expert Setting)
Default value True
Whether to speed up fast_approx=True further, by using only one fold out of all cross-validation folds (e.g., for AutoDoc/MLI).
fast_approx_do_one_model
¶
fast_approx_do_one_model (Boolean) (Expert Setting)
Default value False
Whether to speed up fast_approx=True further, by using only one model out of all ensemble models (e.g., for AutoDoc/MLI).
fast_approx_contribs_num_trees
¶
fast_approx_contribs_num_trees (Number) (Expert Setting)
Default value 50
Max. number of trees to use for fast_approx_contribs=True (e.g., for 〈Fast Approximation〉 in GUI when making Shapley predictions, and for AutoDoc/MLI).
fast_approx_contribs_do_one_fold
¶
fast_approx_contribs_do_one_fold (Boolean) (Expert Setting)
Default value True
Whether to speed up fast_approx_contribs=True further, by using only one fold out of all cross-validation folds (e.g., for 〈Fast Approximation〉 in GUI when making Shapley predictions, and for AutoDoc/MLI).
fast_approx_contribs_do_one_model
¶
fast_approx_contribs_do_one_model (Boolean) (Expert Setting)
Default value True
Whether to speed up fast_approx_contribs=True further, by using only one model out of all ensemble models (e.g., for 〈Fast Approximation〉 in GUI when making Shapley predictions, and for AutoDoc/MLI).
prediction_logging_interval
¶
prediction_logging_interval (Number)
Default value 300
Approximate interval between logging of progress updates when making predictions. >=0 to enable, -1 to disable.
use_187_prob_logic
¶
use_187_prob_logic (Boolean)
Default value True
Whether to use exploit-explore logic like DAI 1.8.x. False will explore more.
enable_ohe_linear
¶
enable_ohe_linear (Boolean)
Default value False
Whether to enable cross-validated OneHotEncoding+LinearModel transformer
num_as_cat_false_if_ohe
¶
num_as_cat_false_if_ohe (Boolean)
Default value True
no_ohe_try
¶
no_ohe_try (Boolean)
Default value True
tensorflow_num_classes_small_data_factor
¶
tensorflow_num_classes_small_data_factor (Number)
Default value 3
tensorflow_num_classes_big_data_reduction_factor
¶
tensorflow_num_classes_big_data_reduction_factor (Number)
Default value 6
max_varimp_to_save
¶
max_varimp_to_save (Number)
Default value 100
Max. number of top variable importances to save per iteration (GUI can only display a max. of 14)
config_overrides
¶
Add to config.toml via toml string (String) (Expert Setting)
Default value ''
Instructions for 〈Add to config.toml via toml string〉 in GUI expert page Self-referential toml parameter, for setting any other toml parameters as string of tomls separated by
(spaces around are ok).
Useful when toml parameter is not in expert mode but want per-experiment control. Setting this will override all other choices. In expert page, each time expert options saved, the new state is set without memory of any prior settings. The entered item is a fully compliant toml string that would be processed directly by toml.load(). One should include 2 double quotes around the entire setting, or double quotes need to be escaped. One enters into the expert page text as follows: e.g. ``enable_glm=》off》
enable_xgboost_gbm=》off》 enable_lightgbm=》on》``
- e.g. ``》》enable_glm=》off》
enable_xgboost_gbm=》off》 enable_lightgbm=》off》 enable_tensorflow=》on》》》``
e.g. fixed_num_individuals=4
e.g. params_lightgbm="{'objective':'poisson'}"
e.g. ""params_lightgbm="{'objective':'poisson'}"""
e.g. ``max_cores=10
data_precision=》float32》 max_rows_feature_evolution=50000000000 ensemble_accuracy_switch=11 feature_engineering_effort=1 target_transformer=》identity》 tournament_feature_style_accuracy_switch=5 params_tensorflow=》{〈layers〉: (100, 100, 100, 100, 100, 100)}》``
- e.g. 《》max_cores=10
data_precision=》float32》 max_rows_feature_evolution=50000000000 ensemble_accuracy_switch=11 feature_engineering_effort=1 target_transformer=》identity》 tournament_feature_style_accuracy_switch=5 params_tensorflow=》{〈layers〉: (100, 100, 100, 100, 100, 100)}》》》
If you see: 《toml.TomlDecodeError》 then ensure toml is set correctly. When set in the expert page of an experiment, these changes only affect experiments and not the server Usually should keep this as empty string in this toml file.
delete_preview_trans_timings
¶
delete_preview_trans_timings (Boolean)
Default value True
whether to delete preview timings if wrote transformer timings
use_random_text_file
¶
use_random_text_file (Boolean)
Default value False
runtime_estimation_train_frame
¶
runtime_estimation_train_frame (String)
Default value ''
enable_bad_scorer
¶
enable_bad_scorer (Boolean)
Default value False
debug_col_dict_prefix
¶
debug_col_dict_prefix (String)
Default value ''
return_early_debug_col_dict_prefix
¶
return_early_debug_col_dict_prefix (Boolean)
Default value False
return_early_debug_preview
¶
return_early_debug_preview (Boolean)
Default value False
wizard_random_attack
¶
wizard_random_attack (Boolean)
Default value False
wizard_enable_back_button
¶
wizard_enable_back_button (Boolean)
Default value True
wizard_deployment
¶
Global preset of deployment option for Experiment Wizard. Set to non-empty string to enable. (String)
Default value ''
wizard_repro_level
¶
Global preset of repro level option for Experiment Wizard. Set to 1, 2, 3 to enable. (Number)
Default value -1
wizard_sample_size
¶
Max. number of rows for experiment wizard dataset samples. 0 to disable sampling. (Number)
Default value 100000
wizard_model
¶
Type of model for experiment wizard to compute variable importances and leakage checks. (String)
Default value 'rf'
wizard_max_cols
¶
wizard_max_cols (Number)
Default value 100000
Maximum number of columns to start an experiment. This threshold exists to constraint the # complexity and the length of the Driverless AI’s processes.
wizard_timeout_preview
¶
wizard_timeout_preview (Number)
Default value 30
How many seconds to allow preview to take for Wizard.
wizard_timeout_leakage
¶
wizard_timeout_leakage (Number)
Default value 60
How many seconds to allow leakage detection to take for Wizard.
wizard_timeout_dups
¶
wizard_timeout_dups (Number)
Default value 30
How many seconds to allow duplicate row detection to take for Wizard.
wizard_timeout_varimp
¶
wizard_timeout_varimp (Number)
Default value 30
How many seconds to allow variable importance calculation to take for Wizard.
wizard_timeout_schema
¶
wizard_timeout_schema (Number)
Default value 60
How many seconds to allow dataframe schema calculation to take for Wizard.
autoviz_enable_recommendations
¶
Autoviz Use Recommended Transformations (Boolean)
Default value True
When enabled, experiment will try to use feature transformations recommended by Autoviz
autoviz_recommended_transformation
¶
Autoviz Recommended Transformations (Dict) (Expert Setting)
Default value {}
Key-value pairs of column names, and transformations that Autoviz recommended
last_recipe
¶
last_recipe (String) (Expert Setting)
Default value ''
Internal helper to allow memory of if changed recipe
mojo_acceptance_test_mojo_types
¶
MOJO types to test at end of experiment (List) (Expert Setting)
Default value ['C++', 'Java']
Which MOJO runtimes should be tested as part of the mini acceptance tests
make_mojo_scoring_pipeline_for_features_only
¶
Create MOJO for feature engineering pipeline only (no predictions) (Boolean) (Expert Setting)
Default value False
Create MOJO for feature engineering pipeline only (no predictions)
mojo_replace_target_encoding_with_grouped_input_cols
¶
Replaces target encoding features with concatenated input features. (Boolean) (Expert Setting)
Default value False
Replaces target encoding features by their input columns. Instead of CVTE_Age:Income:Zip, this will create Age:Income:Zip. Only when make_mojo_scoring_pipeline_for_features_only is enabled.
predictions_as_transform_only
¶
Generate transformation when making predictions (Boolean) (Expert Setting)
Default value False
Use pipeline to generate transformed features, when making predictions, bypassing the model that usually converts transformed features into predictions.
time_series_causal_split_recipe
¶
Whether causal recipe is used for non-lag-based recipe (Boolean)
Default value False
Whether causal splits are used when time_series_recipe is false orwhether to use same train-gap-test splits when lag transformers are disabled (default behavior).For train-test gap, period, etc. to be used when lag-based recipe is disabled, this must be false.
use_lags_if_causal_recipe
¶
Use lag transformers when using causal time-series recipe (Boolean)
Default value False
- Whether to use lag transformers when using causal-split for validation
(as occurs when not using time-based lag recipe). If no time groups columns, lag transformers will still use time-column as sole time group column.
min_ymd_timestamp
¶
min_ymd_timestamp (Number)
Default value 19000101
Earliest allowed datetime (in %Y%m%d format) for which to allow automatic conversion of integers to a time column during parsing. For example, 2010 or 201004 or 20100402 or 201004022312 can be converted to a valid date/datetime, but 1000 or 100004 or 10000402 or 10004022313 can not, and neither can 201000 or 20100500 etc.
max_ymd_timestamp
¶
max_ymd_timestamp (Number)
Default value 21000101
Latest allowed datetime (in %Y%m%d format) for which to allow automatic conversion of integers to a time column during parsing. For example, 2010 or 201004 or 20100402 can be converted to a valid date/datetime, but 3000 or 300004 or 30000402 or 30004022313 can not, and neither can 201000 or 20100500 etc.
max_rows_datetime_format_detection
¶
max_rows_datetime_format_detection (Number)
Default value 100000
maximum number of data samples (randomly selected rows) for date/datetime format detection
disallowed_datetime_formats
¶
List of disallowed datetime formats. (List)
Default value ['%y']
Manually disables certain datetime formats during data ingest and experiments. For example, [〈%y〉] will avoid parsing columns that contain 〈00〉, 〈01〉, 〈02〉 string values as a date column.
use_datetime_cache
¶
use_datetime_cache (Boolean)
Default value True
Whether to use datetime cache
datetime_cache_min_rows
¶
datetime_cache_min_rows (Number)
Default value 10000
Minimum amount of rows required to utilize datetime cache
holiday_country
¶
holiday_country (String)
Default value ''
blend_in_link_space
¶
Whether to blend ensembles in link space (applies to classification only) (Boolean) (Expert Setting)
Default value True
Whether to blend ensembles in link space, so that can apply inverse link function to get predictions after blending. This allows to get Shapley values to sum up to final predictions, after applying inverse link function: preds = inverse_link( (blend(base learner predictions in link space ))) = inverse_link(sum(blend(base learner shapley values in link space))) = inverse_link(sum( ensemble shapley values in link space ))For binary classification, this is only supported if inverse_link = logistic = 1/(1+exp(-x))For multiclass classification, this is only supported if inverse_link = softmax = exp(x)/sum(exp(x))For regression, this behavior happens naturally if all base learners use the identity link function, otherwise not possible
tgc_via_ui_max_ncols
¶
tgc_via_ui_max_ncols (Number)
Default value 10
Maximum amount of columns send from UI to backend in order to auto-detect TGC
tgc_dup_tolerance
¶
tgc_dup_tolerance (Float)
Default value 0.01
Maximum frequency of duplicated timestamps for TGC detection