Experiment Settings¶
This section includes settings that can be used to customize the experiment like total runtime, reproducibility level, pipeline building, feature brain control, adding config.toml settings and more.
max_runtime_minutes
¶
Max Runtime in Minutes Before Triggering the Finish Button
Specify the maximum runtime in minutes for an experiment. This is equivalent to pushing the Finish button once half of the specified time value has elapsed. Note that the overall enforced runtime is only an approximation.
This value defaults to 1440, which is the equivalent of a 24 hour approximate overall runtime. The Finish button will be automatically selected once 12 hours have elapsed, and Driverless AI will subsequently attempt to complete the overall experiment in the remaining 12 hours. Set this value to 0 to disable this setting.
Note that this setting applies to per experiment so if building leaderboard models(n) it will apply to each experiment separately(i.e total allowed runtime will be n*24hrs. This time estimate assumes running each experiment one at a time, sequentially)
max_runtime_minutes_until_abort
¶
Max Runtime in Minutes Before Triggering the Abort Button
Specify the maximum runtime in minutes for an experiment before triggering the abort button. This option preserves experiment artifacts that have been generated for the summary and log zip files while continuing to generate additional artifacts. This value defaults to 10080 mins (7 days).
Note that this setting applies to per experiment so if building leaderboard models( say n), it will apply to each experiment separately(i.e total allowed runtime will be n*7days. This time estimate assumes running each experiment one at a time, sequentially). Also see time_abort.
time_abort
¶
Time to Trigger the ‘Abort’ Button
If the experiment is not done by this time, push the abort button. Note that this applies to the leaderboard as well i.e if all leaderboard experiments are not done, then entire leaderboard is aborted after this actual time. Also see max_runtime_minutes_until_abort for control over per experiment abort times.
This accepts time in format given by time_abort_format (defaults to %Y-%m-%d %H:%M:%S).This assumes a timezone set by time_abort_timezone in config.toml(defaults to UTC). User can also specify integer seconds since 1970-01-01 00:00:00 UTC.
This will apply to the time on a DAI worker that runs the experiments. Similar to max_runtime_minutes_until_abort, time abort will preserves experiment artifacts made so far for summary and log zip files. If user clones this experiment to rerun/refit/restart, this absolute time will apply to such experiments or set of leaderboard experiments.
pipeline-building-recipe
¶
Pipeline Building Recipe
Specify the Pipeline Building recipe type (overrides GUI settings). Select from the following:
Auto: Specifies that all models and features are automatically determined by experiment settings, config.toml settings, and the feature engineering effort. (Default)
Compliant: Similar to Auto except for the following:
Interpretability is set to 10.
Only uses GLM or booster as ‘giblinear’.
Fixed ensemble level is set to 0.
Feature brain level is set to 0.
Max feature interaction depth is set to 1 i.e no interactions.
Target transformers is set to ‘identity’ for regression.
Does not use distribution shift detection.
monotonicity_constraints_correlation_threshold is set to 0.
monotonic_gbm: Similar to Auto except for the following:
Enables monotonicity constraints
Only uses LightGBM model.
Drops features that are not correlated with target by at least 0.01. See monotonicity-constraints-drop-low-correlation-features and monotonicity-constraints-correlation-threshold.
Does not build an ensemble model i.e set
fixed_ensemble_level=0
No feature brain is used to ensure every restart is identical.
Interaction depth is set to 1 i.e no multi-feature interactions done to avoid complexity.
No target transformations applied for regression problems i.e sets target_transformer to ‘identity’. The equivalent config.toml parameter is
recipe=['monotonic_gbm']
.num_as_cat feature transformation is disabled.
List of included_transformers
‘OriginalTransformer’, #numeric (no clustering, no interactions, no num->cat)‘CatOriginalTransformer’, ‘RawTransformer’,’CVTargetEncodeTransformer’, ‘FrequentTransformer’,’WeightOfEvidenceTransformer’,’OneHotEncodingTransformer’, #categorical (but no num-cat)‘CatTransformer’,’StringConcatTransformer’, # big data only‘DateOriginalTransformer’, ‘DateTimeOriginalTransformer’, ‘DatesTransformer’, ‘DateTimeDiffTransformer’, ‘IsHolidayTransformer’, ‘LagsTransformer’, ‘EwmaLagsTransformer’, ‘LagsInteractionTransformer’, ‘LagsAggregatesTransformer’,#dates/time‘TextOriginalTransformer’, ‘TextTransformer’, ‘StrFeatureTransformer’, ‘TextCNNTransformer’, ‘TextBiGRUTransformer’, ‘TextCharCNNTransformer’, ‘BERTTransformer’,#text‘ImageOriginalTransformer’, ‘ImageVectorizerTransformer’] #image
For reference also see Monotonicity Constraints in Driverless AI.
Kaggle: Similar to Auto except for the following:
Any external validation set is concatenated with the train set, with the target marked as missing.
The test set is concatenated with the train set, with the target marked as missing
Transformers that do not use the target are allowed to
fit_transform
across the entirety of the train, validation, and test sets.Has several config.toml expert options open-up limits.
nlp_model: Only enable NLP BERT models based on PyTorch to process pure text. To avoid slowdown when using this recipe, enabling one or more GPUs is strongly recommended. For more information, see NLP in Driverless AI.
included_models = [‘TextBERTModel’, ‘TextMultilingualBERTModel’, ‘TextXLNETModel’, ‘TextXLMModel’,’TextRoBERTaModel’, ‘TextDistilBERTModel’, ‘TextALBERTModel’, ‘TextCamemBERTModel’, ‘TextXLMRobertaModel’]
enable_pytorch_nlp_transformer = ‘off’
enable_pytorch_nlp_model = ‘on’
nlp_transformer: Only enable PyTorch based BERT transformers that process pure text. To avoid slowdown when using this recipe, enabling one or more GPUs is strongly recommended. For more information, see NLP in Driverless AI.
included_transformers = [‘BERTTransformer’]
excluded_models = [‘TextBERTModel’, ‘TextMultilingualBERTModel’, ‘TextXLNETModel’, ‘TextXLMModel’,’TextRoBERTaModel’, ‘TextDistilBERTModel’, ‘TextALBERTModel’, ‘TextCamemBERTModel’, ‘TextXLMRobertaModel’]
enable_pytorch_nlp_transformer = ‘on’
enable_pytorch_nlp_model = ‘off’
image_model: Only enable image models that process pure images (ImageAutoModel). To avoid slowdown when using this recipe, enabling one or more GPUs is strongly recommended. For more information, see Automatic Image Model.
Notes:
This option disables the Genetic Algorithm (GA).
Image insights are only available when this option is selected.
image_transformer: Only enable the ImageVectorizer transformer, which processes pure images. For more information, see Embeddings Transformer (Image Vectorizer).
unsupervised: Only enable unsupervised transformers, models and scorers. See for reference.
gpus_max: Maximize use of GPUs (e.g. use XGBoost, RAPIDS, Optuna hyperparameter search, etc. that run on GPUs).
Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings. Changing the pipeline building recipe will reset all pipeline building recipe options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline building recipe rules.
If choose to do new/continued/refitted/retrained experiment from parent experiment, the recipe rules are not re-applied and any fine-tuning is preserved. To reset recipe behavior, one can switch between ‘auto’ and the desired mode. This way the new child experiment will use the default settings for the chosen recipe.
enable_genetic_algorithm
¶
Enable Genetic Algorithm for Selection and Tuning of Features and Models
Specify whether to enable genetic algorithm for selection and hyper-parameter tuning of features and models:
auto: Default value is ‘auto’. This is same as ‘on’ unless it is a pure NLP or Image experiment.
on: Driverless AI genetic algorithm is used for feature engineering and model tuning and selection.
Optuna: When ‘Optuna’ is selected, model hyperparameters are tuned with Optuna and Driverless AI genetic algorithm is used for feature engineering. In the Optuna case, the scores shown in the iteration panel are the best score and trial scores. Optuna mode currently only uses Optuna for XGBoost, LightGBM, and CatBoost (custom recipe). If Pruner is enabled, as is default, Optuna mode disables mutations of evaluation metric (eval_metric) so pruning uses same metric across trials to compare.
off: When set to ‘off’, the final pipeline is trained using the default feature engineering and feature selection.
THe equivalent config.toml parameter is enable_genetic_algorithm
.
tournament_style
¶
Tournament Model for Genetic Algorithm
Select a method to decide which models are best at each iteration. This is set to Auto by default. Choose from the following:
auto: Choose based upon accuracy and interpretability
uniform: all individuals in population compete to win as best (can lead to all, e.g. LightGBM models in final ensemble, which may not improve ensemble performance due to lack of diversity)
fullstack: Choose from optimal model and feature types
feature: individuals with similar feature types compete (good if target encoding, frequency encoding, and other feature sets lead to good results)
model: individuals with same model type compete (good if multiple models do well but some models that do not do as well still contribute to improving ensemble)
For each case, a round robin approach is used to choose best scores among type of models to choose from.
If enable_genetic_algorithm==’Optuna’, then every individual is self-mutated without any tournament during the genetic algorithm. The tournament is only used to prune-down individuals for, e.g., tuning -> evolution and evolution -> final model.
make_python_scoring_pipeline
¶
Make Python Scoring Pipeline
Specify whether to automatically build a Python Scoring Pipeline for the experiment. Select On or Auto (default) to make the Python Scoring Pipeline immediately available for download when the experiment is finished. Select Off to disable the automatic creation of the Python Scoring Pipeline.
make_mojo_scoring_pipeline
¶
Make MOJO Scoring Pipeline
Specify whether to automatically build a MOJO (Java) Scoring Pipeline for the experiment. Select On to make the MOJO Scoring Pipeline immediately available for download when the experiment is finished. With this option, any capabilities that prevent the creation of the pipeline are dropped. Select Off to disable the automatic creation of the MOJO Scoring Pipeline. Select Auto (default) to attempt to create the MOJO Scoring Pipeline without dropping any capabilities.
mojo_for_predictions
¶
Allow Use of MOJO for Making Predictions
Specify whether to use MOJO for making fast, low-latency predictions after the experiment has finished. When this is set to Auto (default), the MOJO is only used if the number of rows is equal to or below the value specified by mojo_for_predictions_max_rows
.
reduce_mojo_size
¶
Attempt to Reduce the Size of the MOJO (Small MOJO)
Specify whether to attempt to create a small MOJO scoring pipeline when the experiment is being built. A smaller MOJO leads to less memory footprint during scoring. This setting attempts to reduce the mojo size by limiting experiment’s maximum interaction depth to 3, setting ensemble level to 0 i.e no ensemble model for final pipeline and limiting the maximum number of features in the model to 200. Note that these settings in some cases can affect the overall model’s predictive accuracy as it is limiting the complexity of the feature engineering and model building space.
This is disabled by default. The equivalent config.toml setting is reduce_mojo_size
make_pipeline_visualization
¶
Make Pipeline Visualization
Specify whether to create a visualization of the scoring pipeline at the end of an experiment. This is set to Auto by default. Note that the Visualize Scoring Pipeline feature is experimental and is not available for deprecated models. Visualizations are available for all newly created experiments.
benchmark_mojo_latency
¶
Measure MOJO Scoring Latency
Specify whether to measure the MOJO scoring latency at the time of MOJO creation. This is set to Auto by default. In this case, MOJO scoring latency will be measured if the pipeline.mojo file size is less than 100 MB.
mojo_building_timeout
¶
Timeout in Seconds to Wait for MOJO Creation at End of Experiment
Specify the amount of time in seconds to wait for MOJO creation at the end of an experiment. If the MOJO creation process times out, a MOJO can still be made from the GUI or the R and Python clients (the timeout constraint is not applied to these). This value defaults to 1800 sec (30 minutes).
mojo_building_parallelism
¶
Number of Parallel Workers to Use During MOJO Creation
Specify the number of parallel workers to use during MOJO creation. Higher values can speed up MOJO creation but use more memory. Set this value to -1 (default) to use all physical cores.
kaggle_username
¶
Kaggle Username
Optionally specify your Kaggle username to enable automatic submission and scoring of test set predictions. If this option is specified, then you must also specify a value for the Kaggle Key option. If you don’t have a Kaggle account, you can sign up at https://www.kaggle.com.
kaggle_key
¶
Kaggle Key
Specify your Kaggle API key to enable automatic submission and scoring of test set predictions. If this option is specified, then you must also specify a value for the Kaggle Username option. For more information on obtaining Kaggle API credentials, see https://github.com/Kaggle/kaggle-api#api-credentials.
kaggle_timeout
¶
Kaggle Submission Timeout in Seconds
Specify the Kaggle submission timeout in seconds. This value defaults to 120 sec.
min_num_rows
¶
Min Number of Rows Needed to Run an Experiment
Specify the minimum number of rows that a dataset must contain in order to run an experiment. This value defaults to 100.
reproducibility_level
¶
Reproducibility Level
Specify one of the following levels of reproducibility. Note that this setting is only used when the Reproducible option is enabled in the experiment:
1 = Same experiment results for same O/S, same CPU(s), and same GPU(s) (Default)
2 = Same experiment results for same O/S, same CPU architecture, and same GPU architecture
3 = Same experiment results for same O/S, same CPU architecture (excludes GPUs)
4 = Same experiment results for same O/S (best approximation)
This value defaults to 1.
seed
¶
Random Seed
Specify a random seed for the experiment. When a seed is defined and the reproducible button is enabled (not by default), the algorithm will behave deterministically.
allow_different_classes_across_fold_splits
¶
Allow Different Sets of Classes Across All Train/Validation Fold Splits
(Note: Applicable for multiclass problems only.) Specify whether to enable full cross-validation (multiple folds) during feature evolution as opposed to a single holdout split. This is enabled by default.
save_validation_splits
¶
Store Internal Validation Split Row Indices
Specify whether to store internal validation split row indices. This includes pickles of (train_idx, valid_idx) tuples (numpy row indices for original training data) for all internal validation folds in the experiment summary ZIP file. Enable this setting for debugging purposes. This setting is disabled by default.
max_num_classes
¶
Max Number of Classes for Classification Problems
Specify the maximum number of classes to allow for a classification problem. A higher number of classes may make certain processes more time-consuming. Memory requirements also increase with a higher number of classes. This value defaults to 200.
max_num_classes_compute_roc
¶
Max Number of Classes to Compute ROC and Confusion Matrix for Classification Problems
Specify the maximum number of classes to use when computing the ROC and CM. When this value is exceeded, the reduction type specified by roc_reduce_type
is applied. This value defaults to 200 and cannot be lower than 2.
max_num_classes_client_and_gui
¶
Max Number of Classes to Show in GUI for Confusion Matrix
Specify the maximum number of classes to show in the GUI for CM, showing first max_num_classes_client_and_gui
labels. This value defaults to 10, but any value beyond 6 will result in visually truncated diagnostics. Note that if this value is changed in the config.toml and the server is restarted, then this setting will only modify client-GUI launched diagnostics. To control experiment plots, this value must be changed in the expert settings panel.
roc_reduce_type
¶
ROC/CM Reduction Technique for Large Class Counts
Specify the ROC confusion matrix reduction technique used for large class counts:
Rows (Default): Reduce by randomly sampling rows
Classes: Reduce by truncating classes to no more than the value specified by
max_num_classes_compute_roc
max_rows_cm_ga
¶
Maximum Number of Rows to Obtain Confusion Matrix Related Plots During Feature Evolution
Specify the maximum number of rows to obtain confusion matrix related plots during feature evolution. Note that this doesn’t limit final model calculation.
use_feature_brain_new_experiments
¶
Whether to Use Feature Brain for New Experiments
Specify whether to use feature_brain results even if running new experiments. Feature brain can be risky with some types of changes to experiment setup. Even rescoring may be insufficient, so by default this is False. For example, one experiment may have training=external validation by accident, and get high score, and while feature_brain_reset_score=’on’ means we will rescore, it will have already seen during training the external validation and leak that data as part of what it learned from. If this is False, feature_brain_level just sets possible models to use and logs/notifies, but does not use these feature brain cached models.
feature_brain_level
¶
Model/Feature Brain Level
Specify whether to use H2O.ai brain, which enables local caching and smart re-use (checkpointing) of prior experiments to generate useful features and models for new experiments. It can also be used to control checkpointing for experiments that have been paused or interrupted.
When enabled, this will use the H2O.ai brain cache if the cache file:
has any matching column names and types for a similar experiment type
has classes that match exactly
has class labels that match exactly
has basic time series choices that match
the interpretability of the cache is equal or lower
the main model (booster) is allowed by the new experiment
-1: Don’t use any brain cache (default)
0: Don’t use any brain cache but still write to cache. Use case: Want to save the model for later use, but we want the current model to be built without any brain models.
1: Smart checkpoint from the latest best individual model. Use case: Want to use the latest matching model. The match may not be precise, so use with caution.
2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time series options identically. Use case: Driverless AI scans through the H2O.ai brain cache for the best models to restart from.
3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete first iteration.
4: Smart checkpoint like level #2 but for the entire population. Tune only if the brain population is of insufficient size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete first iteration.
5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations to get the best scored individuals. Note that this can be slower due to brain cache scanning if the cache is large.
When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain. In addition, the default maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file. This value defaults to 2.
feature_brain2
¶
Feature Brain Save Every Which Iteration
Save feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration == 0, to be able to restart/refit with which_iteration_brain >= 0. This is disabled (0) by default.
-1: Don’t use any brain cache.
0: Don’t use any brain cache but still write to cache.
1: Smart checkpoint if an old experiment_id is passed in (for example, via running “resume one like this” in the GUI).
2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time series options identically. (default)
3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient size.
4: Smart checkpoint like level #2 but for the entire population. Tune only if the brain population is of insufficient size.
5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations (starting from resumed experiment if chosen) in order to get the best scored individuals.
When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain. In addition, the default maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file.
feature_brain3
¶
Feature Brain Restart from Which Iteration
When performing restart or re-fit of type feature_brain_level with a resumed ID, specify which iteration to start from instead of only last best. Available options include:
-1: Use the last best
1: Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number
2: Identify which iteration brain dump you wants to restart/refit from
3: Restart/Refit from the original experiment, setting which_iteration_brain to that number here in expert settings.
Note: If restarting from a tuning iteration, this will pull in the entire scored tuning population and use that for feature evolution. This value defaults to -1.
feature_brain4
¶
Feature Brain Refit Uses Same Best Individual
Specify whether to use the same best individual when performing a refit. Disabling this setting allows the order of best individuals to be rearranged, leading to a better final result. Enabling this setting lets you view the exact same model or feature with only one new feature added. This is disabled by default.
feature_brain5
¶
Feature Brain Adds Features with New Columns Even During Retraining of Final Model
Specify whether to add additional features from new columns to the pipeline, even when performing a retrain of the final model. Use this option if you want to keep the same pipeline regardless of new columns from a new dataset. New data may lead to new dropped features due to shift or leak detection. Disable this to avoid adding any columns as new features so that the pipeline is perfectly preserved when changing data. This is enabled by default.
force_model_restart_to_defaults
¶
Restart-Refit Use Default Model Settings If Model Switches
When restarting or refitting, specify whether to use the model class’s default settings if the original model class is no longer available. If this is disabled, the original hyperparameters will be used instead. (Note that this may result in errors.) This is enabled by default.
min_dai_iterations
¶
Min DAI Iterations
Specify the minimum number of Driverless AI iterations for an experiment. This can be used during restarting, when you want to continue for longer despite a score not improving. This value defaults to 0.
target_transformer
¶
Select Target Transformation of the Target for Regression Problems
Specify whether to automatically select target transformation for regression problems. Available options include:
auto
identity
identity_noclip
center
standardize
unit_box
log
log_noclip
square
sqrt
double_sqrt
inverse
logit
sigmoid
If set to auto (default), Driverless AI will automatically pick the best target transformer if the Accuracy is set to the value of the tune_target_transform_accuracy_switch
configuration option (defaults to 5) or larger. Selecting identity_noclip automatically turns off any target transformations. All transformers except for center, standardize, identity_noclip and log_noclip perform clipping to constrain the predictions to the domain of the target in the training data, so avoid them if you want to enable extrapolations.
The equivalent config.toml setting is target_transformer
.
fixed_num_folds_evolution
¶
Number of Cross-Validation Folds for Feature Evolution
Specify the fixed number of cross-validation folds (if >= 2) for feature evolution. Note that the actual number of allowed folds can be less than the specified value, and that the number of allowed folds is determined at the time an experiment is run. This value defaults to -1 (auto).
fixed_num_folds
¶
Number of Cross-Validation Folds for Final Model
Specify the fixed number of cross-validation folds (if >= 2) for the final model. Note that the actual number of allowed folds can be less than the specified value, and that the number of allowed folds is determined at the time an experiment is run. This value defaults to -1 (auto).
fixed_only_first_fold_model
¶
Force Only First Fold for Models
Specify whether to force only the first fold for models. Select from Auto (Default), On, or Off. Set “on” to force only first fold for models.This is useful for quick runs regardless of data
feature_evolution_data_size
¶
Max Number of Rows Times Number of Columns for Feature Evolution Data Splits
Specify the maximum number of rows allowed for feature evolution data splits (not for the final pipeline). This value defaults to 100,000,000.
final_pipeline_data_size
¶
Max Number of Rows Times Number of Columns for Reducing Training Dataset
Specify the upper limit on the number of rows times the number of columns for training the final pipeline. This value defaults to 500,000,000.
max_validation_to_training_size_ratio_for_final_ensemble
¶
Maximum Size of Validation Data Relative to Training Data
Specify the maximum size of the validation data relative to the training data. Smaller values can make the final pipeline model training process quicker. Note that final model predictions and scores will always be provided on the full dataset provided. This value defaults to 2.0.
force_stratified_splits_for_imbalanced_threshold_binary
¶
Perform Stratified Sampling for Binary Classification If the Target Is More Imbalanced Than This
For binary classification experiments, specify a threshold ratio of minority to majority class for the target column beyond which stratified sampling is performed. If the threshold is not exceeded, random sampling is performed. This value defaults to 0.01. You can choose to always perform random sampling by setting this value to 0, or to always perform stratified sampling by setting this value to 1.
config_overrides
¶
Add to config.toml via TOML String
Specify any additional configuration overrides from the config.toml file that you want to include in the experiment. (Refer to the Sample config.toml File section to view options that can be overridden during an experiment.) Setting this will override all other settings. Separate multiple config overrides with \n
. For example, the following enables Poisson distribution for LightGBM and disables Target Transformer Tuning. Note that in this example double quotes are escaped (\" \"
).
params_lightgbm=\"{'objective':'poisson'}\" \n target_transformer=identity
Or you can specify config overrides similar to the following without having to escape double quotes:
""enable_glm="off" \n enable_xgboost_gbm="off" \n enable_lightgbm="off" \n enable_tensorflow="on"""
""max_cores=10 \n data_precision="float32" \n max_rows_feature_evolution=50000000000 \n ensemble_accuracy_switch=11 \n feature_engineering_effort=1 \n target_transformer="identity" \n tournament_feature_style_accuracy_switch=5 \n params_tensorflow="{'layers': [100, 100, 100, 100, 100, 100]}"""
When running the Python client, config overrides would be set as follows:
model = h2o.start_experiment_sync(
dataset_key=train.key,
target_col='target',
is_classification=True,
accuracy=7,
time=5,
interpretability=1,
config_overrides="""
feature_brain_level=0
enable_lightgbm="off"
enable_xgboost_gbm="off"
enable_ftrl="off"
"""
)
last_recipe
¶
last_recipe
Internal helper to allow memory of if changed recipe
feature_brain_reset_score
¶
Whether to re-score models from brain cache
Specify whether to smartly keep score to avoid re-munging/re-training/re-scoring steps brain models (‘auto’), always force all steps for all brain imports (‘on’), or never rescore (‘off’). ‘auto’ only re-scores if a difference in current and prior experiment warrants re-scoring, like column changes, metric changes, etc. ‘on’ is useful when smart similarity checking is not reliable enough. ‘off’ is useful when know want to keep exact same features and model for final model refit, despite changes in seed or other behaviors in features that might change the outcome if re-scored before reaching final model. If set off, then no limits are applied to features during brain ingestion, while can set brain_add_features_for_new_columns to false if want to ignore any new columns in data. Can also set refit_same_best_individual True if want exact same best individual (highest scored model+features) to be used regardless of any scoring changes.
feature_brain_save_every_iteration
¶
Feature Brain Save every which iteration
Specify whether to save feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration == 0, to be able to restart/refit with which_iteration_brain >= 0. Set to 0 to disable this setting.
which_iteration_brain
¶
Feature Brain Restart from which iteration
When performing restart or re-fit type feature_brain_level with resumed_experiment_id, choose which iteration to start from, instead of only last best -1 means just use last best.
Usage:
Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number
Identify which iteration brain dump one wants to restart/refit from
Restart/Refit from original experiment, setting which_iteration_brain to that number in expert settings
Note: If restart from a tuning iteration, this will pull in entire scored tuning population and use that for feature evolution.
refit_same_best_individual
¶
Feature Brain refit uses same best individual
When doing re-fit from feature brain, if change columns or features, population of individuals used to refit from may change order of which was best, leading to better result chosen (False case). But sometimes you want to see exact same model/features with only one feature added, and then would need to set this to True case. That is, if refit with just 1 extra column and have interpretability=1, then final model will be same features, with one more engineered feature applied to that new original feature.
restart_refit_redo_origfs_shift_leak
¶
For restart-refit, select which steps to do
When doing restart or re-fit of experiment from feature brain, sometimes user might change data significantly and then warrant redoing reduction of original features by feature selection, shift detection, and leakage detection. However, in other cases, if data and all options are nearly (or exactly) identical, then these steps might change the features slightly (e.g. due to random seed if not setting reproducible mode), leading to changes in features and model that is refitted. By default, restart and refit avoid these steps assuming data and experiment setup have no changed significantly. If check_distribution_shift is forced to on (instead of auto), then this option is ignored. In order to ensure exact same final pipeline is fitted, one should also set:
brain_add_features_for_new_columns false
refit_same_best_individual true
feature_brain_reset_score ‘off’
force_model_restart_to_defaults false
The score will still be reset if the experiment metric chosen changes, but changes to the scored model and features will be more frozen in place.
brain_add_features_for_new_columns
¶
Feature Brain adds features with new columns even during retraining final model
Whether to take any new columns and add additional features to pipeline, even if doing retrain final model. In some cases, one might have a new dataset but only want to keep same pipeline regardless of new columns, in which case one sets this to False. For example, new data might lead to new dropped features, due to shift or leak detection. To avoid change of feature set, one can disable all dropping of columns, but set this to False to avoid adding any columns as new features, so pipeline is perfectly preserved when changing data.
force_model_restart_to_defaults
¶
Restart-refit use default model settings if model switches
If restart/refit and no longer have the original model class available, be conservative and go back to defaults for that model class. If False, then try to keep original hyperparameters, which can fail to work in general.
dump_modelparams_every_scored_indiv
¶
Enable detailed scored model info
Whether to dump every scored individual’s model parameters to csv/tabulated/json file produces files. For example: individual_scored.params.[txt, csv, json]
fast_approx_num_trees
¶
Max number of trees to use for fast approximation
When fast_approx=True
, specify the maximum number of trees to use. By default, this value is 250.
Note
By default, fast_approx
is enabled for MLI and AutoDoc and disabled for Experiment predictions.
fast_approx_do_one_fold
¶
Whether to use only one fold for fast approximation
When fast_approx=True
, specify whether to speed up fast approximation further by using only one fold out of all cross-validation folds. By default, this setting is enabled.
Note
By default, fast_approx
is enabled for MLI and AutoDoc and disabled for Experiment predictions.
fast_approx_do_one_model
¶
Whether to use only one model for fast approximation
When fast_approx=True
, specify whether to speed up fast approximation further by using only one model out of all ensemble models. By default, this setting is disabled.
Note
By default, fast_approx
is enabled for MLI and AutoDoc and disabled for Experiment predictions.
fast_approx_contribs_num_trees
¶
Maximum number of trees to use for fast approximation when making Shapley predictions
When fast_approx_contribs=True
, specify the maximum number of trees to use for ‘Fast Approximation’ in GUI when making Shapley predictions and for AutoDoc/MLI. By default, this value is 50.
Note
By default, fast_approx_contribs
is enabled for MLI and AutoDoc.
fast_approx_contribs_do_one_fold
¶
Whether to use only one fold for fast approximation when making Shapley predictions
When fast_approx_contribs=True
, specify whether to speed up fast_approx_contribs
further by using only one fold out of all cross-validation folds for ‘Fast Approximation’ in GUI when making Shapley predictions and for AutoDoc/MLI. By default, this setting is enabled.
Note
By default, fast_approx_contribs
is enabled for MLI and AutoDoc.
fast_approx_contribs_do_one_model
¶
Whether to use only one model for fast approximation when making Shapley predictions
When fast_approx_contribs=True
, specify whether to speed up fast_approx_contribs
further by using only one model out of all ensemble models for ‘Fast Approximation’ in GUI when making Shapley predictions and for AutoDoc/MLI. By default, this setting is enabled.
Note
By default, fast_approx_contribs
is enabled for MLI and AutoDoc.
autoviz_recommended_transformation
¶
Autoviz Recommended Transformations
Key-value pairs of column names and transformations that Autoviz recommended. Also see Autoviz Recommendation Transformer.