Features configuration¶
feature_engineering_effort
¶
Feature engineering effort (0..10) (Number) (Expert Setting)
Default value -1
How much effort to spend on feature engineering (-1…10) Heuristic combination of various developer-level toml parameters -1 : auto (5, except 1 for wide data in order to limit engineering) 0 : keep only numeric features, only model tuning during evolution 1 : keep only numeric features and frequency-encoded categoricals, only model tuning during evolution 2 : Like #1 but instead just no Text features. Some feature tuning before evolution. 3 : Like #5 but only tuning during evolution. Mixed tuning of features and model parameters. 4 : Like #5, but slightly more focused on model tuning 5 : Default. Balanced feature-model tuning 6-7 : Like #5, but slightly more focused on feature engineering 8 : Like #6-7, but even more focused on feature engineering with high feature generation rate, no feature dropping even if high interpretability 9-10: Like #8, but no model tuning during feature evolution
check_distribution_shift
¶
Data distribution shift detection (String) (Expert Setting)
Default value 'auto'
Whether to enable train/valid and train/test distribution shift detection (‘auto’/’on’/’off’). By default, LightGBMModel is used for shift detection if possible, unless it is turned off in model expert panel, and then only the models selected in recipe list will be used.
check_distribution_shift_transformed
¶
Data distribution shift detection on transformed features (String) (Expert Setting)
Default value 'auto'
Whether to enable train/test distribution shift detection (‘auto’/’on’/’off’) for final model transformed features. By default, LightGBMModel is used for shift detection if possible, unless it is turned off in model expert panel, and then only the models selected in recipe list will be used.
check_distribution_shift_drop
¶
Data distribution shift detection drop of features (String) (Expert Setting)
Default value 'auto'
Whether to drop high-shift features (‘auto’/’on’/’off’). Auto disables for time series.
drop_features_distribution_shift_threshold_auc
¶
Max allowed feature shift (AUC) before dropping feature (Float) (Expert Setting)
Default value 0.999
If distribution shift detection is enabled, drop features (except ID, text, date/datetime, time, weight) for which shift AUC, GINI, or Spearman correlation is above this value (e.g. AUC of a binary classifier that predicts whether given feature value belongs to train or test data)
check_leakage
¶
Leakage detection (String) (Expert Setting)
Default value 'auto'
Specify whether to check leakage for each feature (on
or off
).
If a fold column is used, this option checks leakage without using the fold column.
By default, LightGBM Model is used for leakage detection when possible, unless it is
turned off in the Model Expert Settings tab, in which case only the models selected with
the included_models
option are used. Note that this option is always disabled for time
series experiments.
drop_features_leakage_threshold_auc
¶
Leakage detection dropping AUC/R2 threshold (Float) (Expert Setting)
Default value 0.999
- If leakage detection is enabled,
drop features for which AUC (R2 for regression), GINI, or Spearman correlation is above this value. If fold column present, features are not dropped, because leakage test applies without fold column used.
leakage_max_data_size
¶
Max rows x columns for leakage (Number) (Expert Setting)
Default value 10000000
Max number of rows x number of columns to trigger (stratified) sampling for leakage checks
max_features_importance
¶
Max. num. features for variable importance (Number) (Expert Setting)
Default value 100000
Specify the maximum number of features to use and show in importance tables. When Interpretability is set higher than 1, transformed or original features with lower importance than the top max_features_importance features are always removed. Feature importances of transformed or original features correspondingly will be pruned. Higher values can lead to lower performance and larger disk space used for datasets with more than 100k columns.
enable_wide_rules
¶
Enable Wide Rules (String) (Expert Setting)
Default value 'auto'
Enable various rules to handle wide (Num. columns > Num. rows) datasets (‘auto’/’on’/’off’). Setting on forces rules to be enabled regardless of columns.
wide_factor
¶
Wide rules factor (Float) (Expert Setting)
Default value 5.0
If columns > wide_factor * rows, then enable wide rules if auto. For columns > rows, random forest is always enabled.
orig_features_fs_report
¶
Report permutation importance on original features (Boolean) (Expert Setting)
Default value False
Whether to obtain permutation feature importance on original features for reporting in logs and summary zip file (as files with pattern fs_*.json or fs_*.tab.txt). This computes feature importance on a single un-tuned model (typically LightGBM with pre-defined un-tuned hyperparameters) and simple set of features (encoding typically is frequency encoding or target encoding). Features with low importance are automatically dropped if there are many original features, or a model with feature selection by permutation importance is created if interpretability is high enough in order to see if it gives a better score. One can manually drop low importance features, but this can be risky as transformers or hyperparameters might recover their usefulness. Permutation importance is obtained by: 1) Transforming categoricals to frequency or target encoding features. 2) Fitting that model on many folds, different data sizes, and slightly varying hyperparameters. 3) Predicting on that model for each feature where each feature has its data shuffled. 4) Computing the score on each shuffled prediction. 5) Computing the difference between the unshuffled score and the shuffled score to arrive at a delta score 6) The delta score becomes the variable importance once normalized by the maximum.
Positive delta scores indicate the feature helped the model score, while negative delta scores indicate the feature hurt the model score. The normalized scores are stored in the fs_normalized_* files in the summary zip. The unnormalized scores (actual delta scores) are stored in the fs_unnormalized_* files in the summary zip.
AutoDoc has a similar functionality of providing permutation importance on original features, where that takes the specific final model of an experiment and runs training data set through permutation importance to get original importance, so shuffling of original features is performed and the full pipeline is computed in each shuffled set of original features.
max_rows_fs
¶
Maximum number of rows to perform permutation-based feature selection (Number) (Expert Setting)
Default value 500000
Maximum number of rows when doing permutation feature importance, reduced by (stratified) random sampling.
max_orig_cols_selected
¶
Max. number of original features used (Number) (Expert Setting)
Default value 10000000
Maximum number of columns selected out of original set of original columns, using feature selection. The selection is based upon how well target encoding (or frequency encoding if not available) on categoricals and numerics treated as categoricals. This is useful to reduce the final model complexity. First the best [max_orig_cols_selected] are found through feature selection methods and then these features are used in feature evolution (to derive other features) and in modelling.
max_orig_numeric_cols_selected
¶
max_orig_numeric_cols_selected (Number)
Default value 10000000
Maximum number of numeric columns selected, above which will do feature selection same max_orig_cols_selected but for numeric columns.
max_orig_nonnumeric_cols_selected
¶
Max. number of original non-numeric features (Number) (Expert Setting)
Default value -1
Maximum number of non-numeric columns selected, above which will do feature selection on all features. Same as max_orig_numeric_cols_selected but for categorical columns. If set to -1, then auto mode which uses max_orig_nonnumeric_cols_selected_default, but then for small data can be increased up to 10x larger.
max_orig_cols_selected_simple_factor
¶
max_orig_cols_selected_simple_factor (Number) (Expert Setting)
Default value 2
The factor times max_orig_cols_selected, by which column selection is based upon no target encoding and no treating numerical as categorical in order to limit performance cost of feature engineering
fs_orig_cols_selected
¶
Max. number of original features used for FS individual (Number) (Expert Setting)
Default value 10000000
Like max_orig_cols_selected, but columns above which add special individual with original columns reduced.
fs_orig_numeric_cols_selected
¶
Num. of original numeric features to trigger feature selection model type (Number) (Expert Setting)
Default value 10000000
- Like max_orig_numeric_cols_selected, but applicable to special individual with original columns reduced.
A separate individual in the genetic algorithm is created by doing feature selection by permutation importance on original features.
fs_orig_nonnumeric_cols_selected
¶
Num. of original non-numeric features to trigger feature selection model type (Number) (Expert Setting)
Default value 200
- Like max_orig_nonnumeric_cols_selected, but applicable to special individual with original columns reduced.
A separate individual in the genetic algorithm is created by doing feature selection by permutation importance on original features.
fs_orig_cols_selected_simple_factor
¶
fs_orig_cols_selected_simple_factor (Number) (Expert Setting)
Default value 2
Like max_orig_cols_selected_simple_factor, but applicable to special individual with original columns reduced.
predict_shuffle_inside_model
¶
Allow supported models to do feature selection by permutation importance within model itself (Boolean) (Expert Setting)
Default value True
use_native_cats_for_lgbm_fs
¶
Whether to use native categorical handling (CPU only) for LightGBM when doing feature selection by permutation (Boolean) (Expert Setting)
Default value True
orig_stddev_max_cols
¶
Maximum number of original columns up to which will compute standard deviation of original feature importance. Can be expensive if many features. (Number) (Expert Setting)
Default value 1000
max_relative_cardinality
¶
Max. allowed fraction of uniques for integer and categorical cols (Float) (Expert Setting)
Default value 0.95
Maximum allowed fraction of unique values for integer and categorical columns (otherwise will treat column as ID and drop)
max_absolute_cardinality
¶
max_absolute_cardinality (Number)
Default value 1000000
Maximum allowed number of unique values for integer and categorical columns (otherwise will treat column as ID and drop)
num_as_cat
¶
Allow treating numerical as categorical (Boolean) (Expert Setting)
Default value True
Whether to treat some numerical features as categorical. For instance, sometimes an integer column may not represent a numerical feature but represent different numerical codes instead. Very restrictive to disable, since then even columns with few categorical levels that happen to be numerical in value will not be encoded like a categorical.
max_int_as_cat_uniques
¶
Max. number of unique values for int/float to be categoricals (Number) (Expert Setting)
Default value 50
Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_size_small rows only)
max_int_as_cat_uniques_if_not_benford
¶
Max. number of unique values for int/float to be categoricals if violates Benford’s Law (Number) (Expert Setting)
Default value 10000
Max number of unique values for integer/real columns to be treated as categoricals (test applies to first statistical_threshold_data_size_small rows only). Applies to integer or real numerical feature that violates Benford’s law, and so is ID-like but not entirely an ID.
max_fraction_invalid_numeric
¶
Max. fraction of numeric values to be non-numeric (and not missing) for a column to still be considered numeric (Float) (Expert Setting)
Default value 0.0
When the fraction of non-numeric (and non-missing) values is less or equal than this value, consider the column numeric. Can help with minor data quality issues for experimentation, > 0 is not recommended for production, since type inconsistencies can occur. Note: Replaces non-numeric values with missing values at start of experiment, so some information is lost, but column is now treated as numeric, which can help. If < 0, then disabled. If == 0, then if number of rows <= max_rows_col_stats, then convert any column of strings of numbers to numeric type.
nfeatures_max
¶
Max. number of engineered features (-1 = auto) (Number) (Expert Setting)
Default value -1
Maximum features per model (and each model within the final model if ensemble) kept. Keeps top variable importance features, prunes rest away, after each scoring. Final ensemble will exclude any pruned-away features and only train on kept features, but may contain a few new features due to fitting on different data view (e.g. new clusters) Final scoring pipeline will exclude any pruned-away features, but may contain a few new features due to fitting on different data view (e.g. new clusters) -1 means no restrictions except internally-determined memory and interpretability restrictions. Notes: * If interpretability > remove_scored_0gain_genes_in_postprocessing_above_interpretability, then every GA iteration post-processes features down to this value just after scoring them. Otherwise, only mutations of scored individuals will be pruned (until the final model where limits are strictly applied). * If ngenes_max is not also limited, then some individuals will have more genes and features until pruned by mutation or by preparation for final model. * E.g. to generally limit every iteration to exactly 1 features, one must set nfeatures_max=ngenes_max=1 and remove_scored_0gain_genes_in_postprocessing_above_interpretability=0, but the genetic algorithm will have a harder time finding good features.
ngenes_max
¶
Max. number of genes (transformer instances) (-1 = auto) (Number) (Expert Setting)
Default value -1
Maximum genes (transformer instances) per model (and each model within the final model if ensemble) kept. Controls number of genes before features are scored, so just randomly samples genes if pruning occurs. If restriction occurs after scoring features, then aggregated gene importances are used for pruning genes. Instances includes all possible transformers, including original transformer for numeric features. -1 means no restrictions except internally-determined memory and interpretability restrictions
ngenes_min
¶
Min. number of genes (transformer instances) (-1 = auto) (Number) (Expert Setting)
Default value -1
Like ngenes_max but controls minimum number of genes. Useful when DAI by default is making too few genes but want many more. This can be useful when one has few input features, so DAI may remain conservative and not make many transformed features. But user knows that some transformed features may be useful. E.g. only target encoding transformer might have been chosen, and one wants DAI to explore many more possible input features at once.
nfeatures_min
¶
Min. number of genes (transformer instances) (-1 = auto) (Number) (Expert Setting)
Default value -1
Minimum genes (transformer instances) per model (and each model within the final model if ensemble) kept. Instances includes all possible transformers, including original transformer for numeric features. -1 means no restrictions except internally-determined memory and interpretability restrictions
limit_features_by_interpretability
¶
Limit features by interpretability (Boolean) (Expert Setting)
Default value True
Whether to limit feature counts by interpretability setting via features_allowed_by_interpretability
monotonicity_constraints_interpretability_switch
¶
Threshold for interpretability above which to enable automatic monotonicity constraints for tree models (Number) (Expert Setting)
Default value 7
Interpretability setting equal and above which will use automatic monotonicity constraints in XGBoostGBM/LightGBM/DecisionTree models.
monotonicity_constraints_log_level
¶
Control amount of logging when calculating automatic monotonicity constraints (if enabled) (String) (Expert Setting)
Default value 'medium'
For models that support monotonicity constraints, and if enabled, show automatically determined monotonicity constraints for each feature going into the model based on its correlation with the target. ‘low’ shows only monotonicity constraint direction. ‘medium’ shows correlation of positively and negatively constraint features. ‘high’ shows all correlation values.
monotonicity_constraints_correlation_threshold
¶
Correlation beyond which triggers monotonicity constraints (if enabled) (Float) (Expert Setting)
Default value 0.1
Threshold, of Pearson product-moment correlation coefficient between numerical or encoded transformed feature and target, above (below negative for) which will enforce positive (negative) monotonicity for XGBoostGBM, LightGBM and DecisionTree models. Enabled when interpretability >= monotonicity_constraints_interpretability_switch config toml value. Only if monotonicity_constraints_dict is not provided.
monotonicity_constraints_drop_low_correlation_features
¶
Whether to drop features that have no monotonicity constraint applied (e.g., due to low correlation with target). (Boolean) (Expert Setting)
Default value False
If enabled, only monotonic features with +1/-1 constraints will be passed to the model(s), and features without monotonicity constraints (0, as set by monotonicity_constraints_dict or determined automatically) will be dropped. Otherwise all features will be in the model. Only active when interpretability >= monotonicity_constraints_interpretability_switch or monotonicity_constraints_dict is provided.
monotonicity_constraints_dict
¶
Manual override for monotonicity constraints (Dict) (Expert Setting)
Default value {}
Manual override for monotonicity constraints. Mapping of original numeric features to desired constraint (1 for pos, -1 for neg, or 0 to disable. True can be set for automatic handling, False is same as 0). Features that are not listed here will be treated automatically, and so get no constraint (i.e., 0) if interpretability < monotonicity_constraints_interpretability_switch and otherwise the constraint is automatically determined from the correlation between each feature and the target. Example: {‘PAY_0’: -1, ‘PAY_2’: -1, ‘AGE’: -1, ‘BILL_AMT1’: 1, ‘PAY_AMT1’: -1}
max_feature_interaction_depth
¶
Max. feature interaction depth (Number) (Expert Setting)
Default value -1
Exploring feature interactions can be important in gaining better predictive performance. The interaction can take multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + … featureN) Although certain machine learning algorithms (like tree-based methods) can do well in capturing these interactions as part of their training process, still generating them may help them (or other algorithms) yield better performance. The depth of the interaction level (as in “up to” how many features may be combined at once to create one single feature) can be specified to control the complexity of the feature engineering process. For transformers that use both numeric and categorical features, this constrains the number of each type, not the total number. Higher values might be able to make more predictive models at the expense of time (-1 means automatic).
fixed_feature_interaction_depth
¶
Fixed feature interaction depth (Number) (Expert Setting)
Default value 0
Instead of sampling from min to max (up to max_feature_interaction_depth unless all specified) columns allowed for each transformer (0), choose fixed non-zero number of columns to use. Can make same as number of columns to use all columns for each transformers if allowed by each transformer. -n can be chosen to do 50/50 sample and fixed of n features.
fixed_num_individuals
¶
fixed_num_individuals (Number) (Expert Setting)
Default value 0
set fixed number of individuals (if > 0) - useful to compare different hardware configurations. If want 3 individuals in GA race to be preserved, choose 6, since need 1 mutatable loser per surviving individual.
enable_target_encoding
¶
Enable Target Encoding (auto disables for time series) (String) (Expert Setting)
Default value 'auto'
Whether target encoding (CV target encoding, weight of evidence, etc.) could be enabled Target encoding refers to several different feature transformations (primarily focused on categorical data) that aim to represent the feature using information of the actual target variable. A simple example can be to use the mean of the target to replace each unique category of a categorical feature. This type of features can be very predictive, but are prone to overfitting and require more memory as they need to store mappings of the unique categories and the target values.
cvte_cv_in_cv
¶
Enable outer CV for Target Encoding (Boolean) (Expert Setting)
Default value True
For target encoding, whether an outer level of cross-fold validation is performed, in cases when GINI is detected to flip sign (or have inconsistent sign for weight of evidence) between fit_transform on training, transform on training, and transform on validation data. The degree to which GINI is poor is also used to perform fold-averaging of look-up tables instead of using global look-up tables.
cv_in_cv_overconfidence_protection
¶
Enable outer CV for Target Encoding with overconfidence protection (String) (Expert Setting)
Default value 'auto'
For target encoding, when an outer level of cross-fold validation is performed, increase number of outer folds or abort target encoding when GINI between feature and target are not close between fit_transform on training, transform on training, and transform on validation data.
enable_lexilabel_encoding
¶
Enable Lexicographical Label Encoding (String) (Expert Setting)
Default value 'off'
enable_isolation_forest
¶
Enable Isolation Forest Anomaly Score Encoding (String) (Expert Setting)
Default value 'off'
enable_one_hot_encoding
¶
Enable One HotEncoding (auto enables only for GLM) (String) (Expert Setting)
Default value 'auto'
Whether one hot encoding could be enabled. If auto, then only applied for small data and GLM.
binner_cardinality_limiter
¶
binner_cardinality_limiter (Number) (Expert Setting)
Default value 50
Limit number of output features (total number of bins) created by all BinnerTransformers based on this value, scaled by accuracy, interpretability and dataset size. 0 means unlimited.
enable_binning
¶
Enable BinnerTransformer for simple numeric binning (auto enables only for GLM/FTRL/TensorFlow/GrowNet) (String) (Expert Setting)
Default value 'auto'
Whether simple binning of numeric features should be enabled by default. If auto, then only for GLM/FTRL/TensorFlow/GrowNet for time-series or for interpretability >= 6. Binning can help linear (or simple) models by exposing more signal for features that are not linearly correlated with the target. Note that NumCatTransformer and NumToCatTransformer already do binning, but also perform target encoding, which makes them less interpretable. The BinnerTransformer is more interpretable, and also works for time series.
binner_bin_method
¶
Select methods used to find bins for Binner Transformer (List) (Expert Setting)
Default value ['tree']
- Tree uses XGBoost to find optimal split points for binning of numeric features.
Quantile use quantile-based binning. Might fall back to quantile-based if too many classes or not enough unique values.
binner_minimize_bins
¶
Enable automatic reduction of number of bins for Binner Transformer (Boolean) (Expert Setting)
Default value True
- If enabled, will attempt to reduce the number of bins during binning of numeric features.
Applies to both tree-based and quantile-based bins.
binner_encoding
¶
Select encoding schemes for Binner Transformer (List) (Expert Setting)
Default value ['piecewise_linear', 'binary']
- Given a set of bins (cut points along min…max), the encoding scheme converts the original
numeric feature values into the values of the output columns (one column per bin, and one extra bin for missing values if any). Piecewise linear is 0 left of the bin, and 1 right of the bin, and grows linearly from 0 to 1 inside the bin. Binary is 1 inside the bin and 0 outside the bin. Missing value bin encoding is always binary, either 0 or 1. If no missing values in the data, then there is no missing value bin. Piecewise linear helps to encode growing values and keeps smooth transitions across the bin boundaries, while binary is best suited for detecting specific values in the data. Both are great at providing features to models that otherwise lack non-linear pattern detection.
binner_include_original
¶
Include Original feature value as part of output of Binner Transformer (Boolean) (Expert Setting)
Default value True
If enabled (default), include the original feature value as a output feature for the BinnerTransformer. This ensures that the BinnerTransformer never has less signal than the OriginalTransformer, since they can be chosen exclusively.
isolation_forest_nestimators
¶
Num. Estimators for Isolation Forest Encoding (Number) (Expert Setting)
Default value 200
one_hot_encoding_cardinality_threshold
¶
one_hot_encoding_cardinality_threshold (Number) (Expert Setting)
Default value 50
Enable One-Hot-Encoding (which does binning to limit to number of bins to no more than 100 anyway) for categorical columns with fewer than this many unique values Set to 0 to disable
one_hot_encoding_cardinality_threshold_default_use
¶
one_hot_encoding_cardinality_threshold_default_use (Number) (Expert Setting)
Default value 40
How many levels to choose one-hot by default instead of other encodings, restricted down to 10x less (down to 2 levels) when number of columns able to be used with OHE exceeds 500. Note the total number of bins is reduced if bigger data independently of this.
text_as_categorical_cardinality_threshold
¶
text_as_categorical_cardinality_threshold (Number) (Expert Setting)
Default value 1000
Treat text columns also as categorical columns if the cardinality is <= this value. Set to 0 to treat text columns only as text.
numeric_as_categorical_cardinality_threshold
¶
numeric_as_categorical_cardinality_threshold (Number) (Expert Setting)
Default value 2
If num_as_cat is true, then treat numeric columns also as categorical columns if the cardinality is > this value. Setting to 0 allows all numeric to be treated as categorical if num_as_cat is True.
numeric_as_ohe_categorical_cardinality_threshold
¶
numeric_as_ohe_categorical_cardinality_threshold (Number) (Expert Setting)
Default value 2
If num_as_cat is true, then treat numeric columns also as categorical columns to possibly one-hot encode if the cardinality is > this value. Setting to 0 allows all numeric to be treated as categorical to possibly ohe-hot encode if num_as_cat is True.
drop_redundant_columns_limit
¶
Max number of columns to check for redundancy in training dataset. (Number) (Expert Setting)
Default value 1000
If dataset has more columns, then will check only first such columns. Set to 0 to disable.
drop_constant_columns
¶
Drop constant columns (Boolean) (Expert Setting)
Default value True
Whether to drop columns with constant values
detect_duplicate_rows
¶
Detect duplicate rows (Boolean) (Expert Setting)
Default value True
Whether to detect duplicate rows in training, validation and testing datasets. Done after doing type detection and dropping of redundant or missing columns across datasets, just before the experiment starts, still before leakage detection. Any further dropping of columns can change the amount of duplicate rows. Informative only, if want to drop rows in training data, make sure to check the drop_duplicate_rows setting. Uses a sample size, given by detect_duplicate_rows_max_rows_x_cols.
drop_duplicate_rows
¶
Drop duplicate rows in training data (String) (Expert Setting)
Default value 'auto'
- Whether to drop duplicate rows in training data. Done at the start of Driverless AI, only considering columns to drop as given by the user, not considering validation or training datasets or leakage or redundant columns. Any further dropping of columns can change the amount of duplicate rows. Time limited by drop_duplicate_rows_timeout seconds.
‘auto’: “off”” ‘weight’: If duplicates, then convert dropped duplicates into a weight column for training. Useful when duplicates are added to preserve some distribution of instances expected. Only allowed if no weight columnn is present, else duplicates are just dropped. ‘drop’: Drop any duplicates, keeping only first instances. ‘off’: Do not drop any duplicates. This may lead to over-estimation of accuracy.
detect_duplicate_rows_max_rows_x_cols
¶
Limit of dataset size in rows x cols for data when detecting duplicate rows (Number) (Expert Setting)
Default value 10000000
If > 0, then acts as sampling size for informative duplicate row detection. If set to 0, will do checks for all dataset sizes.
drop_id_columns
¶
Drop ID columns (Boolean) (Expert Setting)
Default value True
Whether to drop columns that appear to be an ID
no_drop_features
¶
Don’t drop any columns (Boolean) (Expert Setting)
Default value False
Whether to avoid dropping any columns (original or derived)
cols_to_drop
¶
Features to drop, e.g. [“V1”, “V2”, “V3”] (List) (Expert Setting)
Default value []
Direct control over columns to drop in bulk so can copy-paste large lists instead of selecting each one separately in GUI
cols_to_group_by
¶
Features to group by, e.g. [“G1”, “G2”, “G3”] (List) (Expert Setting)
Default value []
Control over columns to group by for CVCatNumEncode Transformer, default is empty list that means DAI automatically searches all columns, selected randomly or by which have top variable importance. The CVCatNumEncode Transformer takes a list of categoricals (or these cols_to_group_by) and uses those columns as new feature to perform aggregations on (agg_funcs_for_group_by).
sample_cols_to_group_by
¶
Sample from features to group by (Boolean) (Expert Setting)
Default value False
Whether to sample from given features to group by (True) or to always group by all features (False) when using cols_to_group_by.
agg_funcs_for_group_by
¶
Aggregation functions (non-time-series) for group by operations (List) (Expert Setting)
Default value ['mean', 'sd', 'min', 'max', 'count']
Aggregation functions to use for groupby operations for CVCatNumEncode Transformer, see also cols_to_group_by and sample_cols_to_group_by.
folds_for_group_by
¶
Number of folds to obtain aggregation when grouping (Number) (Expert Setting)
Default value 5
Out of fold aggregations ensure less overfitting, but see less data in each fold. For controlling how many folds used by CVCatNumEncode Transformer.
cols_to_force_in
¶
Features to force in, e.g. [“G1”, “G2”, “G3”] (List) (Expert Setting)
Default value []
Control over columns to force-in. Forced-in features are are handled by the most interpretable transformer allowed by experiment options, and they are never removed (although model may assign 0 importance to them still). Transformers used by default include: OriginalTransformer for numeric, CatOriginalTransformer or FrequencyTransformer for categorical, TextOriginalTransformer for text, DateTimeOriginalTransformer for date-times, DateOriginalTransformer for dates, ImageOriginalTransformer or ImageVectorizerTransformer for images, etc.
cols_to_force_in_sanitized
¶
cols_to_force_in_sanitized (List)
Default value []
mutation_mode
¶
Type of mutation strategy (String) (Expert Setting)
Default value 'sample'
- Strategy to apply when doing mutations on transformers.
Sample mode is default, with tendency to sample transformer parameters. Batched mode tends to do multiple types of the same transformation together. Full mode does even more types of the same transformation together.
detect_features_leakage_threshold_auc
¶
Leakage feature detection AUC/R2 threshold (Float) (Expert Setting)
Default value 0.95
When leakage detection is enabled, if AUC (R2 for regression) on original data (label-encoded) is above or equal to this value, then trigger per-feature leakage detection
detect_features_per_feature_leakage_threshold_auc
¶
Leakage features per feature detection AUC/R2 threshold (Float) (Expert Setting)
Default value 0.8
When leakage detection is enabled, show features for which AUC (R2 for regression, for whether that predictor/feature alone predicts the target) is above or equal to this value. Feature is dropped if AUC/R2 is above or equal to drop_features_leakage_threshold_auc
interaction_finder_gini_rel_improvement_threshold
¶
Required GINI relative improvement for Interactions (Float) (Expert Setting)
Default value 0.5
- Required GINI relative improvement for InteractionTransformer.
If GINI is not better than this relative improvement compared to original features considered in the interaction, then the interaction is not returned. If noisy data, and no clear signal in interactions but still want interactions, then can decrease this number.
interaction_finder_return_limit
¶
Number of transformed Interactions to make (Number) (Expert Setting)
Default value 5
Number of transformed Interactions to make as best out of many generated trial interactions.
varimp_threshold_at_interpretability_10
¶
Lowest allowed variable importance at interpretability 10 (Float) (Expert Setting)
Default value 0.001
- Variable importance below which feature is dropped (with possible replacement found that is better)
This also sets overall scale for lower interpretability settings. Set to lower value if ok with many weak features despite choosing high interpretability, or if see drop in performance due to the need for weak features.
allow_stabilize_varimp_for_ts
¶
Whether to allow stabilization of features using variable importance for time-series (Boolean) (Expert Setting)
Default value False
Whether to avoid setting stabilize_varimp=false and stabilize_fs=false for time series experiments.
stabilize_varimp
¶
Whether to take minimum (True) or mean (False) of variable importance when have multiple folds/repeats. (Boolean) (Expert Setting)
Default value True
- Variable importance is used by genetic algorithm to decide which features are useful,
so this can stabilize the feature selection by the genetic algorithm. This is by default disabled for time series experiments, which can have real diverse behavior in each split. But in some cases feature selection is improved in presence of highly shifted variables that are not handled by lag transformers and one can set allow_stabilize_varimp_for_ts=true.
stabilize_fs
¶
Whether to take minimum (True) or mean (False) of delta improvement in score when aggregating feature selection scores across multiple folds/depths. (Boolean) (Expert Setting)
Default value True
- Whether to take minimum (True) or mean (False) of delta improvement in score when aggregating feature selection scores across multiple folds/depths.
Delta improvement of score corresponds to original metric minus metric of shuffled feature frame if maximizing metric, and corresponds to negative of such a score difference if minimizing. Feature selection by permutation importance considers the change in score after shuffling a feature, and using minimum operation ignores optimistic scores in favor of pessimistic scores when aggregating over folds. Note, if using tree methods, multiple depths may be fitted, in which case regardless of this toml setting, only features that are kept for all depths are kept by feature selection. If interpretability >= config toml value of fs_data_vary_for_interpretability, then half data (or setting of fs_data_frac) is used as another fit, in which case regardless of this toml setting, only features that are kept for all data sizes are kept by feature selection. Note: This is disabled for small data since arbitrary slices of small data can lead to disjoint features being important and only aggregated average behavior has signal.
features_allowed_by_interpretability
¶
features_allowed_by_interpretability (String) (Expert Setting)
Default value '{1: 10000000, 2: 10000, 3: 1000, 4: 500, 5: 300, 6: 200, 7: 150, 8: 100, 9: 80, 10: 50, 11: 50, 12: 50, 13: 50}'
nfeatures_max_threshold
¶
nfeatures_max_threshold (Number) (Expert Setting)
Default value 200
feature_cost_mean_interp_for_penalty
¶
feature_cost_mean_interp_for_penalty (Number) (Expert Setting)
Default value 5
features_cost_per_interp
¶
features_cost_per_interp (Float) (Expert Setting)
Default value 0.25
varimp_threshold_shift_report
¶
varimp_threshold_shift_report (Float) (Expert Setting)
Default value 0.3
apply_featuregene_limits_after_tuning
¶
apply_featuregene_limits_after_tuning (Boolean) (Expert Setting)
Default value True
remove_scored_0gain_genes_in_postprocessing_above_interpretability
¶
remove_scored_0gain_genes_in_postprocessing_above_interpretability (Number) (Expert Setting)
Default value 13
remove_scored_0gain_genes_in_postprocessing_above_interpretability_final_population
¶
remove_scored_0gain_genes_in_postprocessing_above_interpretability_final_population (Number) (Expert Setting)
Default value 2
remove_scored_by_threshold_genes_in_postprocessing_above_interpretability_final_population
¶
remove_scored_by_threshold_genes_in_postprocessing_above_interpretability_final_population (Number) (Expert Setting)
Default value 7
dump_varimp_every_scored_indiv
¶
Enable detailed scored features info (Boolean) (Expert Setting)
Default value False
Whether to dump every scored individual’s variable importance to csv/tabulated/json file produces files like: individual_scored_id%d.iter%d.<hash>.features.txt for transformed features. individual_scored_id%d.iter%d.<hash>.features_orig.txt for original features. individual_scored_id%d.iter%d.<hash>.coefs.txt for absolute importance of transformed features. There are txt, tab.txt, and json formats for some files, and “best_” prefix means it is the best individual for that iteration The hash in the name matches the hash in the files produced by dump_modelparams_every_scored_indiv=true that can be used to track mutation history.
dump_trans_timings
¶
Enable detailed logs for timing and types of features produced (Boolean) (Expert Setting)
Default value False
Whether to dump every scored fold’s timing and feature info to a timings.txt file
unsupervised_aggregator_n_exemplars
¶
Max. number of exemplars for unsupervised Aggregator experiments (Number) (Expert Setting)
Default value 100
Attempt to create at most this many exemplars (actual rows behaving like cluster centroids) for the Aggregator algorithm in unsupervised experiment mode.
unsupervised_clustering_min_clusters
¶
Min. number of clusters for unsupervised clustering experiments (Number) (Expert Setting)
Default value 2
Attempt to create at least this many clusters for clustering algorithm in unsupervised experiment mode.
unsupervised_clustering_max_clusters
¶
Max. number of clusters for unsupervised clustering experiments (Number) (Expert Setting)
Default value 10
Attempt to create no more than this many clusters for clustering algorithm in unsupervised experiment mode.
compute_correlation
¶
Compute correlation matrix (Boolean) (Expert Setting)
Default value False
- ‘
Whether to compute training, validation, and test correlation matrix (table and heatmap pdf) and save to disk alpha: WARNING: currently single threaded and quadratically slow for many columns