Features Settings¶
feature_engineering_effort
¶
Feature Engineering Effort
Specify a value from 0 to 10 for the Driverless AI feature engineering effort. Higher values generally lead to more time (and memory) spent in feature engineering. This value defaults to 5.
0: Keep only numeric features. Only model tuning during evolution.
1: Keep only numeric features and frequency-encoded categoricals. Only model tuning during evolution.
2: Similar to 1 but instead just no Text features. Some feature tuning before evolution.
3: Similar to 5 but only tuning during evolution. Mixed tuning of features and model parameters.
4: Similar to 5 but slightly more focused on model tuning.
5: Balanced feature-model tuning. (Default)
6-7: Similar to 5 but slightly more focused on feature engineering.
8: Similar to 6-7 but even more focused on feature engineering with high feature generation rate and no feature dropping even if high interpretability.
9-10: Similar to 8 but no model tuning during feature evolution.
check_distribution_shift
¶
Data Distribution Shift Detection
:open:
Specify whether Driverless AI should detect data distribution shifts between train/valid/test datasets (if provided). When train and test dataset differ (or train/valid or valid/test) in terms of distribution of data, then a model can be built with high accuracy that tells for each row, whether the row is in train or test. Currently, this information is only presented to the user and not acted upon.
Shifted features should either be dropped. Or more meaningful aggregate features be created by using them as labels or bins.
check_distribution_shift_drop
¶
Data Distribution Shift Detection Drop of Features
Specify whether to drop high-shift features. This defaults to Auto. Note that Auto for time series experiments turns this feature off.
Also see drop_features_distribution_shift_threshold_auc and check_distribution_shift.
drop_features_distribution_shift_threshold_auc
¶
Max Allowed Feature Shift (AUC) Before Dropping Feature
Specify the maximum allowed AUC value for a feature before dropping the feature.
When train and test dataset differ (or train/valid or valid/test) in terms of distribution of data, then a model can be built that tells for each row, whether the row is in train or test. This model includes an AUC value. If this AUC, GINI, or Spearman correlation of the model is above the specified threshold, then Driverless AI will consider it a strong enough shift to drop those features.
The default AUC threshold is 0.999.
check_leakage
¶
Data Leakage Detection
Specify whether to check for data leakage for each feature. Some of the features may contain over predictive power on the target column. This may affect model generalization. Driverless AI runs a model to determine the predictive power of each feature on the target variable. Then, a simple model is built on each feature with significant variable importance. The models with high AUC (for classification) or R2 score (regression) are reported to the user as potential leak.
Note that this option is always disabled if the experiment is a time series experiment. This is set to Auto by default.
The equivalent config.toml parameter is check_leakage
. Also see drop_features_leakage_threshold_auc
drop_features_leakage_threshold_auc
¶
Data Leakage Detection Dropping AUC/R2 Threshold
If Leakage Detection is enabled, specify the threshold for dropping features. When the AUC (or R2 for regression), GINI, or Spearman correlation is above this value, the feature is dropped. This value defaults to 0.999.
The equivalent config.toml parameter is drop_features_leakage_threshold_auc
.
leakage_max_data_size
¶
Max Rows X Columns for Leakage
Specify the maximum number of (rows x columns) to trigger sampling for leakage checks. This value defaults to 10,000,000.
max_features_importance
¶
Max. num. features for variable importance
Specify the maximum number of features to use and show in importance tables. For any interpretability higher than 1, transformed or original features with low importance than top max_features_importance features are always removed Feature importances of transformed or original features correspondingly will be pruned. Higher values can lead to lower performance and larger disk space used for datasets with more than 100k columns.
enable_wide_rules
¶
Enable Wide Rules
Enable various rules to handle wide datasets( i.e no. of columns > no. of rows). The default value is “auto”, that will automatically enable the wide rules when detect that number of columns is greater than number of rows.
Setting “on” forces rules to be enabled regardless of any conditions. Enabling wide data rules sets all max_cols
, max_orig_*col
, and fs_orig*
tomls to large values, and enforces monotonicity to be disabled unless monotonicity_constraints_dict
is set or default value of monotonicity_constraints_interpretability_switch
is changed. It also disables shift detection and data leakage checks. And enables Xgboost Random Forest model for modeling.
To disable wide rules, set enable_wide_rules to “off”. For mostly or entirely numeric datasets, selecting only ‘OriginalTransformer’ for faster speed is recommended (see included_transformers).
Also see Wide Datasets in Driverless AI for a quick model run.
orig_features_fs_report
¶
Report Permutation Importance on Original Features
Specify whether Driverless AI reports permutation importance on original features (represented as normalized change in the chosen metric) in logs and the report file. This is disabled by default.
max_rows_fs
¶
Maximum Number of Rows to Perform Permutation-Based Feature Selection
Specify the maximum number of rows when performing permutation feature importance, reduced by (stratified) random sampling. This value defaults to 500,000.
max_orig_cols_selected
¶
Max Number of Original Features Used
Specify the maximum number of columns to be selected from an existing set of columns using feature selection. This value defaults to 10,000000. For categorical columns, the selection is based upon how well target encoding (or frequency encoding if not available) on categoricals and numerics treated as categoricals helps. This is useful to reduce the final model complexity. First the best [max_orig_cols_selected] are found through feature selection methods and then these features are used in feature evolution (to derive other features) and in modelling.
max_orig_nonnumeric_cols_selected
¶
Max Number of Original Non-Numeric Features
Maximum number of non-numeric columns selected, above which will do feature selection on all features and avoid treating numerical as categorical same as above (max_orig_numeric_cols_selected) but for categorical columns. Feature selection is performed on all features when this value is exceeded. This value defaults to 300.
fs_orig_cols_selected
¶
Max Number of Original Features Used for FS Individual
Specify the maximum number of features you want to be selected in an experiment. This value defaults to 10,0000000. Additional columns above the specified value add special individual with original columns reduced.
fs_orig_numeric_cols_selected
¶
Number of Original Numeric Features to Trigger Feature Selection Model Type
The maximum number of original numeric columns, above which Driverless AI will do feature selection. Note that this is applicable only to special individuals with original columns reduced. A separate individual in the genetic algorithm is created by doing feature selection by permutation importance on original features. This value defaults to 10,000000.
fs_orig_nonnumeric_cols_selected
¶
Number of Original Non-Numeric Features to Trigger Feature Selection Model Type
The maximum number of original non-numeric columns, above which Driverless AI will do feature selection on all features. Note that this is applicable only to special individuals with original columns reduced. A separate individual in the genetic algorithm is created by doing feature selection by permutation importance on original features. This value defaults to 200.
max_relative_cardinality
¶
Max Allowed Fraction of Uniques for Integer and Categorical Columns
Specify the maximum fraction of unique values for integer and categorical columns. If the column has a larger fraction of unique values than that, it will be considered an ID column and ignored. This value defaults to 0.95.
num_as_cat
¶
Allow Treating Numerical as Categorical
Specify whether to allow some numerical features to be treated as categorical features. This is enabled by default.
The equivalent config.toml parameter is num_as_cat
.
max_int_as_cat_uniques
¶
Max Number of Unique Values for Int/Float to be Categoricals
Specify the number of unique values for integer or real columns to be treated as categoricals. This value defaults to 50.
max_fraction_invalid_numeric
¶
Max. fraction of numeric values to be non-numeric (and not missing) for a column to still be considered numeric
When the fraction of non-numeric (and non-missing) values is less or equal than this value, consider the column numeric. Can help with minor data quality issues for experimentation, not recommended for production, since type inconsistencies can occur. Note: Replaces non-numeric values with missing values at start of experiment, so some information is lost, but column is now treated as numeric, which can help. Disabled if < 0.
nfeatures_max
¶
Max Number of Engineered Features
Specify the maximum number of features to be included per model (and in each model within the final model if an ensemble). After each scoring, based on this parameter value, keeps top variable importance features, and prunes away rest of the features. Final ensemble will exclude any pruned-away features and only train on kept features, but may contain a few new features due to fitting on different data view (e.g. new clusters). Final scoring pipeline will exclude any pruned-away features, but may contain a few new features due to fitting on different data view (e.g. new clusters).
The default value of -1 means no restrictions are applied for this parameter except internally-determined memory and interpretability restrictions.
Notes:
If
interpretability
>remove_scored_0gain_genes_in_postprocessing_above_interpretability
(see config.toml for reference), then every GA (genetic algorithm) iteration post-processes features down to this value just after scoring them. Otherwise, only mutations of scored individuals will be pruned (until the final model where limits are strictly applied).If
ngenes_max
is also not limited, then some individuals will have more genes and features until pruned by mutation or by preparation for final model.E.g. to generally limit every iteration to exactly 1 features, one must set
nfeatures_max
=ngenes_max
=1 andremove_scored_0gain_genes_in_postprocessing_above_interpretability
= 0, but the genetic algorithm will have a harder time finding good features.
The equivalent config.toml parameter is nfeatures_max
(also see nfeatures_max_threshold
in config.toml).
ngenes_max
¶
Max Number of Genes
Specify the maximum number of genes (transformer instances) kept per model (and per each model within the final model for ensembles). This controls the number of genes before features are scored, so Driverless AI will just randomly samples genes if pruning occurs. If restriction occurs after scoring features, then aggregated gene importances are used for pruning genes. Instances includes all possible transformers, including original transformer for numeric features. A value of -1 means no restrictions except internally-determined memory and interpretability restriction.
The equivalent config.toml parameter is ngenes_max
.
features_allowed_by_interpretability
¶
Limit Features by Interpretability
Specify whether to limit feature counts with the Interpretability training setting as specified by the features_allowed_by_interpretability
config.toml setting.
monotonicity_constraints_interpretability_switch
¶
Threshold for Interpretability Above Which to Enable Automatic Monotonicity Constraints for Tree Models
Specify an Interpretability setting value equal and above which to use automatic monotonicity constraints in XGBoostGBM, LightGBM, or Decision Tree models. This value defaults to 7.
Also see monotonic gbm recipe and Monotonicity Constraints in Driverless AI for reference.
monotonicity_constraints_correlation_threshold
¶
Correlation Beyond Which to Trigger Monotonicity Constraints (if enabled)
Specify the threshold of Pearson product-moment correlation coefficient between numerical or encoded transformed feature and target above (below negative for) which to use positive (negative) monotonicity for XGBoostGBM, LightGBM and Decision Tree models. This value defaults to 0.1.
Note: This setting is only enabled when Interpretability is greater than or equal to the value specified by the monotonicity_constraints_interpretability_switch setting and when the monotonicity_constraints_dict setting is not specified.
Also see monotonic gbm recipe and Monotonicity Constraints in Driverless AI for reference.
monotonicity_constraints_log_level
¶
Control amount of logging when calculating automatic monotonicity constraints (if enabled)
For models that support monotonicity constraints, and if enabled, show automatically determined monotonicity constraints for each feature going into the model based on its correlation with the target. ‘low’ shows only monotonicity constraint direction. ‘medium’ shows correlation of positively and negatively constraint features. ‘high’ shows all correlation values.
Also see monotonic gbm recipe and Monotonicity Constraints in Driverless AI for reference.
monotonicity_constraints_drop_low_correlation_features
¶
Whether to drop features that have no monotonicity constraint applied (e.g., due to low correlation with target)
If enabled, only monotonic features with +1/-1 constraints will be passed to the model(s), and features without monotonicity constraints (0) will be dropped. Otherwise all features will be in the model. Only active when interpretability >= monotonicity_constraints_interpretability_switch or monotonicity_constraints_dict is provided.
Also see monotonic gbm recipe and Monotonicity Constraints in Driverless AI for reference.
monotonicity_constraints_dict
¶
Manual Override for Monotonicity Constraints
Specify a list of features for max_features_importance which monotonicity constraints are applied. Original numeric features are mapped to the desired constraint:
1: Positive constraint
-1: Negative constraint
0: Constraint disabled
Constraint is automatically disabled (set to 0) for features that are not in this list.
The following is an example of how this list can be specified:
"{'PAY_0': -1, 'PAY_2': -1, 'AGE': -1, 'BILL_AMT1': 1, 'PAY_AMT1': -1}"
Note: If a list is not provided, then the automatic correlation-based method is used when monotonicity constraints are enabled at high enough interpretability settings.
See Monotonicity Constraints in Driverless AI for reference.
max_feature_interaction_depth
¶
Max Feature Interaction Depth
Specify the maximum number of features to use for interaction features like grouping for target encoding, weight of evidence, and other likelihood estimates.
Exploring feature interactions can be important in gaining better predictive performance. The interaction can take multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + … featureN). Although certain machine learning algorithms (like tree-based methods) can do well in capturing these interactions as part of their training process, still generating them may help them (or other algorithms) yield better performance.
The depth of the interaction level (as in “up to” how many features may be combined at once to create one single feature) can be specified to control the complexity of the feature engineering process. Higher values might be able to make more predictive models at the expense of time. This value defaults to 8.
Set Max Feature Interaction Depth to 1 to disable any feature interactions max_feature_interaction_depth=1
.
fixed_feature_interaction_depth
¶
Fixed Feature Interaction Depth
Specify a fixed non-zero number of features to use for interaction features like grouping for target encoding, weight of evidence, and other likelihood estimates. To use all features for each transformer, set this to be equal to the number of columns. To do a 50/50 sample and a fixed feature interaction depth of \(n\) features, set this to -\(n\).
enable_target_encoding
¶
Enable Target Encoding
Specify whether to use Target Encoding when building the model. Target encoding refers to several different feature transformations (primarily focused on categorical data) that aim to represent the feature using information of the actual target variable. A simple example can be to use the mean of the target to replace each unique category of a categorical feature. These type of features can be very predictive but are prone to overfitting and require more memory as they need to store mappings of the unique categories and the target values.
cvte_cv_in_cv
¶
Enable Outer CV for Target Encoding
For target encoding, specify whether an outer level of cross-fold validation is performed in cases where GINI is detected to flip sign or have an inconsistent sign for weight of evidence between fit_transform
(on training data) and transform
(on training and validation data). The degree to which GINI is inaccurate is also used to perform fold-averaging of look-up tables instead of using global look-up tables. This is enabled by default.
enable_lexilabel_encoding
¶
Enable Lexicographical Label Encoding
Specify whether to enable lexicographical label encoding. This is disabled by default.
enable_isolation_forest
¶
Enable Isolation Forest Anomaly Score Encoding
Isolation Forest is useful for identifying anomalies or outliers in data. Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that selected feature. This split depends on how long it takes to separate the points. Random partitioning produces noticeably shorter paths for anomalies. When a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.
This option lets you specify whether to return the anomaly score of each sample. This is disabled by default.
enable_one_hot_encoding
¶
Enable One HotEncoding
Specify whether one-hot encoding is enabled. The default Auto setting is only applicable for small datasets and GLMs.
isolation_forest_nestimators
¶
Number of Estimators for Isolation Forest Encoding
Specify the number of estimators for Isolation Forest encoding. This value defaults to 200.
drop_constant_columns
¶
Drop Constant Columns
Specify whether to drop columns with constant values. This is enabled by default.
drop_id_columns
¶
Drop ID Columns
Specify whether to drop columns that appear to be an ID. This is enabled by default.
no_drop_features
¶
Don’t Drop Any Columns
Specify whether to avoid dropping any columns (original or derived). This is disabled by default.
cols_to_drop
¶
Features to Drop
Specify which features to drop. This setting allows you to select many features at once by copying and pasting a list of column names (in quotes) separated by commas.
cols_to_force_in
¶
Features to always keep or force in, e.g. “G1”, “G2”, “G3”
Control over columns to force-in. Forced-in features are handled by the most interpretable transformers allowed by the experiment options, and they are never removed (even if the model assigns 0 importance to them). Transformers used by default includes:
OriginalTransformer for numeric,
CatOriginalTransformer or FrequencyTransformer for categorical,
TextOriginalTransformer for text,
DateTimeOriginalTransformer for date-times,
DateOriginalTransformer for dates,
ImageOriginalTransformer or ImageVectorizerTransformer for images, etc
cols_to_group_by
¶
Features to Group By
Specify which features to group columns by. When this field is left empty (default), Driverless AI automatically searches all columns (either at random or based on which columns have high variable importance).
sample_cols_to_group_by
¶
Sample from Features to Group By
Specify whether to sample from given features to group by or to always group all features. This is disabled by default.
agg_funcs_for_group_by
¶
Aggregation Functions (Non-Time-Series) for Group By Operations
Specify whether to enable aggregation functions to use for group by operations. Choose from the following (all are selected by default):
mean
sd
min
max
count
folds_for_group_by
¶
Number of Folds to Obtain Aggregation When Grouping
Specify the number of folds to obtain aggregation when grouping. Out-of-fold aggregations will result in less overfitting, but they analyze less data in each fold. The default value is 5.
mutation_mode
¶
Type of Mutation Strategy
Specify which strategy to apply when performing mutations on transformers. Select from the following:
sample: Sample transformer parameters (Default)
batched: Perform multiple types of the same transformation together
full: Perform more types of the same transformation together than the above strategy
dump_varimp_every_scored_indiv
¶
Enable Detailed Scored Features Info
Specify whether to dump every scored individual’s variable importance (both derived and original) to a csv/tabulated/json file. If enabled, Driverless AI produces files such as “individual_scored_id%d.iter%d*features*”. This is disabled by default.
dump_trans_timings
¶
Enable Detailed Logs for Timing and Types of Features Produced
Specify whether to dump every scored fold’s timing and feature info to a timings.txt file. This is disabled by default.
compute_correlation
¶
Compute Correlation Matrix
Specify whether to compute training, validation, and test correlation matrixes. When enabled, this setting creates table and heatmap PDF files that are saved to disk. Note that this setting is currently a single threaded process that may be slow for experiments with many columns. This is disabled by default.
interaction_finder_gini_rel_improvement_threshold
¶
Required GINI Relative Improvement for Interactions
Specify the required GINI relative improvement value for the InteractionTransformer. If the GINI coefficient is not better than the specified relative improvement value in comparison to the original features considered in the interaction, then the interaction is not returned. If the data is noisy and there is no clear signal in interactions, this value can be decreased to return interactions. This value defaults to 0.5.
interaction_finder_return_limit
¶
Number of Transformed Interactions to Make
Specify the number of transformed interactions to make from generated trial interactions. (The best transformed interactions are selected from the group of generated trial interactions.) This value defaults to 5.
enable_rapids_transformers
¶
Whether to enable RAPIDS cuML GPU transformers (no mojo)
Specify whether to enable GPU-based RAPIDS cuML transformers. Note that no MOJO support for deployment is available for this selection at this time, but python scoring is supported and this is in beta testing status.
The equivalent config.toml parameter is enable_rapids_transformers
and the default value is False.
varimp_threshold_at_interpretability_10
¶
Lowest allowed variable importance at interpretability 10
Specify the variable importance below which features are dropped (with the possibility of a replacement being found that’s better). This setting also sets the overall scale for lower interpretability settings. Set this to a lower value if you’re content with having many weak features despite choosing high interpretability, or if you see a drop in performance due to the need for weak features.
stabilize_fs
¶
Whether to take minimum (True) or mean (False) of delta improvement in score when aggregating feature selection scores across multiple folds/depths
Whether to take minimum (True) or mean (False) of delta improvement in score when aggregating feature selection scores across multiple folds/depths. Delta improvement of score corresponds to original metric minus metric of shuffled feature frame if maximizing metric, and corresponds to negative of such a score difference if minimizing. Feature selection by permutation importance considers the change in score after shuffling a feature, and using minimum operation ignores optimistic scores in favor of pessimistic scores when aggregating over folds. Note, if using tree methods, multiple depths may be fitted, in which case regardless of this toml setting, only features that are kept for all depths are kept by feature selection. If interpretability >= config toml value of fs_data_vary_for_interpretability, then half data (or setting of fs_data_frac) is used as another fit, in which case regardless of this toml setting, only features that are kept for all data sizes are kept by feature selection. Note: This is disabled for small data since arbitrary slices of small data can lead to disjoint features being important and only aggregated average behavior has signal