Search Expert Settings¶

First, we'll initialize a client with our server credentials and store it in the variable dai.

In [1]:

Copied!

import driverlessai

dai = driverlessai.Client(address='http://localhost:12345', username='py', password='py')
import driverlessai

dai = driverlessai.Client(address='http://localhost:12345', username='py', password='py')

Lets say we're interested in natural language processing. We can see the experiment settings that pertain to NLP.

In [2]:

Copied!

dai.experiments.search_expert_settings('nlp')
dai.experiments.search_expert_settings('nlp')

num_gpus_for_prediction | default_value: 0
tensorflow_max_epochs_nlp | default_value: 2
enable_tensorflow_nlp_accuracy_switch | default_value: 5
enable_tensorflow_textcnn | default_value: auto
enable_tensorflow_textbigru | default_value: auto
enable_tensorflow_charcnn | default_value: auto
tensorflow_nlp_pretrained_embeddings_file_path | default_value: 
tensorflow_nlp_pretrained_embeddings_trainable | default_value: False
tensorflow_nlp_have_gpus_in_production | default_value: False
text_fraction_for_text_dominated_problem | default_value: 0.3
text_transformer_fraction_for_text_dominated_problem | default_value: 0.3
string_col_as_text_threshold | default_value: 0.3

This gives us the parameter names that we can pass when creating an experiment. We can also see detailed descriptions for each setting.

In [3]:

Copied!

dai.experiments.search_expert_settings('nlp', show_description=True)
dai.experiments.search_expert_settings('nlp', show_description=True)

num_gpus_for_prediction | default_value: 0 | Num. of GPUs for isolated prediction/transform
  Number of GPUs to use for predict for models and transform for transformers when running outside of fit/fit_transform. If predict/transform are called in same process as fit/fit_transform, number of GPUs will match, while new processes will use this count for number of GPUs for applicable models/transformers. If tensorflow_nlp_have_gpus_in_production=true, then that overrides this setting for relevant TensorFflow NLP transformers.

tensorflow_max_epochs_nlp | default_value: 2 | Max. TensorFlow epochs for NLP
  Max. number of epochs for TensorFlow models for making NLP features

enable_tensorflow_nlp_accuracy_switch | default_value: 5 | Accuracy above enable TensorFlow NLP by default for all models
  Accuracy setting equal and above which will add all enabled TensorFlow NLP models below at start of experiment for text dominated problems when TensorFlow NLP transformers are set to auto.  If set to on, this parameter is ignored. Otherwise, at lower accuracy, TensorFlow NLP transformations will only be created as a mutation.

enable_tensorflow_textcnn | default_value: auto | Enable word-based CNN TensorFlow models for NLP
  Whether to use Word-based CNN TensorFlow models for NLP if TensorFlow enabled

enable_tensorflow_textbigru | default_value: auto | Enable word-based BiGRU TensorFlow models for NLP
  Whether to use Word-based Bi-GRU TensorFlow models for NLP if TensorFlow enabled

enable_tensorflow_charcnn | default_value: auto | Enable character-based CNN TensorFlow models for NLP
  Whether to use Character-level CNN TensorFlow models for NLP if TensorFlow enabled

tensorflow_nlp_pretrained_embeddings_file_path | default_value:  | Path to pretrained embeddings for TensorFlow NLP models. If empty, will train from scratch.
  Path to pretrained embeddings for TensorFlow NLP models For example, download and unzip https://nlp.stanford.edu/data/glove.6B.zip tensorflow_nlp_pretrained_embeddings_file_path = /path/on/server/to/glove.6B.300d.txt

tensorflow_nlp_pretrained_embeddings_trainable | default_value: False | Allow training of unfrozen pretrained embeddings (in addition to fine-tuning of the rest of the graph)
  Allow training of all weights of the neural network graph, including the pretrained embedding layer weights. If disabled, then the embedding layer is frozen, but all other weights are still fine-tuned.

tensorflow_nlp_have_gpus_in_production | default_value: False | Whether Python/MOJO scoring runtime will have GPUs (otherwise BiGRU will fail in production if this is enabled)
  Whether Python/MOJO scoring runtime will have GPUs (otherwise BiGRU will fail in production if this is enabled). Enabling this can speed up training for BiGRU, but will require GPUs and CuDNN in production.

text_fraction_for_text_dominated_problem | default_value: 0.3 | Fraction of text columns out of all features to be considered a text-dominated problem
  Fraction of text columns out of all features to be considered a text-dominated problem

text_transformer_fraction_for_text_dominated_problem | default_value: 0.3 | Fraction of text per all transformers to trigger that text dominated
  Fraction of text transformers to all transformers above which to trigger that text dominated problem

string_col_as_text_threshold | default_value: 0.3 | Threshold for string columns to be treated as text (0.0 - text, 1.0 - string)
  Threshold for average string-is-text score as determined by internal heuristics It decides when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable. Higher values will favor string columns as categoricals, lower values will favor string columns as text

Let's define a couple of settings to change.

In [4]:

Copied!





settings = {
    'tensorflow_max_epochs_nlp': 5,
    'enable_tensorflow_textbigru': 'on'
}
settings = {
    'tensorflow_max_epochs_nlp': 5,
    'enable_tensorflow_textbigru': 'on'
}

Then, upload a dataset with natural langauge features.

In [5]:

Copied!

ds = dai.datasets.create('airline_sentiment.csv')
ds = dai.datasets.create('airline_sentiment.csv')

Complete 100% - [4/4] Computing column statistics

Now, we can preview the experiment behavior for the dataset and settings.

In [6]:

Copied!





dai.experiments.preview(
    train_dataset=ds,
    target_column='airline_sentiment',
    task='classification',
    **settings
)
dai.experiments.preview(
    train_dataset=ds,
    target_column='airline_sentiment',
    task='classification',
    **settings
)

ACCURACY [6/10]:
- Training data size: *11,712 rows, 20 cols*
- Feature evolution: *[Constant, DecisionTree, LightGBM, XGBoostGBM]*, *3-fold CV**, 2 reps*
- Final pipeline: *Ensemble (9 models), 3-fold CV*

TIME [3/10]:
- Feature evolution: *4 individuals*, up to *54 iterations*
- Early stopping: After *5* iterations of no improvement

INTERPRETABILITY [7/10]:
- Feature pre-pruning strategy: Permutation Importance FS
- Monotonicity constraints: enabled
- Feature engineering search space: [CVCatNumEncode, CVTargetEncode, CatOriginal, Cat, DateOriginal, DateTimeOriginal, Dates, Frequent, Interactions, IsHoliday, NumCatTE, NumToCatTE, Original, TextBiGRU, Text]

[Constant, DecisionTree, LightGBM, XGBoostGBM] models to train:
- Model and feature tuning: *144*
- Feature evolution: *384*
- Final pipeline: *9*

Estimated runtime: *minutes*
Auto-click Finish/Abort if not done in: *1 day*/*7 days*

Note that TextBiGRU has been added to the feature engineering search space.