NLP Settings¶
enable_tensorflow_textcnn
¶
Enable Word-Based CNN TensorFlow Models for NLP
Specify whether to use out-of-fold predictions from Word-based CNN TensorFlow models as transformers for NLP. This option is ignored if TensorFlow is disabled. We recommend that you disable this option on systems that do not use GPUs.
enable_tensorflow_textbigru
¶
Enable Word-Based BiGRU TensorFlow Models for NLP
Specify whether to use out-of-fold predictions from Word-based BiG-RU TensorFlow models as transformers for NLP. This option is ignored if TensorFlow is disabled. We recommend that you disable this option on systems that do not use GPUs.
enable_tensorflow_charcnn
¶
Enable Character-Based CNN TensorFlow Models for NLP
Specify whether to use out-of-fold predictions from Character-level CNN TensorFlow models as transformers for NLP. This option is ignored if TensorFlow is disabled. We recommend that you disable this option on systems that do not use GPUs.
enable_pytorch_nlp_model
¶
Enable PyTorch Models for NLP
Specify whether to enable pretrained PyTorch models and fine-tune them for NLP tasks. This is set to Auto by default. You need to set this to On if you want to use the PyTorch models like BERT for modeling. Only the first text column will be used for modeling with these models. We recommend that you disable this option on systems that do not use GPUs.
enable_pytorch_nlp_transformer
¶
Enable pre-trained PyTorch Transformers for NLP
Specify whether to enable pretrained PyTorch models for NLP tasks. This is set to Auto by default, and is enabled for text-dominated problems only. You need to set this to On if you want to use the PyTorch models like BERT for feature engineering (via fitting a linear model on top of pretrained embeddings). We recommend that you disable this option on systems that do not use GPUs.
Notes:
This setting requires an Internet connection.
pytorch_nlp_pretrained_models
¶
Select Which Pretrained PyTorch NLP Models to Use
Specify one or more pretrained PyTorch NLP models to use. Select from the following:
bert-base-uncased (Default)
distilbert-base-uncased (Default)
xlnet-base-cased
xlm-mlm-enfr-1024
roberta-base
albert-base-v2
camembert-base
xlm-roberta-base
Notes:
This setting requires an Internet connection.
Models that are not selected by default may not have MOJO support.
Using BERT-like models may result in a longer experiment completion time.
tensorflow_max_epochs_nlp
¶
Max TensorFlow Epochs for NLP
When building TensorFlow NLP features (for text data), specify the maximum number of epochs to train feature engineering models with (it might stop earlier). The higher the number of epochs, the higher the run time. This value defaults to 2 and is ignored if TensorFlow models is disabled.
enable_tensorflow_nlp_accuracy_switch
¶
Accuracy Above Enable TensorFlow NLP by Default for All Models
Specify the accuracy threshold. Values equal and above will add all enabled TensorFlow NLP models at the start of the experiment for text-dominated problems when the following NLP expert settings are set to Auto:
Enable word-based CNN TensorFlow models for NLP
Enable word-based BigRU TensorFlow models for NLP
Enable character-based CNN TensorFlow models for NLP
If the above transformations are set to ON, this parameter is ignored.
At lower accuracy, TensorFlow NLP transformations will only be created as a mutation. This value defaults to 5.
pytorch_nlp_fine_tuning_num_epochs
¶
Number of Epochs for Fine-Tuning of PyTorch NLP Models
Specify the number of epochs used when fine-tuning PyTorch NLP models. This value defaults to 2.
pytorch_nlp_fine_tuning_batch_size
¶
Batch Size for PyTorch NLP Models
Specify the batch size for PyTorch NLP models. This value defaults to 10.
Note: Large models and batch sizes require more memory.
pytorch_nlp_fine_tuning_padding_length
¶
Maximum Sequence Length for PyTorch NLP Models
Specify the maximum sequence length (padding length) for PyTorch NLP models. This value defaults to 100.
Note: Large models and padding lengths require more memory.
pytorch_nlp_pretrained_models_dir
¶
Path to Pretrained PyTorch NLP Models
Specify a path to pretrained PyTorch NLP models. To get all available models, download http://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pretrained/bert_models.zip, then extract the folder and store it in a directory on the instance where Driverless AI is installed:
pytorch_nlp_pretrained_models_dir = /path/on/server/to/bert_models_folder
tensorflow_nlp_pretrained_embeddings_file_path
¶
Path to Pretrained Embeddings for TensorFlow NLP Models
Specify a path to pretrained embeddings that will be used for the TensorFlow NLP models. Note that this can be either a path in the local file system (/path/on/server/to/file.txt
) or an S3 location (s3://
).
Notes:
If an S3 location is specified, an S3 access key ID and S3 secret access key can also be specified with the tensorflow_nlp_pretrained_s3_access_key_id and tensorflow_nlp_pretrained_s3_secret_access_key expert settings respectively.
You can download the Glove embeddings from here and specify the local path in this box.
You can download the fasttext embeddings from here and specify the local path in this box.
You can also train your own custom embeddings. Please refer to this code sample for creating custom embeddings that can be passed on to this option.
If this field is left empty, embeddings will be trained from scratch.
tensorflow_nlp_pretrained_s3_access_key_id
¶
S3 access key ID to use when tensorflow_nlp_pretrained_embeddings_file_path
is set to an S3 location
Specify an S3 access key ID to use when tensorflow_nlp_pretrained_embeddings_file_path
is set to an S3 location. For more information, see the entry on the tensorflow_nlp_pretrained_embeddings_file_path expert setting.
tensorflow_nlp_pretrained_s3_secret_access_key
¶
S3 secret access key to use when tensorflow_nlp_pretrained_embeddings_file_path
is set to an S3 location
Specify an S3 secret access key to use when tensorflow_nlp_pretrained_embeddings_file_path
is set to an S3 location. For more information, see the entry on the tensorflow_nlp_pretrained_embeddings_file_path expert setting.
tensorflow_nlp_pretrained_embeddings_trainable
¶
For TensorFlow NLP, Allow Training of Unfrozen Pretrained Embeddings
Specify whether to allow training of all weights of the neural network graph, including the pretrained embedding layer weights. If this is disabled, the embedding layer will be frozen. All other weights, however, will still be fine-tuned. This is disabled by default.
text_fraction_for_text_dominated_problem
¶
Fraction of Text Columns Out of All Features to be Considered a Text-Dominanted Problem
Specify the fraction of text columns out of all features to be considered as a text-dominated problem. This value defaults to 0.3.
Specify when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable. Higher values will favor string columns as categoricals, while lower values will favor string columns as text. This value defaults to 0.3.
text_transformer_fraction_for_text_dominated_problem
¶
Fraction of Text per All Transformers to Trigger That Text Dominated
Specify the fraction of text columns out of all features to be considered a text-dominated problem. This value defaults to 0.3.
string_col_as_text_threshold
¶
Threshold for String Columns to be Treated as Text
Specify the threshold value (from 0 to 1) for string columns to be treated as text (0.0 - text; 1.0 - string). This value defaults to 0.3.
text_transformers_max_vocabulary_size
¶
Max Size of the Vocabulary for Text Transformers
Max number of tokens created during fitting of Tfidf/Count based text transformers. If multiple values are provided, will use the first one for initial models, and use remaining values during parameter tuning and feature evolution. The default value is [1000, 5000]. Values smaller than 10000 are recommended for speed.