Nlp configuration¶
enable_tensorflow_textcnn
¶
Enable word-based CNN TensorFlow transformers for NLP (String) (Expert Setting)
Default value 'auto'
Whether to use out-of-fold predictions of Word-based CNN TensorFlow models as transformers for NLP if TensorFlow enabled
enable_tensorflow_textbigru
¶
Enable word-based BiGRU TensorFlow transformers for NLP (String) (Expert Setting)
Default value 'auto'
Whether to use out-of-fold predictions of Word-based Bi-GRU TensorFlow models as transformers for NLP if TensorFlow enabled
enable_tensorflow_charcnn
¶
Enable character-based CNN TensorFlow transformers for NLP (String) (Expert Setting)
Default value 'auto'
Whether to use out-of-fold predictions of Character-level CNN TensorFlow models as transformers for NLP if TensorFlow enabled
enable_pytorch_nlp_transformer
¶
Enable PyTorch transformers for NLP (String) (Expert Setting)
Default value 'auto'
Whether to use pretrained PyTorch models as transformers for NLP tasks. Fits a linear model on top of pretrained embeddings. Requires internet connection. Default of 〈auto〉 means disabled. To enable, set to 〈on〉. GPU(s) are highly recommended.Reduce string_col_as_text_min_relative_cardinality closer to 0.0 and string_col_as_text_threshold closer to 0.0 to force string column to be treated as text despite low number of uniques.
pytorch_nlp_transformer_max_rows_linear_model
¶
Max number of rows to use for fitting the linear model on top of the pretrained embeddings. (Number) (Expert Setting)
Default value 50000
More rows can slow down the fitting process. Recommended values are less than 100000.
enable_pytorch_nlp_model
¶
Enable PyTorch models for NLP (String) (Expert Setting)
Default value 'auto'
Whether to use pretrained PyTorch models and fine-tune them for NLP tasks. Requires internet connection. Default of 〈auto〉 means disabled. To enable, set to 〈on〉. These models are only using the first text column, and can be slow to train. GPU(s) are highly recommended.Set string_col_as_text_min_relative_cardinality=0.0 to force string column to be treated as text despite low number of uniques.
pytorch_nlp_pretrained_models
¶
Select which pretrained PyTorch NLP model(s) to use. (List) (Expert Setting)
Default value ['bert-base-uncased', 'distilbert-base-uncased', 'bert-base-multilingual-cased']
Select which pretrained PyTorch NLP model(s) to use. Non-default ones might have no MOJO support. Requires internet connection. Only if PyTorch models or transformers for NLP are set to 〈on〉.
tensorflow_max_epochs_nlp
¶
Max. TensorFlow epochs for NLP (Number) (Expert Setting)
Default value 2
Max. number of epochs for TensorFlow models for making NLP features
enable_tensorflow_nlp_accuracy_switch
¶
Accuracy above enable TensorFlow NLP by default for all models (Number) (Expert Setting)
Default value 5
Accuracy setting equal and above which will add all enabled TensorFlow NLP models below at start of experiment for text dominated problems when TensorFlow NLP transformers are set to auto. If set to on, this parameter is ignored. Otherwise, at lower accuracy, TensorFlow NLP transformations will only be created as a mutation.
tensorflow_nlp_pretrained_embeddings_file_path
¶
Path to pretrained embeddings for TensorFlow NLP models. If empty, will train from scratch. (String) (Expert Setting)
Default value ''
Path to pretrained embeddings for TensorFlow NLP models, can be a path in local file system or an S3 location (s3://). For example, download and unzip https://nlp.stanford.edu/data/glove.6B.zip tensorflow_nlp_pretrained_embeddings_file_path = /path/on/server/to/glove.6B.300d.txt
tensorflow_nlp_pretrained_s3_access_key_id
¶
S3 access key Id to use when tensorflow_nlp_pretrained_embeddings_file_path is set to an S3 location. (String) (Expert Setting)
Default value ''
tensorflow_nlp_pretrained_s3_secret_access_key
¶
S3 secret access key to use when tensorflow_nlp_pretrained_embeddings_file_path is set to an S3 location. (String) (Expert Setting)
Default value ''
tensorflow_nlp_pretrained_embeddings_trainable
¶
For TensorFlow NLP, allow training of unfrozen pretrained embeddings (in addition to fine-tuning of the rest of the graph) (Boolean) (Expert Setting)
Default value False
Allow training of all weights of the neural network graph, including the pretrained embedding layer weights. If disabled, then the embedding layer is frozen, but all other weights are still fine-tuned.
pytorch_tokenizer_parallel
¶
pytorch_tokenizer_parallel (Boolean)
Default value True
Whether to parallelize tokenization for BERT Models/Transformers.
pytorch_nlp_fine_tuning_num_epochs
¶
Number of epochs for fine-tuning of PyTorch NLP models. (Number) (Expert Setting)
Default value -1
Number of epochs for fine-tuning of PyTorch NLP models. Larger values can increase accuracy but take longer to train.
pytorch_nlp_fine_tuning_batch_size
¶
Batch size for PyTorch NLP models. -1 for automatic. (Number) (Expert Setting)
Default value -1
Batch size for PyTorch NLP models. Larger models and larger batch sizes will use more memory.
pytorch_nlp_fine_tuning_padding_length
¶
Maximum sequence length (padding length) for PyTorch NLP models. -1 for automatic. (Number) (Expert Setting)
Default value -1
Maximum sequence length (padding length) for PyTorch NLP models. Larger models and larger padding lengths will use more memory.
pytorch_nlp_pretrained_models_dir
¶
Path to pretrained PyTorch NLP models. If empty, will get models from S3 (String) (Expert Setting)
Default value ''
Path to pretrained PyTorch NLP models. Note that this can be either a path in the local file system
(/path/on/server/to/bert_models_folder), an URL or a S3 location (s3://).
To get all models, download http://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pretrained/bert_models.zip
and unzip and store it in a directory on the instance where DAI is installed.
pytorch_nlp_pretrained_models_dir=/path/on/server/to/bert_models_folder
pytorch_nlp_pretrained_s3_access_key_id
¶
S3 access key Id to use when pytorch_nlp_pretrained_models_dir is set to an S3 location. (String) (Expert Setting)
Default value ''
pytorch_nlp_pretrained_s3_secret_access_key
¶
S3 secret access key to use when pytorch_nlp_pretrained_models_dir is set to an S3 location. (String) (Expert Setting)
Default value ''
text_fraction_for_text_dominated_problem
¶
Fraction of text columns out of all features to be considered a text-dominated problem (Float) (Expert Setting)
Default value 0.3
Fraction of text columns out of all features to be considered a text-dominated problem
text_transformer_fraction_for_text_dominated_problem
¶
Fraction of text per all transformers to trigger that text dominated (Float) (Expert Setting)
Default value 0.3
Fraction of text transformers to all transformers above which to trigger that text dominated problem
string_col_as_text_threshold
¶
Threshold for string columns to be treated as text (0.0 - text, 1.0 - string) (Float) (Expert Setting)
Default value 0.3
Threshold for average string-is-text score as determined by internal heuristics It decides when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable. Higher values will favor string columns as categoricals, lower values will favor string columns as text. Set string_col_as_text_min_relative_cardinality=0.0 to force string column to be treated as text despite low number of uniques.
string_col_as_text_threshold_preview
¶
string_col_as_text_threshold_preview (Float)
Default value 0.1
Threshold for string columns to be treated as text during preview - should be less than string_col_as_text_threshold to allow data with first 20 rows that don’t look like text to still work for Text-only transformers (0.0 - text, 1.0 - string)
string_col_as_text_min_relative_cardinality
¶
string_col_as_text_min_relative_cardinality (Float) (Expert Setting)
Default value 0.1
Mininum fraction of unique values for string columns to be considered as possible text (otherwise categorical)
string_col_as_text_min_absolute_cardinality
¶
string_col_as_text_min_absolute_cardinality (Number) (Expert Setting)
Default value 10000
Mininum number of uniques for string columns to be considered as possible text (if not already)
tokenize_single_chars
¶
Tokenize single characters. (Boolean) (Expert Setting)
Default value True
If disabled, require 2 or more alphanumeric characters for a token in Text (Count and TF/IDF) transformers, otherwise create tokens out of single alphanumeric characters. True means that 〈Street 3〉 is tokenized into 〈Street〉 and 〈3〉, while False means that it’s tokenized into 〈Street〉.
text_transformers_max_vocabulary_size
¶
Max size of the vocabulary for text transformers. (List) (Expert Setting)
Default value [1000, 5000]
- Max size (in tokens) of the vocabulary created during fitting of Tfidf/Count based text
transformers (not CNN/BERT). If multiple values are provided, will use the first one for initial models, and use remaining values during parameter tuning and feature evolution. Values smaller than 10000 are recommended for speed, and a reasonable set of choices include: 100, 1000, 5000, 10000, 50000, 100000, 500000.