Experiment settings: Text classification
The settings for a text classification experiment are listed and described below.
General settings
Dataset
This setting defines the dataset for the experiment.
Problem category
This setting defines a particular general problem type category, for example, image.
- The selected problem category (for example, image) determines the options in the Problem type setting.
- The From experiment option enables you to utilize the settings of an experiment (another experiment).
- The From experiment option is unavailable when you select AutoDL as the experience level.
Experiment
This setting defines the experiment H2O Hydrogen Torch references to initialize the experiment settings. H2O Hydrogen Torch initializes the experiment settings with the values from the selected (built) experiment.
This setting is available only if From experiment is selected in the Problem category setting.
Problem type
This setting defines the problem type of the experiment, which also defines the settings H2O Hydrogen Torch displays for the experiment.
- The selected problem category (in the Problem category setting) determines the available problem types.
- The selected problem type and experience level determine the settings H2O Hydrogen Torch displays for the experiment.
Import config from YAML
This setting defines the YML file that defines the experiment settings.
- H2O Hydrogen Torch supports a YML file import and export functionality. You can download the config settings of finished experiments, make changes, and re-upload them when starting a new experiment in any instance of H2O Hydrogen Torch.
- To learn how to download the YML file (configuration file) of a completed experiment, see Download an experiment's logs/config file.
Use previous experiment weights
This setting determines whether to initialize the model weights with the weights from the experiment specified in the Experiment setting.
A model's weights are available for an experiment (model) of the same problem type and backbone.
This setting might be useful in case you want to continue training from a built experiment.
The Use previous experiment weights setting is available only if From experiment is selected in the Problem category setting.
Experiment name
This setting defines the name of the experiment.
Dataset settings
Train dataframe
This setting specifies the path to a file that contains a dataframe comprising training records utilized by H2O Hydrogen Torch for model training within the experiment. Here, the term 'file' denotes a specific file adhering to a dataset format tailored for the problem type addressed in the experiment. To learn more, see Dataset formats.
- The records are combined into mini-batches when training the model.
- If a validation dataframe is provided, a fold column is not needed in the train dataframe.
- To import datasets for inference only, when defining the settings for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).
Validation strategy
This setting specifies the validation strategy H2O Hydrogen Torch uses for the experiment.
To properly assess the performance of your trained models, it is common practice to evaluate it on separate holdout data that the model has not seen during training.
Details
Options
- K-fold cross validation
- This option splits the data using the provided optional fold column in the train data or performs an automatic 5-fold cross-validation in the absence of a fold column.
- Grouped k-fold cross-validation
- This option allows you to specify a group column based on which the data is split into folds.
- Custom holdout validation
- This option specifies a separate holdout dataframe.
- Automatic holdout validation
- This option allows you to specify a holdout validation sample size that is automatically generated.
Validation dataframe
This setting defines a file containing a dataframe with validation records that H2O Hydrogen Torch uses to evaluate the model during training.
- To set a Validation dataframe requires the Validation strategy to be set to Custom holdout validation. In the case of providing a validation dataframe, H2O Hydrogen Torch fully respects the choice of a separate validation dataframe and does not perform any internal cross-validation. In other words, the model is trained on the full provided train dataframe, and model performance is evaluated on the provided validation dataframe.
- The validation dataframe should have the same format as the train dataframe but does not require a fold column.
The Validation dataframe settings is only available when you select Validation strategy in the Custom holdout validation setting.
Selected folds
This setting defines the selected validation fold(s) in case of cross-validation; a separate model is trained for each value selected. Each model utilizes the corresponding part of the data as a holdout sample to assess performance while the model is fitted to the rest of the records from the training dataframe. As a result, folds estimate how the model performs in general when used to make predictions on data not used during model training.
H2O Hydrogen Torch allows running experiments on a single selected fold for faster experimenting and multiple selected folds to gain more trust in the model's generalization and performance capabilities.
This setting is available only when the Validation strategy setting is not set to Custom holdout validation or Automatic holdout validation.
Test dataframe
This setting defines a file containing a dataframe with test records that H2O Hydrogen Torch uses to test the model.
- The test dataframe should have the same format as the train dataframe but does not require a label column.
- To import datasets for inference only, when defining the setting for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).
Unlabeled dataframe
Defines a separate CSV or Parquet file (depending on the problem type) containing a dataframe with unlabeled records that H2O Hydrogen Torch utilizes to generate pseudo labels. H2O Hydrogen Torch first trains the model with the provided labeled data (Train dataframe). Right after, the model predicts pseudo labels for the provided unlabeled dataframe before doing another training run that combines the original labels and pseudo labels.
- Image regression | Image classification | Image object detection
- The unlabeled dataframe just needs to contain a single image column
- Text regression | Text classification
- The unlabeled dataframe just needs to contain a single text column
- Audio regression | Audio classification | Speech recognition
- The unlabeled dataframe just needs to contain a single audio column
- Image regression | Image classification | Image object detection | Audio regression | Audio classification | Speech recognition
- Assets (images or audios) need to be located in the Data folder (setting)
- All supported problem types
- The training time can significantly increase depending on the size of the unlabeled data
As labeling can be expensive, having additional unlabeled data is quite common. Providing this unlabeled data in H2O Hydrogen Torch trains the model semi-supervised, potentially improving the model quality in contrast to only training on labeled data.
Label columns
This setting defines the name(s) of the dataframe column(s) that refer to the target value(s) an H2O Hydrogen Torch experiment can aim to predict.
Text column
Defines the dataset column(s) containing the input text H2O Hydrogen Torch uses during model training.
H2O Hydrogen Torch concatenates multiple text columns with a specific separator token.
Data sample
This setting defines the percentage of the data to use for the experiment. The default percentage is 100%.
Changing the default value can significantly increase the training speed. Still, it might lead to a substantially poor accuracy value. Using 100% of the data for final models is highly recommended.
Data sample choice
This setting specifies the data H2O Hydrogen Torch samples according to the percentage set in the Data sample setting. H2O Hydrogen Torch does not sample the unselected data.
The Data sample choice setting is only available if the value in the Data sample setting is less than 1.0.
Separator
Separator to use when combining text fields. Leave empty to use default tokenizer separator.
Tokenizer settings
Lowercase
Grid search hyperparameter
Determines whether to transform to lower case the text that H2O Hydrogen Torch observes during the experiment. This setting is turned Off by default.
When turned On, the observed text is always lowercased before training and prediction. Tuning this setting can potentially lead to a higher accuracy value for certain types of datasets.
Max length
Grid search hyperparameter
Specify the maximum length of the token input sequence that is used for model training. The following example describes how you can use this setting to truncate a given token input sequence.
Consider the following text:
I'd like to read the H2O Hydrogen Torch documentation today.
The preceding text is tokenized by bert-base as follows:
['I', "'", 'd', 'like', 'to', 'read', 'the', 'H', '##2', '##O', 'Hydrogen', 'Torch', 'document', '##ation', 'today', '.']
A [CLS] (classification) token is subsequently added to the input sequence at position 0. (The manner in which this token is represented as a string depends on the model.)
['[CLS]', 'I', "'", 'd', 'like', 'to', 'read', 'the', 'H', '##2', '##O', 'Hydrogen', 'Torch', 'document', '##ation', 'today', '.']
If the maximum length is set to 8, the preceding input sequence is truncated after 8 tokens. Therefore, the model is provided with the following input sequence:
``['[CLS]', 'I', "'", 'd', 'like', 'to', 'read', 'the']
A higher token count leads to higher memory usage that slows down training while increasing the probability of obtaining a higher accuracy value.
Padding quantile
Defines the padding quantile H2O Hydrogen Torch uses to select the maximum token length per batch. H2O Hydrogen Torch performs padding of shorter sequences up to the specified padding quantile instead of the selected Max length. H2O Hydrogen Torch truncates longer sequences.
- Lowering the quantile can significantly increase training runtime and reduce memory usage in unevenly distributed sequence lengths but can hurt performance
- The setting depends on the batch size and should be adjusted accordingly
- No padding is done in inference, and the selected Max Length is guaranteed
Augmentation settings
Token mask probability
Defines the random probability of the input text tokens to be randomly masked during training.
- Increasing this setting can be helpful to avoid overfitting and apply regularization
- Each token is randomly replaced by a masking token based on the specified probability
Architecture settings
Pretrained
Grid search hyperparameter
Defines whether the neural network should start with pre-trained weights. When this setting is On, the training of the neural network starts with a pre-trained model on a generic task. When turned Off, the initial weights of the neural network to train become random.
Backbone
Grid search hyperparameter
Defines the backbone neural network architecture to train the model.
- Image regression | Image classification | Image metric learning | Audio regression | Audio classification
- H2O Hydrogen Torch accepts backbone neural network architectures from the timm library (select or enter the architecture name)
- Image object detection
- H2O Hydrogen Torch provides several backbone state-of-the-art neural network architectures for model training. When you select Faster RCnn or Fcos as the model type for the experiment, you can input any architecture name from the timm library. When you select Efficientdet as the model type for the experiment, you can input any architecture name from the efficientdet-pytorch library
- Image semantic segmentation | Image instance segmentation
- H2O Hydrogen Torch accepts backbone neural network architectures from the segmentation-models-pytorch library (select or enter the architecture name).
- 3D image regression | 3D image classification
- H2O Hydrogen Torch accepts backbone (encoder) neural network architectures from a subset (resnet and efficientnet) of the timm library (select or enter the architecture name).
- Text regression | Text classification | Text token classification | Text span prediction | Text sequence to sequence | Text metric learning
- H2O Hydrogen Torch accepts backbone neural network architectures from the Hugging Face library (select or enter the architecture name)
- Speech recognition
- HuggingFace Wav2Vec2 CTC models are supported
- All problem types
- Usually, it is good to use simpler architectures for quicker experiments and larger models when aiming for the highest accuracy
- Speech recognition
- If possible, leverage backbones pre-trained closely to your use case (for example, noisy audio, casual speech, etc.)