Experiment settings: Multi-modal causal language modeling
The settings available for a multi-modal causal language modeling experiment are listed below.
General settings
Dataset
This setting defines the dataset for the experiment.
Problem category
This setting defines a particular general problem type category, for example, image.
- The selected problem category (for example, image) determines the options in the Problem type setting.
- The From experiment option enables you to utilize the settings of an experiment (another experiment).
- The From experiment option is unavailable when you select AutoDL as the experience level.
Problem type
This setting defines the problem type of the experiment, which also defines the settings H2O Hydrogen Torch displays for the experiment.
- The selected problem category (in the Problem category setting) determines the available problem types.
- The selected problem type and experience level determine the settings H2O Hydrogen Torch displays for the experiment.
Experiment name
This setting defines the experiment H2O Hydrogen Torch references to initialize the experiment settings. H2O Hydrogen Torch initializes the experiment settings with the values from the selected (built) experiment.
This setting is available only if From experiment is selected in the Problem category setting.
Dataset settings
Train dataframe
This setting specifies the path to a file that contains a dataframe comprising training records utilized by H2O Hydrogen Torch for model training within the experiment. Here, the term 'file' denotes a specific file adhering to a dataset format tailored for the problem type addressed in the experiment. To learn more, see Dataset formats.
- The records are combined into mini-batches when training the model.
- If a validation dataframe is provided, a fold column is not needed in the train dataframe.
- To import datasets for inference only, when defining the settings for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).
Validation dataframe
This setting defines a file containing a dataframe with validation records that H2O Hydrogen Torch uses to evaluate the model during training.
- To set a Validation dataframe requires the Validation strategy to be set to Custom holdout validation. In the case of providing a validation dataframe, H2O Hydrogen Torch fully respects the choice of a separate validation dataframe and does not perform any internal cross-validation. In other words, the model is trained on the full provided train dataframe, and model performance is evaluated on the provided validation dataframe.
- The validation dataframe should have the same format as the train dataframe but does not require a fold column.
The Validation dataframe settings is only available when you select Validation strategy in the Custom holdout validation setting.
Test dataframe
This setting defines a file containing a dataframe with test records that H2O Hydrogen Torch uses to test the model.
- The test dataframe should have the same format as the train dataframe but does not require a label column.
- To import datasets for inference only, when defining the setting for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).
Architecture settings
Pretrained multimodal model
In the finetuning mode, specify your pretrained multimodal model. This can be a model name from the Hugging Face Hub, a local model path, or a Hydrogen Torch experiment name.
Freeze vision model
This setting allows freezing the parameters of the "Vision Model" component in the multimodal model. Freezing these parameters can reduce memory usage and speed up training. Note that this option is only applicable when LoRA is disabled (set to false).
Freeze language model
This setting allows freezing the parameters of the "Language Model" component in the multimodal model. Freezing these parameters can reduce memory usage and speed up training. Note that this option is only applicable when LoRA is disabled (set to false).
Lora
This setting turns on or off the use of Low Rank Approximations (LoRA) in H2O Hydrogen Torch during model training.
LoRA (Low-Rank Adaptation) is a technique used to compress the weight matrices of large pre-trained language models, making them more memory-efficient and faster to train. In NLP or multimodal machine learning models, LoRA can significantly improve the performance of the model while reducing the computational cost.
Enabling this setting can lead to faster training times and lower memory usage, making it particularly useful when working with large-scale NLP or multimodal tasks. However, it may result in a slight decrease in model accuracy compared to using full-rank matrices. Turning off this setting will ensure that full-rank matrices are used during training but at the cost of longer training times and higher memory requirements.
For most NLP or multimodal tasks, we recommend enabling LoRA during training unless you require the highest possible level of accuracy and have sufficient computational resources available. In such cases, turning off LoRA may improve model performance at the expense of increased training time and memory usage.
Backbone Dtype
Bfloat16 is highly recommended for newer GPUs. For older GPUs that do not support Bfloat16, use float16 instead. However, pure float16 computation can be unstable during full backbone fine-tuning, potentially leading to NaN errors. To mitigate this, enabling LoRA is usually necessary.
Image settings
Use default image settings
This setting applys default image settings when enabled. Turning it off will reveal all image settings for customization.
- In the finetuning mode, the default settings are directly loaded from the configuration file of the pretrained multimodal model.
- In the pretraining mode, the default settings follow the standard Hydrogen Torch experiment configurations, except for the Image or Patch Size setting, which uses the default image size specified in the vision model's configuration file.
We strongly recommend keeping this option enabled in the finetuning mode.
Tokenizer settings
Max length
Grid search hyperparameter
Specify the maximum length of the token input sequence that is used for model training. The following example describes how you can use this setting to truncate a given token input sequence.
Consider the following text:
I'd like to read the H2O Hydrogen Torch documentation today.
The preceding text is tokenized by bert-base as follows:
['I', "'", 'd', 'like', 'to', 'read', 'the', 'H', '##2', '##O', 'Hydrogen', 'Torch', 'document', '##ation', 'today', '.']
A [CLS]
(classification) token is subsequently added to the input sequence at position 0. (The manner in which this token is represented as a string depends on the model.)
['[CLS]', 'I', "'", 'd', 'like', 'to', 'read', 'the', 'H', '##2', '##O', 'Hydrogen', 'Torch', 'document', '##ation', 'today', '.']
If the maximum length is set to 8, the preceding input sequence is truncated after 8 tokens. Therefore, the model is provided with the following input sequence:
``['[CLS]', 'I', "'", 'd', 'like', 'to', 'read', 'the']
A higher token count leads to higher memory usage that slows down training while increasing the probability of obtaining a higher accuracy value.
Training settings
Optimizer
Grid search hyperparameter
This setting defines the algorithm or method (optimizer) to use for model training. The selected algorithm or method defines how the model should change the attributes of the neural network, such as weights and learning rate. Optimizers solve optimization problems and make more accurate updates to attributes to reduce learning losses.
Details
Options
- Adadelta
- To learn about Adadelta, see ADADELTA: An Adaptive Learning Rate Method.
- Adam
- To learn about Adam, see Adam: A Method for Stochastic Optimization.
- AdamW
- To learn about AdamW, see Decoupled Weight Decay Regularization.
- RMSprop
- To learn about RMSprop, see Neural Networks for Machine Learning.
- SGD
- H2O Hydrogen Torch uses a stochastic gradient descent optimizer.
Learning rate
Grid search hyperparameter
This setting defines the learning rate H2O Hydrogen Torch uses when training the model, specifically when updating the neural network's weights. The learning rate is the speed at which the model updates its weights after processing each mini-batch of data.
- The learning rate is an important setting to tune as it balances under and overfitting.
- The number of epochs highly impacts the optimal value of the learning rate.
Batch size
Grid search hyperparameter
This setting defines the number of training examples a mini-batch uses during an iteration of the training model to estimate the error gradient before updating the model weights. In other words, this setting defines the batch size used per GPU.
During model training, the training data is packed into mini-batches of a fixed size.
Epochs
Grid search hyperparameter
This setting defines the number of epochs to train the model. In other words, it specifies the number of times the learning algorithm goes through the entire training dataset.
- The Epochs setting is an important setting to tune because it balances under- and overfitting.
- The learning rate highly impacts the optimal value of the epochs.
- For the following supported problem types, H2O Hydrogen Torch now enables you to utilize/deploy a pre-trained model trained on zero epochs (where H2O Hydrogen Torch does not train the model and the pretrained model (experiment) can be deployed as-is):
- Speech recognition
- Text sequence to sequence
- text span prediction
Schedule
Grid search hyperparameter
This setting defines the learning rate schedule H2O Hydrogen Torch utilizes during model training. Specifying a learning rate schedule prevents the learning rate from staying the same. Instead, a learning rate schedule causes the learning rate to change over iterations, typically decreasing the learning rate to achieve a better model performance and training convergence.
Details
Options
- Constant
- H2O Hydrogen Torch applies a constant learning rate during the training process.
- Cosine
- H2O Hydrogen Torch applies a cosine learning rate that follows the values of the cosine function.
- Linear
- H2O Hydrogen Torch applies a linear learning rate that decreases the learning rate linearly.
Weight decay
Grid search hyperparameter
This setting defines the weight decay that H2O Hydrogen Torch uses for the optimizer during model training.
Weight decay is a regularization technique that adds an L2 norm of all model weights to the loss function while increasing the probability of improving the model generalization.
Grad accumulation
Grid search hyperparameter
This setting defines the number of gradient accumulations before H2O Hydrogen Torch updates the neural network weights during model training.
- Grad accumulation can be beneficial if only small batches are selected for training. With gradient accumulation, the loss and gradients are calculated after each batch, but it waits for the selected accumulations before updating the model weights. You can control the batch size through the Batch size setting.
- Changing the default value of Grad Accumulation might require adjusting the learning rate and batch size.
Prediction settings
Metric
This setting defines the metric to evaluate the model's performance.
Max new tokens
This setting defines the maximum number of new tokens that can be generated in the output text.
Do sample
Determines whether to sample from the next token distribution instead of choosing the token with the highest probability. If turned On, the next token in a predicted sequence is sampled based on the probabilities. If turned Off, the highest probability is always chosen.
Environment settings
GPUs
This setting determines the list of GPUs H2O Hydrogen Torch can use for the experiment. GPUs are listed by name, referring to their system ID (starting from 1). If no GPUs are selected, H2O Hydrogen Torch utilizes the CPU for model training.
Seed
This setting defines the random seed value that H2O Hydrogen Torch uses during model training. It defaults to -1, an arbitrary value. When the value is modified (not -1), the random seed allows results to be reproducible—defining a seed aids in obtaining predictable and repeatable results every time. Otherwise, not modifying the default seed value (-1) leads to random numbers at every invocation.
Logging settings
Logger
This setting defines the logger type that H2O Hydrogen Torch uses for model training
Details
Options
- None
- This option does does not use any logger.
- Neptune
- This option utilizes Neptune as a logger to track the experiment. To use Neptune, you must define the following settings: Neptune API token and Neptune project.
Neptune API token
This setting defines the Neptune API token to validate all subsequent Neptune API calls.
This setting is available if you select Neptune in the Logger setting.
Neptune project
This setting defines the Neptune project.
This setting is available if you select Neptune in the Logger setting.
Log grad norm
This setting determines whether to log the total grad norm before and after clipping.
This setting adds a small overhead during the experiment runtime but can help determine if the gradients are exploding or unstable.
Turn this setting on if you suspect unstable gradients; as a result, you may then choose a value for the gradient clip to prevent exploding gradients.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai