Version: Next

Experiment settings: Multi-modal causal language modeling

The settings available for a multi-modal causal language modeling experiment are listed below.

General settings

Dataset

This setting defines the dataset for the experiment.

Problem category

This setting defines a particular general problem type category, for example, image.

note

The selected problem category (for example, image) determines the options in the Problem type setting.
The From experiment option enables you to utilize the settings of an experiment (another experiment).
- The From experiment option is unavailable when you select AutoDL as the experience level.

Problem type

This setting defines the problem type of the experiment, which also defines the settings H2O Hydrogen Torch displays for the experiment.

Note

The selected problem category (in the Problem category setting) determines the available problem types.
The selected problem type and experience level determine the settings H2O Hydrogen Torch displays for the experiment.

Experiment name

This setting defines the experiment H2O Hydrogen Torch references to initialize the experiment settings. H2O Hydrogen Torch initializes the experiment settings with the values from the selected (built) experiment.

Setting dependency

This setting is available only if From experiment is selected in the Problem category setting.

Dataset settings

Train dataframe

This setting specifies the path to a file that contains a dataframe comprising training records utilized by H2O Hydrogen Torch for model training within the experiment. Here, the term 'file' denotes a specific file adhering to a dataset format tailored for the problem type addressed in the experiment. To learn more, see Dataset formats.

note

The records are combined into mini-batches when training the model.
If a validation dataframe is provided, a fold column is not needed in the train dataframe.
To import datasets for inference only, when defining the settings for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).

Validation dataframe

This setting defines a file containing a dataframe with validation records that H2O Hydrogen Torch uses to evaluate the model during training.

Note

To set a Validation dataframe requires the Validation strategy to be set to Custom holdout validation. In the case of providing a validation dataframe, H2O Hydrogen Torch fully respects the choice of a separate validation dataframe and does not perform any internal cross-validation. In other words, the model is trained on the full provided train dataframe, and model performance is evaluated on the provided validation dataframe.
The validation dataframe should have the same format as the train dataframe but does not require a fold column.

Setting dependency

The Validation dataframe settings is only available when you select Validation strategy in the Custom holdout validation setting.

Test dataframe

This setting defines a file containing a dataframe with test records that H2O Hydrogen Torch uses to test the model.

note

The test dataframe should have the same format as the train dataframe but does not require a label column.
To import datasets for inference only, when defining the setting for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).

Architecture settings

Pretrained multimodal model

In the finetuning mode, specify your pretrained multimodal model. This can be a model name from the Hugging Face Hub, a local model path, or a Hydrogen Torch experiment name.

Freeze vision model

This setting allows freezing the parameters of the "Vision Model" component in the multimodal model. Freezing these parameters can reduce memory usage and speed up training. Note that this option is only applicable when LoRA is disabled (set to false).

Freeze language model

This setting allows freezing the parameters of the "Language Model" component in the multimodal model. Freezing these parameters can reduce memory usage and speed up training. Note that this option is only applicable when LoRA is disabled (set to false).

Lora

This setting turns on or off the use of Low Rank Approximations (LoRA) in H2O Hydrogen Torch during model training.

LoRA (Low-Rank Adaptation) is a technique used to compress the weight matrices of large pre-trained language models, making them more memory-efficient and faster to train. In NLP or multimodal machine learning models, LoRA can significantly improve the performance of the model while reducing the computational cost.

Enabling this setting can lead to faster training times and lower memory usage, making it particularly useful when working with large-scale NLP or multimodal tasks. However, it may result in a slight decrease in model accuracy compared to using full-rank matrices. Turning off this setting will ensure that full-rank matrices are used during training but at the cost of longer training times and higher memory requirements.

tip

For most NLP or multimodal tasks, we recommend enabling LoRA during training unless you require the highest possible level of accuracy and have sufficient computational resources available. In such cases, turning off LoRA may improve model performance at the expense of increased training time and memory usage.

Backbone Dtype

Bfloat16 is highly recommended for newer GPUs. For older GPUs that do not support Bfloat16, use float16 instead. However, pure float16 computation can be unstable during full backbone fine-tuning, potentially leading to NaN errors. To mitigate this, enabling LoRA is usually necessary.

Image settings

Use default image settings

This setting applys default image settings when enabled. Turning it off will reveal all image settings for customization.

Note

In the finetuning mode, the default settings are directly loaded from the configuration file of the pretrained multimodal model.
In the pretraining mode, the default settings follow the standard Hydrogen Torch experiment configurations, except for the Image or Patch Size setting, which uses the default image size specified in the vision model's configuration file.

tip

We strongly recommend keeping this option enabled in the finetuning mode.

Tokenizer settings

Max length

Grid search hyperparameter

Specify the maximum length of the token input sequence that is used for model training. The following example describes how you can use this setting to truncate a given token input sequence.

Consider the following text:

I'd like to read the H2O Hydrogen Torch documentation today.

The preceding text is tokenized by bert-base as follows:

['I', "'", 'd', 'like', 'to', 'read', 'the', 'H', '##2', '##O', 'Hydrogen', 'Torch', 'document', '##ation', 'today', '.']

A [CLS] (classification) token is subsequently added to the input sequence at position 0. (The manner in which this token is represented as a string depends on the model.)

['[CLS]', 'I', "'", 'd', 'like', 'to', 'read', 'the', 'H', '##2', '##O', 'Hydrogen', 'Torch', 'document', '##ation', 'today', '.']

If the maximum length is set to 8, the preceding input sequence is truncated after 8 tokens. Therefore, the model is provided with the following input sequence:

``['[CLS]', 'I', "'", 'd', 'like', 'to', 'read', 'the']

note

A higher token count leads to higher memory usage that slows down training while increasing the probability of obtaining a higher accuracy value.

Training settings

Optimizer

Grid search hyperparameter

This setting defines the algorithm or method (optimizer) to use for model training. The selected algorithm or method defines how the model should change the attributes of the neural network, such as weights and learning rate. Optimizers solve optimization problems and make more accurate updates to attributes to reduce learning losses.

Details

Options

Adadelta
- To learn about Adadelta, see ADADELTA: An Adaptive Learning Rate Method.
Adam
- To learn about Adam, see Adam: A Method for Stochastic Optimization.
AdamW
- To learn about AdamW, see Decoupled Weight Decay Regularization.
RMSprop
- To learn about RMSprop, see Neural Networks for Machine Learning.
SGD
- H2O Hydrogen Torch uses a stochastic gradient descent optimizer.

Learning rate

Grid search hyperparameter

This setting defines the learning rate H2O Hydrogen Torch uses when training the model, specifically when updating the neural network's weights. The learning rate is the speed at which the model updates its weights after processing each mini-batch of data.

note

The learning rate is an important setting to tune as it balances under and overfitting.
The number of epochs highly impacts the optimal value of the learning rate.

Batch size

Grid search hyperparameter

This setting defines the number of training examples a mini-batch uses during an iteration of the training model to estimate the error gradient before updating the model weights. In other words, this setting defines the batch size used per GPU.

note

During model training, the training data is packed into mini-batches of a fixed size.

Epochs

Grid search hyperparameter

This setting defines the number of epochs to train the model. In other words, it specifies the number of times the learning algorithm goes through the entire training dataset.

note

The Epochs setting is an important setting to tune because it balances under- and overfitting.
The learning rate highly impacts the optimal value of the epochs.
For the following supported problem types, H2O Hydrogen Torch now enables you to utilize/deploy a pre-trained model trained on zero epochs (where H2O Hydrogen Torch does not train the model and the pretrained model (experiment) can be deployed as-is):
- Speech recognition
- Text sequence to sequence
- text span prediction

Schedule

Grid search hyperparameter

This setting defines the learning rate schedule H2O Hydrogen Torch utilizes during model training. Specifying a learning rate schedule prevents the learning rate from staying the same. Instead, a learning rate schedule causes the learning rate to change over iterations, typically decreasing the learning rate to achieve a better model performance and training convergence.

Details

Options

Constant
- H2O Hydrogen Torch applies a constant learning rate during the training process.
Cosine
- H2O Hydrogen Torch applies a cosine learning rate that follows the values of the cosine function.
Linear
- H2O Hydrogen Torch applies a linear learning rate that decreases the learning rate linearly.

Weight decay

Grid search hyperparameter

This setting defines the weight decay that H2O Hydrogen Torch uses for the optimizer during model training.

note

Weight decay is a regularization technique that adds an L2 norm of all model weights to the loss function while increasing the probability of improving the model generalization.

Grad accumulation

Grid search hyperparameter

This setting defines the number of gradient accumulations before H2O Hydrogen Torch updates the neural network weights during model training.

note

Grad accumulation can be beneficial if only small batches are selected for training. With gradient accumulation, the loss and gradients are calculated after each batch, but it waits for the selected accumulations before updating the model weights. You can control the batch size through the Batch size setting.
Changing the default value of Grad Accumulation might require adjusting the learning rate and batch size.

Prediction settings

Metric

This setting defines the metric to evaluate the model's performance.

Max new tokens

This setting defines the maximum number of new tokens that can be generated in the output text.

Do sample

Determines whether to sample from the next token distribution instead of choosing the token with the highest probability. If turned On, the next token in a predicted sequence is sampled based on the probabilities. If turned Off, the highest probability is always chosen.

Environment settings

GPUs

This setting determines the list of GPUs H2O Hydrogen Torch can use for the experiment. GPUs are listed by name, referring to their system ID (starting from 1). If no GPUs are selected, H2O Hydrogen Torch utilizes the CPU for model training.

Seed

This setting defines the random seed value that H2O Hydrogen Torch uses during model training. It defaults to -1, an arbitrary value. When the value is modified (not -1), the random seed allows results to be reproducible—defining a seed aids in obtaining predictable and repeatable results every time. Otherwise, not modifying the default seed value (-1) leads to random numbers at every invocation.

Logging settings

Logger

This setting defines the logger type that H2O Hydrogen Torch uses for model training

Details

Options

None
- This option does does not use any logger.
Neptune
- This option utilizes Neptune as a logger to track the experiment. To use Neptune, you must define the following settings: Neptune API token and Neptune project.

Neptune API token

This setting defines the Neptune API token to validate all subsequent Neptune API calls.

setting dependency

This setting is available if you select Neptune in the Logger setting.

Neptune project

This setting defines the Neptune project.

setting dependency

This setting is available if you select Neptune in the Logger setting.

Log grad norm

This setting determines whether to log the total grad norm before and after clipping.

note

This setting adds a small overhead during the experiment runtime but can help determine if the gradients are exploding or unstable.

tip

Turn this setting on if you suspect unstable gradients; as a result, you may then choose a value for the gradient clip to prevent exploding gradients.

Feedback

Submit and view feedback for this page
Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai

General settings​

Dataset​

Problem category​

Problem type​

Experiment name​

Dataset settings​

Train dataframe​

Validation dataframe​

Test dataframe​

Architecture settings​

Pretrained multimodal model​

Freeze vision model​

Freeze language model​

Lora​

Backbone Dtype​

Image settings​

Use default image settings​

Tokenizer settings​

Max length​

Training settings​

Optimizer​

Learning rate​

Batch size​

Epochs​

Schedule​

Weight decay​

Grad accumulation​

Prediction settings​

Metric​

Max new tokens​

Do sample​

Environment settings​

GPUs​

Seed​

Logging settings​

Logger​

Neptune API token​

Neptune project​

Log grad norm​

General settings

Dataset

Problem category

Problem type

Experiment name

Dataset settings

Train dataframe

Validation dataframe

Test dataframe

Architecture settings

Pretrained multimodal model

Freeze vision model

Freeze language model

Lora

Backbone Dtype

Image settings

Use default image settings

Tokenizer settings

Max length

Training settings

Optimizer

Learning rate

Batch size

Epochs

Schedule

Weight decay

Grad accumulation

Prediction settings

Metric

Max new tokens

Do sample

Environment settings

GPUs

Seed

Logging settings

Logger

Neptune API token

Neptune project

Log grad norm