Experiment settings: Image metric learning
The settings for an image metric learning experiment are listed and described below.
General settings
Dataset
It defines the dataset for the experiment.
Problem category
This setting defines a particular general problem type category, for example, image.
- The selected problem category (for example, image) determines the options in the Problem type setting.
- The following option is available when defining the settings of an experiment: From experiment.
- The From experiment option enables you to utilize the settings of an experiment (another experiment).
Experiment
Defines the experiment H2O Hydrogen Torch references to initialize the experiment settings. H2O Hydrogen Torch initializes the experiment settings with the values from the selected (built) experiment.
This setting is available only if From experiment is selected in the Problem category setting.
Problem type
Defines the problem type of the experiment, which also defines the settings H2O Hydrogen Torch displays for the experiment.
- The selected problem category (in the Problem category setting) determines the available problem types.
- The selected problem type and experience level determine the settings H2O Hydrogen Torch displays for the experiment.
Import config from YAML
Defines the .yml
file that defines the experiment settings.
- H2O Hydrogen Torch supports a
.yml
file import and export functionality. You can download the config settings of finished experiments, make changes, and re-upload them when starting a new experiment in any instance of H2O Hydrogen Torch.- To learn how to download the
.yml
file (configuration file) of a completed experiment, see Download an experiment's logs/config file.
- To learn how to download the
Use previous experiment weights
Defines whether to initialize the model weights with the weights from the experiment specified in the Experiment setting.
- This setting is available only if From experiment is selected in the Problem category setting
- A model's weights are available for an experiment (model) of the same problem type and backbone.
- This setting might be useful in case you want to continue training from a built experiment
Experiment name
It defines the name of the experiment.
Dataset settings
Train dataframe
Defines a .csv
or .pq
file containing a dataframe with training records that H2O Hydrogen Torch uses to train the model.
- The records are combined into mini-batches when training the model.
- If a validation dataframe is provided, a fold column is not needed in the train dataframe.
- You can now import datasets for inference only. To do so, when defining the setting for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).
Data folder
Defines the location of the folder containing assets (for example, images or audio clips) the model utilizes for training. H2O Hydrogen Torch loads assets from this folder during training.
Validation strategy
Specifies the validation strategy H2O Hydrogen Torch uses for the experiment.
To properly assess the performance of your trained models, it is common practice to evaluate it on separate holdout data that the model has not seen during training. H2O Hydrogen Torch allows you to specify different strategies for this task fitting your needs.
Options
- All supported problem types
- K-fold cross validation
- Splits the data using the provided optional fold column in the train data or performs an automatic 5-fold cross-validation.
- Grouped k-fold cross validation
- Allows to specify a group column based on which the data is split into folds.
- Custom holdout validation
- Specifies a separate holdout dataframe.
- Automatic holdout validation
- Allows to specify a holdout validation sample size that is automatically generated.
- K-fold cross validation
Validation dataframe
Defines a .csv
or .pq
file containing a dataframe with validation records that H2O Hydrogen Torch uses to evaluate the model during training.
- To set a Validation dataframe requires the Validation strategy to be set to Custom holdout validation. In this case, H2O Hydrogen Torch fully respects the choice of a separate validation dataframe and does not perform any internal cross-validation. In other words, the model is trained on the full provided train dataframe, and model performance is evaluated on the provided validation dataframe.
- The validation dataframe should have the same format as the train dataframe but does not require a fold column.
Selected folds
Defines the selected validation fold(s) in case of cross-validation; a separate model is trained for each value selected. Each model utilizes the corresponding part of the data as a holdout sample to assess performance while the model is fitted to the rest of the records from the training dataframe. As a result, folds estimate how the model performs in general when used to make predictions on data not used during model training.
- H2O Hydrogen Torch allows running experiments on a single selected fold for faster experimenting and multiple selected folds to gain more trust in the model's generalization and performance capabilities.
- The Selected folds setting is only be available if Custom holdout validation is not selected as the Validation strategy.
Test dataframe
Defines a .csv
or .pq
file containing a dataframe with test records that H2O Hydrogen Torch uses to test the model.
- The test dataframe should have the same format as the train dataframe but does not require a label column.
- You can now import datasets for inference only. To do so, when defining the setting for an experiment, set the Train dataframe setting to None while setting the Test dataframe setting to the relevant dataframe (as a result, H2O Hydrogen Torch utilizes the relevant dataset for predictions and not for training).
Data folder test
Defines the location of the folder containing assets (for example, images, texts, or audio clips) H2O Hydrogen Torch utilizes to test the model. H2O Hydrogen Torch loads the assets from this folder when testing the model. This setting is only available if a test dataframe is selected.
The Data folder test setting appears when you specify a test dataframe in the Test dataframe setting.
Label columns
Defines the name(s) of the dataframe column(s) that refer to the target value(s) H2O Hydrogen Torch aims to predict.
Image column
Defines the dataframe column storing the names of images that H2O Hydrogen Torch loads from the data folder and data folder test when training and testing the model.
Data sample
Defines the percentage of the data to use for the experiment. The default percentage is 100% (1).
Changing the default value can significantly increase the training speed. Still, it might lead to a substantially poor accuracy value. Using 100% (1) of the data for final models is highly recommended.
Image settings
Image width
Defines the width H2O Hydrogen Torch uses to rescale the images for training and predictions.
Depending on the original image size, a bigger width can generate a higher accuracy value.
Image height
Defines the width H2O Hydrogen Torch uses to rescale the images for training and predictions.
Depending on the original image size, a bigger width can generate a higher accuracy value.
Image channels
Defines the number of channels the train images contain.
- Typically images have three input channels (red, green, and blue (RGB)), but grayscale images have only 1. When you provide image data in a NumPy data format, any number of channels is allowed. For this reason, data scientists can specify the number of channels.
- The defined number of channels also refers to the provided validation and test datasets.
Image normalization
Grid search hyperparameter
Defines the transformer to normalize the image data before training the model.
Usually, state-of-the-art image models normalize the training images by scaling values of each of the input channels to predefined means and standard deviations.
Options
Image regression | Image classification | Image object detection | Image semantic segmentation | Image instance segmentation | Image metric learning
- Channel
- Calculates mean and standard deviation per channel in all the images in the batch and then applies per channel normalization: subtracts mean and divides by standard deviation.
- Image
- Calculates mean and standard deviation per image and then applies normalization.
- ImageNet
- Divides input images by 255 and normalizes with mean and standard deviation equal to (0.485, 0.456, 0.406) and (0.229, 0.224, 0.225) per channel, respectively.
- Inception
- Divides input images by 255 and normalizes with mean and standard deviation equal to 0.5.
- Min_Max
- Calculates minimum and maximum values in all the images in the batch and then applies min-max normalization: subtracts min and divides by the max and min difference.
- No
- No normalization is applied to the input images.
- Simple
- Divides input images by 255.
3D image classification | 3D image regression | 3D image semantic segmentation
- No
- No normalization is applied to the input images.
- Simple
- Divides input images by 255.
- Min_Max
- Calculates minimum and maximum values in all the images in the batch and then applies min-max normalization: subtracts min and divides by the max and min difference.
Usually, state-of-the-art image models normalize the training images by scaling values of each of the input channels to predefined means and standard deviations.
Augmentation settings
Augmentations strategy
Grid search hyperparameter
Defines the augmentation strategy to apply to the input images. Soft, Medium, and Hard values correspond to the strength of the augmentations to apply.
Options
Image regression | Image classification | Image object detection | Image semantic segmentation | Image instance segmentation | Image metric learning
- Soft
- The Soft strategy applies image Resize and random HorizontalFlip during model training while applying image Resize during model inference.
- Medium
- The Medium strategy adds ShiftScaleRotate and CoarseDropout to the list of the train augmentations.
- Hard
- The Hard strategy applies RandomResizedCrop (instead of Resize) during model training while adding RandomBrightnessContrast to the list of train augmentations.
- Custom
- The Custom strategy allows users to use their own augmentations that can be defined in the following two settings:
3D image classification | 3D image regression | 3D image semantic segmentation
- Soft
- The Soft strategy applies image Resize and random HorizontalFlip during model training while applying image Resize during model inference.
- Medium
- The Medium strategy adds ShiftScaleRotate and CoarseDropout to the list of the train augmentations.
- Hard
- The Hard strategy applies RandomResizedCrop (instead of Resize) during model training while adding RandomBrightnessContrast to the list of train augmentations.
Augmentations are ways to modify train images while keeping the target values valid, such as flipping the image or adding noise. Distorting training images do not influence the expected prediction of the model but enrich the training data. Augmentations help generalize the model better and improve its accuracy.
Architecture settings
Pretrained
Grid search hyperparameter
Defines whether the neural network should start with pre-trained weights. When this setting is On, the training of the neural network starts with a pre-trained model on a generic task. When turned Off, the initial weights of the neural network to train become random.
Embedding size
Grid search hyperparameter
Defines the dimensionality H2O Hydrogen Torch uses for the embedding vector representing one sample during model training.
- The embedding size has an impact on the granularity of the embedding individual records (embedding calculation) and cosine similarity calculation (a calculation that follows the embedding calculation)
- A smaller embedding size typically leads to more general embeddings and larger ones to more specific ones.
- Tuning the size of the embedding can impact overfitting and underfitting.
Backbone
Grid search hyperparameter
Defines the backbone neural network architecture to train the model.
- Image regression | Image classification | Image metric learning | Audio regression | Audio classification
- H2O Hydrogen Torch accepts backbone neural network architectures from the timm library (select or enter the architecture name)
- Image object detection
- H2O Hydrogen Torch provides several backbone state-of-the-art neural network architectures for model training. When you select Faster RCnn or Fcos as the model type for the experiment, you can input any architecture name from the timm library. When you select Efficientdet as the model type for the experiment, you can input any architecture name from the efficientdet-pytorch library
- Image semantic segmentation | Image instance segmentation
- H2O Hydrogen Torch accepts backbone neural network architectures from the segmentation-models-pytorch library (select or enter the architecture name).
- 3D image regression | 3D image classification
- H2O Hydrogen Torch accepts backbone (encoder) neural network architectures from a subset (resnet and efficientnet) of the timm library (select or enter the architecture name).
- Text regression | Text classification | Text token classification | Text span prediction | Text sequence to sequence | Text metric learning
- H2O Hydrogen Torch accepts backbone neural network architectures from the Hugging Face library (select or enter the architecture name)
- Speech recognition
- HuggingFace Wav2Vec2 CTC models are supported
- All problem types
- Usually, it is good to use simpler architectures for quicker experiments and larger models when aiming for the highest accuracy
- Speech recognition
- If possible, leverage backbones pre-trained closely to your use case (for example, noisy audio, casual speech, etc.)
Pool
Grid search hyperparameter
Defines the global pooling method H2O Hydrogen Torch uses in the model architecture before the final fully connected layer. Instead of adding a fully connected layer on top of the feature maps, global pooling is applied to each feature map beforehand.
Certain backbones (for example, VIT) do not require pooling. Accordingly, H2O Hydrogen Torch does not display this setting.
Options
Image regression
- Average
- H2O Hydrogen Torch applies global average pooling.
- CatAverageMax
- H2O Hydrogen Torch concatenates global average and max poolings.
- GeM
- H2O Hydrogen Torch applies a Generalized Mean Pooling (GeM) introduced in the following paper: Fine-tuning CNN Image Retrieval with No Human Annotation.
- Max
- H2O Hydrogen Torch applies a global max pooling.
- MeanAverageMax
- H2O Hydrogen Torch calculates the mean between global average and max poolings.
Image classification
- Average
- H2O Hydrogen Torch applies global average pooling.
- CatAverageMax
- H2O Hydrogen Torch concatenates global average and max poolings.
- GeM
- H2O Hydrogen Torch applies a Generalized Mean Pooling (GeM) introduced in the following paper: Fine-tuning CNN Image Retrieval with No Human Annotation.
- Max
- H2O Hydrogen Torch applies a global max pooling.
- MeanAverageMax
- H2O Hydrogen Torch calculates the mean between global average and max poolings.
Image metric learning
- Average
- H2O Hydrogen Torch applies global average pooling.
- CatAverageMax
- H2O Hydrogen Torch concatenates global average and max poolings.
- GeM
- H2O Hydrogen Torch applies a Generalized Mean Pooling (GeM) introduced in the following paper: Fine-tuning CNN Image Retrieval with No Human Annotation.
- Max
- H2O Hydrogen Torch applies a global max pooling.
- MeanAverageMax
- H2O Hydrogen Torch calculates the mean between global average and max poolings.
Text regression
- Average
- H2O Hydrogen Torch applies global average pooling.
- GeM
- H2O Hydrogen Torch applies a Generalized Mean Pooling (GeM) introduced in the following paper: Fine-tuning CNN Image Retrieval with No Human Annotation.
- Max
- H2O Hydrogen Torch applies a global max pooling.
- [CLS] token
- H2O Hydrogen Torch uses the output of the first [CLS] token.
Text classification
- Average
- H2O Hydrogen Torch applies global average pooling.
- GeM
- H2O Hydrogen Torch applies a Generalized Mean Pooling (GeM) introduced in the following paper: Fine-tuning CNN Image Retrieval with No Human Annotation.
- Max
- H2O Hydrogen Torch applies a global max pooling.
- [CLS] token
- H2O Hydrogen Torch uses the output of the first [CLS] token.
Text metric learning
- Average
- H2O Hydrogen Torch applies global average pooling.
- GeM
- H2O Hydrogen Torch applies a Generalized Mean Pooling (GeM) introduced in the following paper: Fine-tuning CNN Image Retrieval with No Human Annotation.
- Max
- H2O Hydrogen Torch applies a global max pooling.
- [CLS] token
- H2O Hydrogen Torch uses the output of the first [CLS] token.
Audio regression
- Average
- H2O Hydrogen Torch applies global average pooling.
- CatAverageMax
- H2O Hydrogen Torch concatenates global average and max poolings.
- GeM
- H2O Hydrogen Torch applies a Generalized Mean Pooling (GeM) introduced in the following paper: Fine-tuning CNN Image Retrieval with No Human Annotation.
- Max
- H2O Hydrogen Torch applies a global max pooling.
- MeanAverageMax
- H2O Hydrogen Torch calculates the mean between global average and max poolings.
Audio classification
- Average
- H2O Hydrogen Torch applies global average pooling.
- CatAverageMax
- H2O Hydrogen Torch concatenates global average and max poolings.
- GeM
- H2O Hydrogen Torch applies a Generalized Mean Pooling (GeM) introduced in the following paper: Fine-tuning CNN Image Retrieval with No Human Annotation.
- Max
- H2O Hydrogen Torch applies a global max pooling.
- MeanAverageMax
- H2O Hydrogen Torch calculates the mean between global average and max poolings.
Dropout
Grid search hyperparameter
Defines the dropout rate before the final fully connected layer that H2O Hydrogen Torch applies during model training. This setting defines the dropout rate between the backbone and neck of the model H2O Hydrogen Torch applies during model training. The dropout rate helps the model generalize better by randomly dropping a share of the neural network connections.
Training settings
Loss function
Grid search hyperparameter
Defines the loss function H2O Hydrogen Torch utilizes during model training. The loss function is a differentiable function measuring the prediction error. The model utilizes gradients of the loss function to update the model weights during training.
Options
Image regression | 3D image regression | Text regression | Audio regression
- MAE
- H2O Hydrogen Torch utilizes the mean absolute error (L1 norm) as the loss function.
- MSE
- H2O Hydrogen Torch utilizes the mean squared error (squared L2 norm) as the loss function.
- RMSE
- H2O Hydrogen Torch utilizes the mean squared error (L2 norm) as a loss function.
Image classification | 3D image classification | Text classification | Audio classification
- BCE
- H2O Hydrogen Torch uses binary cross entropy loss.
- Classification
- This default classification loss automatically chooses between BCE (multi-label) and CrossEntropy (multi-class) for classification.
- CrossEntropy
- H2O Hydrogen Torch utilizes multi-class cross entropy loss as a loss function.
- SigmoidFocal
- H2O Hydrogen Torch uses the sigmoid Focal loss (gamma=2.0) for classification introduced in the following paper: Focal Loss for Dense Object Detection
- SoftmaxFocal
- H2O Hydrogen Torch uses the softmax Focal loss (gamma=2.0) for classification introduced in the following paper: Focal Loss for Dense Object Detection
Image semantic segmentation | 3D image semantic segmentation | Image instance segmentation
- BCE
- H2O Hydrogen Torch uses binary cross entropy loss.
- BCEDice
- H2O Hydrogen Torch uses binary cross entropy loss and Dice loss weights 2 and 1, respectively.
- BCELovasz
- H2O Hydrogen Torch uses binary cross entropy loss and Lovasz loss with equal weights.
- Dice
- H2O Hydrogen Torch uses Dice loss.
- Focal
- H2O Hydrogen Torch uses the Focal loss for semantic segmentation introduced in the following paper: Focal Loss for Dense Object Detection
- FocalDice
- H2O Hydrogen Torch uses Focal loss and Dice loss with weights 2 and 1, respectively.
- Jaccard
- H2O Hydrogen Torch uses Jaccard loss.
Image metric learning | Text metric learning
- ArcFace
- H2O Hydrogen Torch utilizes an Additive Angular Margin Loss for Deep Face Recognition (ArcFace).
- CrossEntropy
- H2O Hydrogen Torch utilizes multi-class cross entropy loss as a loss function.
Text token classification | Text span prediction | Text sequence to sequence
- CrossEntropy
- H2O Hydrogen Torch utilizes multi-class cross entropy loss as a loss function.
Speech recognition
- CTC Loss
- H2O Hydrogen Torch utilizes Conectionist Temporal Classification loss as a loss function.
Arcface margin
Grid search hyperparameter
Defines the margin for ArcFace loss; higher values result in a bigger separation of samples.
- Tuning this setting can impact the training and quality of embeddings.
- This setting can be an important setting to tune and specifically depends on the dataset at hand.
Arcface scale
Grid search hyperparameter
Defines the ArcFace loss scale value that changes the shape of logits and impacts gradients.
Optimizer
Grid search hyperparameter
Defines the algorithm or method (optimizer) to use for model training. The selected algorithm or method defines how the model should change the attributes of the neural network, such as weights and learning rate. Optimizers solve optimization problems and make more accurate updates to attributes to reduce learning losses.
Options
- All supported problem types
- Adadelta
- To learn about Adadelta, see ADADELTA: An Adaptive Learning Rate Method.
- Adam
- To learn about Adam, see Adam: A Method for Stochastic Optimization.
- AdamW
- To learn about AdamW, see Decoupled Weight Decay Regularization.
- RMSprop
- To learn about RMSprop, see Neural Networks for Machine Learning.
- SGD
- H2O Hydrogen Torch uses a stochastic gradient descent optimizer.
- Adadelta
Learning rate
Grid search hyperparameter
Defines the learning rate H2O Hydrogen Torch uses when training the model, specifically when updating the neural network's weights. The learning rate is the speed at which the model updates its weights after processing each mini-batch of data.
- Learning rate is an important setting to tune as it balances under- and overfitting.
- The number of epochs highly impacts the optimal value of the learning rate.
Differential learning rate layers
Defines the learning rate to apply to certain layers of a model. H2O Hydrogen Torch applies the regular learning rate to layers without a specified learning rate.
Options
Image regression | Image classification | Text regression | Text classification | Text token classification | Audio regression | Audio classification
- Backbone
- H2O Hydrogen Torch applies a different learning rate to a body of the neural network architecture.
- Head
- H2O Hydrogen Torch applies a different learning rate to a head of the neural network architecture.
Image object detection
The options for an image object detection experiment are different based on the selected Model type (setting). Options:
If you select EfficientDet as the experiment's Model type (setting), the following options are available:
Options
- Backbone
- H2O Hydrogen Torch applies a different learning rate to a body of the EfficientDet architecture.
- FPN
- H2O Hydrogen Torch applies a different learning rate to a Feature Pyramid Network (FPN) block of the EfficientDet architecture.
- class_net
- H2O Hydrogen Torch applies a different learning rate to a classification head of the EfficientDet architecture.
- box_net
- H2O Hydrogen Torch applies a different learning rate to a box regression head of the EfficientDet architecture.
- Backbone
If you select Faster R-CNN as the experiment's Model type (setting), the following options are available:
Options
- Body
- H2O Hydrogen Torch applies a different learning rate to a body of the Faster R-CNN architecture.
- FPN
- H2O Hydrogen Torch applies a different learning rate to a Feature Pyramid Network (FPN) block in the Faster R-CNN architecture.
- RPN
- H2O Hydrogen Torch applies a different learning rate to a Region Proposal block of the Faster R-CNN architecture.
- ROI heads
- H2O Hydrogen Torch applies a different learning rate to the Faster R-CNN architecture proposal heads.
- Body
If you select FCOS as the experiment's Model type (setting), the following options are available:
Options
- Body
- H2O Hydrogen Torch applies a different learning rate to a body of the FCOS architecture.
- FPN
- H2O Hydrogen Torch applies a different learning rate to a Feature Pyramid Network (FPN) block of the FCOS architecture.
- classification_head
- H2O Hydrogen Torch applies a different learning rate to the classification head of the FCOS architecture.
- regression_head
- H2O Hydrogen Torch applies a different learning rate to a box regression head of the FCOS architecture.
- Body
Image semantic segmentation
- Encoder
- H2O Hydrogen Torch applies a different learning rate to the encoder of the neural network architecture.
- Decoder
- H2O Hydrogen Torch applies a different learning rate to the decoder of the neural network architecture.
- Segmentation head
- H2O Hydrogen Torch applies a different learning rate to the head of the neural network architecture.
3D image semantic segmentation | Text sequence to sequence
- Encoder
- H2O Hydrogen Torch applies a different learning rate to the encoder of the neural network architecture.
- Decoder
- H2O Hydrogen Torch applies a different learning rate to the decoder of the neural network architecture.
Image instance segmentation
- Encoder
- H2O Hydrogen Torch applies a different learning rate to the encoder of the neural network architecture.
- Decoder
- H2O Hydrogen Torch applies a different learning rate to the decoder of the neural network architecture.
- Segmentation head
- H2O Hydrogen Torch applies a different learning rate to the head of the neural network architecture.
Image metric learning | Text metric learning
- Backbone
- H2O Hydrogen Torch applies a different learning rate to a body of the neural network architecture.
- Neck
- H2O Hydrogen Torch applies a different learning rate to a neck of the neural network architecture.
- Loss
- H2O Hydrogen Torch applies a different learning rate to an ArcFace block of the neural network architecture.
Text regression
- Backbone
- H2O Hydrogen Torch applies a different learning rate to a body of the neural network architecture.
Text span prediction
- qa_outputs
A common strategy is to apply a lower learning rate to the backbone of a model for better convergence and training stability.
Different layers are available for different problem types.
Batch size
Grid search hyperparameter
Defines the number of training examples a mini-batch uses during an iteration of the training model to estimate the error gradient before updating the model weights. Batch size defines the batch size used per a single GPU.
During model training, the training data is packed into mini-batches of a fixed size.
Automatically adjust batch size
If this setting is turned On, H2O Hydrogen Torch checks whether the Batch size specified fits into the GPU memory. If a GPU out-of-memory (OOM) error occurs, H2O Hydrogen Torch automatically decreases the Batch size by a factor of 2 units until it fits into the GPU memory or Batch size equals 1.
Drop last batch
H2O Hydrogen Torch drops the last incomplete batch during model training when this setting is turned On.
H2O Hydrogen Torch groups the train data into mini-batches of equal size during the training process, but the last batch can have fewer records than the others. Not dropping the last batch can lead to a less robust gradient estimation while causing a more volatile training step.
Epochs
Grid search hyperparameter
Defines the number of epochs to train the model. In other words, it specifies the number of times the learning algorithm goes through the entire training dataset.
- The Epochs setting is an important setting to tune because it balances under- and overfitting.
- The learning rate highly impacts the optimal value of the epochs.
- For the following supported problem types, H2O Hydrogen Torch now enables you to utilize/deploy a pre-trained model trained on zero epochs (where H2O Hydrogen Torch does not train the model and the pretrained model (experiment) can be deployed as-is):
- Speech recognition
- Text sequence to sequence
- text span prediction
Schedule
Grid search hyperparameter
Defines the learning rate schedule H2O Hydrogen Torch utilizes during model training. Specifying a learning rate schedule prevents the learning rate from staying the same. Instead, a learning rate schedule causes the learning rate to change over iterations, typically decreasing the learning rate to achieve a better model performance and training convergence.
Options
- All supported problem types
- Constant
- H2O Hydrogen Torch applies a constant learning rate during the training process.
- Cosine
- H2O Hydrogen Torch applies a cosine learning rate that follows the values of the cosine function.
- Linear
- H2O Hydrogen Torch applies a linear learning rate that decreases the learning rate linearly.
- Constant
Warmup epochs
Grid search hyperparameter
Defines the number of epochs to warm up the learning rate where the learning rate should increase linearly from 0 to the desired learning rate.
Weight decay
Grid search hyperparameter
Defines the weight decay that H2O Hydrogen Torch uses for the optimizer during model training.
Weight decay is a regularization technique that adds an L2 norm of all model weights to the loss function while increasing the probability of improving the model generalization.
Gradient clip
Grid search hyperparameter
Defines the maximum norm of the gradients H2O Hydrogen Torch specifies during model training. Defaults to 0, no clipping. When a value greater than 0 is specified, H2O Hydrogen Torch modifies the gradients during model training. H2O Hydrogen Torch uses the specified value as an upper limit for the norm of the gradients, calculated using the Euclidean norm over all gradients per batch.
This setting can help model convergence when extreme gradient values cause high volatility of weight updates.
Grad accumulation
Grid search hyperparameter
Defines the number of gradient accumulations before H2O Hydrogen Torch updates the neural network weights during model training.
- Grad accumulation can be beneficial if only small batches are selected for training. With gradient accumulation, the loss and gradients are calculated after each batch, but it waits for the selected accumulations before updating the model weights. You can control the batch size through the Batch size setting.
- Changing the default value of Grad Accumulation might require adjusting the learning rate and batch size.
Save best checkpoint
Determines if H2O Hydrogen Torch should save the model weights of the epoch exhibiting the best validation metric. When turned On, H2O Hydrogen Torch saves the model weights for the epoch exhibiting the best validation metric. When turned Off, H2O Hydrogen Torch saves the model weights after the last epoch is executed.
- This setting should be turned On with care as it has the potential to lead to overfitting of the validation data.
- The default goal should be to attempt to tune models so that the last or very last epoch is the best epoch.
- Suppose an evident decline for later epochs is observed in logging. In that case, it is usually better to adjust hyperparameters, such as reducing the number of epochs or increasing regularization, instead of turning this setting On.
Evaluation epochs
Defines the number of epochs H2O Hydrogen Torch uses before each validation loop for model training. In other words, it determines the frequency (in a number of epochs) to run the model evaluation on the validation data.
- Increasing the number of Evaluation Epochs can speed up an experiment.
- The Evaluation epochs setting is available only if the following setting is turned Off: Save Best Checkpoint.
Evaluate before training
Determines whether to perform a validation run before training. This setting is potentially helpful for assessing the performance of zero-shot pertained backbones and checking the modeling pipeline.
The following supported problem types support externally pretrained zero-shot models (while problem types that do not contain this support fit a new head on top of a backbone):
- Text span prediction
- Text sequence to sequence
- Speech recognition
Calculate train metric
Determines whether the model metric should also be calculated for the training data at the end of the training. When On, the model metric is calculated for the training data. The resulting values do not indicate the true model performance because they are based on H2O Hydrogen Torch's identical data records for model training but can give insights into over/underfitting.
Train validation data
Defines whether the model should use the entire train and validation dataset during model training. When turned On, H2O Hydrogen Torch uses the whole train dataset and validation data to train the model.
- H2O Hydrogen Torch also evaluates the model on the provided validation fold. Validation is always only on the provided validation fold.
- H2O Hydrogen Torch uses both datasets for model training if you provide a train and validation dataset.
- To define a training dataset, use the Train dataframe setting. For more information, see Train dataframe.
- To define a validation dataset, use the Validation dataframe setting. For more information, see Validation dataframe.
- The Train validation data setting is only available if you turned the Save best checkpoint setting Off.
- See Save best checkpoint to learn more about the Save best checkpoint setting.
- Turning On the Train validation data setting should produce a model that you can expect to perform better because H2O Hydrogen Torch trained the model on more data. Thought, also note that using the entire train dataset and out-of-fold validation dataset generally causes the model's accuracy to be overstated as information from the validation data is incorporated into the model during the training process. note
If you have five folds and set fold 0 as validation, H2O Hydrogen Torch usually trains on folds 1-4 and reports on fold 0. With Train validation data turned On, we can add fold 0 to the training, but H2O Hydrogen Torch still reports its accuracy. As a result, it overstated for fold 0 but should be better for any unseen (test) data/production scenarios. For that reason, you usually want to consider this setting after running your experiments and deciding on models.
Build scoring pipelines
Determines whether the experiment (model) automatically generates an H2O MLOps pipeline and Python scoring pipeline at the end of the experiment. If turned Off, you can still create scoring pipelines on demand when the experiment is complete (e.g., when you click Download soring or Download MLOps).
Prediction settings
Metric
Defines the metric to evaluate the model's performance.
Options
Image regression | 3D image regression | Text regression | Audio regression
- MAE: Mean absolute error
- The Mean Absolute Error (MAE) is an average of the absolute errors. The MAE units are the same as the predicted target, which is useful for understanding whether the size of the error is of concern or not. The smaller the MAE the better the model’s performance.
- MSE: Mean squared error
- The MSE metric measures the average of the squares of the errors or deviations. MSE takes the distances from the points to the regression line (these distances are the “errors”) and squaring them to remove any negative signs. MSE incorporates both the variance and the bias of the predictor.
- MSE also gives more weight to larger differences. The bigger the error, the more it is penalized. For example, if your correct answers are 2,3,4 and the algorithm guesses 1,4,3, then the absolute error on each one is exactly 1, so squared error is also 1, and the MSE is 1. But if the algorithm guesses 2,3,6, then the errors are 0,0,2, the squared errors are 0,0,4, and the MSE is a higher 1.333. The smaller the MSE, the better the model’s performance.
- RMSE: Root mean squared error
- The Root Mean Sqaured Error (RMSE) metric evaluates how well a model can predict a continuous value. The RMSE units are the same as the predicted target, which is useful for understanding if the size of the error is of concern or not. The smaller the RMSE, the better the model’s performance.
- RMSE penalizes outliers more, as compared to MAE, so it is useful if we want to avoid having large errors.
- MAPE: Mean absolute percentage error
- Mean Absolute Percentage Error (MAPE) measures the size of the error in percentage terms. It is calculated as the average of the unsigned percentage error.
- MAPE is useful when target values are across different scales.
- SMAPE Symmetric mean absolute percentage error
- Unlike the MAPE, which divides the absolute errors by the absolute actual values, the SMAPE divides by the mean of the absolute actual and the absolute predicted values. This is important when the actual values can be 0 or near 0. Actual values near 0 cause the MAPE value to become infinitely high. Because SMAPE includes both the actual and the predicted values, the SMAPE value can never be greater than 200%.
- R2: R squared
- The R2 value represents the degree that the predicted value and the actual value move in unison. The R2 value varies between 0 and 1 where 0 represents no correlation between the predicted and actual value and 1 represents complete correlation.
Image classification | 3D image classification | Text classification | Audio classification
- LogLoss: Logarithmic loss
- The logarithmic loss metric can be used to evaluate the performance of a binomial or multinomial classifier. Unlike AUC which looks at how well a model can classify a binary target, logloss evaluates how close a model’s predicted values (uncalibrated probability estimates) are to the actual target value. For example, does a model tend to assign a high predicted value like .80 for the positive class, or does it show a poor ability to recognize the positive class and assign a lower predicted value like .50? Logloss can be any value greater than or equal to 0, with 0 meaning that the model correctly assigns a probability of 0% or 100%.
- ROC_AUC: Area under the receiver operating characteristic curve
- This model metric is used to evaluate how well a binary classification model is able to distinguish between true positives and false positives. For multi-class problems, this score is computed by micro-averaging the ROC curves for each class.
- An Area Under the Curve (AUC) of 1 indicates a perfect classifier, while an AUC of .5 indicates a poor classifier whose performance is no better than random guessing.
- F1
- The F1 score is calculated from the harmonic mean of the precision and recall. An F1 score of 1 means both precision and recall are perfect, and the model correctly identified all the positive cases and didn’t mark a negative case as a positive case. If either precision or recall is very low, it is reflected with an F1 score closer to 0.
- Formula: F1 = 2 (Precision * Recall / Precision + Recall)
- Precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives).
- Recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
- Micro-averaging: H2O Hydrogen Torch micro-averages the F1 metric (score).
- Multi-class: For multi-class classification experiments utilizing an F1 metric, the derived micro-average F1 metric might look suspicious; in that case, the micro-average F1 metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing an F1 metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the F1 metric.
- F2
- The F2 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F2 score gives more weight to recall than to precision. More weight should be given to recall for cases where False Negatives are considered worse than False Positives. For example, if your use case is to predict which customers will churn, you may consider False Negatives worse than False Positives. In this case, you want your predictions to capture all of the customers that will churn. Some of these customers may not be at risk for churning, but the extra attention they receive is not harmful. More importantly, no customers actually at risk of churning have been missed.
- Formula: F2 = 5 (Precision * Recall / (4 * Precision) + Recall)
- Precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives).
- Recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
- Micro-averaging: H2O Hydrogen Torch micro-averages the F2 metric (score).
- Multi-class: For multi-class classification experiments utilizing an F2 metric, the derived micro-average F2 metric might look suspicious; in that case, the micro-average F2 metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing an F2 metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the F2 metric.
- F05
- The F05 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F05 score gives more weight to precision than to recall. More weight should be given to precision for cases where False Positives are considered worse than False Negatives. For example, if your use case is to predict which products you will run out of, you may consider False Positives worse than False Negatives. In this case, you want your predictions to be very precise and only capture the products that will definitely run out. If you predict a product will need to be restocked when it actually doesn’t, you incur cost by having purchased more inventory than you actually need.
- Formula: F05 = 1.25 (Precision * Recall / (0.25 * Precision) + Recall)
- Precision is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives).
- Recall is the positive observations (true positives) the model correctly identified from all the actual positive cases (the true positives + the false negatives).
- Micro-averaging: H2O Hydrogen Torch micro-averages the F05 metric (score).
- Multi-class: For multi-class classification experiments utilizing an F05 metric, the derived micro-average F05 metric might look suspicious; in that case, the micro-average F05 metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing an F05 metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the F05 metric.
- Precision
- The precision metric measures the ratio of correct true positives among all predicted positives.
- Formula: Precision = True Positive / (True Positive + False Positive)
- Micro-averaging: H2O Hydrogen Torch micro-averages the precision metric (score).
- Multi-class: For multi-class classification experiments utilizing a precision metric, the derived micro-average precision metric might look suspicious; in that case, the micro-average precision metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing a precision metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the precision metric.
- Recall
- The recall metric measures the ratio of true positives predicted correctly.
- Formula: Recall = True Positive / (True Positive + False Negative)
- Micro-averaging: H2O Hydrogen Torch micro-averages the recall metric (score).
- Multi-class: For multi-class classification experiments utilizing a recall metric, the derived micro-average recall metric might look suspicious; in that case, the micro-average recall metric is numerically equivalent to the accuracy score.
- Binary: For binary classification experiments utilizing a recall metric, the label column needs to contain 0/1 values. If the column contains string values, the column is transformed into multiple columns using a one-hot encoder method resulting in the experiment being treated as a multi-class classification experiment while leading to an incorrect calculation of the recall metric.
- Accuracy
- In binary classification, Accuracy is the number of correct predictions made as a ratio of all predictions made. In multiclass classification, the set of labels predicted for a sample must exactly match the corresponding set of labels in target values.
- MCC: Matthews correlation coefficient
- The goal of the Matthews Correlation Coefficient (MCC) metric is to represent the confusion matrix of a model as a single number. The MCC metric combines the true positives, false positives, true negatives, and false negatives using the following MCC equation: 𝑀𝐶𝐶=𝑇𝑃𝑥𝑇𝑁−𝐹𝑃𝑥𝐹𝑁/√(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁).
- Unlike metrics like Accuracy, MCC is a good scorer to use when the target variable is imbalanced. In the case of imbalanced data, high Accuracy can be found by predicting the majority class. Metrics like Accuracy and F1 can be misleading, especially in the case of imbalanced data, because they do not consider the relative size of the four confusion matrix categories. MCC, on the other hand, takes the proportion of each class into account. The MCC value ranges from -1 to 1 where -1 indicates a classifier that predicts the opposite class from the actual value, 0 means the classifier does no better than random guessing, and 1 indicates a perfect classifier.
Image object detection
- mAP: Mean average precision
Image semantic segmentation | 3D image semantic segmentation
- IoU: Intersection over union
- Dice
Image instance segmentation
- COCO_mAP: COCO (Common Objects in Context) mean average precision
Image metric learning | Text metric learning
- mAP: Mean average precision
Text token classification
- CONLL_MICRO_F1_SCORE
- Macro F1 score calculated in CoNLL style
- CONLL_MACRO_F1_SCORE
- Micro F1 score calculated in CoNLL style
- MICRO_F1_SCORE: Micro F1 score
- MACRO_F1_SCORE: Macro F1 score
Text span prediction
- Jaccard
- F1
- Accuracy
- Top_2_Accuracy
- Top_3_Accuracy
- Top_4_Accuracy
- Top_5_Accuracy
Text sequence to sequence
- BLEU
- Computes the BLEU metric given hypotheses and references
- CHRF
- Computes the chrF(++) metric given hypotheses and references
- TER
- Computes the translation edit rate metric given hypotheses and references
Speech recognition
- WER: Word error rate
- CER: Character error rate
Top K similar
Defines the number (k) of similar predictions to keep for each record during the training model.
Defining this setting impacts output predictions and metrics (metrics that rely on some top-k selection) but not the training process.
Test time augmentations
Defines the test time augmentation(s) to apply during inference. Test time augmentations are applied when the model makes predictions on new data. The final prediction is an average of the predictions for all the augmented versions of an image.
Options
Image regression | 3D image regression | Image classification | 3D image classification | Image semantic segmentation | 3D image semantic segmentation | Image instance segmentation | Image metric learning
- Horizontal flip
- H2O Hydrogen Torch applies a horizontal flip as the test time augmentation(s).
- Vertical flip
- H2O Hydrogen Torch applies a vertical flip as the test time augmentation(s).
This technique can improve the model accuracy.
Environment settings
GPUs
Determines the list of GPUs H2O Hydrogen Torch can use for the experiment. GPUs are listed by name, referring to their system ID (starting from 1). If no GPUs selected, CPU is used for model training.
Number of seeds per run
Defines the number of seeds to use for a single run. If more than one seed is selected, each experiment runs multiple times.
- Deep learning models can sometimes exhibit certain randomness in individual runs. Running an experiment multiple times with multiple seeds, can give insights into stability of results.
- In case of high randomness, better judgement can be made about the performance of a model with certain hyperparameter settings, by comparing the average results across seeds, for example in a grid search scenario.
Number of GPUs per run
Defines the number of GPUs to use for a single run when training the model. A single run might represent a single fold, a single seed run or a single grid search run.
If 5 GPUs are available, it is possible to run a 5-fold cross-validation in parallel using a single GPU per fold.
- The available GPUs are the ones that can be enabled using the GPUs setting.
- If the number of GPUs is less than or equal to 1, this setting (Number of GPUs per run ) is not available.
Mixed precision training
Determines whether to use mixed-precision during model training. When turned Off, H2O Hydrogen Torch does not use mixed-precision for training.
Mixed-precision is a technique that helps decrease memory consumption and increases training speed.
Mixed precision inference
Determines whether to use mixed-precision during model inference.
Mixed-precision is a technique that helps decrease memory consumption and increases inference speed.
Sync batch normalization
Determines whether to synchronize batch normalization across GPUs in a distributed data-parallel (DDP) mode. In other words, when turned On, multi-GPU training is enabled to synchronize the batch normalization layers of the model across GPUs. In a nutshell, H2O Hydrogen Torch with multi GPU splits the batch across GPUs, and therefore, when a normalization layer wants to normalize data, it has access only to the part of the batch stored on the device. As a result, it works out of the box but gives better results if the data in all GPUs are collected to normalize the data of the entire batch.
When turned On, data scientists can expect the training speed to drop slightly while the model's accuracy improves. However, this rarely happens in practice and only occurs under specific problem types and defined batch sizes.
Number of workers
Defines the number of workers H2O Hydrogen Torch uses for the DataLoader. In other words, it defines the number of CPU processes to use when reading and loading data to GPUs during model training.
Seed
Defines the random seed value that H2O Hydrogen Torch uses during model training. It defaults to -1, an arbitrary value. When the value is modified (not -1), the random seed allows results to be reproducible—defining a seed aids in obtaining predictable and repeatable results every time. Otherwise, not modifying the default seed value (-1) leads to random numbers at every invocation.
Logging settings
Logger
Defines the logger type that H2O Hydrogen Torch uses for model training
Options
- All supported problem types
- None
- H2O Hydrogen Torch does not use any logger.
- Neptune
- H2O Hydrogen Torch uses Neptune as a logger to track the experiment. To use Neptune, you must specify a Neptune API token and a Neptune project.
- None
Neptune API token
Defines the Neptune API token to validate all subsequent Neptune API calls.
Neptune project
Defines the Neptune project to access if you selected Neptune in the Logger setting.
Log grad norm
Determines whether to log the total grad norm before and after clipping.
This setting adds a small overhead during the experiment runtime but can help determine if the gradients are exploding or unstable.
Turn this setting on if you suspect unstable gradients; as a result, you may then choose a value for the gradient clip to prevent exploding gradients.
Number of images
This setting defines the number of images to show in the experiment Insights tab.
- Submit and view feedback for this page
- Send feedback about H2O Hydrogen Torch to cloud-feedback@h2o.ai