Experiment settings

The settings for creating an experiment are grouped into the following sections:

General settings
Dataset settings
Tokenizer settings
Architecture settings
Training settings
Augmentation settings
Prediction settings
Environment settings
Logging settings

The settings under each category are listed and described below.

General settings

Dataset

Select the dataset for the experiment.

Problem type

Defines the problem type of the experiment, which also defines the settings H2O LLM Studio displays for the experiment.

Causal Language Modeling: Used to fine-tune large language models
Causal Classification Modeling: Used to fine-tune causal classification models
Causal Regression Modeling: Used to fine-tune causal regression models
Sequence To Sequence Modeling: Used to fine-tune large sequence to sequence models
DPO Modeling: Used to fine-tune large language models using Direct Preference Optimization

Import config from YAML

Defines the .yml file that defines the experiment settings.

H2O LLM Studio supports a .yml file import and export functionality. You can download the config settings of finished experiments, make changes, and re-upload them when starting a new experiment in any instance of H2O LLM Studio.

Experiment name

It defines the name of the experiment.

LLM backbone

The LLM Backbone option is the most important setting as it sets the pretrained model weights.

Use smaller models for quicker experiments and larger models for higher accuracy
Aim to leverage models pre-trained on tasks similar to your use case when possible
Select a model from the dropdown list or type in the name of a Hugging Face model of your preference

Dataset settings

Train dataframe

Defines a .csv or .pq file containing a dataframe with training records that H2O LLM Studio uses to train the model.

The records are combined into mini-batches when training the model.

Validation strategy

Specifies the validation strategy H2O LLM Studio uses for the experiment.

To properly assess the performance of your trained models, it is common practice to evaluate it on separate holdout data that the model has not seen during training. H2O LLM Studio allows you to specify different strategies for this task fitting your needs.

Options

Custom holdout validation
- Specifies a separate holdout dataframe.
Automatic holdout validation
- Allows to specify a holdout validation sample size that is automatically generated.

Validation size

Defines an optional relative size of the holdout validation set. H2O LLM Studio do automatically sample the selected percentage from the full training data, and build a holdout dataset that the model is validated on.

Data sample

Defines the percentage of the data to use for the experiment. The default percentage is 100% (1).

Changing the default value can significantly increase the training speed. Still, it might lead to a substantially poor accuracy value. Using 100% (1) of the data for final models is highly recommended.

System column

The column in the dataset containing the system input which is always prepended for a full sample.

Prompt column

One column or multiple columns in the dataset containing the user prompt. If multiple columns are selected, the columns are concatenated with a separator defined in Prompt Column Separator.

Prompt column separator

If multiple prompt columns are selected, the columns are concatenated with the separator defined here. If only a single prompt column is selected, this setting is ignored.

Answer column

The column in the dataset containing the expected output.

For classification, this needs to be an integer column starting from zero containing the class label, while for regression, it needs to be a float column.

Multiple target columns can be selected for classification and regression supporting multilabel problems. In detail, we support the following cases:

Multi-class classification requires a single column containing the class label
Binary classification requires a single column containing a binary integer label
Multilabel classification requires each column to refer to one label encoded with a binary integer label
For regression, each target column requires a float value

Parent ID column

An optional column specifying the parent id to be used for chained conversations. The value of this column needs to match an additional column with the name id. If provided, the prompt will be concatenated after preceding parent rows.

Text prompt start

Optional text to prepend to each prompt.

Text answer separator

Optional text to append to each prompt / prepend to each answer.

Add EOS token to prompt

Adds EOS token at end of prompt.

Add EOS token to answer

Adds EOS token at end of answer.

Mask prompt labels

Whether to mask the prompt labels during training and only train on the loss of the answer.

Num classes

The number of possible classes for the classification task. For binary classification, a single class should be selected.

The Num classes field should be set to the total number of classes in the answer column of the dataset.

Tokenizer settings

Max length

Defines the maximum length of the input sequence H2O LLM Studio uses during model training. In other words, this setting specifies the maximum number of tokens an input text is transformed for model training.

A higher token count leads to higher memory usage that slows down training while increasing the probability of obtaining a higher accuracy value.

In case of Causal Language Modeling, this includes both prompt and answer, or all prompts and answers in case of chained samples.

In Sequence to Sequence Modeling, this refers to the length of the prompt, or the length of a full chained sample.

Add prompt answer tokens

Adds system, prompt and answer tokens as new tokens to the tokenizer. It is recommended to also set Force Embedding Gradients in this case.

Padding quantile

Defines the padding quantile H2O LLM Studio uses to select the maximum token length per batch. H2O LLM Studio performs padding of shorter sequences up to the specified padding quantile instead of the selected Max length. H2O LLM Studio truncates longer sequences.

Lowering the quantile can significantly increase training runtime and reduce memory usage in unevenly distributed sequence lengths but can hurt performance
The setting depends on the batch size and should be adjusted accordingly
No padding is done in inference, and the selected Max Length is guaranteed
Setting to 0 disables padding
In case of distributed training, the quantile will be calculated across all GPUs

Architecture settings

Backbone Dtype

The datatype of the weights in the LLM backbone.

Gradient Checkpointing

Determines whether H2O LLM Studio activates gradient checkpointing (GC) when training the model. Starting GC reduces the video random access memory (VRAM) footprint at the cost of a longer runtime (an additional forward pass). Turning On GC enables it during the training process.

Caution Gradient checkpointing is an experimental setting that is not compatible with all backbones or all other settings.

Activating GC comes at the cost of a longer training time; for that reason, try training without GC first and only activate when experiencing GPU out-of-memory (OOM) errors.

Intermediate dropout

Defines the custom dropout rate H2O LLM Studio uses for intermediate layers in the transformer model.

Pretrained weights

Allows you to specify a local path to the pretrained weights.

Training settings

Loss function

Defines the loss function H2O LLM Studio utilizes during model training. The loss function is a differentiable function measuring the prediction error. The model utilizes gradients of the loss function to update the model weights during training. The options depend on the selected Problem Type.

For multiclass classification problems, set the loss function to Cross-entropy.

Optimizer

Defines the algorithm or method (optimizer) to use for model training. The selected algorithm or method defines how the model should change the attributes of the neural network, such as weights and learning rate. Optimizers solve optimization problems and make more accurate updates to attributes to reduce learning losses.

Options:

Adadelta
- To learn about Adadelta, see ADADELTA: An Adaptive Learning Rate Method.
Adam
- To learn about Adam, see Adam: A Method for Stochastic Optimization.
AdamW
- To learn about AdamW, see Decoupled Weight Decay Regularization.
AdamW8bit
- To learn about AdamW, see Decoupled Weight Decay Regularization.
RMSprop
- To learn about RMSprop, see Neural Networks for Machine Learning.
SGD
- H2O LLM Studio uses a stochastic gradient descent optimizer.

Learning rate

Defines the learning rate H2O LLM Studio uses when training the model, specifically when updating the neural network's weights. The learning rate is the speed at which the model updates its weights after processing each mini-batch of data.

Learning rate is an important setting to tune as it balances under- and overfitting.
The number of epochs highly impacts the optimal value of the learning rate.

Differential learning rate layers

Defines the learning rate to apply to certain layers of a model. H2O LLM Studio applies the regular learning rate to layers without a specified learning rate.

Backbone
- H2O LLM Studio applies a different learning rate to a body of the neural network architecture.
Value Head
- H2O LLM Studio applies a different learning rate to a value head of the neural network architecture.

A common strategy is to apply a lower learning rate to the backbone of a model for better convergence and training stability.

By default, H2O LLM Studio applies Differential learning rate Layers, with the learning rate for the classification_head being 10 times smaller than the learning rate for the rest of the model.

Freeze layers

An optional list of layers to freeze during training. Full layer names will be matched against selected substrings. Only available without LoRA training.

Attention Implementation

Allows to change the utilized attention implementation.

Auto selection will automatically choose the implementation based on system availability.
Eager relies on vanilla attention implementation in Python.
SDPA uses scaled dot product attention in PyTorch.
Flash Attention 2 explicitly uses FA2 which requires the flash_attn package.

Batch size

Defines the number of training examples a mini-batch uses during an iteration of the training model to estimate the error gradient before updating the model weights. Batch size defines the batch size used per a single GPU.

During model training, the training data is packed into mini-batches of a fixed size.

Epochs

Defines the number of epochs to train the model. In other words, it specifies the number of times the learning algorithm goes through the entire training dataset.

The Epochs setting is an important setting to tune because it balances under- and overfitting.
The learning rate highly impacts the optimal value of the epochs.
H2O LLM Studio enables you to utilize a pre-trained model trained on zero epochs (where H2O LLM Studio does not train the model and the pretrained model (experiment) can be evaluated as-is):

Schedule

Defines the learning rate schedule H2O LLM Studio utilizes during model training. Specifying a learning rate schedule prevents the learning rate from staying the same. Instead, a learning rate schedule causes the learning rate to change over iterations, typically decreasing the learning rate to achieve a better model performance and training convergence.

Options

Constant
- H2O LLM Studio applies a constant learning rate during the training process.
Cosine
- H2O LLM Studio applies a cosine learning rate that follows the values of the cosine function.
Linear
- H2O LLM Studio applies a linear learning rate that decreases the learning rate linearly.

Min Learning Rate Ratio

The minimum learning rate ratio determines the lowest learning rate that will be used during training as a fraction of the initial learning rate. This is particularly useful when using learning rate schedules like "Cosine" or "Linear" that decrease the learning rate over time.

For example, if the initial learning rate is 0.001 and the min_learning_rate_ratio is set to 0.1, the learning rate will never drop below 0.0001 (0.001 * 0.1) during training.

Setting this to a value greater than 0 can help prevent the learning rate from becoming too small, which might slow down training or cause the model to get stuck in local optima.

A value of 0.0 allows the learning rate to potentially reach zero by the end of training.
Typical values range from 0.01 to 0.1, depending on the specific task and model.

This parameter cannot be set when using the Constant learning rate schedule.

Warmup epochs

Defines the number of epochs to warm up the learning rate where the learning rate should increase linearly from 0 to the desired learning rate. Can be a fraction of an epoch.

Weight decay

Defines the weight decay that H2O LLM Studio uses for the optimizer during model training.

Weight decay is a regularization technique that adds an L2 norm of all model weights to the loss function while increasing the probability of improving the model generalization.

Gradient clip

Defines the maximum norm of the gradients H2O LLM Studio specifies during model training. Defaults to 0, no clipping. When a value greater than 0 is specified, H2O LLM Studio modifies the gradients during model training. H2O LLM Studio uses the specified value as an upper limit for the norm of the gradients, calculated using the Euclidean norm over all gradients per batch.

This setting can help model convergence when extreme gradient values cause high volatility of weight updates.

Grad accumulation

Defines the number of gradient accumulations before H2O LLM Studio updates the neural network weights during model training.

Grad accumulation can be beneficial if only small batches are selected for training. With gradient accumulation, the loss and gradients are calculated after each batch, but it waits for the selected accumulations before updating the model weights. You can control the batch size through the Batch size setting.
Changing the default value of Grad Accumulation might require adjusting the learning rate and batch size.

Lora

Whether to use low rank approximations (LoRA) during training.

Use Dora

Enables Weight-Decomposed Low-Rank Adaptation (DoRA) to be used instead of low rank approximations (LoRA) during training. This parameter efficient training method is built on top of LoRA and has shown promising results. Especially at lower ranks (e.g. r=4), it is expected to perform superior to LoRA.

Lora R

The dimension of the matrix decomposition used in LoRA.

Lora Alpha

The scaling factor for the lora weights.

Lora dropout

The probability of applying dropout to the LoRA weights during training.

Use RS Lora

When active, H2O LLM Studio uses Rank-Stabilized LoRA which sets the LoRA adapter scaling factor to lora_alpha/math.sqrt(lora_r). The creators suggest that this works especially better for very large ranks. Otherwise, it will use the original default value of lora_alpha/lora_r.

Lora target modules

The modules in the model to apply the LoRA approximation to. Defaults to all linear layers.

Lora unfreeze layers

An optional list of backbone layers to unfreeze during training. By default, all backbone layers are frozen when training with LoRA, here certain layers can be additionally trained, such as embedding or head layer. Full layer names will be matched against selected substrings. Only available with LoRA training.

Save checkpoint

Specifies how H2O LLM Studio should save the model checkpoints.

When set to Last it will always save the last checkpoint, this is the recommended setting.

When set to Best it saves the model weights for the epoch exhibiting the best validation metric.

This setting should be turned on with care as it has the potential to lead to overfitting of the validation data.
The default goal should be to attempt to tune models so that the last epoch is the best epoch.
Suppose an evident decline for later epochs is observed in logging. In that case, it is usually better to adjust hyperparameters, such as reducing the number of epochs or increasing regularization, instead of turning this setting on.

When set to Each evaluation epoch it will save the model weights for each evaluation epoch.

This can be useful for debugging and experimenting, but will consume more disk space.
Models uploaded to Hugging Face Hub will only contain the last checkpoint.
Local downloads will contain all checkpoints.

When set to Disable it will not save the checkpoint at all. This can be useful for debugging and experimenting in order to save disk space, but will disable certain functionalities like chatting or pushing to HF.

Evaluation epochs

Defines the number of epochs H2O LLM Studio uses before each validation loop for model training. In other words, it determines the frequency (in a number of epochs) to run the model evaluation on the validation data.

Increasing the number of Evaluation Epochs can speed up an experiment.
The Evaluation epochs setting is available only if the following setting is turned Off: Save Best Checkpoint.
Can be a fraction of an epoch

Evaluate before training

This option lets you evaluate the model before training, which can help you judge the quality of the LLM backbone before fine-tuning.

Train validation data

Defines whether the model should use the entire train and validation dataset during model training. When turned On, H2O LLM Studio uses the whole train dataset and validation data to train the model.

H2O LLM Studio also evaluates the model on the provided validation fold. Validation is always only on the provided validation fold.
H2O LLM Studio uses both datasets for model training if you provide a train and validation dataset.
- To define a training dataset, use the Train dataframe setting.
- To define a validation dataset, use the Validation dataframe setting.
Turning On the Train validation data setting should produce a model that you can expect to perform better because H2O LLM Studio trained the model on more data. Though, also note that using the entire train dataset and validation dataset generally causes the model's accuracy to be overstated as information from the validation data is incorporated into the model during the training process.

Augmentation settings

Token mask probability

Defines the random probability of the input text tokens to be randomly masked during training.

Increasing this setting can be helpful to avoid overfitting and apply regularization
Each token is randomly replaced by a masking token based on the specified probability

Skip parent probability

If Parent Column is set, this random augmentation will skip parent concatenation during training at each parent with this specified probability.

Random parent probability

While training, each sample will be concatenated to a random other sample simulating unrelated chained conversations. Can be specified without using a Parent Column.

Neftune noise alpha

Will add noise to the input embeddings as proposed by https://arxiv.org/abs/2310.05914 (NEFTune: Noisy Embeddings Improve Instruction Finetuning)

Prediction settings

Metric

Defines the metric to evaluate the model's performance.

We provide several metric options for evaluating the performance of your model. The options depend on the selected Problem Type:

Causal Language Modeling, DPO Modeling, Sequence to Sequence Modeling

In addition to the BLEU and the Perplexity score, we offer GPT metrics that utilize the OpenAI API to determine whether the predicted answer is more favorable than the ground truth answer.
To use these metrics, you can either export your OpenAI API key as an environment variable before starting LLM Studio, or you can specify it in the Settings Menu within the UI.

Causal Classification Modeling

AUC: Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC).
Accuracy: Compute the accuracy of the model.
LogLoss: Compute the log loss of the model.

Causal Regression Modeling

MSE: Compute Mean Squared Error of the model.
MAE: Compute Mean Absolute Error of the model.

Metric GPT model

Defines the OpenAI model endpoint for the GPT metric.

Metric GPT template

The template to use for GPT-based evaluation. Note that for mt-bench, the validation dataset will be replaced accordingly; to approximate the original implementation as close as possible, we suggest to use gpt-4-0613 as the gpt judge model and use 1024 for the max length inference.

Min length inference

Defines the min length value H2O LLM Studio uses for the generated text.

This setting impacts the evaluation metrics and should depend on the dataset and average output sequence length that is expected to be predicted.

Max length inference

Defines the max length value H2O LLM Studio uses for the generated text.

Similar to the Max Length setting in the tokenizer settings section, this setting specifies the maximum number of tokens to predict for a given prediction sample.
This setting impacts the evaluation metrics and should depend on the dataset and average output sequence length that is expected to be predicted.

Batch size inference

Defines the size of a mini-batch uses during an iteration of the inference. Batch size defines the batch size used per GPU.

Do sample

Determines whether to sample from the next token distribution instead of choosing the token with the highest probability. If turned On, the next token in a predicted sequence is sampled based on the probabilities. If turned Off, the highest probability is always chosen.

Num beams

Defines the number of beams to use for beam search. Num Beams default value is 1 (a single beam); no beam search.

A higher Num Beams value can increase prediction runtime while potentially improving accuracy.

Temperature

Defines the temperature to use for sampling from the next token distribution during validation and inference. In other words, the defined temperature controls the randomness of predictions by scaling the logits before applying softmax. A higher temperature makes the distribution more random.

Modify the temperature value if you have the Do Sample setting enabled (On).
To learn more about this setting, refer to the following article: How to generate text: using different decoding methods for language generation with Transformers.

Repetition penalty

The parameter for repetition penalty. 1.0 means no penalty. See https://arxiv.org/pdf/1909.05858.pdf for more details.

Stop tokens

Will stop generation at occurrence of these additional tokens; multiple tokens should be split by comma ,.

Top K

If > 0, only keep the top k tokens with the highest probability (top-k filtering).

Top P

If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering).

Environment settings

GPUs

Determines the list of GPUs H2O LLM Studio can use for the experiment. GPUs are listed by name, referring to their system ID (starting from 1).

Mixed precision

Determines whether to use mixed-precision. When turned Off, H2O LLM Studio does not use mixed-precision.

Mixed-precision is a technique that helps decrease memory consumption and increases training speed.

Compile model

Compiles the model with Torch. Experimental!

Find unused parameters

In Distributed Data Parallel (DDP) mode, prepare_for_backward() is called at the end of DDP forward pass. It traverses the autograd graph to find unused parameters when find_unused_parameters is set to True in DDP constructor.

Note that traversing the autograd graph introduces extra overheads, so applications should only set to True when necessary.

Trust remote code

Trust remote code. This can be necessary for some models that use code which is not (yet) part of the transformers package. Should always be checked with this option being switched Off first.

Hugging Face branch

The Hugging Face Branch defines which branch to use in a Hugging Face repository. The default value is "main".

Number of workers

Defines the number of workers H2O LLM Studio uses for the DataLoader. In other words, it defines the number of CPU processes to use when reading and loading data to GPUs during model training.

Seed

Defines the random seed value that H2O LLM Studio uses during model training. It defaults to -1, an arbitrary value. When the value is modified (not -1), the random seed allows results to be reproducible—defining a seed aids in obtaining predictable and repeatable results every time. Otherwise, not modifying the default seed value (-1) leads to random numbers at every invocation.

Logging settings

Log step size

Specifies the interval for logging during training. Two options are available:

Absolute: The default setting. Uses the total number of training samples processed as the x-axis for logging.
Relative: Uses the proportion of training data seen so far as the x-axis for logging.

Log all ranks

If used, the local logging will include the output of all ranks (DDP mode).

Logger

Defines the logger type that H2O LLM Studio uses for model training

Options

None
- H2O LLM Studio does not use any logger.
Neptune
- H2O LLM Studio uses Neptune as a logger to track the experiment. To use Neptune, you must specify a Neptune API token in the settings or as a NEPTUNE_API_TOKEN environment variable and a Neptune project.
W&B
- H2O LLM Studio uses W&B as a logger to track the experiment. To use W&B, you must specify a W&B API key in the settings or as a WANDB_API_KEY environment variable and a W&B project and W&B entity.

Neptune project

Defines the Neptune project to access if you selected Neptune in the Logger setting.

W&B project

This is the name of the project in your W&B account.

W&B entity

This is the name of the entity (user name or organization name) in your W&B account. If you are using W&B as a logger, you will need to set this.

Feedback

Submit and view feedback for this page
Send feedback about H2O LLM Studio | Docs to cloud-feedback@h2o.ai

Experiment settings

General settings​

Dataset​

Problem type​

Import config from YAML​

Experiment name​

LLM backbone​

Dataset settings​

Train dataframe​

Validation strategy​

Validation size​

Data sample​

System column​

Prompt column​

Prompt column separator​

Answer column​

Parent ID column​

Text prompt start​

Text answer separator​

Add EOS token to prompt​

Add EOS token to answer​

Mask prompt labels​

Num classes​

Tokenizer settings​

Max length​

Add prompt answer tokens​

Padding quantile​

Architecture settings​

Backbone Dtype​

Gradient Checkpointing​

Intermediate dropout​

Pretrained weights​

Training settings​

Loss function​

Optimizer​

Learning rate​

Differential learning rate layers​

Freeze layers​

Attention Implementation​

Batch size​

Epochs​

Schedule​

Min Learning Rate Ratio​

Warmup epochs​

Weight decay​

Gradient clip​

Grad accumulation​

Lora​

Use Dora​

Lora R​

Lora Alpha​

Lora dropout​

Use RS Lora​

Lora target modules​

Lora unfreeze layers​

Save checkpoint​

Evaluation epochs​

Evaluate before training​

Train validation data​

Augmentation settings​

Token mask probability​

Skip parent probability​

Random parent probability​

Neftune noise alpha​

Prediction settings​

Metric​

Metric GPT model​

Metric GPT template​

Min length inference​

Max length inference​

Batch size inference​

Do sample​

Num beams​

Temperature​

Repetition penalty​

Stop tokens​

Top K​

Top P​

Environment settings​

GPUs​

General settings

Dataset

Problem type

Import config from YAML

Experiment name

LLM backbone

Dataset settings

Train dataframe

Validation strategy

Validation size

Data sample

System column

Prompt column

Prompt column separator

Answer column

Parent ID column

Text prompt start

Text answer separator

Add EOS token to prompt

Add EOS token to answer

Mask prompt labels

Num classes

Tokenizer settings

Max length

Add prompt answer tokens

Padding quantile

Architecture settings

Backbone Dtype

Gradient Checkpointing

Intermediate dropout

Pretrained weights

Training settings

Loss function

Optimizer

Learning rate

Differential learning rate layers

Freeze layers

Attention Implementation

Batch size

Epochs

Schedule

Min Learning Rate Ratio

Warmup epochs

Weight decay

Gradient clip

Grad accumulation

Lora

Use Dora

Lora R

Lora Alpha

Lora dropout

Use RS Lora

Lora target modules

Lora unfreeze layers

Save checkpoint

Evaluation epochs

Evaluate before training

Train validation data

Augmentation settings

Token mask probability

Skip parent probability

Random parent probability

Neftune noise alpha

Prediction settings

Metric

Metric GPT model

Metric GPT template

Min length inference

Max length inference

Batch size inference

Do sample

Num beams

Temperature

Repetition penalty

Stop tokens

Top K

Top P

Environment settings

GPUs

Mixed precision