Skip to main content

Experiment settings

The settings for creating an experiment are grouped into the following sections:

The settings under each category are listed and described below.

General settings

Dataset

It defines the dataset for the experiment.

Problem type

Defines the problem type of the experiment, which also defines the settings H2O LLM Studio displays for the experiment.

  • Causal Language Modeling: Used to fine-tune large language models

  • DPO Modeling: Used to fine-tune large language models using Direct Preference Optimization

  • Sequence To Sequence Modeling: Used to fine-tune large sequence to sequence models

  • Causal Classification Modeling: Used to fine-tune causal classification models

Import config from YAML

Defines the .yml file that defines the experiment settings.

  • H2O LLM Studio supports a .yml file import and export functionality. You can download the config settings of finished experiments, make changes, and re-upload them when starting a new experiment in any instance of H2O LLM Studio.

Experiment name

It defines the name of the experiment.

LLM backbone

The LLM Backbone option is the most important setting as it sets the pretrained model weights.

  • Usually, it is good to use smaller architectures for quicker experiments and larger models when aiming for the highest accuracy
  • If possible, leverage backbones pre-trained closely to your use case
  • Any huggingface model can be used here (not limited to the ones in the dropdown list)

Dataset settings

Train dataframe

Defines a .csv or .pq file containing a dataframe with training records that H2O LLM Studio uses to train the model.

  • The records are combined into mini-batches when training the model.

Validation strategy

Specifies the validation strategy H2O LLM Studio uses for the experiment.

To properly assess the performance of your trained models, it is common practice to evaluate it on separate holdout data that the model has not seen during training. H2O LLM Studio allows you to specify different strategies for this task fitting your needs.

Options

  • Custom holdout validation
    • Specifies a separate holdout dataframe.
  • Automatic holdout validation
    • Allows to specify a holdout validation sample size that is automatically generated.

Validation size

Defines an optional relative size of the holdout validation set. H2O LLM Studio do automatically sample the selected percentage from the full training data, and build a holdout dataset that the model is validated on.

Data sample

Defines the percentage of the data to use for the experiment. The default percentage is 100% (1).

Changing the default value can significantly increase the training speed. Still, it might lead to a substantially poor accuracy value. Using 100% (1) of the data for final models is highly recommended.

System column

The column in the dataset containing the system input which is always prepended for a full sample.

Prompt column

The column in the dataset containing the user prompt.

Answer column

The column in the dataset containing the expected output.

For classification, this needs to be an integer column containing the class label.

Parent ID column

An optional column specifying the parent id to be used for chained conversations. The value of this column needs to match an additional column with the name id. If provided, the prompt will be concatenated after preceding parent rows.

Text prompt start

Optional text to prepend to each prompt.

Text answer separator

Optional text to append to each prompt / prepend to each answer.

Adaptive Kl control

Use adaptive KL control, otherwise linear.

Add EOS token to prompt

Adds EOS token at end of prompt.

Add EOS token to answer

Adds EOS token at end of answer.

Mask prompt labels

Whether to mask the prompt labels during training and only train on the loss of the answer.

Tokenizer settings

Max length prompt

The maximum sequence length of the prompt to use during training. In case of chained samples, this max length refers to a single prompt length in the chain.

Max length answer

The maximum sequence length of the answer to use during training. In case of chained samples, this max length refers to a single answer length in the chain.

Max length

Defines the maximum length of the input sequence H2O LLM Studio uses during model training. In other words, this setting specifies the maximum number of tokens an input text is transformed for model training.

A higher token count leads to higher memory usage that slows down training while increasing the probability of obtaining a higher accuracy value.

In case of Causal Language Modeling, this includes both prompt and answer, or all prompts and answers in case of chained samples.

In Sequence to Sequence Modeling, this refers to the length of the prompt, or the length of a full chained sample.

Add prompt answer tokens

Adds system, prompt and answer tokens as new tokens to the tokenizer. It is recommended to also set Force Embedding Gradients in this case.

Padding quantile

Defines the padding quantile H2O LLM Studio uses to select the maximum token length per batch. H2O LLM Studio performs padding of shorter sequences up to the specified padding quantile instead of the selected Max length. H2O LLM Studio truncates longer sequences.

  • Lowering the quantile can significantly increase training runtime and reduce memory usage in unevenly distributed sequence lengths but can hurt performance
  • The setting depends on the batch size and should be adjusted accordingly
  • No padding is done in inference, and the selected Max Length is guaranteed
  • Setting to 0 disables padding
  • In case of distributed training, the quantile will be calculated across all GPUs

Use fast

Whether or not to use a Fast tokenizer if possible. Some LLM backbones only offer certain types of tokenizers and changing this setting might be needed.

Architecture settings

Backbone Dtype

The datatype of the weights in the LLM backbone.

Gradient Checkpointing

Determines whether H2O LLM Studio activates gradient checkpointing (GC) when training the model. Starting GC reduces the video random access memory (VRAM) footprint at the cost of a longer runtime (an additional forward pass). Turning On GC enables it during the training process.

Caution Gradient checkpointing is an experimental setting that is not compatible with all backbones or all other settings.

Activating GC comes at the cost of a longer training time; for that reason, try training without GC first and only activate when experiencing GPU out-of-memory (OOM) errors.

Force Embedding Gradients

Whether to force the computation of gradients for the input embeddings during training. Useful for LORA.

Intermediate dropout

Defines the custom dropout rate H2O LLM Studio uses for intermediate layers in the transformer model.

Pretrained weights

Allows you to specify a local path to the pretrained weights.

Training settings

Loss function

Defines the loss function H2O LLM Studio utilizes during model training. The loss function is a differentiable function measuring the prediction error. The model utilizes gradients of the loss function to update the model weights during training. The options depend on the selected Problem Type.

Optimizer

Defines the algorithm or method (optimizer) to use for model training. The selected algorithm or method defines how the model should change the attributes of the neural network, such as weights and learning rate. Optimizers solve optimization problems and make more accurate updates to attributes to reduce learning losses.

Options:

Learning rate

Defines the learning rate H2O LLM Studio uses when training the model, specifically when updating the neural network's weights. The learning rate is the speed at which the model updates its weights after processing each mini-batch of data.

  • Learning rate is an important setting to tune as it balances under- and overfitting.
  • The number of epochs highly impacts the optimal value of the learning rate.

Use Flash Attention 2

If enabled, Flash Attention 2 will be used to compute the attention. Otherwise, the attention will be computed using the standard attention mechanism.

Flash Attention 2 is a new attention mechanism that is faster and more memory efficient than the standard attention mechanism. Only newer GPUs support this feature.

See https://arxiv.org/abs/2205.14135 for more details.

Batch size

Defines the number of training examples a mini-batch uses during an iteration of the training model to estimate the error gradient before updating the model weights. Batch size defines the batch size used per a single GPU.

During model training, the training data is packed into mini-batches of a fixed size.

Epochs

Defines the number of epochs to train the model. In other words, it specifies the number of times the learning algorithm goes through the entire training dataset.

  • The Epochs setting is an important setting to tune because it balances under- and overfitting.
  • The learning rate highly impacts the optimal value of the epochs.
  • H2O LLM Studio enables you to utilize a pre-trained model trained on zero epochs (where H2O LLM Studio does not train the model and the pretrained model (experiment) can be evaluated as-is):

Schedule

Defines the learning rate schedule H2O LLM Studio utilizes during model training. Specifying a learning rate schedule prevents the learning rate from staying the same. Instead, a learning rate schedule causes the learning rate to change over iterations, typically decreasing the learning rate to achieve a better model performance and training convergence.

Options

  • Constant
    • H2O LLM Studio applies a constant learning rate during the training process.
  • Cosine
    • H2O LLM Studio applies a cosine learning rate that follows the values of the cosine function.
  • Linear
    • H2O LLM Studio applies a linear learning rate that decreases the learning rate linearly.

Warmup epochs

Defines the number of epochs to warm up the learning rate where the learning rate should increase linearly from 0 to the desired learning rate. Can be a fraction of an epoch.

Weight decay

Defines the weight decay that H2O LLM Studio uses for the optimizer during model training.

Weight decay is a regularization technique that adds an L2 norm of all model weights to the loss function while increasing the probability of improving the model generalization.

Gradient clip

Defines the maximum norm of the gradients H2O LLM Studio specifies during model training. Defaults to 0, no clipping. When a value greater than 0 is specified, H2O LLM Studio modifies the gradients during model training. H2O LLM Studio uses the specified value as an upper limit for the norm of the gradients, calculated using the Euclidean norm over all gradients per batch.

This setting can help model convergence when extreme gradient values cause high volatility of weight updates.

Grad accumulation

Defines the number of gradient accumulations before H2O LLM Studio updates the neural network weights during model training.

  • Grad accumulation can be beneficial if only small batches are selected for training. With gradient accumulation, the loss and gradients are calculated after each batch, but it waits for the selected accumulations before updating the model weights. You can control the batch size through the Batch size setting.
  • Changing the default value of Grad Accumulation might require adjusting the learning rate and batch size.

Lora

Whether to use low rank approximations (LoRA) during training.

Lora R

The dimension of the matrix decomposition used in LoRA.

Lora Alpha

The scaling factor for the lora weights.

Lora dropout

The probability of applying dropout to the LoRA weights during training.

Lora target modules

The modules in the model to apply the LoRA approximation to. Defaults to all linear layers.

Save checkpoint

Specifies how H2O LLM Studio should save the model checkpoints.

When set to Last it will always save the last checkpoint, this is the recommended setting.

When set to Best it saves the model weights for the epoch exhibiting the best validation metric.

  • This setting should be turned on with care as it has the potential to lead to overfitting of the validation data.
  • The default goal should be to attempt to tune models so that the last epoch is the best epoch.
  • Suppose an evident decline for later epochs is observed in logging. In that case, it is usually better to adjust hyperparameters, such as reducing the number of epochs or increasing regularization, instead of turning this setting on.

When set to Disable it will not save the checkpoint at all. This can be useful for debugging and experimenting in order to save disk space, but will disable certain functionalities like chatting or pushing to HF.

Evaluation epochs

Defines the number of epochs H2O LLM Studio uses before each validation loop for model training. In other words, it determines the frequency (in a number of epochs) to run the model evaluation on the validation data.

  • Increasing the number of Evaluation Epochs can speed up an experiment.
  • The Evaluation epochs setting is available only if the following setting is turned Off: Save Best Checkpoint.
  • Can be a fraction of an epoch

Evaluate before training

This option lets you evaluate the model before training, which can help you judge the quality of the LLM backbone before fine-tuning.

Train validation data

Defines whether the model should use the entire train and validation dataset during model training. When turned On, H2O LLM Studio uses the whole train dataset and validation data to train the model.

  • H2O LLM Studio also evaluates the model on the provided validation fold. Validation is always only on the provided validation fold.
  • H2O LLM Studio uses both datasets for model training if you provide a train and validation dataset.
    • To define a training dataset, use the Train dataframe setting.
    • To define a validation dataset, use the Validation dataframe setting.
  • Turning On the Train validation data setting should produce a model that you can expect to perform better because H2O LLM Studio trained the model on more data. Though, also note that using the entire train dataset and validation dataset generally causes the model's accuracy to be overstated as information from the validation data is incorporated into the model during the training process.

Use RLHF

Toggle to enable Reinforcement Learning with Human Feedback.

Reward model

The Reward Model option is gives control over the models weights that shall be used to score the active LLM during RLHF training.

  • Any suited huggingface model can be used here (not limited to the ones in the dropdown list)

Adaptive KL control

Use adaptive KL control, otherwise linear.

Initial KL coefficient

Initial KL penalty coefficient (used for adaptive and linear control).

KL target

Target KL value for adaptive KL control.

KL Horizon

Horizon for adaptive KL control.

Advantages gamma

Gamma parameter for advantage calculation.

Advantages Lambda

Lambda parameter for advantage calculation.

PPO clip policy

Range for clipping in PPO policy gradient loss.

PPO clip value

Range for clipping values in loss calculation.

Scaling factor value loss

Scaling factor for value loss.

PPO epochs

Number of optimisation epochs per batch of samples.

PPO Batch Size

Number of samples optimized inside PPO together.

PPO generate temperature

This is the temperature that is used in the generate function during the PPO Rollout.

Offload reward model

When enabled, this will offload the reward model weights to CPU when not in use. This can be useful when training on a GPU with limited memory. The weights will be moved back to the GPU when needed.

Augmentation settings

Token mask probability

Defines the random probability of the input text tokens to be randomly masked during training.

  • Increasing this setting can be helpful to avoid overfitting and apply regularization
  • Each token is randomly replaced by a masking token based on the specified probability

Skip parent probability

If Parent Column is set, this random augmentation will skip parent concatenation during training at each parent with this specified probability.

Random parent probability

While training, each sample will be concatenated to a random other sample simulating unrelated chained conversations. Can be specified without using a Parent Column.

Neftune noise alpha

Will add noise to the input embeddings as proposed by https://arxiv.org/abs/2310.05914 (NEFTune: Noisy Embeddings Improve Instruction Finetuning)

Prediction settings

Metric

Defines the metric to evaluate the model's performance.

We provide several metric options for evaluating the performance of your model. In addition to the BLEU and the Perplexity score, we offer GPT metrics that utilize the OpenAI API to determine whether the predicted answer is more favorable than the ground truth answer. To use these metrics, you can either export your OpenAI API key as an environment variable before starting LLM Studio, or you can specify it in the Settings Menu within the UI.

Metric GPT model

Defines the OpenAI model endpoint for the GPT metric.

Metric GPT template

The template to use for GPT-based evaluation. Note that for mt-bench, the validation dataset will be replaced accordingly; to approximate the original implementation as close as possible, we suggest to use gpt-4-0613 as the gpt judge model and use 1024 for the max length inference.

Min length inference

Defines the min length value H2O LLM Studio uses for the generated text.

  • This setting impacts the evaluation metrics and should depend on the dataset and average output sequence length that is expected to be predicted.

Max length inference

Defines the max length value H2O LLM Studio uses for the generated text.

  • Similar to the Max Length setting in the tokenizer settings section, this setting specifies the maximum number of tokens to predict for a given prediction sample.
  • This setting impacts the evaluation metrics and should depend on the dataset and average output sequence length that is expected to be predicted.

Batch size inference

Defines the size of a mini-batch uses during an iteration of the inference. Batch size defines the batch size used per GPU.

Do sample

Determines whether to sample from the next token distribution instead of choosing the token with the highest probability. If turned On, the next token in a predicted sequence is sampled based on the probabilities. If turned Off, the highest probability is always chosen.

Num beams

Defines the number of beams to use for beam search. Num Beams default value is 1 (a single beam); no beam search.

A higher Num Beams value can increase prediction runtime while potentially improving accuracy.

Temperature

Defines the temperature to use for sampling from the next token distribution during validation and inference. In other words, the defined temperature controls the randomness of predictions by scaling the logits before applying softmax. A higher temperature makes the distribution more random.

Repetition penalty

The parameter for repetition penalty. 1.0 means no penalty. See https://arxiv.org/pdf/1909.05858.pdf for more details.

Stop tokens

Will stop generation at occurrence of these additional tokens; multiple tokens should be split by comma ,.

Top K

If > 0, only keep the top k tokens with the highest probability (top-k filtering).

Top P

If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering).

Environment settings

GPUs

Determines the list of GPUs H2O LLM Studio can use for the experiment. GPUs are listed by name, referring to their system ID (starting from 1).

Mixed precision

Determines whether to use mixed-precision. When turned Off, H2O LLM Studio does not use mixed-precision.

Mixed-precision is a technique that helps decrease memory consumption and increases training speed.

Compile model

Compiles the model with Torch. Experimental!

Find unused parameters

In Distributed Data Parallel (DDP) mode, prepare_for_backward() is called at the end of DDP forward pass. It traverses the autograd graph to find unused parameters when find_unused_parameters is set to True in DDP constructor.

Note that traversing the autograd graph introduces extra overheads, so applications should only set to True when necessary.

Trust remote code

Trust remote code. This can be necessary for some models that use code which is not (yet) part of the transformers package. Should always be checked with this option being switched Off first.

Huggingface branch

The Huggingface Branch defines which branch to use in a Huggingface repository. The default value is "main".

Number of workers

Defines the number of workers H2O LLM Studio uses for the DataLoader. In other words, it defines the number of CPU processes to use when reading and loading data to GPUs during model training.

Seed

Defines the random seed value that H2O LLM Studio uses during model training. It defaults to -1, an arbitrary value. When the value is modified (not -1), the random seed allows results to be reproducible—defining a seed aids in obtaining predictable and repeatable results every time. Otherwise, not modifying the default seed value (-1) leads to random numbers at every invocation.

Logging settings

Logger

Defines the logger type that H2O LLM Studio uses for model training

Options

  • None
    • H2O LLM Studio does not use any logger.
  • Neptune
    • H2O LLM Studio uses Neptune as a logger to track the experiment. To use Neptune, you must specify a Neptune API token and a Neptune project.

Neptune project

Defines the Neptune project to access if you selected Neptune in the Logger setting.


Feedback