Glossary: LLM fine-tuning hyperparameters

Solutions Engineer

Learning rate

The learning rate is a scalar that determines the step size at each iteration while moving toward a minimum of the loss function. The bigger the learning rate, the faster fine-tuning goes, but you have to balance that against the risk of overshooting the optimal solution or causing unstable training.

The specific value of the learning rate will be dependent on the optimizer you choose. The default for the most popular Adam optimizer is 0.001.

Learning rates are often expressed in exponential terms, such as 1e-5 or 5e-4.

Learning rate (LR) scheduler

Instead of manually adjusting the learning rate, you can choose to use a learning rate (LR) scheduler that dynamically adjusts the learning rate during fine-tuning.

Linear decay is a common choice, where the learning rate is gradually reduced over time to facilitate smoother model convergence. In contrast, cosine annealing follows a cosine pattern, starting with a high learning rate that rapidly decreases to a minimum before increasing again. This process simulates a restart of the learning process, reusing good weights as the starting point, known as a “warm restart.” This differs from a “cold restart,” where a new set of small random numbers is used as the starting point.

Optimizer

The optimizer is the algorithm responsible for adjusting a model’s parameters to minimize the loss function.

AdamW is the most popular optimizer for deep learning training/fine-tuning use cases due to its simplicity, efficiency, and robustness. It’s a modification of the Adam optimizer that controls overfitting and improves the generalizations of the models.

AdamW has 32-bit, 8-bit, and paged versions. The 32-bit version is resource-intensive, requiring additional memory for optimizer states. In contrast, AdamW 8-bit performs similarly but with reduced GPU memory usage, making it a recommended choice. The paged version is useful in distributed settings.

Batch size

The number of training examples used in one iteration. Larger batch sizes can lead to faster training but may require more memory. The default batch size is often set to 32.

Number of epochs

An epoch is one complete pass through the entire training dataset. The number of epochs determines how many times the model will see the entire dataset during training. The default number of epochs is often set to 3. Increasing it can allow the model to see the data more times, which can improve performance. However, it can also increase the risk of overfitting.

Warmup steps

The number of training steps for which the learning rate is gradually increased from a small value to the initial learning rate. This can help stabilize training in the early stages. The default number of warmup steps is often set to 0.1 x total training steps.

Weight decay

A regularization technique that adds a penalty term to the loss function to prevent overfitting by keeping the weights small. The default weight decay is often set to 0.01. Increasing it can improve generalization, especially for large models or complex tasks. However, it can also slow down learning.

Packing

Batches have a pre-defined sequence length. Instead of assigning one batch per sample, we can combine multiple small samples in one batch, increasing efficiency.

Tips for hyperparameter tuning

Start with a small learning rate and gradually increase: A smaller learning rate, like 0.0001, can help with convergence, especially at the beginning of training.
Use hyperparameter search: Employ techniques like grid search, random search, or Bayesian optimization to find optimal hyperparameter combinations. Modal makes it fast and easy to do massively parallel hyperparameter searches on thousands of containers.
Monitor validation performance: Keep an eye on the model’s performance on a held-out validation set to avoid overfitting. You can use tools like Weights and Biases to visualize the performance of different permutations of hyperparameters on the held-out set.
Experiment with different techniques: Try combining different fine-tuning methods, such as LoRA with QAT, to achieve better results.

For practical examples of LLM fine-tuning using Modal, check out our LLM fine-tuning guide.