Transformer Training Details
Optimizer, Scheduler, Loss Function
The previous article discussed the implementation of a data loader for training a model based on the transformer architecture from Attention Is All You Need by Ashish Vaswani et al.
This article discusses Transformer training details with the following details:
- Adam Optimizer
- Learning Rate Scheduler
- Cross-Entropy Loss With Label Smoothing
- Transformer Training Loop & Results
1 Adam Optimizer
In section 5.3 of the paper, they mentioned that they used the Adam optimizer with the following parameters:
\[ \begin{aligned} \beta_1 &= 0.9 \\ \beta_2 &= 0.98 \\ \epsilon &= 10^{-9} \end{aligned} \]
from torch.optim import Adam
= Adam(model.parameters(),
optimizer = (0.9, 0.98),
betas = 1.0e-9) eps
There is no surprise here except that we didn’t explicitly specify the learning rate (the default is 0.001).
2 Learning Rate Scheduler
In section 5.3 of the paper, they explained how to vary the learning rate throughout training:
\[ \text{learning\_rate} = \dfrac{1}{\sqrt{d_\text{model}}} \cdot \min \left( \dfrac{1}{\sqrt{\text{step\_num}}},\ \text{step\_num} \cdot \dfrac{1}{\text{warmup\_steps}^{\frac{3}{2}}} \right) \]
The first observation is that the learning rate is lower as the number of embedding vector dimensions is larger. It makes sense to reduce the learning rate when adjusting more parameters.
The second observation is that two terms within the brackets become the same value when the training step number step_num
reaches the warmup steps warmup_steps
.
\[ \begin{aligned} \text{step\_num} &\rightarrow \text{warmup\_steps} \\ \dfrac{1}{\sqrt{\text{step\_num}}} &\rightarrow \dfrac{1}{\sqrt{\text{warmup\_steps}}} \\ \text{step\_num} \cdot \dfrac{1}{\text{warmup\_steps}^\frac{3}{2}} &\rightarrow \dfrac{\text{warmup\_steps}}{\text{warmup\_steps}^\frac{3}{2}} = \dfrac{1}{\sqrt{\text{warmup\_steps}}} \end{aligned} \]
So, the learning rate linearly increases until the training step hits the warmup steps (the second term). Then, it decreases due to the inverse square root of the step number (the first term).
We can use a Python function to calculate the learning rate:
# Learning rate caculation: step_num starts with 1
def calc_lr(step, dim_embed, warmup_steps):
return dim_embed**(-0.5) * min(step**(-0.5), \
* warmup_steps**(-1.5)) step
As we can see, the learning rate is lower as the number of embedding vector dimensions dim_embed
is larger. As expected, the learning rate peaks when the step_num
is at warmup_steps
, and the larger warmup_steps
is, the lower the learning rate at the peak.
The learning rate starts very small during the warmup period and increases linearly. The paper doesn’t mention the reason for this learning rate schedule. Still, I guess they found the training unstable during the initial steps and empirically decided to use warmup_steps=4000
for the base transformer training.
I implemented a learning rate scheduler as follows:
from torch.optim import Optimizer
from torch.optim.lr_scheduler import _LRScheduler
class Scheduler(_LRScheduler):
def __init__(self,
optimizer: Optimizer,int,
dim_embed: int,
warmup_steps: int=-1,
last_epoch: bool=False) -> None:
verbose:
self.dim_embed = dim_embed
self.warmup_steps = warmup_steps
self.num_param_groups = len(optimizer.param_groups)
super().__init__(optimizer, last_epoch, verbose)
def get_lr(self) -> float:
= calc_lr(self._step_count, self.dim_embed, self.warmup_steps)
lr return [lr] * self.num_param_groups
def calc_lr(step, dim_embed, warmup_steps):
return dim_embed**(-0.5) * min(step**(-0.5), step * warmup_steps**(-1.5))
3 Cross-Entropy Loss With Label Smoothing
We use the cross-entropy loss to calculate the loss value since predicting the next token ID is a classification problem.
import torch.nn as nn
= nn.CrossEntropyLoss( ignore_index = PAD_IDX,
loss_func = 0.1 ) label_smoothing
ignore_index = PAD_IDX
means the loss calculation ignores where label token indices are for padding, regardless of what the model predicts.label_smoothing = 0.1
means we are using label smoothing, which is a way to prevent a model from being too confident about its prediction:- Cross-entropy loss without label smoothing assumes there is only one correct choice of the token. The loss is negative-log-likelihood
nll
, where the label token index has 100% weight like one-hot encoding. - However, there could be multiple token choices with different probabilities. So, instead of 100% weight, we assign the weight
1.0 — label_smoothig
to the label token index and distribute the remaining weightlabel_smoothing
across all the token indices:distribution = label_smoothing / vocab_size
. In other words, we add a small possibility (of being a correct label) to all token indices. - We calculate the sum of negative-log-softmax across all the token indices
-log_softmax
and multiply it by the distribution.
- Cross-entropy loss without label smoothing assumes there is only one correct choice of the token. The loss is negative-log-likelihood
So, we calculate a loss with label smoothing as follows:
= label_smoothing / vocab_size
distribution
= (1.0 - label_smoothing) * nll - distribution * log_softmax loss
For label_smoothing = 0.1
, the loss becomes:
= 0.9 * nll - 0.1 / vocab_size * log_softmax loss
Note: for both nll
and log_softmax
we ignore loss where the label is PAD_IDX
.
Thankfully, PyTorch nn.CrossEntropyLoss
supports both ignoring padding and handling label smoothing.
When we use the nn.CrossEntropyLoss
, we need to flatten the model outputs and label token indices. So, I wrote the following wrapper module:
import torch
import torch.nn as nn
from torch import Tensor
class TranslationLoss(nn.Module):
def __init__(self, label_smoothing: float=0.0) -> None:
super().__init__()
self.loss_func = nn.CrossEntropyLoss(ignore_index = PAD_IDX,
= label_smoothing)
label_smoothing
def forward(self, logits: Tensor, labels: Tensor) -> Tensor:
= logits.shape[-1]
vocab_size = logits.reshape(-1, vocab_size)
logits = labels.reshape(-1).long()
labels return self.loss_func(logits, labels)
According to the paper, the use of label smoothing improved the BLEU score:
During training, we employed label smoothing of value ls = 0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score. Attention is All You Need
For details of the BLEU score, please look at this article.
4 Transformer Training Loop
The following code handles one epoch during training:
def train(model: nn.Module,
loader: DataLoader,
loss_func: torch.nn.Module,
optimizer: torch.optim.Optimizer,-> float:
scheduler: torch.optim.lr_scheduler._LRScheduler)
# train mode
model.train()
= 0
total_loss = len(loader)
num_batches
for source, target, labels, source_mask, target_mask in loader:
# feed forward
= model(source, target, source_mask, target_mask)
logits
# loss calculation
= loss_func(logits, labels)
loss += loss.item()
total_loss
# back-prop
optimizer.zero_grad()
loss.backward()
optimizer.step()
# learning rate scheduler
if scheduler is not None:
scheduler.step()
# average training loss
= total_loss / num_batches
avg_loss return avg_loss
We load the training dataset with split = 'train'
and pass it to the train function. Please look at this article for the details of the data loader.
For evaluation, we set the model to the eval mode by model.eval()
, and also, we don’t need gradients as we have no optimization step.
We load the validation dataset with split = 'valid'
and pass it to the validate function.
def validate(model: nn.Module,
loader: DataLoader,-> float:
loss_func: torch.nn.Module)
eval() # eval mode
model.
= 0
total_loss = len(loader)
num_batches
for source, target, labels, source_mask, target_mask in loader:
with torch.no_grad():
# feed forward
= model(source, target, source_mask, target_mask)
logits
# loss calculation
= loss_func(logits, labels)
loss += loss.item()
total_loss
# average validation loss
= total_loss / num_batches
avg_loss return avg_loss
5 Transformer Training Results
I set up a smaller version of the Transformer than the original base model.
model:
name: Transformer
max_positions: 5000 # Positional encoding
num_blocks: 2 # Encoder and decoder layers
num_heads: 8 # Multi-head attention
dim_embed: 128 # Embedding vector dimensions
dim_pffn: 512 # Position-wise feed-forward
drop_prob: 0.3 # Drop out
Since the dataset (Multi30k
for German-to-English translation) is relatively small, I reduced the network parameters and used a higher drop probability to prevent over-fitting from happening.
And I did a Transformer training with the following setup:
20
epochs: 32
batch_size:
optimizer:
name: torch.optim.Adam
betas:- 0.9
- 0.98
1.0e-9
eps:
scheduler:
name: Scheduler128
dim_embed: 10000
warmup_steps:
loss:
name: TranslationLoss0.1
label_smoothing:
val_loss:
name: TranslationLoss0.0 # no label smoothing for validation label_smoothing:
I used relatively large warmup steps to keep the learning rate lower.
The Transformer training finished less than 6 hours on a Linux machine with 4 CPUs (Intel Core i7–7700K @ 4.20GHz) and two GPUs (NVIDIA GeForce GTX 1080 Ti).
I could’ve run it longer, but it was enough to prove the model is learning.
I implemented a translator class and tested with the test dataset from Multi30k
. Some good examples are shown below (Input, Label, and the model’s prediction):
German : Die Person im gestreiften Hemd klettert auf einen Berg. English : The person in the striped shirt is mountain climbing. Translation: The person in the striped shirt is climbing a mountain.
German : Ein junges Mädchen schwimmt in einem Pool English : A young girl swimming in a pool Translation: A young girl swimming in a pool.
German : Eine Frau, die in einer Küche eine Schale mit Essen hält. English : A woman holding a bowl of food in a kitchen. Translation: A woman holding a bowl of food in a kitchen.
Below are some examples of bad Translations:
German : Drei Leute sitzen in einer Höhle. English : Three people sit in a cave. Translation: Three people are sitting in an indoor pool.
German : Leute Reparieren das Dach eines Hauses. English : People are fixing the roof of a house. Translation: People riding the roof of a house.
German : Ein Boston Terrier läuft über saftig-grünes Gras vor einem weißen Zaun. English : A Boston Terrier is running on lush green grass in front of a white fence. Translation: A rodeo athlete runs across the grass.
I’ll write about the translator in the next article
6 References
- The Annotated Transformer
Harvard NLP - Rethinking the inception architecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna - Training Tips for the Transformer Model
Martin Popel, Ondřej Bojar - Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin