# Transformer Training Details

Optimizer, Scheduler, Loss Function

The previous article discussed the implementation of a data loader for training a model based on the transformer architecture from Attention Is All You Need by Ashish Vaswani et al.

This article discusses Transformer training details with the following details:

- Adam Optimizer
- Learning Rate Scheduler
- Cross-Entropy Loss With Label Smoothing
- Transformer Training Loop & Results

## 1 Adam Optimizer

In section 5.3 of the paper, they mentioned that they used the Adam optimizer with the following parameters:

\[ \begin{aligned} \beta_1 &= 0.9 \\ \beta_2 &= 0.98 \\ \epsilon &= 10^{-9} \end{aligned} \]

```
from torch.optim import Adam
= Adam(model.parameters(),
optimizer = (0.9, 0.98),
betas = 1.0e-9) eps
```

There is no surprise here except that we didn’t explicitly specify the learning rate (the default is 0.001).

## 2 Learning Rate Scheduler

In section 5.3 of the paper, they explained how to vary the learning rate throughout training:

\[ \text{learning\_rate} = \dfrac{1}{\sqrt{d_\text{model}}} \cdot \min \left( \dfrac{1}{\sqrt{\text{step\_num}}},\ \text{step\_num} \cdot \dfrac{1}{\text{warmup\_steps}^{\frac{3}{2}}} \right) \]

The first observation is that the learning rate is lower as the number of embedding vector dimensions is larger. It makes sense to reduce the learning rate when adjusting more parameters.

The second observation is that two terms within the brackets become the same value when the training step number `step_num`

reaches the warmup steps `warmup_steps`

.

\[ \begin{aligned} \text{step\_num} &\rightarrow \text{warmup\_steps} \\ \dfrac{1}{\sqrt{\text{step\_num}}} &\rightarrow \dfrac{1}{\sqrt{\text{warmup\_steps}}} \\ \text{step\_num} \cdot \dfrac{1}{\text{warmup\_steps}^\frac{3}{2}} &\rightarrow \dfrac{\text{warmup\_steps}}{\text{warmup\_steps}^\frac{3}{2}} = \dfrac{1}{\sqrt{\text{warmup\_steps}}} \end{aligned} \]

So, the learning rate linearly increases until the training step hits the warmup steps (the second term). Then, it decreases due to the inverse square root of the step number (the first term).

We can use a Python function to calculate the learning rate:

```
# Learning rate caculation: step_num starts with 1
def calc_lr(step, dim_embed, warmup_steps):
return dim_embed**(-0.5) * min(step**(-0.5), \
* warmup_steps**(-1.5)) step
```

As we can see, the learning rate is lower as the number of embedding vector dimensions `dim_embed`

is larger. As expected, the learning rate peaks when the `step_num`

is at `warmup_steps`

, and the larger `warmup_steps`

is, the lower the learning rate at the peak.

The learning rate starts very small during the warmup period and increases linearly. The paper doesn’t mention the reason for this learning rate schedule. Still, I guess they found the training unstable during the initial steps and empirically decided to use `warmup_steps=4000`

for the base transformer training.

I implemented a learning rate scheduler as follows:

```
from torch.optim import Optimizer
from torch.optim.lr_scheduler import _LRScheduler
class Scheduler(_LRScheduler):
def __init__(self,
optimizer: Optimizer,int,
dim_embed: int,
warmup_steps: int=-1,
last_epoch: bool=False) -> None:
verbose:
self.dim_embed = dim_embed
self.warmup_steps = warmup_steps
self.num_param_groups = len(optimizer.param_groups)
super().__init__(optimizer, last_epoch, verbose)
def get_lr(self) -> float:
= calc_lr(self._step_count, self.dim_embed, self.warmup_steps)
lr return [lr] * self.num_param_groups
def calc_lr(step, dim_embed, warmup_steps):
return dim_embed**(-0.5) * min(step**(-0.5), step * warmup_steps**(-1.5))
```

## 3 Cross-Entropy Loss With Label Smoothing

We use the cross-entropy loss to calculate the loss value since predicting the next token ID is a classification problem.

```
import torch.nn as nn
= nn.CrossEntropyLoss( ignore_index = PAD_IDX,
loss_func = 0.1 ) label_smoothing
```

`ignore_index = PAD_IDX`

means the loss calculation ignores where label token indices are for padding, regardless of what the model predicts.`label_smoothing = 0.1`

means we are using label smoothing, which is a way to prevent a model from being too confident about its prediction:- Cross-entropy loss without label smoothing assumes there is only one correct choice of the token. The loss is negative-log-likelihood
`nll`

, where the label token index has 100% weight like one-hot encoding. - However, there could be multiple token choices with different probabilities. So, instead of 100% weight, we assign the weight
`1.0 — label_smoothig`

to the label token index and distribute the remaining weight`label_smoothing`

across all the token indices:`distribution = label_smoothing / vocab_size`

. In other words, we add a small possibility (of being a correct label) to all token indices. - We calculate the sum of negative-log-softmax across all the token indices
`-log_softmax`

and multiply it by the distribution.

- Cross-entropy loss without label smoothing assumes there is only one correct choice of the token. The loss is negative-log-likelihood

So, we calculate a loss with label smoothing as follows:

```
= label_smoothing / vocab_size
distribution
= (1.0 - label_smoothing) * nll - distribution * log_softmax loss
```

For `label_smoothing = 0.1`

, the loss becomes:

`= 0.9 * nll - 0.1 / vocab_size * log_softmax loss `

Note: for both `nll`

and `log_softmax`

we ignore loss where the label is `PAD_IDX`

.

Thankfully, PyTorch `nn.CrossEntropyLoss`

supports both ignoring padding and handling label smoothing.

When we use the `nn.CrossEntropyLoss`

, we need to flatten the model outputs and label token indices. So, I wrote the following wrapper module:

```
import torch
import torch.nn as nn
from torch import Tensor
class TranslationLoss(nn.Module):
def __init__(self, label_smoothing: float=0.0) -> None:
super().__init__()
self.loss_func = nn.CrossEntropyLoss(ignore_index = PAD_IDX,
= label_smoothing)
label_smoothing
def forward(self, logits: Tensor, labels: Tensor) -> Tensor:
= logits.shape[-1]
vocab_size = logits.reshape(-1, vocab_size)
logits = labels.reshape(-1).long()
labels return self.loss_func(logits, labels)
```

According to the paper, the use of label smoothing improved the BLEU score:

During training, we employed label smoothing of value ls = 0.1 [36]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score. Attention is All You Need

For details of the BLEU score, please look at this article.

## 4 Transformer Training Loop

The following code handles one epoch during training:

```
def train(model: nn.Module,
loader: DataLoader,
loss_func: torch.nn.Module,
optimizer: torch.optim.Optimizer,-> float:
scheduler: torch.optim.lr_scheduler._LRScheduler)
# train mode
model.train()
= 0
total_loss = len(loader)
num_batches
for source, target, labels, source_mask, target_mask in loader:
# feed forward
= model(source, target, source_mask, target_mask)
logits
# loss calculation
= loss_func(logits, labels)
loss += loss.item()
total_loss
# back-prop
optimizer.zero_grad()
loss.backward()
optimizer.step()
# learning rate scheduler
if scheduler is not None:
scheduler.step()
# average training loss
= total_loss / num_batches
avg_loss return avg_loss
```

We load the training dataset with `split = 'train'`

and pass it to the train function. Please look at this article for the details of the data loader.

For evaluation, we set the model to the eval mode by `model.eval()`

, and also, we don’t need gradients as we have no optimization step.

We load the validation dataset with `split = 'valid'`

and pass it to the validate function.

```
def validate(model: nn.Module,
loader: DataLoader,-> float:
loss_func: torch.nn.Module)
eval() # eval mode
model.
= 0
total_loss = len(loader)
num_batches
for source, target, labels, source_mask, target_mask in loader:
with torch.no_grad():
# feed forward
= model(source, target, source_mask, target_mask)
logits
# loss calculation
= loss_func(logits, labels)
loss += loss.item()
total_loss
# average validation loss
= total_loss / num_batches
avg_loss return avg_loss
```

## 5 Transformer Training Results

I set up a smaller version of the Transformer than the original base model.

```
model:
name: Transformer
max_positions: 5000 # Positional encoding
num_blocks: 2 # Encoder and decoder layers
num_heads: 8 # Multi-head attention
dim_embed: 128 # Embedding vector dimensions
dim_pffn: 512 # Position-wise feed-forward
drop_prob: 0.3 # Drop out
```

Since the dataset (`Multi30k`

for German-to-English translation) is relatively small, I reduced the network parameters and used a higher drop probability to prevent over-fitting from happening.

And I did a Transformer training with the following setup:

```
20
epochs: 32
batch_size:
optimizer:
name: torch.optim.Adam
betas:- 0.9
- 0.98
1.0e-9
eps:
scheduler:
name: Scheduler128
dim_embed: 10000
warmup_steps:
loss:
name: TranslationLoss0.1
label_smoothing:
val_loss:
name: TranslationLoss0.0 # no label smoothing for validation label_smoothing:
```

I used relatively large warmup steps to keep the learning rate lower.

The Transformer training finished less than 6 hours on a Linux machine with 4 CPUs (Intel Core i7–7700K @ 4.20GHz) and two GPUs (NVIDIA GeForce GTX 1080 Ti).

I could’ve run it longer, but it was enough to prove the model is learning.

I implemented a translator class and tested with the test dataset from `Multi30k`

. Some good examples are shown below (Input, Label, and the model’s prediction):

German : Die Person im gestreiften Hemd klettert auf einen Berg. English : The person in the striped shirt is mountain climbing. Translation: The person in the striped shirt is climbing a mountain.

German : Ein junges Mädchen schwimmt in einem Pool English : A young girl swimming in a pool Translation: A young girl swimming in a pool.

German : Eine Frau, die in einer Küche eine Schale mit Essen hält. English : A woman holding a bowl of food in a kitchen. Translation: A woman holding a bowl of food in a kitchen.

Below are some examples of bad Translations:

German : Drei Leute sitzen in einer Höhle. English : Three people sit in a cave. Translation: Three people are sitting in an indoor pool.

German : Leute Reparieren das Dach eines Hauses. English : People are fixing the roof of a house. Translation: People riding the roof of a house.

German : Ein Boston Terrier läuft über saftig-grünes Gras vor einem weißen Zaun. English : A Boston Terrier is running on lush green grass in front of a white fence. Translation: A rodeo athlete runs across the grass.

I’ll write about the translator in the next article

## 6 References

- The Annotated Transformer

Harvard NLP - Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna - Training Tips for the Transformer Model

Martin Popel, Ondřej Bojar - Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin