Transformer’s Self-Attention

Why Is Attention All You Need?

Transformer
Machine Translation
Published

November 14, 2021

Transformer’s Self-Attention: Why Is Attention All You Need?

In 2017, Vaswani et al. published a paper titled Attention Is All You Need for the NeurIPS conference. The transformer architecture does not use recurrence or convolution. It solely relies on attention mechanisms.

In this article, we discuss the attention mechanisms in the transformer:

1 Dot-Product And Word Embedding

The dot-product takes two equal-length vectors and returns a single number.

a=[a1,a2,,an]b=[b1,b2,,bn]ab=[a1b1+a2b2++anbn]=i=1naibi

We use the dot operator to express the dot-product operation. We also call it the inner product as we calculate element-wise multiplication (that is, “inner product”) and sum those products together.

The geometric definition of the dot product is as follows:

ab=abcosθ

a is the magnitude of vector a, and b is the magnitude of vector a. θ is the angle between vector a and vector b.

We can also project vector a onto vector b and visualize the product of acosθ and b. The calculated value is the same either way.

The dot-product abcosθ is at maximum when θ is 0 (cos0=1), and at minimum when is θ is 2π (cos2π=1). In other words, the dot-product is bigger when two vectors have similar directions than otherwise.

Before discussing dot-product attention, we should talk about word embeddings as it gives us some intuition on how dot-product attention works.

Word embeddings use distributed representations of words (tokens), which is more efficient than one-hot vector representation. The word2vec and GloVe (Global Vectors for Word Representation) are well-known pre-trained word embeddings. Those word embedding vectors allow us to decompose a word into analogies.

The following diagram shows how we can decompose king and queen using word embedding vectors.

In other words, a word vector may contain multiple semantics. Hypothetically, we could query if a word vector has “royal” by projecting the vector with some matrix operation.

The question is: can a language model automatically learn to query different meanings and functions of each word in a sentence?

The transformer learns word embeddings from scratch, learns matrix weights for word vector projections, and calculates the dot-products for attention mechanisms. In the next section, we discuss how those vector mathematics work together.

2 Scaled Dot-Product Attention

Say we have a model that translates an English sentence X into a French sentence Y. In the output sentence, the first-word vector would be a BOS (beginning-of-sentence) marker which is just another word vector (albeit fixed), and we call it vector y1. The model needs to extract a context from the input sentence X for the first-word vector y1 in the output sentence Y.

For simplicity, we express a word vector as a row vector with four dimensions.

We use a 4x3 matrix WQ to project word vector y1:

It’s a simple matrix multiplication on the word vector y1.

q1=y1WQ

As you can see, the dimension of q1 does not have to be the same as that of y1.

In this example, the dimension of the resulting vector became three.

We also project word vectors in X using another matrix WK. For example, we project word vector x1 in X as follows:

k1=x1WK

Again, the dimension of k1 does not have to be the same as the dimension of x1.

However, the dimension of k1 must be the same as the dimension of q1. Only then can we apply the dot-product to them.

dim(q1)=dim(k1)

We apply the dot-product to see how strong these projected features (we call them query q1 and key k1) relate to each other:

s11=q1k1

We can calculate the dot-product by transposing vector k1 and performing matrix multiplication:

The result is a scalar value. Let’s call it “score”. As noted above, we also call vector q1 “query” and vector k1 “key”. In other words, we extract a query from vector y1, extract a key from vector x1, and check how well the query matches with the key by the score s11. So, we are finding out how two-word vectors relate in terms of extracted features (the query and the key).

Since we are using a linear transformation, we can calculate scores between multiple word-vectors by a simple matrix operation.

Sentence X consists of n word vectors (in reality, an input sentence has a pre-defined number of tokens, and unused word positions filled with a filler are effectively ignored):

X=[x1x2xn]

We can apply the following matrix operation to extract keys from sentence X:

XWK=[x1x2xn]WK=[x1WKx2WKxnWK]=[k1k2kn]=K

We can apply the dot-product between the query q1 and all of the keys in K:

q1K=q1[k1k2kn]=q1[k1k2kn]=[q1k1q1k2q1kn]=[s11s12s1n]

We now have a list of scores that tells us how the query q1 relates to each key in sentence X. We can apply the softmax function to convert the scores into weights:

softmax(q1K)=softmax([s11s12s1n])=[w11w12w1n]

The weights indicate how much the model should pay attention to each word in sentence X regarding the query q1. So, let’s call the weights “attention weights”.

We use the attention weights to extract a context from sentence X, but we need to handle one more step. Since we are dealing with a specific query, we should also extract particular values from sentence X. We use another matrix WV to project sentence X to extract values.

XWV=[x1x2xn]WV=[x1WVx2WVxnWV]=[v1v2vn]=V

The dimension of vectors in V may be different from the dimension of vectors in X.

We calculate the weighted sum of value vectors using the attention weights:

softmax(q1K)V=[w11w12w1n]V=[w11w12w1n][v1v2vn]=i=1nw1ivi

The weighted sum of value vectors represents a context from sentence X regarding the query q1 from word vector y1.

We can extend the logic to multiple words (y1,y2,,ym) in sentence Y (in reality, an output sentence has a pre-defined number of tokens and unpredicted word positions are masked and won’t affect the attention mechanism):

Y=[y1y2ym]

We represent all queries in a matrix:

Q=[q1q2qn]=[y1WQy2WQynWQ]=[y1y2yn]WQ=YWQ

So, we can calculate the attention vectors for all tokens in sentence Y:

Attention(Q,K,V)=softmax(QKdk)Vwhere Q=YWQ,K=XWK,V=XWV

The scaling 1dk deserves some explanation. dk is the dimension of query and key vectors. When dk is large, the dot-product tends to be large in magnitude.

The paper explains the reason, assuming a scenario where the elements of the query and key vectors are independent random variables with mean 0 and variance 1. The dot-product of vectors q and k has mean 0 and variance dk due to the distribution of the product of two random variables:

qk=i=1dkqikiE[qk]=E[i=1dkqiki]=i=1dkE[qiki]=i=1dkE[qi]E[ki]=0Var(qk)=Var(i=1dkqiki)=i=1dkVar(qiki)=i=1dk(σqi2+μqi2)(σki2+μki2)μqi2μki2=i=1dk(1+0)(1+0)00=dk

In short, more dimensions tend to produce larger scores, which is problematic because the softmax function uses the exponential function, pushing large values even larger.

The following example Python script should clarify the reason:

import numpy as np

def softmax(x):
    return np.exp(x)/np.exp(x).sum()

s1 = np.array([1.0, 2.0, 3.0, 4.0])
s2 = s1 * 10

print(f’s1: {softmax(s1)}’)
print(f’s2: {softmax(s2)}’)

The output is as follows:

s1: [0.0320586 0.08714432 0.23688282 0.64391426]
s2: [9.35719813e-14 2.06106005e-09 4.53978686e-05 9.99954600e-01]

The weights in s2 except the last element are almost zero, which makes the gradients very small.

Let me explain the reason why the gradients become small.

Suppose the softmax probability of class i{1,2,,N} be:

pi=softmax(zi)=ezik=1Nezk

Then, the partial derivatives of the softmax with respect to the variable zj are as follows:

pizj=pi(δij  pj),δij={1if i=j0otherwise

So, it’s like pi and pj are close to 0 or 1. Therefore, any of the partial derivatives will be almost 0.

As such, Vaswani et al. introduced the scaling factor, and they call the attention mechanism “scaled dot-product attention”.

3 Multi-Head Attention

A word could have a different meaning or function depending on the context. So, we should use multiple queries per word rather than just one.

In the paper, they use eight parallel attention calculations. They call each attention function “head”. In other words, they used eight heads (h=8).

The base transformer uses 512-dimensional word vectors projected into eight vectors of 64 (=512/8) dimensions — yielding eight representation subspaces. The scaled product-dot attention processes each of the eight representations using a different set of projection matrices:

headi=Attention(Qi,Ki,Vi)(i=1h)where Qi=YWiQ, Ki=XWiK, Vi=XWiV

Even though we have eight sets of matrix operations, we can perform them in parallel. So, it’s very fast.

The next step of the multi-head attention is to concatenate all eight heads and apply one more matrix operation WO.

MultiHead(Y,X)=Concat(head1,,headh)WO

Note: I’m using a slightly different notation than the paper to keep similar usage of Q, K, and V from the scaled dot-product attention section.

Since we use the same number of dimensions (=64) for value vectors, the concatenated vectors restore the original dimension (=64x8=512). But they could also use a different number of dimensions for value vectors because _W_O can adjust the final vector dimension.

The transformer uses multi-head attention in multiple ways. One is for encoder-decoder (source-target) attention where Y and X are different language sentences. Another use of multi-head attention is for self-attention, where Y and X are the same sentences.

4 Self-Attention

A word can have a different meaning or function depending on the word itself and the words around it.

For example, in the below two sentences, the word “second” has different meanings:

Give me a second, please.

I came second in the exam.

But there is only one-word embedding for the word “second”. However, the meaning of “second” depends on its context. So, we must treat the word “second” with its context.

Let’s look at another example.

My dog chases after the thief.

The above sentence shows a functional relationship between “dog” and “chases”. We also see another relationship between “after” and “thief”.

In summary, we need to extract word contexts and relationships within a sentence, where self-attention comes into play.

With self-attention, all queries, keys, and values originate from the same sentence. So, we use multi-head attention like MultiHead(X, X) to extract contexts for each word in a sentence.

Self-attention handles long-range dependencies well compared with RNN and CNN. For RNN, we have diminishing gradients that even LSTMs can not eliminate. CNN can only associate data within the kernel size. Self-attention has none of those issues. Moreover, self-attention operations can run parallel and much faster than sequential processing like RNN cells.

Also, we can visually inspect self-attention, making it more interpretable. The paper has a few visualizations of the attention mechanism. For example, the following is a self-attention visualization for the word “making” in layer 5 of the encoder.

Figure 3 of the paper

There are eight different colors with various intensities, representing the eight attention heads.

It is clear there is a strong relationship between “making” and “more difficult”.

The below image shows the word “its” and the referred words in layer 5 of the encoder (isolating only attention head 5 and attention head 6).

Figure 4 (bottom) of the paper

The word “its” has strong attention to “Law” and “application”.

The below shows two heads separately (green and red). Each head seems to have learned to perform different tasks based on the structure of the sentence.

Figure 5 of the paper

5 References