This post is a summary of chapter 25.4 in this book. The figures are from the book.
Self-attention is part of transformer architecture. In my previous post, I talked about how attention was applied in sequence-to-sequence models from the target recurrent neural network (RNN) to the source RNN. Here, self-attention applied this mechanism from each sequence of hidden states to itself, the source to the source, and the target to the target.
However, within the first step in the attention algorithm, self-attention requires dot products between a hidden state to all the hidden state. Among these dot products, the one between a hidden state to itself will result in much higher value so the attender will be biased towards itself. The transformer solves this by projecting the input, hidden state into three different representations using three different weight matrices:
The query vector
The key vector
The value vector
In the standard attention mechanism, the key and value vector are the same. They are both hidden state from the source RNN and the query vector is the previous hidden state from the target RNN. In self-attention, the encoding results of the ith word, ci, can be calculated by applying an attention mechanism to the projected vectors:
where d is the dimension of k and q. Note that i and j are indexes in the same sentence when we encode the context using self-attention. The scale factor √d was added to improve numerical stability. This encoding for all words in a sentence can be calculated simultaneously as the above equations can be expressed using matrix operations that can be computed efficiently in parallel.
The context-based summarisation, ci, is a sum over all previous positions in the sentence. Because it is averaged out over the whole sentence, some information gets lost in practice even though it should not be the case. One way to address this issue is called multi-headed attention. We can divide the sentence up in to m equal pieces and apply the attention model to each of the m pieces. Each piece has its own set of weights. Then the results are concatenated together to form ci.
Self-attention is only one component of the transformer model. On transformer layer contains several sub-layers. The input vectors go through self-attention RNN layer. Then the output was fed to a feedforward layer, followed by a ReLU activation function to generate the output vectors. In this process, there is a potential vanishing gradient problem. Hence, two residual connections are added into the transformer layer.
The full transformer architecture normally contains several layers of transformers. As shown in the following figure. It can be used for POS tagging (mentioned in the previous post).
The transformer architecture here does not capture the order of words in the sequence due to the self-attention, which is agnostic to word order. In order to capture the ordering of the words, we add positional embedding. If our input sequence has a maximum length of n, then we learn n new embedding vectors, one for each word position. The input to the first transformer layer is the sum of the word embedding at position t plus the positional embedding corresponding to position t.
Transformer encoder and decoder
When transformer is applied for POS tagging, we only need a architecture which gives us a probability distribution over the tags. This part is called the transformer encoder. When it comes to tasks such as machine translation, we would need a sequence-to-sequence model. Therefore, in addition to the encoder, it also includes a transformer decoder. Both encoder and decoder contains self-attention and they are nearly identical, only that the decoder uses a version of self-attention where each word can only attend to the words before it.