Multi-headed Self-attention

Jun 07, 2023

In a translation model, the following sentence is an input we want to translate:

‘‘The animal didn't cross the street because it was too tired.”

As the model processes each word (each position in the input sequence), self-attention allows the model to look at other positions in the input sequence for clues that can help lead to a better encoding for this word. For e.g., self-attention allows it to associate “it” with “animal”.

Self-attention Steps

Query, Key and Value vector
1. For each word, we create a Query vector, a Key vector, and a Value vector.
2. These vectors are created by multiplying the embedding or vectors from lower encoders by three matrices (WQ, WK, WV) trained during the training.
Scoring
1. It determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
2. The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring.
Stable Gradients and Normalization
1. Divide the scores by the square root of the dimension of the key vectors.
2. Softmax normalization of the scores so they’re all positive and add up to 1. The softmax score determines how much each word will be expressed at this position.

Self-attention Weights
1. Multiply each value vector by the softmax score. This is to keep intact the values of the word(s) we want to focus on and drown-out irrelevant words.
2. Sum up the weighted value vectors. This produces the output of the self-attention layer at a particular position.

Matrix Formulation

Calculate the Query, Key, and Value matrices by packing our embeddings into a matrix X and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).

The rest of the steps can be condensed in one formula to calculate the outputs of the self-attention layer.

Multi-headed

With multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices. This expands the model’s ability to focus on different positions and gives the attention layer multiple “representation subspaces”.

Here, we do the same self-attention calculation many times with different weight matrices, we end up with different Z matrices. The feed-forward layer is expecting a single matrix (a vector for each word). So, we concatenate the matrices then multiply them by an additional weights matrix WO.

"1-min read" Articles on NLP | ML | System Design

Discussion about this post