The Transformer model uses a “Scaled Dot Product” attention mechanism. The transformer model also uses what is called as “Multi-Head Attention” — instead of calculating just one attention score for a given input, multiple attention scores are calculated — using different sets of weights. This allows the model to attend to different “representation sub-spaces” at different positions, akin to using different filters to create different features maps in a single layer in a CNN.
Reading through various attention related papers gives an interesting perspective on how researchers have used attention mechanisms for various tasks and how the thinking has evolved. This is a quick overview of such a study that will give a sense of how we could tweak and use attention based architectures for our own tasks.
The longer the input sequence length (i.e. sentence length in NLP) the more difficult it is for the hidden vector in RNNs to capture the context. The more updates are made to the same vector, the higher the chances are the earlier inputs and updates are lost. How could we solve this? Perhaps if we get rid of using just the last hidden state as a proxy for the entire sentence and instead build an architecture that consumes all hidden states, then we won’t have to deal with the weakening context. Well, that is what “attention” mechanisms do.