Tag Scaled Dot Product

From Attention to Self Attention to Transformers

The Transformer model uses a “Scaled Dot Product” attention mechanism. The transformer model also uses what is called as “Multi-Head Attention” — instead of calculating just one attention score for a given input, multiple attention scores are calculated — using different sets of weights. This allows the model to attend to different “representation sub-spaces” at different positions, akin to using different filters to create different features maps in a single layer in a CNN.