From Attention to Self Attention to Transformers

The Transformer model uses a “Scaled Dot Product” attention mechanism. The transformer model also uses what is called as “Multi-Head Attention” — instead of calculating just one attention score for a given input, multiple attention scores are calculated — using different sets of weights. This allows the model to attend to different “representation sub-spaces” at different positions, akin to using different filters to create different features maps in a single layer in a CNN.

Different types of Attention in Neural Networks

Reading through various attention related papers gives an interesting perspective on how researchers have used attention mechanisms for various tasks and how the thinking has evolved. This is a quick overview of such a study that will give a sense of how we could tweak and use attention based architectures for our own tasks.

An introduction to Attention – the why and the what

The longer the input sequence length (i.e. sentence length in NLP) the more difficult it is for the hidden vector in RNNs to capture the context. The more updates are made to the same vector, the higher the chances are the earlier inputs and updates are lost. How could we solve this? Perhaps if we get rid of using just the last hidden state as a proxy for the entire sentence and instead build an architecture that consumes all hidden states, then we won’t have to deal with the weakening context. Well, that is what “attention” mechanisms do.

An overview of Generative Adversarial Networks

GAN is a network where two models, a generative model G and a discriminatory model D, are trained simultaneously. The generative model will be trained to produce new bedroom images by capturing the data distribution associated with the training dataset. The discriminatory model will be trained to correctly classify a given input image as real (i.e. coming from the training dataset images) or fake (i.e. synthetic image produced by the generative model). Simply put, the discriminatory model is a typical CNN image classifier model, or more specifically, a binary image classifier.

Understanding Backpropagation – detailed review of the backprop function

If we were to write a stand alone backprop function, it would take the derivative of loss wrt to the output activation as input and will have to calculate two values from it. First, will be the derivative of loss wrt the weights. This will be used in the gradient descent calculation to update the weights. Second, the function should calculate the derivative of loss wrt the input activation. This will have to be returned so as to continue with the backpropogation, as the input activation for this layer is nothing but the output activation of the previous layer.