While image classification capabilities have many uses, most real world applications require more than classification of singleton images. We need the ability to detect multiple objects in a given image.
Q-learning is one of the most popular Reinforcement learning algorithms and lends itself much more readily for learning through implementation of toy problems as opposed to scouting through loads of papers and articles.
Entropy represents how much “information content” is present in the outcome — however it is communicated to us. It is a quantitative measure of the information content, or in other words - uncertainty, associated with the event.
The Transformer model uses a “Scaled Dot Product” attention mechanism. The transformer model also uses what is called as “Multi-Head Attention” — instead of calculating just one attention score for a given input, multiple attention scores are calculated — using different sets of weights. This allows the model to attend to different “representation sub-spaces” at different positions, akin to using different filters to create different features maps in a single layer in a CNN.
Reading through various attention related papers gives an interesting perspective on how researchers have used attention mechanisms for various tasks and how the thinking has evolved. This is a quick overview of such a study that will give a sense of how we could tweak and use attention based architectures for our own tasks.
The longer the input sequence length (i.e. sentence length in NLP) the more difficult it is for the hidden vector in RNNs to capture the context. The more updates are made to the same vector, the higher the chances are the earlier inputs and updates are lost. How could we solve this? Perhaps if we get rid of using just the last hidden state as a proxy for the entire sentence and instead build an architecture that consumes all hidden states, then we won’t have to deal with the weakening context. Well, that is what “attention” mechanisms do.
Recurrent Neural Networks (RNNs) add an interesting twist to basic neural networks. A vanilla neural network takes in a fixed size vector as input which limits its usage in situations that involve a ‘series’ type input with no predetermined size. Whereas RNNs are designed to take a series of input with no predetermined limit on size.
GAN is a network where two models, a generative model G and a discriminatory model D, are trained simultaneously. The generative model will be trained to produce new bedroom images by capturing the data distribution associated with the training dataset. The discriminatory model will be trained to correctly classify a given input image as real (i.e. coming from the training dataset images) or fake (i.e. synthetic image produced by the generative model). Simply put, the discriminatory model is a typical CNN image classifier model, or more specifically, a binary image classifier.
If we were to write a stand alone backprop function, it would take the derivative of loss wrt to the output activation as input and will have to calculate two values from it. First, will be the derivative of loss wrt the weights. This will be used in the gradient descent calculation to update the weights. Second, the function should calculate the derivative of loss wrt the input activation. This will have to be returned so as to continue with the backpropogation, as the input activation for this layer is nothing but the output activation of the previous layer.