The longer the input sequence length (i.e. sentence length in NLP) the more difficult it is for the hidden vector in RNNs to capture the context. The more updates are made to the same vector, the higher the chances are the earlier inputs and updates are lost. How could we solve this? Perhaps if we get rid of using just the last hidden state as a proxy for the entire sentence and instead build an architecture that consumes all hidden states, then we won’t have to deal with the weakening context. Well, that is what “attention” mechanisms do.