Attention in Deep Learning

This article is a digest/note for the blog of Denny Britz: Attention ad Memory in Deep Learning and NLP.

In Neural Machine Translation system, we map the meaning of a sentence into a fixed-length vector representation and then generate a translation base on it.

The first word of English translation is probably highly correlated ith the first word of the source sentence. Researches have found that reversing the source sequence (feeding it backwards into the encoder) produce significantly better results because it shorten the path from the decoder to the relevan parts of encoder. Similarly, feeding an input twice also seems to help a network to better memorize things. But this hack is not always useful.

With an attention mechanism, we no longer try encode the full source sentence into a fix-length vector. Rather, we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far. The important part is that each decoder output word yt now depends on a weighted combination of all the input states, not just the last state.

The weight value a’s are the “attention” the decoder pays to the question. The a’s are typically normalized to sum to 1 (so they are a distribution over the input state).

A big advantage of attention i taht it gives us the ability to interpret an visualize what the model is doing.

An alternative approach to attention is to use Reinforcement Learning to predict an approximate location to focus to.

The basic problem that the attention mechanism solves is that it allow the network to refer back to the input sequence, instead of forcing it to encode all information into a fixed-length vector. Interpreted another way, the attention mechanism is simply giving the network access to its internal memory, which is the hidden state of th encoder.