Advantage of RNN over fixed window
In a simple task like Part-of-Speech (POC) tagging, a fixed-size window such as n-gram would offer enough information. When it comes to a complex task such as question answering or reference resolution, it might require context with more words. For example, in the sentence "Jack told me that his passport did not arrive in time so he had to cancel the flight." knowing that he refers to Jack instead of Julia. It requires the context that span from the first to the last word.
Hence, we want to create a language model with sufficient context. A language model is a probability distribution over sequences of words. It allows us to predict the next word in a text given all the previous words. Both n-gram model and feedforward network has a fixed window of n words which has a restriction on the context. Moreover, a feedforward network has the problem of asymmetry. The "he" in this sentence needs to be relearned in another position.
Recurrent Neural Network
Recurrent Neural Network is designed to process time-series data. So it might be useful in processing Language. In a RNN model, each input word is encoded as a word embedding vector, xi . The hidden layer zi will pass the input from one time step to the next. The model is then a multi-class classification where each class represents a word in the vocabulary. The output yt is then a softmax probability distribution over the possible value of the next word in the sentence. This RNN architecture also solves the problem of asymmetry because the weights are the same for every word position.
The equations defining the model refer to the values of the variables indexed by time step t:
In a language model, we want to predict the nth word given the previous words. But for classification, there is no reason we should limit ourselves to looking at only the previous words. To capture the context on the right, we can use a bidirectional RNN, which concatenates a separate right-to-left model onto the left-to-right model.
Beyond the word level classification task such as POS tagging, RNN can also be used for sentence-level or document-level classification tasks. For example, it can be used to classify a sentence to have either positiv or negative sentiment in sentiment analysis. Instead of making a classfication by each word, we need to obtain an aggregate whole-sentence representation, y from the per-word outputs yt of the RNN. One method (simplest way) is to used the RNN hidden state corresponding to the last word of the input. However it will focus on the ending of a sentence. Another method is to pool all of the hidden vectors, such as average pooling:
. The pooled d-dimensional vector can then be fed into a feed forward layer before the output layer.
LSTMs for NLP
In the RNN time steps, information from far back will start to get lost along with time and the recent information get preserved better. This is similar to the vanishing gradient problem. Hence, we can introduce the Long Short-Term Memory (LSTM) model which remember some parts of the input and forgets the unimportant part. For example, In an unfinished sentence
The athletes, who all won their local qualifiers and advanced to the finals in Tokyo, now ...
we would expect it to pick "compete" over "competes" because "athletes" is plural. An LSTM can learn to create a latent feature for the subject person and copy that feature forward without alteration until it is needed to make a choice like this. A regular RNN or an n-gram model often gets confused in along sentences with many intervening words between the subject and verb.
Commentaires