#### What is sequence-to-sequence model

One of the application of Natural Language Processing (NLP) is to generate one sentence based on another sentence instead of generate one word at a time, such as machine translation (MT). We want to translate a sentence from a source language to a target language. In this case, each target word is conditional on the entire source sentence and on all previously generate target word, due to the various word sequence in each language.

The straight forward method is to use two Recurrent Neural Networks (RNN). A source RNN to go through the source sentence and generate a final hidden state. This hidden state are then used as a starting input for the target RNN. This neural network architecture is called a basic sequence-to-sequence model.

There are some limitations with this basic method. such as:

Nearby context bias: As mentioned in the previous post, the base RNN start to lose information along with the time steps. The final hidden state will focus on the last word way more than the initial word. This problem exists even with Long Short-Term Memory (LSTM) structure.

Fixed context size limit: The hidden state after the the source RNN has a fixed-dimension which limited the amount of information it can store. Increasing the hidden state vector size can lead to slow training and overfitting.

Slower sequential processing: Neural Network (NN) gains efficiency when trained in batches utilising the modern GPU processors. RNN, on the other hand, is constrained to operate one word at a time.

#### Attention

Average pooling (mentioned in NLP 2) would mitigate the shortcomings of nearby context bias and fixed context size limits. It also comes with a huge increase in the number of weights.

The good note is that not every word is important in the source RNN, that saying, not every hidden state has the same importance. We can train a context vector ci which contains the most relevant information for generating the next target word in target RNN. This concept in NN is called attention. The sequence-to-sequence model that uses attention is called an attentional sequence-to-sequence model.

If a standard target RNN is written as:

the target RNN for attentional sequence-to-sequence model can be written as:

where [xi, ci] is the concatenation of the all hidden vectors and context vectors, ci, defined as:

where hi-1 is the target RNN vector that is going to be used for prediction the word at time step i, and sj is the output of the source RNN vector for the source word j. Both hi-1 and sj are d-dimensional vectors, where d is the hidden size. The value of rij is the "attention score" between the current state i and the source word j. These scores are normalised into a probability aij using a softmax over all source words. Then these probability are used to generate a weighted average of the source RNN vectors, ci, which is another d-dimensional vector.

The probabilistic softmax formulation for attention is added here for three purposes. First, it makes attention differentiable. Second, it allows the model to capture certain types of long-distance contextualisation. Third, it allows the network to represent uncertainty.

#### Decoding

At training time, a sequence-to-sequence model attempts to maximise the probability of each word in the target training sentence, conditioned on the source and all of the previous target words. The way to generate the target one word at a time, and then used the word we generated at the next time step, is called decoding.

If we choose the word with the highest probability for the next time step and then feed this word as input to the next time step, this approach is called greedy decoding. This method can not guarantee maximising the probability of the entire target sequence.

A better approach is to search for an optimal decoding with search algorithms such as beam search. In the context of MT decoding. beam search typically keep the top k hypotheses at each stage, extending each by one word using the top k choices of words. then chooses the best k of the resulting k^2 new hypotheses. The output will be the highest scoring hypothesis. Current state-of-the-art neural MT models use a beam size of 4 to 8.