This is a summary of chapter 25.5 in "Artificial intelligence: A Modern Approach".
It would be very difficult to build a robust language model from enough data. Most of the data in natural language are unlabelled. On top of that, there are over 100 billion words of test added to the internet everyday. There are many resources we could use to have access to these data, such as Common Crawl project and many FAQ sites with question-answer pairs. It will be a lot of trouble to create a new dataset every time we want a new Natural Language Processing (NLP) mode. So we could utilised pretraining, a forma of transfer learning, to train an initial version of an NLP model. From there, we can use a smaller amount of domain-specific data to refine the model.
Pretrained word embedding
Here, I am going to introduce the GloVe model. It counts how many times a word appears within a window of another word, then calculated the probability that one word appears in the context of another word.
where Xi,j is the number of times word i co-occurs with word j. Let Ei be the word embedding for word i.
The GloVe model converts ratios of probabilities into vector differences and dot products:
The dot product of two word vectors is equal to the logo probability of their co-occurrence. It is much less expensive to train GloVe model than a standard Neural Network.
Another word embedding model by Tshitoyan et al. (2019) were trained on scientific abstracts that it is even able to recover knowledge in material science.
Pretrained contextual representations
Word embedding are better representations than atomic word tokens. But there is a issue with polysemous words as they might mean something totally different. Therefore, instead of just learning a word-to-embedding table, we want to train a model to generate contextual representations of each word in a sentence. We could use a Recurrent Neural Network (RNN) to create contextual word embeddings. We feed in one word at a time and ask the model to predict the next word. Each RNN node at that time step will receive two inputs: the noncontextual word embedding for the current word and the encoded information from the previous words.
This model is similar to the one for part-of-speech (POS) tagging mentioned in a previous post with two important differences. First, this model is unidirectional. Second, it predicts the next word using the prior context. As some context might come from the rest of the words in this sentence. A workaround is to train a separate right-to-left language model that contextualises the right side of a sentence. However, such a model fails to combine evidence from both directions. Why is that?
Masked Language model
Instead of combining two uni-directional RNN, we used a masked language model (MLM). It is trained by masking individual words in the input and asking the model to predict the masked words. For this task, one can use a deep bidirectional RNN or transformer on top of the masked sentence. The final hidden vectors that corresponds to the masked tokens are then used the predict the words that were masked. The beauty of this approach is that it requires no labeled data. The sentence provides its own label for the masked words.