The Natural Language Processing (NLP) might have a long history which trace back to the start of Machine learning which is early 20th century. But the modern NLP techniques are based on the work in the past 30 something years.
image source: here
Deerwester et al. (1990) projected words into low-dimensional vectors by decomposing the cooccurrence matrix formed by words and the documents the words appear in. At the similar time, Another method is to treat the surrounding words—say, a 5-word window—as context.Then Brown et al. (1992) grouped words into hierarchical clusters according to the bigram context of words — say, a 5-word window—as context. It seems to be effective in entity recognition (Turian et al., 2010).
In 2013, Mikolov et al. introduced the WORD2VEC system which is the word embedding obtained from training neural networks. In 2014, Pennington et al. introduced GloVe word embedding which were obtained by operating directly on a word co-occurrence matrix obtained from billions of words of text. In 2018, Contextual representations for words were emphasised by Peters et al.
Apart from word embedding, the progress in language model is going fast as well. Mikolov et al. (2010) introduced the idea of using Recurrent Neural Networks (RNN) for NLP, and Sutskever et al. (2015) introduced the idea of sequence to sequence learning with deep networks. Jozefowicz et al. (2016) showed how an RNN trained on a billion words can outperform carefully hand-crafted n-gram models. Zhu et al. (2017) and Liu et al., (2018b) showed that an unsupervised approach works, and makes data collection much easier. It was soon found that these kinds of models could perform surprisingly well at a variety of tasks.
Since 2018, new NLP projects typically start with a pretrained transformer model. Howard and Ruder (2018) describe the ULMFIT (Universal Language Model Fine-tuning) framework, which makes it easier to fine-tune a pretrained language model without requiring a vast corpus of target-domain documents. Ruder et al. (2019) give a tutorial on transfer learning for NLP.
Devlin et al. (2018) showed that transformer models pretrained with the masked language modeling objective can be directly used for multiple tasks. The model was called BERT (Bidirectional Encoder Representations from Transformers).
The XLNet system (Yang et al., 2019) improves on BERT by eliminating a discrepancy between the pretraining and fine-tuning.
The ERNIE 2.0 framework (Sun et al., 2019) extracts more from the training data by considering sentence order and the presence of named entities, rather than just co-occurrence of words, and was shown to outperform BERT and XLNet.
An improved BERT, The ROBERTA system (Liu et al., 2019b) used more data and different hyperparameters and training procedures, and found that it could match XLNet.
Another BERT, ALBERT (A Lite BERT) went in the other direction, reducing the number of parameters from 108 million to 12 million (so as to fit on mobile devices) while maintaining high accuracy.
The XLM system (Lample and Conneau, 2019) is a transformer model with training data from multiple languages. This is useful for machine translation, but also provides more robust representations for monolingual tasks.
ARISTO (Clark et al., 2019) achieved a score of 91.6% on an 8th grade multiple-choice science exam. It consists of an ensemble of solvers: some use information retrieval (similar to a web search engine), some do textual entailment and qualitative reasoning, and some use large transformer language models. But it deals only with multiple-choice questions, not essay questions, and it can neither read nor generate diagrams.
GPT-2, (OpenAI, Microsoft) a transformer-like language model with 1.5 billion parameters trained on 40GB of Internet text, achieves good results on such diverse tasks as translation between French and English, finding referents of long-distance dependencies, and general-knowledge question answering, all without fine-tuning for the particular task. GPT series has updated to version 4 already. This field is growing so fast that it is really hard to keep track of.
T5 (the Text-to-Text Transfer Transformer, Google) is designed to produce textual responses to various kinds of textual input. It includes a standard encoder–decoder transformer model, pretrained on 35 billion words from the 750 GB Colossal Clean Crawled Corpus (C4).
You might find an updated comparison of Large Language Model (LLM) here.
In 2018, The GLUE (General Language Understanding Evaluation) benchmark, a collection of tasks and tools for evaluating NLP systems, was introduced by Wang et al. It soon becomes the metric to compare language model. You can find the leader board here.
Tasks include question answering, sentiment analysis, textual entailment, translation, and parsing.
Transformer models have so dominated the leaderboard (the human baseline is way down atninth place) that a new version, SUPERGLUE (Wang et al., 2019), was introduced with tasks that are designed to be harder for computers, but still easy for humans.