This is a section summary of “Artificial Intelligence, A Morden Approach”.
One-hot vector is one of the basic method os encode a work, it means we encode the ith word in the dictionary with a 1 bit in the ith input position and a 0 in all the other positions. But it would not capture the similarity between words.
n-gram counts are better understanding the context around a word and all the phrases that the word appears in. With a 1,000,000 word vocabulary, there are 10²⁵ 5-grams to keep track of. If we we can reduced this to a smaller-size vector with just a few hundred dimensions, it will be more generalised. This low-dimensional vector which representing a word is then called word embedding.
Each word is a just a vector of numbers, where the individual dimension and numeric values do not have physical meanings:
"physical" = [-0.7, +0.2, -3.2, ...]
"meanings" = [+0.5, +0.9, -1.3, ...]
The feature space has the property that similar words have similar vectors. It turns out, the word embedding vectors have additional properties beyond mere proximity for similar words. we can use the vector difference from one word to another to represent the relationship between these two words.
Hence, word embedding are good representation for downstream language tasks, such as question answering or translation or summarisation. They are not guaranteed to answer analogy questions on their own.
It is proved that word embedding vectors are more helpful than one-hot encodings of words in deep learning to NLP tasks. Most of the times, we can use generic pretrained vectors. The commonly used vector dictionaries include WORD2VEC, GloVe, and FastText and they have embeddings for 157 languages.
We can also train our own word vectors. This is usually done at the same time as training a network for a particular task. Unlike generic retrained embeddings, word embeddings produced for a specific task can be trained on a carefully selected corpus and tend to emphasise aspects of words that are useful for the task. For instance, the task is part-of-speech (POS) tagging. It means we need to predict the correct part of speech for each word in a sentence. It is a simple task but nontrivial because many words can be tagged in multiple ways. The word "CUT" can be a present-tense verb, a past-tense verb, an infinitive verb, a past participle, an adjective, or a noun. If a nearby temporal adverb refers to the past, we would expect the embedding will capture the past-refering aspect of the adverbs.
How to do POS tagging with word embeddings?
Given a corpus of sentences with POS tags, we learn the parameters for the word embeddings and the POS tagger simultaneously. The process works as follows:
Choose the width, w, for the n-grams counts (the prediction window) to be used to tag each word. A typical value is w=5, meaning that the tag is predicted based not eh word plus the two words to the left and the two words to the right. Split every sentence in the corpus into overlapping windows. Each window produces one training example consisting of the w words as input and the POS category of the middle word as output.
Create a vocabulary of all of the unique word tokens that occur more than, say, 3 times in the training data. Denote the total number of words in the vocabulary as v.
Sort this vocabulary in any arbitrary order (perhaps alphabetically).
Choose a value d as the size of each word embedding vector
Create a new v-by-d weight matrix called E. This is the word embedding matrix. Row I of E is the word embedding of the ith word in the vocabulary. Initialise E randomly.
Set up a neural network that outputs a part of speech label, The first layer will consist of w copies of the embedding matrix. We might use two additional hidden layers, z1 and z2 (with weight matrices W1 and W2 respectively), followed by a softmax layer yielding an output probability distribution y over the possible part-of-speech categories for the middle word.
To encode a sequence of w words into an input vector, simply look up the embedding for each word and concatenate the embedding vectors. The result is a real-valued input vector x of length wd. Even though a given word will have the same embedding vector whether it occurs in the first position, the last, or somewhere in between, each embedding will be multiplied by a different part of the first hidden layer; therefore we are implicitly encoding the relative position of each word.
Train the weights E and the other weight matrices W1, W2, and Woutput using gradient descent. If all goes well, the middle word, cut, will be labelled as a past-tense verb, based on the evidence in the window, which includes the temporal past word "yesterday", the third-person subject pronoun "they" immediately before cut, and so on.