Evolution of Word to Vector

Original Source Here
Limitations of Word2Vec:
- Inability to handle unknown or OOV(out-of-vocabulary) words.
- No shared representations at sub-word levels.
- Scaling to new languages requires new embedding matrices.
- Word2Vec only rely just on local statistics (local context information of words) but does not incorporate global statistics (word co-occurrence) to obtain word vectors.
- Indifference to word order and inability to represent idiomatic phrases.
Improvements over Vanilla Word2Vec:
When the number of output classes is very large, such as in the case of the skip-gram model, computing the softmax becomes very expensive.
Mikolov in his follow-up paper [Mikolov et al., 2013b] suggested using one of the two approximate cost functions to make the computation more efficient:
Hierarchical softmax [Morin and Bengio, 2005]
Hierarchical softmax builds a Huffman binary tree where leaves are all the words from the vocabulary. In order to estimate the probability of a given word, one traverses the tree from the root to a leaf.
To evaluate the probability of a given word, take the product of the probabilities of each edge on the path to that node:
How does it improve computational efficiency?
Let’s understand, in the case of a binary tree, this can provide an exponential speedup. In the case of 1 million words, the computation involves log(1000000)=20 multiplications!
Google had deployed this model on their Allo smart replies, the prediction time decreased from around half a second to a nearly instant prediction. Many neural language models nowadays use either hierarchical softmax or other softmax approximation techniques.
Negative Sampling and Noise Contrastive Estimation
Negative sampling, on the other hand, is a simplification of the Noise Contrastive Estimation [Gutmann and Hyvärinen, 2010] technique.
Multinomial softmax regression is expensive when we are computing softmax across many different classes (each word essentially denotes a separate class). The core idea of Noise Contrastive Estimation (NCE) is to convert a multiclass classification problem into one of binary classification via logistic regression, while still retaining the quality of word vectors learned. With NCE, word vectors are no longer learned by attempting to predict the context words from the target word. Instead, we learn word vectors by learning how to distinguish true pairs of (target, context) words from corrupted (target, a random word from vocabulary) pairs. The idea is that if a model can distinguish between actual pairs of target and context words from random noise, then good word vectors will be learned.
Specifically, for each positive sample (ie, true target/context pair) we present the model with kk negative samples drawn from a noise distribution. For small to average size training datasets, a value for kk between 5 and 20 was recommended, while for very large datasets a smaller value of kk between 2 and 5 suffices. Our model only has a single output node, which predicts whether the pair was just random noise or actually a valid target/context pair. The noise distribution itself is a free parameter, but the paper found that the unigram distribution raised to the power 3/43/4 worked better than other distributions, such as the unigram and uniform distributions.
As mentioned above Word2Vec model has some limitations and ignoring the morphology(or more precisely the words having the same pronunciation or looks similar) is a major drawback.
For example, you and I might encounter a new word that ends in “less”, and from our knowledge of words that end similarly, we can guess that it’s probably an adjective indicating a lack of something, like flawless or careless.
Word2vec represents every word as an independent vector, even though many words are morphologically similar, just like our two examples above.
This can also become a challenge in morphologically rich, and polysynthetic languages such as Arabic, German, or Turkish.
FastText(By Facebook)
As a solution to the problem mentioned above, Bojanowski et al. [Bojanowski et al., 2017] enriched skip-gram with subword information. Subword Information is basically taking the whole thing at the char level. Instead of conditioning, the probability of context words on a center word vector Bojanowski focused it on a sum of the center word vector and its subword vectors which took care of the morphological disadvantage of the skip-gram model.
In their experiments, they consider character n-grams of sizes 3, 4, 5, and 6.
Example:
Taking the word where and n = 3 as an example, it will be represented by the character n-grams:
<wh, whe, her, ere, re> and the special sequence<where>.
Since the number of all possible character n-grams is huge, the authors place them in some fixed-size hash table (e.g. 10^6 elements) in order to bound the memory requirements, and embeddings are learned for hashes instead of n-grams.
Bojanowski et al. report result superior to the original skip-gram both on word similarity and analogy tasks.
GloVe is an unsupervised learning algorithm for obtaining vector representations for words, developed by Pennington, et al. (2014) at Stanford.
Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.
The name Glove stands for Global Vectors because the model is based on capturing global corpus statistics. Glove aims to combine the count-based matrix factorization and the context-based skip-gram model together.
Glove is available as a pre-trained word vector in various datasets. Pre-trained word vectors are made available under the Public Domain Dedication and License based on the dataset it has been trained on.
- Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip
- Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip
- Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d vectors, 822 MB download): glove.6B.zip
- Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 200d vectors, 1.42 GB download): glove.twitter.27B.zip
Each cell X i,j of the matrix represents the number of times words i occurs in the same context as word j. This defines the probability that word j appears in the same context as word i :
where xi is the number of occurrences of the word i. Non-zero elements of this sparse co-occurrence matrix are passed as an input to the GloVe learning algorithm.
Let’s understand this whole mathematical concept with an example. Consider two words i and j that exhibit a particular aspect of interest; for concreteness, suppose we are interested in the concept of thermodynamic phase.
The table shows the Co-occurrence probabilities for target words ice and steam with selected context words from a 6 billion token corpus. Only in the ratio does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific to steam.
for which we might take i=ice & j=steam
The relationship of these words can be examined by studying the ratio of their co-occurrence prob-abilities with various probe words,k.
For words k related to ice but not steam, say k=solid, we expect the ratio P ik/ P jk will be large.
Similarly, for words k related to steam but not ice, say k=gas, the ratio should be small.
For words like water or fashion, that are either related to both ice and steam, or to neither, the ratio should be close to 1.
Compared to the raw probabilities, the ratio is better able to distinguish relevant words (solid & gas) from irrelevant words (water & fashion) and it is also better able to discriminate between the two relevant words.
Glove promotes the word embedding learning should be with ratios of co-occurrence probabilities rather than the probabilities themselves
For more details, you can also look at the Glove Documentation here.
<Introduction to Contextual embedding>
ELMo — Deep contextualized word representations (By Allen NLP)
In 2018 Matthew E. Peters first introduced an algorithm called ELMo.
Elmo or in short Embedding from Language Model is a bidirectional Language Model (biLM) whose vectors are pre-trained using a large corpus to extract multi-layered word embeddings.
What’s the key difference between ELMO Embeddings from traditional embeddings such as Word2Vec, GLOVE, etc?
Let’s have a look at the below image:
Glove and word2vec have only one numeric representation, which means one word has only one embedding.
What’s the problem here?
The word present has been used in different contexts and Elmo considers different embedding for both of these words based on their context.
ELMo vectors are assigned to a token or word and are actually a function of the entire sentence containing that word. Therefore, the same word can have different word vectors under different contexts.
Consider the below two sentences:
I know how it feels
It feels hot here
Take a moment to grasp the difference between these two sentences. The verb “feel” in the first sentence is being empathetic about someone. And the same verb transforms into personal experience in the second sentence. This is a case of Polysemy wherein a word could have multiple meanings or senses.
ELMo word vectors successfully address this issue. ELMo word representations take the entire input sentence into the equation for calculating the word embeddings. Hence, the term “feel” would have different ELMo vectors under different contexts.
So how does it understand the polysemy?
Let’s understand this algorithm in terms of how they understand the word sense better. We will consider the below sentence as an example:
How similar is a cup to mug?
We can think of WSD or Word Sense Disambiguation as a kind of contextualized similarity task, since our goal is to be able to distinguish the meaning of a word like bass in one context (playing music) from another context (fishing).
What’s the goal of the algorithm?
Here the algorithm has been given two sentences, each with the same target word but in a different sentential context. The system must decide whether the target words are used in the same sense in the two sentences or in a different sense.
Consider the below two sentences:
There’s a lot of trash on the bed of the river.
I keep a glass of water next to my bed when I sleep.
The above task of the algorithm is basically known as a word-in-context task which lies somewhere between WSD.
ELMo Learns conceptualized word representations that capture the Syntax, Semantics, and Word Sense Disambiguation (WSD) through the above technique.
ELMo could be coupled with existing deep learning approaches for building supervisory models for a diverse range of complex NLP tasks to improve their performance significantly.
BERT(By Google)
After ELMo, in the same year, we got introduced to another revolutionary model called BERT or Bidirectional Encoder Representations from Transformers which is based on the bidirectional idea of ELMo but uses a Transformer architecture instead of LSTMs. Internally it uses the ATTENTION mechanism to understand the inter-relation between all the words in the sentence.
Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of the Transformer are described in a paper by Google.
As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
The chart below is a high-level description of the Transformer encoder. The input is a sequence of tokens, which are first embedded into vectors and then processed in the neural network. The output is a sequence of vectors of size H, in which each vector corresponds to an input token with the same index.
When training language models, there is a challenge of defining a prediction goal. Many models predict the next word in a sequence (e.g. “The child came home from ___”), a directional approach that inherently limits context learning. To overcome this challenge, BERT uses two training strategies:
Masked LM (MLM)
Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.
Next Sentence Prediction (NSP)
In the BERT training process, the model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.
When training the BERT model, Masked LM and Next Sentence Prediction are trained together, with the goal of minimizing the combined loss function of the two strategies.
We hope this overview of word embeddings has helped to highlight some fantastic research that sheds light on the relationship between traditional distributional semantic and state-of-the-art dynamic embedding models.
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot
via WordPress https://ramseyelbasheer.io/2021/06/19/evolution-of-word-to-vector/