Word Embeddings and Pre-training for Large Language Models (BERT, GPT)
Original Source Here
Around 2017
Attention and beyond!
Attention in Language models(LM) by Vaswani et al. (2017), introduced a way of capturing this context in language that outperformed some previous SOTA benchmarks in various downstream NLP tasks. Since then the language models have grown significantly in size. For context (all are transformer-based by the way):
BERT-base(Transformer Encoder) has ~110M parameters
GPT-1 (Transformer Decoder) has ~117M parameters
BERT-large has ~340M parameters
GPT-2 has ~1.5B parameters
GPT-3 has ~175B parameters
The pre-training objective of some of these large pre-trained language models is to predict the next word or next sentence. This turns out to be a good pre-training objective to understand complex word interactions and is useful for different downstream tasks like question answering.
The datasets these models are trained on are non-trivial; for example BERT was trained on Wikipedia (2.5B words) + BookCorpus (800M words), in total 3.2B words.
And why do we care about their size? because these models can capture complex word interactions between words in their context. More than what pre-trained word2vec has to offer, and so these large models learn a word representation independant of word2vec or any other pre-trained word embedding, i.e. pre-training whole models, illustrated below.
Note: You could also throw away Word2Vec embeddings before 2017 and just have your model (say LSTM) learn words embeddings from scratch end-end, but this was not useful as we did not have such large models trained on huge collection of datasets and limiting factors from previous mode architectures like LSTM (one of them is that the context is lost beyond ~100 words) and a problem with sequential compute with them.
But this works well for large transformer based models on learning “strong” representations of language. One might ask if pre-training these large pre-trained models overfit on the pre-training data — it turns out from the results, BERT and these large models are in fact underfitting the pre-training data and so we need even larger models to express complex interactions that allow us to fit these massive datasets better. Therefore more and more large models are showing up like T5 from Google(text-to-text transfer transformer).
Text that these models were trained on were of all sorts data, a hypothetical example: “This pizza is taaaaasty!” (OR) “We should transformify this and laern the word representations better”.
there can be variations of words like “taaaaasty”, misspellings like “laern” and novel items like “Transformerify” which would get an UNK token from tokenizers. Trying to learn meaning of each of these words wouldn’t help, though English has slightly less complex morphology, other languages have more complex morphology. Meaning that, there are massive number of variations of words in other languages compared to English.
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot
via WordPress https://ramseyelbasheer.io/2021/07/31/word-embeddings-and-pre-training-for-large-language-models-bert-gpt/