From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough

: With the recent advances in deep learning, different approaches to improving pre-trained language models (PLMs) have been proposed. PLMs have advanced state-of-the-art (SOTA) performance on various natural language processing (NLP) tasks such as machine translation, text classiﬁcation, question answering, text summarization, information retrieval, recommendation systems, named entity recognition, etc. In this paper, we provide a comprehensive review of prior embedding models as well as current breakthroughs in the ﬁeld of PLMs. Then, we analyse and contrast the various models and provide an analysis of the way they have been built (number of parameters, compression techniques, etc.). Finally, we discuss the major issues and future directions for each of the main points.


Introduction
Word embeddings (i.e., distributed representations, word vectors) are dense feature vector representations of words in a specific dimensional space [1], which are usually learned by an unsupervised algorithm when fed with large amounts of tokens (t 1 , t 2 , . . ., t n ) [2].Numerous variations have been proposed: context-free embeddings representation [3][4][5], and contextual representation (pre-trained language models) [6].Traditional context-free word embedding approaches aim to learn a global word embedding matrix E ∈ R V * d , where V is the vocabulary size and d is the number of dimensions.In fact, each row e i of E corresponds to the embedding of the token t i in the vocabulary V.Alternatively, methods that learn contextual embeddings can associate to a token t i different vectors (different meanings) depending on the context in which the token appears [7].
The above studies have tried to improve word representations, but they still have a key limitation, which is producing one embedding vector for each word in almost all word embeddings, despite the fact that a word may have different meanings.With the large amounts of available data (e.g., Wikipedia dumps, news collections, Web Crawl, The Pile, etc.) and the recent advances in deep learning over recent years such as transformers, model capacity, and computational efficiency, a different number of methods have been introduced to overcome drawbacks of word embeddings [8][9][10][11].
The rest of this paper is organized as follows.Section 2 provides background concepts and distinguishes between the different categorizations of word embeddings (frequencybased and context-free word embeddings).In Section 3, we give a deep review of the different SOTA pre-trained language models.Section 4 describes the application ways, parameters, and different compression methods used to create PLMs.Section 5 discusses the current challenges and suggests future directions.Finally, we draw the conclusion of this research paper.

Language Models and Word Embeddings
There are a variety of pre-training general language representations.We distinguish two generations of word embeddings: frequency-based word vectors and context-free embedding representation.An overview of the most popular approaches is presented here.
Latent Semantic Analysis (LSA) is a representation method of word meaning introduced in [12].It is a fully automated statistical method to extract relationships between words based on their contexts of use in documents or sentences.LSA is an unsupervised learning technique (Mathematical details: [13]).Fundamentally, the proposed model is concretized through four main steps: (1) Term-Document Matrix calculation also known as the bag of words model; (2) Transformed Term-Document Matrix; (3) Dimension Reduction [14]; and (4) Retrieval in Reduced Space.

Neural Network Language Model (NNLM)
The Neural Network Language Model (NNLM) [2,15] jointly learns a word vector representation and a statistical language model with a feedforward neural network that contains a linear projection layer and a non-linear hidden layer.N-dimensional one-hot vector that represents the word is used as the input, where N is the size of the vocabulary.The input is first projected onto the projection layer.Afterwards, a softmax operation is used to compute the probability distribution over all words in the vocabulary.As a result of its non-linear hidden layers, the NNLM model is very computationally complex.To lower the complexity, an NNLM is first trained using continuous word vectors learned from simple models.Then, another N-gram NNLM is trained from the word vectors.
Max-Margin Loss (MML) Collobert and Westion [16] proposed word embeddings by training a model on a large dataset.They consider a window approach network.Max margin or hinge loss is a technique that tries to get a higher score for words that are correct than for words that are not.
Word2Vec is a model, developed by Google, to build word embeddings [3].An intrinsic difference with previously developed methods is that Word2vec is a predictive model, while LSA and MML are statistical.The Word2vec architecture is based on a feedforward neural network language model [2].Word2vec can be obtained from two different models (both involving Neural Networks): Skip-Gram and Common Bag Of Words (CBOW).The Skip-Gram model trains by trying to predict the context words from the target, while the CBOW model aims at predicting the current word from the sum of all the word vectors in its surrounding context.Using a contextual window, Word2Vec is able to learn semantic meaning and similarity between words in a fully unsupervised manner.This means that words with similar meanings (e.g., king, queen) tend to occur in a semantic space in close proximity to each other.The CBOW model is quicker as it treats the entire context as one entity, while the skip-gram model produces various training pairs for each context word.The Skip-gram model does a great job of capturing rare words due to the way it manages the context.
After the emergence of Skip-Gram and CBOW-based word embedding models, two other models based on these methods were announced.In the rest of this section, we will present global vectors and fastText.
Global vectors (Glove) is an unsupervised learning algorithm, developed by Stanford University, based on a word-word co-occurence matrix that is used to create the embeddings [4].Each row of the matrix represents a word, while each column represents the contexts that words can appear in.The matrix values represent the frequency with which a word appears in a given context.Finally, an embedding matrix where each row will be a word's embedding vector for the corresponding word is created by applying a dimensionality reduction [14].
FastText is a method for learning word embeddings for large datasets [17,18].It can be seen as an improvement of Word2Vec.Instead of considering words as distinct entities, FastText aims to learn vector representations for each word and to exploit the character and morphological structure of the word by considering every word as a composition of n-grams character (e.g., "unbalanced" = "un" + "balance" + "ed").Thus, words can be represented by averaging the embeddings of their n-grams (sub-words).While this will increase the training cost, the sub-word representation permits fastText to represent words more efficiently, allowing the estimation of out-of-vocabulary (OOV) and rare words since their character-based n-grams should occur across other words in the training dataset.
CoVe (Context Vectors model), introduced by McCann et al. [19] from SalesForce Research, learns contextual vectors using a neural machine language translation system.It is so viewed as a sequence-to-sequence model.CoVe shows that the learned representations are transferable to other NLP tasks.In order to obtain contextual embeddings, CoVe uses a deep LSTM encoder from a sequence-to-sequence model trained for machine translation.Empirical results revealed that boosting non-contextualized word representations (e.g., Word2Vec, Glove) with CoVe embeddings (LM + CoVe) improves performance over a variety of common NLP tasks.
FLAIR embeddings were introduced by Zalando research [20].FLAIR embeddings are character-level language models.FLAIR proposed embeddings have the distinct properties of being trained without any explicit concept of words and thus fundamentally modeling words as sequences of characters using the BPEmb technique presented by Heinzerling and Strube [21].In FLAIR, the same word may have different embeddings depending on the context.

List of Recent Pretrained Language Models
With the large amounts of available data (e.g., Wikipedia dumps, news collections, Web Crawl, and The Pile) and the recent advances in deep learning over recent years, such as transformers (Figure 1), model capacity, and computational efficiency, a different number of methods have been introduced to overcome the drawbacks of word embeddings.Language pre-training has been highly successful, with state-of-the-art models such as GPT, BERT, XLNet, ERNIE, ELECTRA, and T5, among many others [8,19,20,[23][24][25].These methods share the same idea, pre-training the language model in an unsupervised fashion on vast amounts of datasets, and then using this pre-trained model for fine-tuning for various downstream NLP tasks.Fine-tuning [8,[23][24][25][26] typically adds an extra layer(s) for the specific task and further trains the model using a task-specific annotated dataset, starting from the pre-trained back-bone weights (Figure 2).Recently, different unsupervised pre-training objectives have been explored in the literature (autoregressive (AR) language modelling, autoencoding (AE), Language Model (LM), Masked Language Model (MLM), Next Sentence Prediction (NSP), Replaced Text Detection (RTD), etc.).In this section, we will present recent pre-trained language models that have significantly advanced the SOTA over a variety of NLP tasks.
ELMO, short for the Embeddings from Language Model [27], comes to overcome the limitations of previous word embeddings.ELMo representations are a function of the internal layers of the bi-directional Language Model (biLM).While the input is a sequence of tokens, the language model learns to predict the probability of the next token given the previous ones (a task called "Language Modeling").In the forward pass, the sequence contains tokens before the target token.In the backward pass, the history contains words after the target token.The deep BiLSTM architecture helps ELMo learn more contextdependent aspects of the meanings of words in the top layers along with syntax aspects in the lower layers.This leads to better word embeddings and different representations of a word depending on the context in which it appears.Peters et al. [27] demonstrated that the biLM-based representations outperformed CoVe in all considered tasks.ULMfit (Universal Language Model Fine-tuning).Howard and Ruder [28], from Fas-tAI, have suggested ULMfit.The core concept behind this model is that a single model architecture can be used for language pre-training and fine-tuned for supervised downstream classification tasks by using the weights acquired through pre-training.Howard and Ruder [28] demonstrate the importance of a number of novel techniques, including discriminative fine-tuning, slanted triangular learning rate, and gradual unfreezing, for maintaining prior knowledge and ensuring stable fine-tuning.
OpenAI GPT Transformer, short for Generative Pre-training Transformer [24], is a multi-layer transformer decoder [22].Its approach is a combination of two existing ideas: transformers [22] and unsupervised pretraining [28].GPT is a large auto-regressive language model pretrained with language modelling (LM), a simple architecture that can be trained faster than an LSTM-based model.GPT uses the BookCorpus dataset [29], which contains more than 7000 books from various genres.It is able to learn complex patterns in the data by using the attention mechanism.The total number of trained parameters is 110M parameters.
BERT (Bidirectional Encoder Representations from Transformers) introduced by Devlin et al. [8] is built on a number of clever ideas that have been emerging in the NLP recently including Semi-supervised Sequence Learning [26], ELMo [27], ULMFiT [28], the Transformer [22], and the OpenAI transformer [24].BERT is the first deeply bidirectional, unsupervised language representation, pretrained using only a large text corpus and jointly conditioning both the left and right context of each token.BERT is pretrained with a linear combination of two objectives: (1) masked language modelling (MLM) objective, which randomly masks 15% of the input sequence and tends to predict these masked tokens.(2) Next Sentence Prediction (NSP) which is given two sentences, does the second sentence truly follow the first in the original text?BERT comes in different versions: a 12-layer BERT-base model (with 12 attention heads and 110M parameters) and a 24-layer BERT-large model (with 16 attention heads and 340M parameters) [8].
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context [30].The problem of long-distance dependence is a challenging problem, and it is a task where RNN fails.Transformer-XL is an auto-regressive pre-trained language model developed by the Google AI team that introduces architectural modifications enabling transformers to understand context beyond that fixed-length limitation without disrupting temporal coherence via segment-level recurrence and relative positional encoding schemes.With its 257M parameters, Transformer-XL is more than 1800 times faster than a typical transformer.
The XL here refers to extra long, which means that it has very good performance on the longdistance dependency problem in language modelling (80% longer than RNN).At the same time, it also implies that it was born to solve the problem of long-distance dependence.
GPT2 [31], OpenAI GPT's successor, is a large auto-regressive language model.This large transformer-based language model has 1.5 billion parameters.GPT-2 was trained on a dataset of 8 million web pages (40 GB of Internet text) called WebText (10 × larger than GPT), and it was trained simply to predict the next word (Language Model objective).GPT2 showcased zero-shot task transfer capabilities for various tasks such as machine translation and reading comprehension.ERNIE 2.0 [32], a continual pre-training framework for language understanding, is a pre-trained language model based on Baidu's own deep learning framework, PaddlePaddle, which can incorporate lexical, syntactic, and semantic information at the same time.We distinguish two general pre-training tasks: word-aware pre-training tasks (which include the Knowledge Masking Task, Capitalization Prediction Task and the Token-Document Relation Prediction Task) and semantic-aware pre-training tasks (which include Sentence Reordering, Sentence Distance, Discourse Relation Task, and IR Relevance Task).ERNIE was trained for English and Chinese languages.The experimental results show that significant improvements have been made in different knowledge-driven tasks, and they are comparable to the existing BERT and XLNET models on other common tasks.ERNIE 2.0 version ranks first in the GLUE rankings (as of January 2020).
XLNet is proposed by researchers at Google Brain and CMU [33].It borrows ideas from Transformer-XL autoregressive language modeling [33] and autoencoding (e.g., BERT).XLNet avoids the drawbacks of previous techniques by introducing a variant of language modeling called permutation language modeling (PLM) during the training process, where the order of the next token prediction can be right to left or left to right (randomly).This drives the model to capture the bidirectional dependencies (bidirectional contexts) and make it a generalized order-aware autoregressive language model.With its 340M parameters, XLNet beats BERT on 20 NLP downstream tasks and succeeds in the SOTA results on 18 tasks.
RoBERTa: Robustly Optimized BERT Pretraining Approach [34], developed by Facebook, is built on BERT's language masking strategy and makes some changes related to the key hyperparameters in BERT.RoBERTa uses 160 GB of training data instead of the 16 GB dataset originally used with BERT.To improve the training procedure [34], RoBERTa eliminates the Next Sentence Prediction (NSP) task.It also introduces dynamic masking so that a new masking pattern is generated each time a sentence is fed into training, whereas BERT use a fixed masked token during training.It was also trained more iteratively (more epochs and larger mini-batch sizes).The total number of parameters reached 355M parameters.
XLM (Cross-lingual Language Model Pretraining) was developed by Facebook [35].XLM uses a preprocessing technique known as Byte Pair Encoding (BPE) and a duallanguage training mechanism with BERT in order to learn relations between words in different languages without any cross-lingual supervision.Various objectives were used to pretrain XLM, including the causal language modeling (CLM) objective (next token prediction), a masked language modeling (MLM) objective (BERT-like), or a Translation Language Modeling (TLM) object (extension of BERT's MLM to multiple language inputs).The model outperforms previous models in a multi-lingual classification task and significantly improves machine translation when a pretrained model is used for the initialization of the translation model.XLM shows the effectiveness of pretrained representations for cross-lingual language modeling (both on monolingual data and parallel data) and cross-lingual language understanding (XLU).
CTRL (Conditional Transformer Language) is a 1.63B parameter PLM [36].A research team at Salesforce trained CTRL on a very large corpus of around 140 GB of text data with a causal language modeling (CLM) objective.It has a powerful and controllable artificial text generation function, allowing it to generate syntactically coherent writing (predicting the next token in a sequence).CTRL can also improve other NLP applications by fine-tuning specific tasks or transferring the learned representation of the model.
Megatron-ML was proposed by NVIDIA in "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" [37].Megatron-LM is an NLP model released by NVIDIA.They train an 8.3 billion parameter transformer language model similar to GPT-2 and a 3.9 billion parameter model similar to BERT.It is a huge model that is over 24× the size of BERT and approximately 6× the size of OpenAI's GPT-2.Increasing the size of both GTP-2 and BERT's models made them faster and more accurate.
ALBERT: In conjunction with the trend of larger and larger models (as with the huge Megatron, trained on 512 GPUs), it is fascinating to observe how new architecture enhancements, such as ALBERT (a lite version of BERT), provide higher accuracy with substantially fewer parameters [38].It is the outcome of a collaboration between Google's research team and Chicago's Toyota Technological Institute.The goal is to introduce two parameter-reduction techniques (decomposed parameterized embedding and cross-layer parameter sharing) to address the problems of training length and memory restrictions.When compared to the original version of BERT, this model scales substantially better.ALBERT has 18× fewer parameters and can be trained about 1.7× faster than BERT-large.For pretraining, ALBERT utilizes its own training method called Sentence-Order Prediction (SOP), which differs from the NSP objective for BERT.The problem with NSP, according to the authors, is that it conflates topic prediction and coherence prediction.
DistilBERT: Knowledge distillation method, which was initially described in 2006 [39] and 2015 [40], is a compression technique in which a compact model is trained to replicate the behaviour of a larger model or an ensemble of models.Developed by HuggingFace [41], DistilBERT learns a distilled (approximate) version of BERT, retains 95% performance on GLUE while employing half the number of parameters (only 66 million parameters, instead of 110M/340M).The idea is that when a large neural network has been trained, its whole output distributions of the network can be approximated by using a smaller network.DistilBERT is smaller, faster, cheaper, and lighter compared to BERT, yet its performance is about equivalent to that of BERT.Sharing the same tokenizer as BERT's bert-base-uncased, DistilBERT processes the sentence and passes along some information to the next model.DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace.
DistilGPT2 is a model that has been pretrained under the supervision of GPT2 (the smallest of the GPT-2 variants) using the OpenWebTextCorpus dataset.With 82M parameters, DistilGPT-2 is smaller than even the smallest version of GPT-2 (124M parameters).GPT2 has been compressed into DistilGPT2 using the same approach as the knowledge distillation technique.DistilGPT2 is approximately twice as fast as GPT2 OpenAI.A primary goal is to keep the weight down as much as possible.
T5 is a large pre-trained encoder-decoder model that converts all NLP problems, whether they are unsupervised or supervised, into a text-to-text format [46].It is trained using teacher forcing.This means that when we train, we always need an input sequence and a target sequence that goes with it.T5 performs well on a wide range of tasks out-of-the-box by prepending a different prefix to the input corresponding to each task (e.g., translation: translate, English to Arabic, summarization: summarize, etc.).Various T5 variants (T5-small, T5-base, T5-large, T5-3b, T5-11b.) can be trained/fine-tuned both in a supervised and unsupervised fashion.
BART is a type of PLM.Facebook first described it in the paper "BART: Denoising Sequence-to-Sequence Natural Language Generation, Translation, and Comprehension Pre-training" [47].BART is a transformer encoder-encoder (seq2seq) model.It comes with two main parts: a bidirectional encoder (BERT-like) and an autoregressive decoder (GPT-like).BART is learned by first corrupting text using an arbitrary noise function and developing a model to restore the original text.BART is most effective when finetuned for text generation (e.g., summarization, translation), but it is equally effective for comprehension tasks (e.g., text classification, and question answering).
XLM-R is a new cross-lingual language model from Facebook AI, with the letter "R" standing for Roberta [48].XLM-R is a transformer-based multilingual masked language model that has been pre-trained on a dataset in 100 languages (2.5 TB of newly created clean Common Crawl data).It outperforms prior released multi-lingual models, such as mBERT or XLM, on the GLUE and XNLI benchmarks (classification, sequence labelling, and question answering).
PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization.Based on Transformer, Google's PEGASUS is a generative language model capable of generating words to complete open text tasks [49].With a new self-supervised objective called gap-sentence generation (GSG) for Transformer encoder-decoder models, the finetuning performance on abstractive summarization was improved, achieving state-of-the-art results across a wide range of summarization datasets.
Reformer architecture [50] extends the boundaries of long sequence modeling by processing up to 500,000 tokens at once.The Reformer authors came up with four new features (Reformer Self-Attention Layer, Chunked Feed Forward Layers, Reversible Residual Layers and Axial Positional Encodings).Many NLP tasks, e.g., summarization and question answering, can benefit from this model as it is designed to efficiently handle very long sequences of data.[9], is intended to enable quick customisation for a broad range of natural language understanding tasks, utilizing a variety of algorithms (classification, regression, structured prediction) and text encoders (e.g., RNNs, UniLM, RoBERTa, BERT).The adversarial multi-task learning paradigm is a key aspect of MT-DNN that enables robust and transferable learning.MT-DNN enables multi-task knowledge distillation, which compresses a DNN model significantly without sacrificing performance.

MT-DNN (Multi-Task Deep Neural Network), introduced by Microsoft
REALM model was proposed in REALM: Retrieval-Augmented Language Model Pre-Training [51].In this retrieval-augmented language model, documents are retrieved from a textual knowledge corpus and used to process question-answer tasks.
T-NLG (Turing Natural Language Generation) is a Transformer-based generative language model developed by the Microsoft Turing team.It is a type of generative language model [52].Using its 17B learned parameters (78 transformer layers with a hidden size of 4256 and 28 attention heads), T-NLG can generate words to perform open-ended textual tasks, enhancing incomplete sentences, producing answers to questions and summarizing documents.In a nutshell, the goal is to answer as correctly and smoothly as possible, as humans are capable of doing in any particular situation.T-NLG was the biggest model ever issued (in early 2020).This research would not have been possible without the use of the DeepSpeed library and the ZeRO optimizer.
ELECTRA is an acronym that stands for "Efficiently Learning an Encoder that Classifies Token Replacements Accurately" [53].ELECTRA uses a new pretraining task, called "replaced token detection" (RTD), to train a bidirectional model (similar to a masked language model (MLM)) that learns from all input positions (like a LM).Instead of training a model to recover masked tokens "[MASK]" such as BERT, ELECTRA trains a discriminator model to distinguish between "real" and "fake" input tokens (corrupted tokens were replaced by a generator network: GANs).
TAPAS: Weakly Supervised Table Parsing via Pre-training, is a BERT-like transformer model using just raw tables and their related texts [54].More precisely, it was pretrained with two particular objectives: Masked Language Model (MLM) and Intermediate pretraining to predict (classify) whether a sentence is supported or refuted by the contents of a table.In this manner, the model learns an inner representation of the English language used in tables and associated texts, which can then be used to extract features useful for downstream tasks such as answering questions about a table or determining whether a sentence is entailed or refuted by the contents of a table.To fine-tune the pre-trained model, one or more classification heads were added.
LongFormer model (The Long-Document Transformer) was proposed by Allen AI [55].Previous models had a significant disadvantage in that they could not handle longer sequences (e.g., BERT, which is limited to a max of 512 tokens at a time).LongFormer is suggested to overcome these long sequence issues.Self-attention computing is reformulated in this transformer-based design in order to decrease model complexity.The long-former model is better than the Transformer-XL model and better than RoBERTa when it comes to long document tasks.
MPNet (Masked and Permuted Pre-training for Language Understanding) [56] is a revolutionary pre-training approach that retains the strengths of BERT and XLNet while avoiding the drawbacks (limitations of Masked Language Modeling in BERT and Permuted Language Modeling in XLNet), achieving a better level of accuracy.
RAG, short of Retrieval-augmented generation [57], models combine the capabilities of pre-trained dense retrieval (Facebook AI's DPR system) with seq2seq (BART model) generator models to produce a more powerful model.These models perform well on QA tasks and language generation tasks.
GPT-3, an autoregressive model, developed by researchers at OpenAI [10].About 3 trillion words were utilized to train GPT-3, including data from the Common Crawl corpus, as well as data from the web, books, and Wikipedia.In GPT-3, there are 175 billion parameters.There are ten times more parameters than the MT-NLG language model and 100 times more parameters than GPT-2.GPT-3 performs effectively in downstream NLP tasks under zero-shot and few-shot settings due to the huge number of parameters and large dataset training.
DeBERTa (Decoding-enhanced BERT with disentangled attention) was pre-trained on vast volumes of raw text using self-supervised learning [58].As with previous PLMs, the goal of DeBERTa is to provide generic representations of language that may be used in a variety of downstream NLU applications.With the help of three advanced techniques, DeBERTa enhances SOTA PLMs (including BERT, RoBERTa, and UniLM).These techniques are a disentangled attention mechanism, an improved mask decoder, and a fine-tuning virtual adversarial training method.
MARGE is a seq2seq model that has been trained with an unsupervised paraphrasing objective as an alternative to the MLM objective [59].With a little fine-tuning, this leads to very good results on a wide range of discriminative and generative tasks in a lot of different languages.
GShared: introduced in "Scaling Giant Models with Conditional Computation and Automatic Sharding" [60].Gshared (a sequence-to-sequence Transformer model with 600B parameters) attempts to improve the training efficiency of large-scale models using sparsely-gated Mixture-of-Experts (MoE) layers.The Gshared model aids in scaling up multilingual neural machine translation.
MT5 by Google [61] is a variation of T5 (multilingual T5).It is a massively multilingual, pretrained text-to-text transformer model.MT5 is trained in a unsupervised manner on the mC4 corpus [61] following a similar way as T5.MT5 supports a total of 101 !languages.MT5 demonstrated that the T5 recipe is easily adaptable to a multilingual environment and achieved strong performance on a diverse set of benchmarks.
KEPLER is an acronym for Knowledge Embedding (KE) and Pre-trained LanguagE Representation [62], and it is used to enhance the integration of factual knowledge into PLMs as well as to develop effective text-enhanced KE with strong PLMs.Experimental results show that KEPLER achieves state-of-the-art performances on various NLP tasks and also works remarkably well as an inductive KE model on knowledge graphs (KG) link prediction.The authors also built Wikidata5M for pretraining and evaluation.KEPLER does well as an inductive KE model when it comes to predicting KG links.It also does well on different NLP tasks.
XLM-E [63], released by Microsoft, is an across-lingual language model that has been pre-trained by ELECTRA-style tasks [53].It is pretrained with the masked LM, translation LM, multilingual replaced token detection (MRTD), and translation placed token detection (TRTD).The XLM-E model beats the baseline models on a variety of cross-lingual comprehension tests while consuming significantly less computing time than the baseline models.Furthermore, the results of the investigation reveal that XLM-E has a higher likelihood of achieving superior cross-lingual transferability.
The Megatron-Turing NLG (MT-NLG) pre-trained language model, a joint effort from NVIDIA and Microsoft, has 530 billion paremeters [64].It is trained using Megatron and DeepSpeed on DGX SuperPod.In comparison to the existing largest model, it has three times the number of parameters as the current biggest model of this type (Turing-NLG 17B and Megatron-LM).In zero-, one-, and few-shot situations, the 105-layer transformer-based MT-NLG outperformed previous state-of-the-art models and established a new benchmark for large-scale language models in both model scale and quality across a wide range of natural language tasks (e.g., completion prediction, reading comprehension, commonsense reasoning, NL inferences, and word sense disambiguation).
T0: Multitask Prompted training Enables Zero-Shot Task Generalization [65].It is a variation of the T5 encoder-decoder model that receives textual inputs and generates target answers.It is trained on a multitask combination of natural language processing datasets that have been divided into distinct tasks.Each dataset has numerous prompt templates that are used to structure example instances as input/target pairs.After training on a varied set of tasks, the T0 model is evaluated on zero-shot generalization to previously unseen tasks.This method is an excellent alternative to unsupervised language model pretraining since it allows the T0 model to achieve exceptional zero-shot performance on numerous standard datasets and outperform models many times its size.
KnGPT2 [66] is a compressed version of GPT-2 using Kronecker decomposition followed by an extremely little pre-training on a tiny percentage of the training data with intermediate layer knowledge distillation (ILKD).A comparison of KnGPT2's performance on benchmark tasks for language modeling and the General Language Understanding Evaluation reveals that it outperforms its competitor (DistilGPT2).
GLaM (Generalist Language Model), is a series of language models from Google AI [67].The largest GLaM has 1.2 trillion parameters (7× larger than GPT-3).GaLM scales model capacity via a sparsely active mixture-of-experts (MoE) architecture, resulting in better overall zero-shot and one-shot performance across 29 natural language processing tasks.
Gopher: DeepMind trained a series of transformer language models of various sizes, spanning from 44 million parameters to 280 billion parameters.The goal is to identify which model size (scale) substantially enhances performance the most (the largest model named Gopher [68]).
XGLM is a multilingual language models [69] developed by Meta AI.It is trained using a large-scale multilingual dataset with tokens from 30 different languages, for a total of 500 billion tokens.XGLM largest release has 7.5B parameters.It is demonstrated that XGLM7.5Boutperforms a GPT-3 model of equivalent size (6.7Bparameters), especially in the less-resourced languages.A variety of reasoning and natural language inference tests show that XGLM7.5B has high zero and few-shot learning capabilities.
LaMDA is a family of neural language models based on transformers that are optimized for dialog applications [70].It contains around 137B parameters and has been pre-trained on 1.56T words of public conversation data and online content.Quality, safety, and groundedness are the three main factors (metrics) used to evaluate LaMDA.Their findings indicate that model scaling enhances the quality (sensibility, specificity, and interestingness), safety, and groundedness metrics to some extent.On the other hand, combining scaling with fine-tuning yields significant gains in performance on all metrics.
GPT-NeoX-20B, a collaborative initiative between EleutherAI and CoreWeave in early 2022 [11].According to its architecture, it is an autoregressive transformer decoder model that was developed in the same manner as the Generative Pre-trained Transformer 3 (GPT-3).With its 20 billion parameters trained on a curated dataset of 825 GB named the Pile [71], EleutherAI aims to make GPT-NeoX-20B the largest publicly available pretrained generalpurpose autoregressive LM.An evaluation of the model was performed using Gao et al. [71], an open source codebase for language model evaluation, which includes a variety of model APIs.GPT-NeoX-20B excelled in knowledge-based and mathematical tasks.
Chinchilla (70B Parameter), DeepMind's new pre-trained language model, contributes to the development of an effective training paradigm for large auto-regressive language models with low computational complexity [72].In order to determine the optimal model size and training length, three techniques have been presented: experimenting with different model sizes and training token numbers, as well as IsoFLOP profiles and parametric loss functions to fit a model.Chinchilla significantly outperforms Gopher (280B), GPT-3 (175B), and Megatron-Turing NLG (530B).
PaLM: is a Pathways Language Model recently unveiled by Google AI [73].It is a 540-billion parameter transformer language model trained using Pathways.Pathways is a novel ML approach recently presented by Barham et al. [74].It enables extremely efficient training of very large neural networks.This pre-trained language model showcases the impact of scale in the context of few-shot learning.
OPT: Open Pre-trained Transformer Language Models [75].Proposed by Meta AI, it is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters.OPT-175B is trained on public datasets (CCNewsV2, a subset of the Pile, PushShift.ioReddit, etc.).Results show that OPT-175B (96 layers, 96 attention heads, 12,288 embedding size, 2M global battch size) performance is comparable to OpenAI's GPT-3, while requiring only 1/7th of the carbon footprint to develop.To enable researchers to study the effect of scale alone, Meta AI released a smaller-scale baseline models (125M, 350M, 1.3B, 2.7B, 13B, 30B), trained on the same dataset and using similar settings as OPT-175B (weight initialization: Megatron-LM, optimizer: AdamW, and a dropout of 0.1 throughout).

Summary and Comparison
Recent breakthroughs in neural architectures, such as the Transformer, coupled with the increased availability of large-scale datasets, have revolutionized the field of large language models.This advanced the state-of-the-art in a variety of NLP tasks and increased the use of PLMs (BERT, GPT-3, XLNet, GPT-NEO, T5, etc.) in production contexts.Different architectures with different settings have been proposed.Table 1 lists the settings for training selected pre-trained language models.As previously stated, the availability of public datasets has risen in recent years.As a result, the size of PLMs training datasets increased.Table 2 contains an overview of selected PLMs.We have provided all of the datasets used for training, including (Wikipedia, the Pile, Common Crawl, etc.) of selected language models.At the end of this section, we presented a full list of the current state-of-the-art of key pre-trained language models.Table 3 provides a summary of the PLMs, architectures and training parameters for each.

PLMs Applications
With the recent advances in deep learning and the large number of pre-trained language models, a wide range of NLP tasks can be solved efficiently.PLMs can be taken advantage of by fine-tuning them for the task, prompting them to execute the intended task, or reshaping the task as a text generation issue and applying PLMs to solve it appropriately.Text generation: reducing an NLP challenge to a text generation task in order to exploit knowledge encoded in generative language models such as GPT-3 [31], T5 [46], and GPT-Neox-20B [11].

PLMs Parameters
PLMs have advanced rapidly in recent years due to the availability of large-scale computers and datasets.At the same time, recent research has demonstrated that large language models are excellent few-shot learners, achieving high accuracy on a wide range of NLP datasets.As a result, the number of cutting-edge NLP models has increased at an exponential rate (e.g., T5: 11B parameters, GPT-3: 175B parameters, LaMDA: 137B parameters, MT-NLG: 530B parameters, GShard: 600B parameters, etc.). Figure 3 shows some of the most popular large pre-trained language models.

Compression Methods
State-of-the-art PLMs are often made up of millions or even billions of parameters.This requires a substantial amount of memory and compute (GPUs and TPUs infrastructure).The topic of model compression emerges as a result of the obvious need to minimize the size and complexity of neural networks for deployment.Variant strategies have been used to compress models and determine the ideal mix of model sizes versus accuracy.Pruning [80] (weights, neurons, heads and layers, blocks), quantization, knowledge distillation [81][82][83][84] (DistilBERT, TinyBERT), parameter sharing (ALBERT), matrix (tensor) decomposition, weight squeezing [85], and dynamic inference acceleration are among the emerged techniques.For further information, please see details here [86,87].

Challenges, and Future Directions
Though PLMs have proven their power for various NLP tasks, challenges still exist due to complexity of languages and the need of computation.In this section, we discuss some challenges and expose future research directions.

More Data, More Parameters
Currently, PTMs have not yet reached their upper bounds.The introduction of the GPT-3, or Megatron-Turing NLG, has proved that the more data a model is trained on, the better the representations are at transferring to other NLP problems.We think that in the future, big companies such as Nvidia, Google, and Open-AI will keep working on the huge model as a research goal, but that it will be out of reach for most people.

Others Compression Alternatives
Model compression (in size and/or latency) has been widely adopted to obtain lightweighted models without significantly diminished accuracy.There are a number of well-known methods, including pruning, quantization, low-rank approximation, knowledge distillation, and neural architecture search (NAS).This line of research should be looked into more deeply to come up with powerful ways to make light-weight models from heavier ones.

New Architectures of PLMs
For pre-training purposes, the transformer has been shown to be a highly successful architectural solution in capturing contextualised representations and encode relations.The primary restriction on the transformer, on the other hand, is the complexity of its computations.
Most transformer models are inefficient at processing long sequences.The limit is derived from the transformer architecture's positional embeddings, for which a maximum length must be set (The magnitude of such a size is related to the amount of memory needed to handle texts; most PLMs are trained to handle sequences up to 512 tokens).However, the model is requested to analyze lengthier sequences for a variety of NLP tasks.The model will only process 512 tokens at a time, truncating anything longer to be processed later or in parallel (if memory allows.A Titan RTX with 24 GB of GPU RAM, for example, can barely fit 24 samples of 512 tokens in length at the same time).This issue arises due to the computation and memory complexity of the self-attention module: attention layers scale quadratically with the sequence length, posing a problem with long texts.As a result, the use of more GPUs and TPUs is compulsory in order to have enough memory to speedup training (Table 4).Large PLMs, such as GPT-3, T5, Megatron-LM, and Turing-NLG, for example, require roughly 10,000 GPUs for training and may potentially cost tens of millions of dollars (to my knowledge, exact numbers are not available).
To overcome this restriction, the transformer's architecture must be improved in upcoming PLMs.Moreover, it is critical to investigate more efficient non-transformer architectures for PLMs.

Responsible Compute and GreenAI
In the field of PLMs, significant progress has been made.With the increased availability of large-scale datasets (Wikipedia, the Pile, Common Crawl, etc.) and computational capabilities (GPUs, TPUs), the energy consumed by large PLMs (GPT-3, Megatron-LM, T5, GlaM, LaMDA, PaLM, etc.) is becoming a growing concern [88].An empirical study [89] has estimated the carbon footprint of several NLP models and concluded that this trend is both environmentally unfriendly and prohibitively expensive [90].To summarize, it is critical to use data-centric techniques to make AI greener, more inclusive, and to democratize Green AI [88][89][90].

Conclusions
This paper presents an overview of the recent advances achieved in pre-trained language models for NLP.For the most part, we compiled a thorough list of current PLMs, including their architectures, number of parameters, training data, training objectives, compression methods, application strategies, etc.Additionally, we reviewed several challenges and suggested several possible future research directions for PLMs.

Table 1 .
Model architecture details in terms of number of layers, number of attention heads, dimension of contextual embedding and trained parameters.

Table 2 .
Summary of major datasets used in pre-training language models.

Table 4 .
Training parameters vs Training duration.