A Systematic Review of Transformer-Based Pre-Trained Language Models through Self-Supervised Learning

: Transfer learning is a technique utilized in deep learning applications to transmit learned inference to a different target domain. The approach is mainly to solve the problem of a few training datasets resulting in model overﬁtting, which affects model performance. The study was carried out on publications retrieved from various digital libraries such as SCOPUS, ScienceDirect, IEEE Xplore, ACM Digital Library, and Google Scholar, which formed the Primary studies. Secondary studies were retrieved from Primary articles using the backward and forward snowballing approach. Based on set inclusion and exclusion parameters, relevant publications were selected for review. The study focused on transfer learning pretrained NLP models based on the deep transformer network. BERT and GPT were the two elite pretrained models trained to classify global and local representations based on larger unlabeled text datasets through self-supervised learning. Pretrained transformer models offer numerous advantages to natural language processing models, such as knowledge transfer to downstream tasks that deal with drawbacks associated with training a model from scratch. This review gives a comprehensive view of transformer architecture, self-supervised learning and pretraining concepts in language models, and their adaptation to downstream tasks. Finally, we present future directions to further improvement in pretrained transformer-based language models.


Introduction
The transformer network is a novel architecture that produces optimal performance in language processing applications. Its success depends on its abstraction of long-range dependencies from large datasets. This transformer network does not require hand-crafted features, which is a bottleneck in machine-learning models. The advancements in computer hardware, availability of larger datasets, and advanced word embedding algorithms have increased the adaptation of DL models for vision tasks [1] and solving NLP problems [2]. Transfer learning techniques provide optimal results in NLP tasks through pretraining. Some of these include language representation [3], natural language generation [4], language understanding [5][6][7][8], text reconstruction [9], and abstractive text summarization [10,11]. The learning techniques require more labelled and annotated data to yield good performance. It is a drawback because much time is required to generate these annotated datasets. Transfer learning [12] is one technique used to address this drawback. In transfer learning, models trained on larger datasets, such as ImageNet [13], are used as the base model to train target models with few datasets. Classification and detection problems in vision processing are mostly solved based on transfer learning [14][15][16][17] in vision processing tasks. Transfer learning (TL) for image registration and segmentation cannot be left out depending on the outstanding performance [18,19].
For example, a base model VGG-16 [20] extracts information for all tasks, and the knowledge gained during the training is transferred to downstream tasks through finetuning [21] with optimal performance [22][23][24]. The approach eliminates the problem of • An overview of the transformer network architecture and its core concepts. • Self-supervised learning based on unlabeled datasets for transformer-based pretrained models.

•
Explains the fundamental principles of pre-training techniques and activities for downstream adaption. • Future trends for pretrained transformer-based language models.
The study follows the below structure: Retrieving publications for this study following PRISMA reporting standards are in Section 2. The description of the core structure of the transformer network is in Section 3. Self-supervised learning and its application in pretraining are explained in Section 4. Section 5 discusses the various pretrained models proposed in the literature. Pretraining downstream tasks with the transformer network are in Section 6. Challenges with future directions for efficient pretrained models are in Section 7. Section 8 concludes the review.

Materials and Methods
This section describes methods and techniques employed in writing this paper, following PRISMA reporting standards.

Review Planning
This review was planned and executed by, first, formulating research questions that address the set objectives of the study. Based on the research objectives, we set up a search strategy and criteria, which served as a guide to include or reject papers or publications.

Objectives and Research Questions
Deep TL passes on knowledge gained from one domain and is transferred to another target domain to deal with the problem of overfitting due to a few training datasets. Pretraining is a transfer learning technique widely used in language processing (LP) tasks. The BERT pretrained model has given birth to other variants to handle different language tasks instead of the traditional deep learning algorithms such as RNN due to its efficacy in dealing with long-range sequences. This review seeks to understand the various pretrained language models proposed in the literature. The following research questions aid in achieving the aim of this paper: RQ1: What are the various transformer-based pretrained models available for NLP processing? RQ2: What are the various pretraining techniques available? RQ3: What datasets or corpora are used for pretraining language models? RQ4: What are the challenges associated with transformer-based language model pretraining based on self-supervised learning? RQ5: How and when to choose a pretraining model for an NLP task?

Search Strategy
We searched for relevant publications or literature about NLP applications based on transformer networks for pretrained language models. The search strings for article retrieval were formulated based on study objectives and research questions. Three main categories of keywords "transformer-based natural language processing", "pretrained language models", and "transfer learning approaches for natural language processing" were formulated. The selected set of keywords used in the publication search is in Table 1. The keywords were linked using Boolean operators such as "OR" and "AND" to complete the search string for retrieving articles. The search strings had to be modified based on individual database requirements without compromising the selected keywords. This review considered publications on transformer networks proposed for NLP applications from 2018 to 2022. Electronic databases such as SCOPUS, Google Scholar, SpringerLink IEEE Xplore, ACM Digital Library, and ScienceDirect were the sources of articles used in this study.

Snowballing Approach
The snowballing technique [36] was used in retrieving research articles in conjunction with database searches. Articles retrieved from the various digital databases (Primary studies) assisted in getting additional publications using the reference list (backwards snowballing-BSB) and citations (forward snowballing-FSB). The approach helped in touching on all the relevant articles needed for this study without missing some key publications.

Screening Criteria
The retrieved publications from databases and also through snowballing approach for this study were screened by two authors (Ramkumar T. and Evans Kotei). Only transformerbased publications were considered during the screening process.

Exclusion Criteria
This review does not include publications with less than four pages, symposium papers, conference keynotes, and tutorials. Publications downloaded multiple times due to articles having multiple database indexing were identified and removed accordingly. Only relevant articles were selected for the study to form the Primary studies. Figure 1 is the PRISMA flow diagram to explain the search process.

Transformer Network
The transformer network has two parts (the encoder and the decoder), with selfattention for neural sequence transduction [37,38]. The encoder architecture in the transformer network handles symbolic relationships of an input categorization (x1, . . . , xn) to an incessant relation, z = (z1, . . . , zn). On the other hand, the decoder part of the transformer model engenders an output sequence (y1, . . . , ym) one after the other. Each stage is autodegenerating and exploits the earlier input as supplementary to the next word. Figure 3 is the transformer network.  . Transformer model (An input sequence is converted into a series of continuous representations by the encoder component of the transformer's architecture before being supplied to the decoder. The decoder combines the encoder's output with the decoder's output from the preceding time step to produce an output sequence).

Encoder and Decoder Stacks
A position-wise fully connected feed-forward network and multi-head self-attention form part of the encoder/decoder layers. Additionally, there is a residual layer with a normalization function to ensure the models' optimal performance [25].

Attention
Attention models produce good results through their query (Q), key (K), and value-pairs (V), which are in vector form. Predictions are based on these three variables, as shown in Figure 3. Attention uses the scaled dot-attention function for value localization based on two input pairs (queries and keys). The dimensions of the input pairs are denoted dk (dimensional key) and dv (dimension value). Figure 4 is scaled dot-attention and multi-head attention.  Attention mechanism in transformer network (This is performed in the multi-head attention mechanism, which concurrently implements several single attention functions by masking the output of the scaled multiplication of the Q and K matrices. The multi-head self-attention is comparable to the encoder's first sublayer. This multi-head mechanism receives the keys and values from the encoder's output and the queries from the preceding decoder sublayer on the decoder side. The decoder then focuses on every word in the input sequence).
The weight value is the dot-product of Q and K/ √ dk and a SoftMax function. There are two kinds of attention (additive attention and dot-product attention) [39]. Comparatively, the dot-product attention mechanism is faster and more efficient because of the multiplication code in its architecture. When the dk value is small, the additive attention performs better than the dot product attention [40]. This is because an increase in the dk values due to dot product computation pushes SoftMax to a lesser gradient space. Masked Multi-Head Attention within the transformer architecture is defined by: The output from the attention model is h i is concatenated together and projected to the same magnitude by multiplying it with W h , such that W h ∈ R (n×dv)× dmodel and O ∈ R L×dmodel .

Self-Supervised Learning (SSL)
This is a novel technique of acquiring collective information or knowledge based on unlabeled datasets through pseudo-supervision. Even though self-supervised learning is new, its patronage cuts across several disciplines, such as NLP, computer vision, speech recognition, and robotics [44][45][46][47][48].

Why Self-Supervised Learning?
Most deep learning applications are trained on supervised learning, which requires human-annotated instances to learn. Supervised learning depends on labelled data, but good and quality data are hard to come by, specifically for complex issues such as object detection [49,50] and image segmentation [51,52], where detailed information is required. Meanwhile, the unlabeled data are readily accessible in abundance. The advantage of a supervised learning application is that models perform very well on specific datasets. Generating human-annotated labels is a cumbersome process and requires a domain expert, who is scarce and not readily available, especially in the medical sector. Models trained through supervised learning suffer from generalization errors and fake correlations because the model only knows the training pattern and struggles with the unseen dataset. Despite the supervised learning approach being dominant in developing deep learning applications, it has some drawbacks. Below is a summary of them: • Supervised learning requires a human-annotated dataset, which is expensive to generate, especially a domain-specific dataset.

•
Poor generalization because the model tries to memorize the training data and suffers from unseen data during classification.

•
Limitation of deep learning applications in domains where labelled data are less example, in the medical health sector.
Based on these drawbacks, some solutions have been provided through extensive research. One is self-supervised learning, which eliminates the requirement for humanannotated labels. The labels are generated automatically by the algorithm. The intuition behind self-supervised learning is to study representations from a given unlabeled dataset using self-supervision and fine-tuned with a few labelled datasets for the supervised downstream task such as classification, segmentation or object detection.

Self-Supervised Learning-Explained
In this type of learning, a model learns from part of the input dataset and evaluates itself with the other part of the dataset. The basic idea for SSL is to transform the unsupervised problem into a supervised problem by generating some auxiliary pre-text tasks for the model from the input data such that while solving the problem, the model learns the underlying structure of the data. Transformer models such as BERT [5], ELECTRA [9], and T5 [11] produce optimal results in NLP tasks. The models are, first, trained on larger datasets and later fine-tuned with a few labelled data examples.
Self-supervised learning for pretraining models comes in multiple forms. For example, the models presented in [5,6] employed masked language modelling (MLM) with cross entropy as a loss function and next sentence prediction (NSP) using sigmoid loss. Through pretraining from the larger unlabeled dataset, the model extracts general language representations making downstream tasks to achieve better performance in a less labelled dataset. Pretraining over larger unlabeled datasets through SSL provides low-level information or background knowledge, which optimizes model performance even on lesser labelled data.
The paradigm of self-supervised learning shares similarities with supervised and unsupervised learning. For example, SLL does not require human-annotated data for learning, which is not the case with unsupervised learning with supervision. The variance between SSL and SL is learning meaningful representations from the unlabeled dataset, whereas unsupervised learning finds hidden patterns. On the other hand, SSL is synonymous with supervised learning because both require supervision. SSL offers general language representations for downstream models through transfer learning. It has better generalization through learning from unlabeled text data.

Self-Supervised Applications in NLP Applications
This learning approach began with NLP tasks in language models such as document processing applications, text suggestion, and sentence completion. The narrative changed after the Word2Vec paper [53] was introduced. The BERT (Bidirectional Encoder Representations from the Transformers) [5] model and its variants are the widely used language models based on SSL. Most of the variants of the BERT model were developed through modification in the last layers to handle a variety of NLP scenarios.

Pretrained Language Models Based on Transformer Network
The intuition of TL has become a standard method in NLP applications. Typical examples of NLP pretrained models include BERT [5], RoBERTa [6], ELECTRA [9], T5 [11], and XLNet [7]. Pretrained models present several opportunities such as: • Pretrained models extract low-level information from unlabeled text datasets to enhance downstream tasks for performance optimization.

•
The disadvantages of building models from scratch with minimal data sets are eliminated via transfer learning. • Fast convergence with optimized performance even on smaller datasets. • Transfer learning mitigates the overfitting problem in deep learning applications due to limited training datasets [54].

Transformer-Based Language Model Pretraining Process
In transferring knowledge from a pretrained model to a downstream application in natural language processing, it follows the under-listed steps: Corpus identification: Identifying the best corpus to train the model in any pretraining model context is vital. A corpus is an unlabeled benchmark dataset, usually adopted to train a model for better performance, similar to BERT [5], which is pretrained by English Wikipedia and BooksCorpus. For a model to perform well, it must train on different text corpora [6,7].
Create vocabulary: Creating or generating the vocabulary is the next step. The step is mostly with varieties of tokenizers such as Google's Neural Machine Translation (GNMT) [55], byte pair encoding [56], and SentencePiece [57]. A tokenizer generates the vocabulary based on a selected corpus. Table 2 is a list of the size and the vocabulary used Pretrained models such as XLM [59] and mBART [61] had larger vocabulary sizes because they modelled various languages. Pre-training a language model on big data increases the model size but ensures optimal performance. CharacterBERT, CANINE, ByT5, and Charformer [65][66][67][68]. The models do not use the WordPiece system; rather, a Character-CNN module makes the model lighter and more efficient, especially in specialized domains such as biomedical.
Construct the learning framework: The learning models learn by minimizing the loss function for convergence. Pretrained models such as [3,6,63] extract sentence semantics and should work on a downstream task for optimal performance. For example, the Span-BERT [64] is a variant of BERT proposed for content-masked prediction without using masked token representations, as performed in [69].
Pre-training approach: One approach to pretrain a language model is to start from scratch. The method is good but computationally expensive and requires a larger dataset. The drawback limits its application in language model training since it is unaffordable. The authors in [70,71] proposed a pre-training framework known as "knowledge inheritance" (KI) that aids in the development of new pretrained models from already existing pretrained models. Based on this framework, less computational power and lesser time are required to pretrain the new model through self-supervised learning.
Parameter and hyperparameter settings: Model parameters and hyper-parameters such as learning rate, batch size [72] mask, and input sequence must be carefully set for quicker convergence and improved performance.

Dataset
Pretraining language models based on self-supervised learning require a larger unlabeled training dataset. In dealing with NLP tasks, the training dataset can be general, social media, language-based and domain-specific categories. Each category has different text characteristics to make it suitable for a particular language task. The dataset belonging to the general category is a clean text written by experts. The dataset obtained from the social media category is noisy and unstructured because it came from the public, not experts. Text datasets belonging to the domain-specific category, for example, biomedical, finance, and law, have texts not used in the general domain category. Domain-specific datasets are few in quantity, which makes it challenging when developing an NLP model for domain-specific tasks such as BioBER [33], ClinicalBERT [34], BLUE [73], and DAPT [74]. Pretraining on a larger dataset offers performance optimization, with the BERT model being an example. To affirm this point, the models developed in [6,7] with 32.89 B texts produced good performances. Based on this notion, larger datasets emerged for pretraining language models. A typical example is the CommonCrawl corpus [75]. Models such as IndoNLG [76], MuRIL [77], IndicNLPSuite [78], mT5 [79], mT6 [80], XLM-R [81], XLM-E [82], and INFOXLM [83] are multilingual pretrained models trained on larger datasets producing optimal performance. A summary of pretraining models with their datasets is in Table 3. GLUE [84], RACE, and SQuAD T2T Transformer [11] Colossal Clean Crawled Corpus (C4) [11] Developed a common framework to convert a variety of text-based language problems into a text-to-text format GLUE and SQuAD

Social media
HateBERT [85] RAL-E Developed to analyze offensive language singularities in English Macro F1 Class-F1 SentiX [86] Amazon review [87] and Yelp 2020 dataset Analysis of consumer sentiments from different domains Accuracy  Multi-lingual mT5 [79] mC4 derived from Common Crawl corpus [75] Introduced the mT5 multilingual variant of the T5 model pretrained on the Common Crawl dataset, which covers 101 languages Accuracy and F1 score mT6 [80] CCNet [62] The proposed MT6 is an improved version of MT5 for corruption analysis Proposed an info-theoretic model for cross-lingual language modelling to maximize the mutual information between multi-granularity texts Accuracy

Transformer-Based Language Model Pretraining Techniques
This section introduces various pretraining techniques based on transformer networks proposed in the literature for NLP tasks using SSL.

Pretraining from Scratch
Pretraining a language model from scratch was used in elite models such as BERT [5], RoBERTa [6], and ELECTRA [9] for language processing tasks. The method is data driven because the training process is through self-supervised learning based on a larger unlabeled test dataset. Pretraining from scratch is computationally intensive and expensive because computers with high processing power technologies, such as graphical processing units (GPUs), are required.

Incessant Pretraining
In this method, a new language model is initialized from an existing pretrained language model for further pretraining. The initialized weights are not learned from scratch, as in pretraining from scratch models. Figure 5 illustrates the transmission of preexisting weights or parameters from a base pretrained model to a target domain for tuning. This approach is a common phenomenon in developing models for domainspecific tasks. Transformer-based language models such as ALeaseBERT [27], BioBERT [33], infoXLM [83], and TOD-BERT [29] are examples of models initialized on existing pretrained models and later finetuned for specific NLP tasks. A key observation of this method of pretraining is that it is cost-effective in terms of computational power since it is trained on already pretrained parameters. Additionally, less training time is required compared to training from scratch. ff ff Figure 5. Incessant pretraining process (in this case, the pre-training task is progressively constructed, and the models are pre-trained and fine-tuned to respond to different language understanding tasks).
The BioBERT method was initialized using BERT's weights, which were pre-trained with general domain corpora (English Wikipedia and BooksCorpus). Next, BioBERT is finetuned on corpora from the biomedical area (PubMed abstracts and PMC full-text articles).

Pretraining Based on Knowledge Inheritance
As previously indicated, pretraining a language model based on self-supervised learning requires a larger dataset, which makes the method computationally expensive and time-consuming. As knowledge acquisition from a people perspective is from human experience, the same phenomenon is in language model training. The authors in [70] proposed a model known as "knowledge inheritance pretrained transformer" (KIPT), which is similar to knowledge distillation (KD). Refer to Figure 6 for the training process. The model learns how knowledge distillation provides supervision during pre-training to target models. A new language model is pretrained using the knowledge from an existing pretrained model. The equation below explains the learning process. L SSL and L KD are losses from self-supervised learning and knowledge distillation, respectively, and L KIPT is the model's loss function.
The proposed knowledge inheritance model operates on the "teacher and student" scenario, where the "student" learns from the "teacher" by encoding the knowledge acquired from the "teacher". The student model extracts knowledge through SSL and from the "teacher" to enhance model efficiency. The approach requires less datasets, making it less computationally expensive with minimal training time compared to only self-supervised pretraining methods. The CPM-2 model introduced in [71] is a Chinese-English bilingual model developed based on knowledge inheritance with optimized performance.

Multi-Task Pre-Training
With this technique, a model extracts relevant information across multiple tasks concurrently to minimize the need for a labelled dataset in a specific target task. The authors in [11] utilized a multi-task-pretraining approach to optimize model performance. Multi-Task Deep Neural Network (MT-DNN) was used for learning representations across several natural language understanding (NLU) tasks. The proposed model depends on a significant quantity of cross-task data with a regularization effect that results in more general representations to aid in adapting to new domains [96]. Two steps make up the MT-DNN training process: pre-training and multi-task learning. The pre-training phase is the same as the BERT model. The parameters of all shared task-specific layers were learned during the multi-task learning stage using mini-batch-based stochastic gradient descent (SGD). Finding a single training dataset that includes all the necessary slot types, such as domain classification, intents categorization, and slot tagging for named entity identification, is challenging in the health domain. A multi-task transformer-based neural architecture for slot tagging solves the issues [97]. As a multi-task learning problem, the slot taggers were trained using many data sets encompassing various slot kinds. In terms of time and memory and efficiency and effectiveness, the experimental findings in the biomedical domain were superior to earlier state-of-the-art systems for slot tagging on the various benchmark biomedical datasets. The multi-task approach was used in [98] to extract eight different tasks in the biomedical field. The Clinical STS [99] dataset was subjected to multi-task fine-tuning, and the authors repeatedly selected the optimal subset of related datasets to produce the best results. To further improve the model's performance after multi-task fine-tuning, the model can be further fine-tuned on the target particular dataset. The Multi-task Learning (MTL) [100] model's outstanding performance represents the pinnacle of the multi-task pre-training technique in NLP applications. Table 4 is a summary of the various pretraining techniques employed in the development of language models for different NLP tasks. Multi-task pretraining is appropriate for domain-specific applications with outstanding performance. On the other hand, knowledge inheritance is as good as the multi-task pretraining technique. Its adaptation is suitable for edge scenario devices since it is computationally less expensive. The information and the suggested literature support researchers in selecting the appropriate pretraining technique for new applications.

CPM-2 [71]
A cost-effective pipeline for large-scale pre-trained language models based on KI.
The framework is memory-efficient for quick tuning, achieving outstanding performance on full-model tuning.
The model needs further optimization.

Word Embedding Types in Transformer-Based Pretraining Models
Word embedding converts character-based datasets into matrix format for a language model to process. There are two major embedding types: primary embedding and secondary embedding. Primary embeddings are characters or sub-words and word embeddings to form a vocabulary fed as input to the NLP model for processing. Word embedding vocabulary consists of every word selected in the pretraining dataset. Meanwhile, the character-embedding vocabulary entails only the characters that form the pretraining corpus. Secondary embeddings contain secondary information, such as the position and language of the pretraining model. The model size and the vocabulary with primary and secondary embeddings are equal [101].

Text/Character Embeddings
The input dataset for most NLP models is a sequence of characters, a combination of characters (sub-word), numbers, and symbols. The CharacterBERT [65], CHAR-FORMER [68], and AlphaBERT [102] are typical examples of character-based embedding pretrained models that utilize characters instead of words for pretraining. On the other hand, the novel BERT model [5], BART [4], RoBERTa [6], and XLNet [7] are pretrained on sub-word embeddings, even though they have varying tokenizers for vocabulary generation. The generated vocabulary consists of letters, symbols, punctuation, and numbers mapped to a dense low-dimensional vector. The learning process is through the random initialization of each character in the vocabulary.

Code Embeddings
This type of embedding is domain-specific, for example, in the medical sector, where special codes represent cases or concepts such as disease, drug prescription, prognosis, therapy, and surgery. Patient information is stored in codes instead of plain text so that only clinical professionals can interpret it. The authors in [103] proposed a transformer-based bidirectional representation learning model on EHR sequences to diagnose depression. The input dataset for the model was code embedding extracted from an electronic health record (EHR). Med-BERT [104] and BeHRt [105] also uses code embeddings as input vocabulary for pretraining through random initialization.

Sub-Word Embeddings
Byte Pair Encoding Byte Level BPE (bBPE), Unigram, and SentencePiece are employed to generate the vocabulary, which serves as the input data for pretraining the language model. A summary of tokenizers used in literature is in Table 2. It is very critical when choosing the vocabulary size when using sub-word embeddings. A smaller-sized vocabulary can generate long sequences because multiple sub-words will emerge. The case is a bit different with models such as IndoNLG [76], MuRIL [77], and IndicNLPSuite [78], which are developed for multilingual language processing because such models require a large vocabulary to handle different kinds of languages.

Secondary Embeddings
Secondary embedding contains specific information with a purpose about the pretrained model. Positional embedding and sectional embedding are examples of secondary embeddings used in general models to describe the position and also differentiate tokens forming various sentences, especially in language models such as RoBERTa-tiny-clue [94], Chinese-Transformer-XL [95], and XLM-E [82]. There are specific secondary embedding types used in domain-specific transformer-based language models. A few are below.

Positional Embeddings
Transformer-based language models require positional information about the text dataset to make predictions without regard to the text location in the vocabulary. The situation varies with CNN and RNN models because predictions are consecutive to each character following the other in RNN. Sequential processing does not use positional information. Transformer networks do not process information sequentially, hence the need-to-know order and positional details of characters for prediction. The positional information is sometimes learned together with other parameters during pretraining [5,9].

Sectional Embeddings
In sentence-pair models, both sentence tokens are taken as input simultaneously and differentiated with sectional embedding. Positional embedding varies with tokens in the input sentences, but sectional embedding remains constant.

Language Embeddings
This type of secondary embedding works in cross-lingual pretrained language models [106,107] to provide vivid information to the model on the input sentence language. For instance, the XLM model is pretrained on MLM, which contains sentences in one language on monolingual text data in 100 languages. MLM sentences come from one language where the language embedding is constant for all the input sentence tokens.

Knowledge Transfer Techniques for Downstream Tasks
The techniques employed to transfer knowledge, parameters and pretrained corpus to a downstream task for natural language processing include: (word feature-based transfer, fine-tuning, and prompt-based tuning).

Word Feature Transfer
The input data to traditional natural language architectures such as RNN embedding models such as Word2Vec [53] generate the input set (word features). Transformer-based pretrained models such as BERT [5], generate contextual word vectors (word features) similar to Word2Vec. The BERT model supports encoding more information in word vectors due to the deepness of transformer architecture with stacked attention. Due to this, downstream tasks benefit from the word vectors from any part of the network layer.
The process involves training the downstream model from the initial stages without the labelled embedding instances. The innovative BERT model is improved upon by the DeBERTa model suggested [108]. The variance between the two is that DeBERTa uses a disentangled attention mechanism where the words are in two-vector form (content and position). The second unique technique of DeBERTa is a mask decoder for prediction during pretraining. All these combined make this model superior to BERT and RoBERTa. The ConvBET model [105] is also an advancement of the BERT model because it uses less memory and is computationally efficient.

Fine-Tuning
Current work has shown that fine-tuning a base model produces optimal performance on target tasks to training with only target task data [109]. The advantage of pretraining is that it provides universal inference of a language [110]. The work in [111] proved that fine-tuning yields optimal performance by examining the English BERT variants. It is also evident that fine-tuning does not change the representation but rather fine-tunes it to a downstream task. Fine-tuning was used in [112] to understand how representation space changes during fine-tuning for downstream tasks.
The study capitalized on three NLP tasks; dependency parsing, NLP inference, and reading comprehension. Fine-tuning adds massive changes to domain instances but looks out-of-domain similar to the pre-trained model. To evaluate fine-tuning effects on representations learned by pretrained language models, the authors in [113] proposed a sentence-level probing model to ascertain the changes. BERT was fine-tuned based on two indicators [114]. The first indicator was to evaluate the attention mode in the transformer network based on the Jensen-Shannon divergence during fine-tuning of the BERT model. The second indicator measured feature extraction changes during model fine-tuning based on Singular Vector Canonical Correlation Analysis (SVCCA) [115].

Intermediate-Task Transfer Learning
Compared with more established deep learning techniques such as RNN, the top pretrained networks BERT and RoBERTa perform extraordinarily well. The current performance of these models is optimized by further training the model on a curated dataset for the intermediate task through fine-tuning. The work proposed in [116] employed an intermediate fine-tuning approach to improving the performance of the RoBERTa pretrained language model with 110 intermediate-target task combinations.
Intermediate fine tuning on a semi-supervised pretrained language model performs well in domain-specific tasks such as medical question-answer pairs [117] to extract medical question resemblances. Figure 7 depicts the training process. In dealing with medical domain NLP applications, a model pretrained on a different problem in similar domain beats models pre-trained on an analogous task in a dissimilar field. Pretraining a language model based on a biomedical dataset produces optimal performance in domain-specific languages [118]. For example, an optimized performance was achieved in the biomedical question and answering (QA) task through the transfer of knowledge from BioBERT, based on natural language inference (NLI) [119]. Fine-tuning works well when the source and target datasets come from the same domain but in different tasks. In this case, finetuning happens on domain datasets before transferring to in-domain datasets. In [118], the authors showed that teaching a domain-specific language model on rich biological corpora has a considerable impact. Training the model on larger NLI datasets such as MultiNLI [120] and SNLI [121] aids in efficient task-specific reasoning with optimized performance. Fine-tuning is possible when the source and target datasets are from the same task and domain. However, the target dataset is more specialized, whereas the source dataset is more general [122]. Fine-tuning is also feasible for many tasks and domains where source and target datasets come from different fields. The BioBERT model was tuned on a generic MultiNLI database biomedical question and answer (QA) [119]. The performance was outstanding in learning to reason at the phrase level for biomedical QA.

Discussion, Open Challenges, and Future Directions
This section highlights findings from literature based on transfer learning techniques, such as pretraining on transformer networks for natural language models. We also shed light on some future directions that are vital to the progression of the field.

Optimized Pretraining Techniques
As there are billions of parameters involved, pretraining transformer-based language models using unlabeled datasets over SSL is costly and makes it impractical to train a language model from scratch. According to the literature, models such as [69,70] acquired knowledge from language models that had already undergone pretraining using a knowledge distillation technique. As compared to models created for equivalent tasks, the newly designed KPIT's efficiency was exceptional. The KPIT model possesses rich features such as a faster convergence rate and less pretraining time requirements, making it appropriate for downstream tasks.

Domain Specific Pretraining
Mixed-Domain Pretraining is a popular strategy frequently used in the literature to produce domain-specific assignments.
The method relies on a larger domain-specific dataset, which unintentionally necessitates more computing capacity. Despite its efficacy, pretraining is unaffordable due to hardware requirements and energy usage. Task Adaptive Transfer Learning (TATL) was proposed in [118] to address this. Another technique was through pseudo-labelling indomain data and iterative training [123], which keeps the distribution of pseudo-labelled instances closer to that of the in-domain data to achieve optimal performance.

Dataset/Corpus
Pretrained models require a larger volume of labelled datasets or text corpus for optimal performance. Labelled datasets are expensive to generate in larger quantities. Self-supervised learning is one approach that utilizes the voluminous unlabeled dataset for contemporary NLP tasks. Language models such as ALBERT [3], BER [5], and RoBERTa [6] were pretrained on benchmark general corpora such as English Wikipedia and Books Corpus [58]. On the other hand, developing task or domain-specific language models is challenging since the dataset in specific domains are scanty for the transformer model to produce good results. For example, training models for the biomedical field require domain-specific datasets, which are not readily available in larger quantities.

Model Efficacy
The cost of pretraining on unlabeled text data is expensive in terms of hardware and dataset acquisition. The second issue is that datasets for domain-specific areas (biomedical) are few, even though unlabeled datasets are abundantly available. The DeBERTa model [108] and ConvBERT [124] are examples of models that produce good performance compared to earlier pretrained models such as BERT [5], RoBERTa [6], and ELECTRA [9]. For instance, DeBERTa is pretrained on fewer datasets compared to BERT, reducing computational power with improved performance as well. Moreover, the ConvNet model created employing a mixed attention mechanism outperforms ELECTRA utilizing just a quarter of the dataset used to pretrain the ELECTRA model. Modern pretrained language models require such models to operate on edge devices with less processing power and have optimal performance.

Model Adaptation
Through incessant pretraining, knowledge gained in general pretrained models was adapted to specific domains such as biomedical and multilingual models. Despite the success of incessant pretraining in domain-specific tasks, there are some performance issues due to inadequate domain-specific datasets. Models such as ALeaseBERT [27], BioBERT [33], infoXLM [83], and TOD-BERT [29] are examples of incessant pretrained models whose main aim is to reduce computational cost and provide optimal performance for domain-specific models. There is a need to research novel adaptation methods for pretrained language models.

Benchmarks
Evaluating a transformer-based pretrained model is vital, as model efficacy is paramount in its patronage. Some benchmarking frameworks have been proposed in this regard for general [84] and specific domain models [35,73]. In [76,78], there is some benchmarks to evaluate monolingual and multilingual language models. Despite these benchmarks being available, they are not adequate to cover all domains. Most of these benchmarks are developed for the performance of literature-based datasets, hence the need for other ones for electronic health records and domain-specific corpus.

Security Concerns
Security is of much concern in pretrained transformer models since there are some identified risks, such as data leakage occurring during pretraining. This usually happens on datasets containing confidential information about people. Training a model over a long period subjects it to retrieve vital information, such as personally identifiable information. Due to this drawback, models pretrained on datasets containing confidential information are not released into the public domain. A typical example is the model presented in [125], which extracted precise text classifications of personal information from the GPT-2 model's training data. We recommend that the KART (Knowledge, Anonymization, Resource, and Target) framework [126], which deals with real-world privacy leakages, be adapted and improved for better performance.

Conclusions
This review follows the PRISMA reporting standards for review to retrieve relevant publications to form the Primary studies. Additionally, backwards and forward snowballing was employed to retrieve additional publications from the Primary studies. This study reviews transfer learning-based pretrained models for NLP based on deep transformer networks. The study shows the recent trends of transformer networks in solving language problems compared to traditional deep learning algorithms such as RNN. The paper explains the transformer model and the various core concepts behind its operation. The work focused on self-supervised learning using labelled data for later tasks rather than unlabeled datasets for model pretraining. The study also examined several benchmarking systems for assessing the effectiveness of pretrained models. Some challenges identified in the literature from transformer-based pretrained models have been discussed, with possible recommendations to deal with those challenges. We also provide future directions to help researchers focus on developing improved NLP applications using transformer networks and self-supervised learning.