A Historical Survey of Advances in Transformer Architectures

: In recent times, transformer-based deep learning models have risen in prominence in the field of machine learning for a variety of tasks such as computer vision and text generation. Given this increased interest, a historical outlook at the development and rapid progression of transformer-based models becomes imperative in order to gain an understanding of the rise of this key architecture. This paper presents a survey of key works related to the early development and implementation of transformer models in various domains such as generative deep learning and as backbones of large language models. Previous works are classified based on their historical approaches, followed by key works in the domain of text-based applications, image-based applications, and miscellaneous applications. A quantitative and qualitative analysis of the various approaches is presented. Additionally, recent directions of transformer-related research such as those in the biomedical and timeseries domains are discussed. Finally, future research opportunities, especially regarding the multi-modality and optimization of the transformer training process, are identified.


Introduction
Ever since the introduction of the transformer model in June 2017 by Vaswani et al. [1], the world of deep learning has seen a rapid adaptation of the model in pushing the state of the art in a number of previously challenging tasks.Due to its prowess in sequence modeling and machine translation, the transformer architecture was initially widely implemented and indeed emerged as the predominant deep learning model for natural language processing (NLP) and generative deep-learning tasks [2].Indeed, the introduction of transformers has been a key factor in the development of large language models such as GPT3 and GPT4, which are the basis of culturally significant tools such as ChatGPT [3].However, inspired by the revolutionary self-attention mechanism in transformers, the architecture has since been implemented in various application domains such as that of images, audio, and time series data [4].Indeed, in recent times, transformers have been touted as being a potential replacement for Convolutional Neural Networks (CNNs) for vision applications [5], with the introduction of the Vision Transformer (ViT) opening a new realm of architectures which build upon it.Considering the rapid increase in interest in transformer architecture, it becomes pertinent to examine in detail the architecture of the transformer as well as its historical progression from being introduced as an alternative to RNN-like architectures for sequence-to-sequence mapping to being one of the most impactful architectures in the current realm of deep learning.Finally, it may be beneficial to examine the various prevalent transformer architectures applicable to the different data domains.
Prior to the introduction of transformers to the deep learning space, the established state of the art in sequence modeling had long been Long Short-term Memory (LSTMs) [6] and other forms of Recurrent Neural Networks (RNNs) [7].These were especially prevalent for transduction problems such as language modeling and machine translation due to their recurrence which allows for recent information to be accounted for in order to maintain sequential information [1].However, these established models had numerous drawbacks, particularly that the sequential computation involved in the training process prevents parallelization, therefore leading to slower training times [8] in cases of long sentences as they would be processed word by word.Furthermore, RNNs were susceptible to encoderdecoder bottlenecks particularly in sequence-to-sequence tasks because the encoder had to read the entire sequence before developing a hidden state of fixed length which the decoder then decoded [9].Transformers emerged as an ideal solution to these drawbacks thanks to the self-attention mechanism which disregards the distance between words or output sequences when accounting for dependencies [10], which further allows for parallelization and therefore faster training.The following sections conduct an in-depth outlook at the initial architecture of early transformers.This gives insight into what makes transformers as unique as they are and what features of this architecture contribute to the large success seen by this kind of model.

Transformers
In order to take a deeper look and investigate the success seen by the transformer model, it is imperative to examine, in detail, the architecture and workings of the solution proposed by Vaswani et al. [1].Unlike previously proposed sequence transduction models like [11] and [12], transformers maintain the encoder-decoder structure, as seen in Figure 1, but discard the recurrence and convolution aspects.This is made possible thanks to the novel multi-head attention mechanism proposed in addition to the point-wise feedforward networks ingrained in the transformer model.Figure 1 shows the overall transformer architecture as proposed by Vaswani et al. [1].The following sections describe the various blocks contributing to this architecture in further detail.
Appl.Sci.2024, 14, x FOR PEER REVIEW 2 of 31 [6] and other forms of Recurrent Neural Networks (RNNs) [7].These were especially prevalent for transduction problems such as language modeling and machine translation due to their recurrence which allows for recent information to be accounted for in order to maintain sequential information [1].However, these established models had numerous drawbacks, particularly that the sequential computation involved in the training process prevents parallelization, therefore leading to slower training times [8] in cases of long sentences as they would be processed word by word.Furthermore, RNNs were susceptible to encoder-decoder bottlenecks particularly in sequence-to-sequence tasks because the encoder had to read the entire sequence before developing a hidden state of fixed length which the decoder then decoded [9].Transformers emerged as an ideal solution to these drawbacks thanks to the self-attention mechanism which disregards the distance between words or output sequences when accounting for dependencies [10], which further allows for parallelization and therefore faster training.The following sections conduct an indepth outlook at the initial architecture of early transformers.This gives insight into what makes transformers as unique as they are and what features of this architecture contribute to the large success seen by this kind of model.

Transformers
In order to take a deeper look and investigate the success seen by the transformer model, it is imperative to examine, in detail, the architecture and workings of the solution proposed by Vaswani et al. [1].Unlike previously proposed sequence transduction models like [11] and [12], transformers maintain the encoder-decoder structure, as seen in Figure 1, but discard the recurrence and convolution aspects.This is made possible thanks to the novel multi-head attention mechanism proposed in addition to the point-wise feedforward networks ingrained in the transformer model.Figure 1 shows the overall transformer architecture as proposed by Vaswani et al. [1].The following sections describe the various blocks contributing to this architecture in further detail.

Self Attention
The first and most important component of transformer architecture is the selfattention mechanism seen in Figure 2 which allows the model to learn the relationships between the elements of a sequence [13].In the context of an LLM such as BERT, this would mean that in a sentence such as "The bank of the river is overflowing", the model would use self-attention to conclude that the "bank" in this case refers to the side of a river as opposed to a financial organization.

Self Attention
The first and most important component of transformer architecture is the self-attention mechanism seen in Figure 2 which allows the model to learn the relationships between the elements of a sequence [13].In the context of an LLM such as BERT, this would mean that in a sentence such as "The bank of the river is overflowing", the model would use self-attention to conclude that the "bank" in this case refers to the side of a river as opposed to a financial organization.In the encoder version of this layer, the inputs consist of queries and keys.The attention function is then applied to these vectors as seen in (1).
where -Q is the matrix of the queries; -K is the matrix of the keys; -V is the matrix of the values.The equation is applied in a way that the dot product between the query and the key is first computed to form the score   ⋅  These scores are important as they determine how much attention is given to other words when encoding words at the current position.These scores are then normalized in order to ensure the stability of the gradient to enhance training, thereby giving the normalized score  .The softmax function is then applied to the normalized scores in order to translate them into probabilities   .These probabilities can then be applied to the value matrix to obtain   ⋅ .This would mean that vectors with larger probabilities would receive a greater focus from the consequent layers [5].In transformers, a multi-head attention system is used wherein the original queries, keys, and values are projected into H different sets of learned projections.
For each projection, the attention equation from ( 1) is applied to formulate the output.The output across the H projections is then concatenated to form the multi-head output.The formulation for this process can be found in (2).In the encoder version of this layer, the inputs consist of queries and keys.The attention function is then applied to these vectors as seen in (1).
where -Q is the matrix of the queries; -K is the matrix of the keys; -V is the matrix of the values.
The equation is applied in a way that the dot product between the query and the key is first computed to form the score S = Q • K ⊤ These scores are important as they determine how much attention is given to other words when encoding words at the current position.These scores are then normalized in order to ensure the stability of the gradient to enhance training, thereby giving the normalized score . The softmax function is then applied to the normalized scores in order to translate them into probabilities P = (S n ) .These probabilities can then be applied to the value matrix to obtain Z = V • P.This would mean that vectors with larger probabilities would receive a greater focus from the consequent layers [5].In transformers, a multi-head attention system is used wherein the original queries, keys, and values are projected into H different sets of learned projections.
For each projection, the attention equation from ( 1) is applied to formulate the output.The output across the H projections is then concatenated to form the multi-head output.The formulation for this process can be found in (2).
This process improves upon the performance seen by a single attention layer as it allows the model to focus on multiple equally important words based on different criterion instead of simply attributing a single word per input.This allows for multiple complex relationships among different elements in a sequence to effectively be captured by the model [13] and therefore enhances the diversity of the subspace.The original transformer model proposed uses eight different heads; however, consequential works have experimented with optimizing the heads to retain the ones which provide the most important information [14].

Feedforward Networks
Another important component in the functioning of transformers is the feedforward network which is applied after the self-attention layers in the encoder and decoder.This network consists of two linear transformations and a non-linear ReLU activation function which is applied to each position separately and identically.This allows the model to ensure the same treatment across all positions in the input, meaning the token is processed in isolation.This allows the model to learn the complex transformations of the data at each position.Going back to the example mentioned in the previous section, the feedforward network in BERT would fine-tune the embeddings by adding additional layers of abstraction and complexity.So, if there was an example sentence like "The bank of the river is slippery", the self-attention would help give context and recognize it is not a financial organization as discussed previously while the feedforward network would capture the nuance about the bank being slippery due to it being close to water.The formulation for this network can be seen in Equation (3).

Residual Connections
The transformer also implements residual connections [15] around each module followed by layer normalization [16] which applies normalization layer by layer.This helps mitigate the vanishing gradient problem by allowing gradients to flow directly, bypassing several layers.We can therefore represent each transformer block using the formulation seen in Equation (4).
This residual connection boosts the flow of data by relaying the information forward and therefore serves to enhance the model's performance.The '+' operator in this equation refers to element-wise addition which helps combat the vanishing gradient problem.In the context of the example discussed, these residual connections would make sure essential characteristics of the word "bank" are not lost in the depth of the model's layers.

Position Encodings
As the self-attention process of the transformer discards with the sequential way in which RNNs or LSTMs handle input embeddings and instead treats all inputs simultaneously and identically, it means that the self-attention layer is not able to account for the position of words in a sentence.However, since the words are sequential, a mechanism is needed which maintains the positions of the words within the encoded information and, therefore, the transformer model makes use of position encodings which are added to the input embedding.In the context of the example, this would mean the position encoding helps maintain the sequential context that the word "bank" is related to "river" and "slippery".The formulation for the added embeddings is seen in Equation (5).Wherein pos is the position of a word within a sentence, d model is the dimension and i is the current dimension of the position encoding.Using this, each element of the positional encoding corresponds to a sinusoid, thereby allowing the transformer model to learn to pay attention based on relative positions as well, consequently allowing it to extrapolate to longer sequences.These encodings have indeed been a focal point of the consequent research aiming to optimize the learning process.Indeed numerous works have proposed modifications such as a learning process for the encodings [17,18] or a relative form of position encoding [19].

PE(pos
Having discussed the importance and the working of transformer architecture, and given the rapid advances in the field of deep learning brought forth due to this model, it might be noteworthy to examine the historical progression since its introduction in 2017 leading up to transformers taking over many of the state-of-the-art techniques.While there exist surveys on the various types of transformer architectures that have been proposed, there seems to be a gap in the analysis from a historical viewpoint.Therefore, the rest of the paper examines a historical perspective on the progression of notable transformer architectures in addition to discussing the state-of-the-art techniques and architectures for data of different types.

Survey Methodology
The search for sources for this work was done following the PRISMA checklist [20].The following subsections illustrate the points focused on for the survey's methodology.

Information Sources
Impactful works to be added to the survey were identified by searching online databases and scanning through the list of references within the main papers.The search was applied mainly to google scholar, OpenAI, Papers with Code, and arxiv as it was found that majority of the works on transformers were published through Arxiv.As the survey is based on the history of transformers, the search was not limited by year, but it was found that works were present only from the year 2017 to the present.The last search for sources was done on the 29 September 2023.

Search
The following search terms were used through all the above-mentioned databases: Transformers, State-of-The-Art Transformers, Key Transformer Architectures, Transformer Deep Learning, Transformer Vision, Transformer NLP, BERT.

Study Selection
The works were first shortlisted by their impact factor and number of citations.They were then further filtered based on their usefulness to the subject of this survey.

Data Collection and Data Items
A data extraction Excel spreadsheet was created that consisted of the following columns: Name of paper, Author, Date, Proposed Model, Datasets, Models Benchmarked Against, Results, and Key notes.This Excel spreadsheet was connected via a paper serial number to a word document that consisted of further key points summarized from the papers.

Introductory Works
Since the introduction of the aforementioned transformer model in 2017, a vast array of works have aimed to build upon its novel architecture in order to optimize its performance for a variety of domains.Indeed, the work proposing the transformer model has been cited more than 90,500 times as of 29 September 2023, according to Google Scholar [21].Among the thousands of consequential works, a few emerge as notable models which have consequently contributed to pushing the overall state-of-the-art techniques and have established themselves as standards in their fields.Figure 3 displays a timeline of these notable works arranged chronologically and coded according to the domain of implementation.

Introductory Works
Since the introduction of the aforementioned transformer model in 2017, a vast array of works have aimed to build upon its novel architecture in order to optimize its performance for a variety of domains.Indeed, the work proposing the transformer model has been cited more than 90,500 times as of 29 September 2023, according to Google Scholar [21].Among the thousands of consequential works, a few emerge as notable models which have consequently contributed to pushing the overall state-of-the-art techniques and have established themselves as standards in their fields.Figure 3 displays a timeline of these notable works arranged chronologically and coded according to the domain of implementation.In order to benchmark these works, a number of datasets have been utilized by the various works.A few of the commonly used datasets are BookCorpus [48], WMT 2014 [49], Wikipedia [50], C4 [22], ImageNet [51], and COCO [52].
An early work building upon the transformer model was that of Shaw et al. [19], which simply involved extending the self-attention mechanism of transformers to efficiently consider representations of the relative positions or distances between sequence In order to benchmark these works, a number of datasets have been utilized by the various works.A few of the commonly used datasets are BookCorpus [48], WMT 2014 [49], Wikipedia [50], C4 [22], ImageNet [51], and COCO [52].
An early work building upon the transformer model was that of Shaw et al. [19], which simply involved extending the self-attention mechanism of transformers to efficiently consider representations of the relative positions or distances between sequence elements.This is done by modeling the input as a labeled, fully connected graph with the edges between input elements x i and x j represented by vectors a V ij , a K ij ∈ R d a .A modification is then made to the transformer equation wherein edge information is then propagated to the sublayer output as seen in Equation (6).
Using these improved embeddings, the authors were able to report improvements in both the EN-DE and EN-FR tasks over the vanilla transformer architecture.
Another early and majorly consequential work was that of Radford et al. [23] who proposed the famous Generative Pre-Training (GPT) model.The base model used for the work was the transformer architecture as it allowed the authors to capture long-range linguistic structures.The idea proposed by the authors was one where the model can perform more optimally for small amounts of labeled text data when it is generatively trained in an unsupervised manner on a large unlabeled text corpus consisting of diverse samples and then discriminatively fine-tuned on the specific task at hand.They do this by utilizing a multi-layer transformer-decoder [53] architecture which applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over the target tokens.These trained weights can then be used with an auxiliary objective for classification tasks.The architecture used by the model can be seen in Figure 4.
Another early and majorly consequential work was that of Radford et al. [23] who proposed the famous Generative Pre-Training (GPT) model.The base model used for the work was the transformer architecture as it allowed the authors to capture long-range linguistic structures.The idea proposed by the authors was one where the model can perform more optimally for small amounts of labeled text data when it is generatively trained in an unsupervised manner on a large unlabeled text corpus consisting of diverse samples and then discriminatively fine-tuned on the specific task at hand.They do this by utilizing a multi-layer transformer-decoder [53] architecture which applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over the target tokens.These trained weights can then be used with an auxiliary objective for classification tasks.The architecture used by the model can be seen in Figure 4.A similar approach to that of GPT was seen by the Bidirectional Encoder Representations from Transformers (BERT) model proposed by Devlin et al. [17], where unlabeled data is used to pre-train the transformer model in an unsupervised fashion before the A similar approach to that of GPT was seen by the Bidirectional Encoder Representations from Transformers (BERT) model proposed by Devlin et al. [17], where unlabeled data is used to pre-train the transformer model in an unsupervised fashion before the model is fine-tuned using representative samples from the problem at hand.The major improvement proposed by the authors is the use of bidirectional encoder representations unlike previous solutions, which involved unidirectional models being used in the learning process such as GPT using a left-to-right architecture where each token in the self-attention layer was only able to attend to previous tokens.The BERT model achieves bidirectional learning by using a masked language model (MLM) pre-training objective which the authors adapted from the Cloze task [54].This model randomly masks some of the tokens from the input with the objective of predicting the original vocabulary ID of the masked work based on the context.This allows the representation to join the left and right context, thereby allowing a bidirectional training process.To further the MLM objective, the authors also implement a next-sentence prediction task which jointly pre-trains text-pair representations.Thereby, the authors outline two distinct processes in training the model, the pre-training and the fine-tuning.During the pre-training, the model is given various tasks when training on unlabeled data, whereas for fine-tuning, the model is initialized with the parameters from the pre-training and all of the parameters are fine-tuned using labeled data from the downstream tasks.Each of these tasks has a separate fine-tuned model; however, in general, there is no architectural difference between the pre-training and the fine-tuning process except for the output layers.Figure 5, adapted from [17], shows the pre-training and fine-tuning procedures.
whereas for fine-tuning, the model is initialized with the parameters from the pre-training and all of the parameters are fine-tuned using labeled data from the downstream tasks.Each of these tasks has a separate fine-tuned model; however, in general, there is no architectural difference between the pre-training and the fine-tuning process except for the output layers.Figure 5, adapted from [17], shows the pre-training and fine-tuning procedures.Using this relatively simple conceptual approach, the BERT model was able to obtain state-of-the-art results on eleven natural language processing (NLP) tasks, thereby establishing it as a notable work which numerous consequent models have been built upon.
It was soon after that, in the beginning of 2019, that Radford et al. followed up their proposed GPT model with a model they called GPT-2 which followed a similar philosophy of multi-task learning which they based on a framework proposed by Caruana [53].In their work, Radford et al. aimed to unify the two dominant approaches, namely, pre-training followed by supervised fine-tuning as well as a technique with unsupervised approaches towards specific tasks such as commonsense reasoning [55] and sentiment analysis [24].They achieve this by performing language modeling where, in addition to conditioning a model on the input, it is also conditioned on the task.They train their model in an unsupervised manner on a dataset consisting of millions of web pages, called WebText, producing GPT-2, which is an enormous 1.5 billion parameter model which achieved state-of-the-art results on seven language modeling tasks in a zero-shot system.The authors hypnotized that a large enough model would learn tasks embedded within Using this relatively simple conceptual approach, the BERT model was able to obtain state-of-the-art results on eleven natural language processing (NLP) tasks, thereby establishing it as a notable work which numerous consequent models have been built upon.
It was soon after that, in the beginning of 2019, that Radford et al. followed up their proposed GPT model with a model they called GPT-2 which followed a similar philosophy of multi-task learning which they based on a framework proposed by Caruana [53].In their work, Radford et al. aimed to unify the two dominant approaches, namely, pre-training followed by supervised fine-tuning as well as a technique with unsupervised approaches towards specific tasks such as commonsense reasoning [55] and sentiment analysis [24].They achieve this by performing language modeling where, in addition to conditioning a model on the input, it is also conditioned on the task.They train their model in an unsupervised manner on a dataset consisting of millions of web pages, called WebText, producing GPT-2, which is an enormous 1.5 billion parameter model which achieved stateof-the-art results on seven language modeling tasks in a zero-shot system.The authors hypnotized that a large enough model would learn tasks embedded within language and would not require explicit, supervised training, which was proven by their results.
Meanwhile, Wang et al. [25], in 2019, proposed a direct improvement upon the transformer model itself by formulating a deep transformer model which they claimed would bypass the prevalent big transformer counterpart.They achieved this using a dual approach where, firstly, they implemented the proper use of layer normalization in addition to introducing a novel way to pass the combinations of previous layers to the next ones.Furthermore, they trained a 30-layer encoder, which they claim was the deepest at the time.Using this approach, the authors were able to outperform the results of both the shallow and the big transformers on the WMT'16 EN-DE, the NIST OpenMT'12 Chinese-English, and the WMT'18 Chinese-English tasks.
Liu et al.'s proposed Robustly Optimized BERT Pre-training Approach (RoBERTa) model [26] was introduced with the idea of improving the limitations of the BERT model which were caused by significant undertraining.The authors achieved this by training the model over a larger dataset, which consisted of CC-News and OpenWebText in addition to the two datasets used to train the original BERT model, and training on longer sequences.The performance was further improved by making the following changes on the original model: dynamically changing the masking pattern that was applied to the training data and removing the Next Sentence Prediction (NSP) objective.Unlike in the BERT model, where the mask was generated only once during the data preprocessing stage, for the RoBERTa, the authors generate a masking pattern every time a sequence is fed into the model.The authors came to the conclusion that removing NSP matched or slightly improved the downstream task performance after comparing the training of their model with and without NSP.Throughout their experimentation, for a more accurate comparison, the original optimization hyperparameters of the BERT model were initially maintained.The model was able to achieve state-of-the-art results on GLUE [56], RACE [57] and the Stanford Question-Answering Dataset (SQuAD) [58], which are notable NLP tasks.
Another notable proposed modification of the transformer model is that outlined by Sukhbaatar et al. [27], which suggests removing the feedforward layer from the transformer architecture and solely using the attention layers.This is done by augmenting the attention layers with persistent memory vectors which serve the same purpose as the feedforward layers.On the first level, they first show that a feedforward sublayer can be viewed as an attention layer.This argument can then be used to merge them into a single layer which performs both functions by applying the attention mechanism simultaneously on the sequence of input vectors, as in the attention layer, as well as a set of vectors not conditioned on the input.Using this approach, they report outperforming models of similar sizes on the enwik8 and WikiText-103 datasets.
An interesting work published in late 2019 that explored the NLP landscape is that of Raffel et al.'s T5 model [22]; the researchers followed a transfer learning approach in introducing a unified framework which converted all text-based language problems into a text-to-text format.They experiment with a variety of pre-training objectives, architectures, datasets and transfer approaches in addition to developing a new dataset they call the Colossal Clean Crawled Corpus.Using this pre-training regime, they report having achieved state-of-the-art results on a number of prevalent challenges in summarization, question answering, and text classification.

Further Progression
In early 2020, Shazeer [28] proposed an improvement to the transformer model, which involved variants of Gated Linear Units (GLUs) [59] being applied to the feedforward sublayers of the transformer model.These variations were implemented using different linear and non-linear activation functions in place of sigmoids, and the authors report an improvement in performance over the generally used ReLU activation function when evaluating on the SQuAD, GLUE, and SuperGlue [60] tasks.
It was in April of 2020 that a key architecture in the form of the Lite Transformer was introduced by Wu et al. [29].The reasoning behind the introduction of this architecture was that the authors argued that transformers require an enormous amount of computation in order to achieve high performance and, therefore, they would not be suitable for mobile applications that are constrained by hardware and battery resources.Therefore, they proposed the Lite Transformer specifically to be deployed to perform NLP on mobile devices.They introduce Long-Short Range Attention (LSRA), where one group of heads specialize in local context modeling using convolution while the other specializes in long-distance relationship modeling using attention.They report that this approach shows improvement over the vanilla transformer in three established language tasks, namely, machine translation, abstractive summarization, and language modeling.The Lite Transformer block can be seen in Figure 6.
Using this approach, the proposed model reduces the computation of the transformer base model by 2.5× with only a 0.3 BLEU score degradation.Furthermore, the authors report implementing pruning and quantization processes to compress the model size by 18.2×.
Carion et al. [30] propose a ground-breaking object detecting transformer named DETR that views object detection as a direct set prediction problem.The main components of the model are a set-based loss that forces predictions via a bipartite matching and a transformer encoder and decoder.The overall architecture of the model is illustrated in the following Figure 7.
they proposed the Lite Transformer specifically to be deployed to perform NLP on mobile devices.They introduce Long-Short Range Attention (LSRA), where one group of heads specialize in local context modeling using convolution while the other specializes in longdistance relationship modeling using attention.They report that this approach shows improvement over the vanilla transformer in three established language tasks, namely, machine translation, abstractive summarization, and language modeling.The Lite Transformer block can be seen in Figure 6.Using this approach, the proposed model reduces the computation of the transformer base model by 2.5× with only a 0.3 BLEU score degradation.Furthermore, the authors report implementing pruning and quantization processes to compress the model size by 18.2×.
Carion et al. [30] propose a ground-breaking object detecting transformer named DETR that views object detection as a direct set prediction problem.The main components of the model are a set-based loss that forces predictions via a bipartite matching and a transformer encoder and decoder.The overall architecture of the model is illustrated in the following Figure 7.  Using this approach, the proposed model reduces the computation of the transformer base model by 2.5× with only a 0.3 BLEU score degradation.Furthermore, the authors report implementing pruning and quantization processes to compress the model size by 18.2×.
Carion et al. [30] propose a ground-breaking object detecting transformer named DETR that views object detection as a direct set prediction problem.The main components of the model are a set-based loss that forces predictions via a bipartite matching and a transformer encoder and decoder.The overall architecture of the model is illustrated in the following Figure 7.  [32] introduced the Vision Transformer (ViT) in late 2020, which caused a shift in the research field.In order to adapt the transformer for image tasks, the authors applied a standard transformer to images by splitting an image into patches and providing the sequence of the linear embeddings of the patches as the input to the transformer.The overview of the ViT model can be seen in Figure 8.
of-the-art NLP transformer model by proposing their improved GPT-3 model.The authors scale-up the model by training it with 175 billion parameters which results in a model which can perform a variety of tasks without requiring task-specific gradient updates or fine-tuning, unlike the previous generations of the model.The other variation from the architecture of GPT-2 is that of the use of alternating dense and locally banded sparse attention patterns in the layers of the transformer.The model is able to perform well and even achieve SOTA results on famous NLP dataset tasks with few-shot demonstrations which are specified purely via text interactions with the model.Dosovitskiy et al. [32] introduced the Vision Transformer (ViT) in late 2020, which caused a shift in the research field.In order to adapt the transformer for image tasks, the authors applied a standard transformer to images by splitting an image into patches and providing the sequence of the linear embeddings of the patches as the input to the transformer.The overview of the ViT model can be seen in Figure 8.The image is first broken down into patches which are passed through a trainable linear projection resulting in a D-dimension latent vector where D is the latent vector size used by the transformer in its layers.An additional embedding at position 0 is added, which serves as a class label.A classification head consisting of simple, dense layers is added with a hidden layer during pre-training and a single linear layer while fine-tuning.The authors report improvements on the state-of-the-art results achieved by CNN-based models for a range of benchmark datasets such as ImageNet [51], CIFAR10, CIFAR100 [62] and Oxford-IIIT Pets [63].
An interesting implementation using transformer architecture was that created by Zheng et al. [33], who proposed a segmentation model named the Segmentation The image is first broken down into patches which are passed through a trainable linear projection resulting in a D-dimension latent vector where D is the latent vector size used by the transformer in its layers.An additional embedding at position 0 is added, which serves as a class label.A classification head consisting of simple, dense layers is added with a hidden layer during pre-training and a single linear layer while fine-tuning.The authors report improvements on the state-of-the-art results achieved by CNN-based models for a range of benchmark datasets such as ImageNet [51], CIFAR10, CIFAR100 [62] and Oxford-IIIT Pets [63].
An interesting implementation using transformer architecture was that created by Zheng et al. [33], who proposed a segmentation model named the Segmentation Transformer (SETR).They implement a solution wherein semantic segmentation is treated as a sequence-to-sequence prediction task with a transformer being deployed to encode an image as a sequence of patches.They combine the encoder with a single decoder by modeling the global context in each layer of the transformer.Figure 9 shows the architecture of their proposed system.Transformer (SETR).They implement a solution wherein semantic segmentation is treated as a sequence-to-sequence prediction task with a transformer being deployed to encode an image as a sequence of patches.They combine the encoder with a single decoder by modeling the global context in each layer of the transformer.Figure 9 shows the architecture of their proposed system.In the system, the image is first split into fixed patches which are linearly embedded with position encodings added.The resulting sequence of vectors is then fed into a standard transformer encoder.They propose two different decoder designs for pixel-wise segmentation, as can be seen in parts (b) and (c) of Figure 9.They then put these features together through a multi-level feature aggregation system as seen in part (c) in Figure 9.In the system, the image is first split into fixed patches which are linearly embedded with position encodings added.The resulting sequence of vectors is then fed into a standard transformer encoder.They propose two different decoder designs for pixelwise segmentation, as can be seen in parts (b) and (c) of Figure 9.They then put these features together through a multi-level feature aggregation system as seen in part (c) in Figure 9.Using this methodology, they were able to achieve state-of-the-art results on the ADE20K [64] and Pascal Context [65] challenges.
To study the low-level vision tasks like denoising, super-resolution, and deraining, Chen et al. [34] worked on developing a new pre-trained model using transformer architecture, called the image processing transformer (IPT).The entire network is composed of multiple pairs of heads and tails corresponding to different tasks and a single shared body, so the pre-trained model becomes more compatible with different image processing tasks.Multiple corrupted counterparts were generated for each image in the famous benchmark ImageNet dataset using several carefully designed operations.The model was then trained on the dataset's original images in addition to the newly generated images, and it outperformed the current state-of-the-art methods on several low-level benchmarks.
Touvron et al. [35] proposed a major non-convolutional transformer model, called the DeiT, that has fewer parameters than the ResNet model, which makes it trainable on a single computer in less than 3 days.Furthermore, a teacher-student strategy which relies on a distillation token procedure was used to ensure that the student learns from the teacher through attention.Using the distillation technique enables image transformers to learn more from a convent than from another comparably performing transformer.Therefore, a combination of those techniques results in a top accuracy of 85.2% on ImageNet with no external data.Consequently, transferring these models to a different downstream task, such as a fine-grained classification on popular benchmark datasets like CIFAR-10, Oxford-102 flowers, and Stanford Cars, achieved competitive results.

Recent Advancements
Fedus et al. [36], in early 2021, found that the widespread adoption of the mixture of experts (MoE) model has been obstructed by the complexity, communication costs, and training instability of the model.As a result, they introduced the switch transformer to simplify the MoE routing algorithm and reduce the communication and computational costs.This is done by distilling the sparse pre-trained and specialized fine-tuned models into small, dense models while preserving 30% of the quality grains.To increase the scale of the neural language model, data, model, and expert parallelism was combined to build models with a trillion parameters which improved the pre-training speed four times for a strongly tuned T5-XXL baseline model.
Radford et al. [37], meanwhile, aimed to leverage a much broader source of supervision by utilizing the raw text about the images to train a model instead of training it on a fixed set of predetermined object categories.They have shown that learning a SOTA image representation from scratch can be efficiently done by pre-training a model to match the captions with the corresponding images.Following the pre-training phase, natural language is used to reference the learned visual concepts or express new ones, which, in turn, enables the zero-shot transfer of the models to downstream tasks.This approach was tested on 30 different existing computer vision datasets and has proven its competitivity with the fully supervised baseline models without the need for any dataset specific training.
In a recent implementation, Zhai et al. [38] aim on scaling-up the original ViT model to achieve better results, generating a model that they named ViT-G.Through the improvements made, the authors were able to train their model using data parallelism alone and were able to fit the entire model on a single TPUv3 core.The model was scaled-up with two billion parameters.The authors removed the class token to save memory and additionally equated the number of multi-head attention-pooling heads to the number of attention heads in the model.Finally, they removed the final nonlinear projection before the final prediction layer, which was present in the original ViT model.The authors also scaled-up the data by using a larger version of the JFT-300M dataset, namely, the JFT-3B dataset.Using this model, the authors were able to achieve a new state-of-the-art result on the ImageNet dataset with a top accuracy of 90.45%.They also proved that they achieved a decent accuracy of 84.86% with few-shot learning, limiting to only 10 examples per class from the ImageNet dataset for fine-tuning.
To make large-scaled language models more accessible, less complex and resources less expensive, Zhang et al. [39] propose a suite of eight decoder-only pre-trained transformers that consist of 125 million to 175 billion parameters, namely, Open Pre-trained Transformers (OPTs).Their model is comparable to the state-of-the-art GPT-3 model with only 1/7th of the carbon footprint.The model is directly developed from the GPT-3 model with a change in the number of layers and attention heads to vary the parameter size.The smallest model, consisting of 125M parameters, consists of 12 layers and 12 attention heads, while the biggest model, consisting of 175B parameters, consists of 96 layers with 96 attention heads.The batch size is varied from the original model to increase computational efficiency.While training the OPT-175B model, the authors faced an issue of loss divergence, which they fixed by lowering the learning rate and restarting the training from an earlier checkpoint.The authors noticed a correlation between the loss divergence, the dynamic loss scalar crashing to zero, and the l 2 -norm of the activations of the final layer spiking.From this, the authors derived a conclusion to pick restart points where the dynamic scalar loss was still in the healthy state, which is greater than 1.The models were also additionally trained with a larger set of data, including datasets that were used to train the RoBERTa, The Pile dataset, and the PushShift.ioReddit dataset.The models were evaluated across 14 NLP tasks, and it was seen that for zero-shot, the average performance follows the trend of GPT-3 for 10 tests.

Text-Based Applications
Transformers have revolutionized the realm of text-based applications and natural language processing (NLP) through providing solutions to a variety of problems such as text classification, question answering, text summarization, machine translation, and text generation [66].The first prevalent model in text-based applications is one already analyzed previously-the BERT model proposed in 2018 by Devlin et al. [17].Despite being a number of years old, this architecture is still relevant to this day due to how groundbreaking it was when it was proposed.Indeed, the BERT model's NLP transformer has been the base for various other prevalent models such as the RoBERTa [26] in 2019, which achieved excellent results by proposing a variation of the BERT model, ETC [67], in 2020, which reported high performance when building upon the BERT model and using the weights provided by the RoBERTa as well as Big Bird [68] in 2021, which was proposed as a variation of the BERT model for longer sequences.Another notable implementation of transformers for the text domain is that of TENER [69], proposed in 2019 as a solution to using transformers for the named entity recognition task-which is the task of finding the start and end of an entity in a sentence and assigning a class for this entity.This is especially useful in applications such as question generation [70], relation extraction [71], and coreference resolution [72].This model adapts the transformer encoder to model character-level features and word-level features.

Image-Based Applications
In the realm of image-based applications, an early implementation was that of the Image Transformer proposed by Parmar et al. in 2018 [73].This model restricted the transformer's self-attention to attend to local neighborhoods.However, in the domain of images, one model reigns supreme, which is that of the Vision Transformer introduced by Dosovitsky et al. in 2020 [32], which was discussed earlier in this work.Numerous consequential works have been derived from this proposed model.A work based on ViTs, which outperformed it, was that of Touvron et al. [35], which has also been previously described in this paper.An alternative framework based on ViTs is that of the Feature Fusion Vision Transformer (FFVT) proposed by Wang et al. [74] in 2021, which adopts the patch generation process employed by ViTs but modifies it to avoid overlap.An extremely recent solution making use of transformers in the field of vision is that of the Unsupervised Semantic Segmentation Transformer (STEGO), proposed in March 2022 by Hamilton et al. [75].This model makes use of transformers to localize semantically meaningful categories within image corpora without any form of annotation.This is done by using a novel loss function that encourages features to form compact clusters while preserving their relationships across the corpora.

Miscellaneous Applications
In addition to the previously identified domains, a couple of miscellaneous approaches are discussed.The first of them is that of audio classification, for which a number of audio transformers have been proposed over the years.The first of these was proposed by Dong et al. in 2018 [76] with the idea of applying a two-dimensional attention block in the proposed audio transformer model.A consequential model for audio captioning was that of the Audio Captioning Transformer (TRACKE) proposed by Koizumi et al. [77] in 2020.The TRACKE estimates keywords, which comprise a word set corresponding to audio events/scenes in the input audio, and generates the caption while referring to the estimated keywords to reduce word-selection indeterminacy.Following this, in 2021, the Audio Spectrogram Transformer was proposed by Gong et al. [78] as a convolution-free, purely attention-based model for audio classification.
The second miscellaneous set of approaches are those of time series modeling, first introduced through the work proposed by Liu et al. in 2021 [79].This is done by adding gating to the vanilla transformer in an approach they call Gated Transformer Networks.Another model proposed in 2021 was for time series forecasting, by Zhou et al., which they call the Frequency Enhanced Decomposed Transformer (FEDformer) [80].An interesting time-series-based implementation using transformers is that of the TranAD proposed by Tuli et al. in 2022 for anomaly detection in time series data [81].The TranAD uses focused score-based self-conditioning to enable robust multi-modal feature extraction and adversarial training to gain stability.The results obtained by these models are highlighted in the next section.

Recent Directions
The recent increase in the relevance of transformers and the work being conducted in exploring the uses of these versatile models has resulted in transformers becoming more accessible for implementation in various real-world applications.Indeed, one of the applications which has greatly seen the use of transformers is that of medical image analysis [82].While numerous works have previously aimed at applying a variety of artificial intelligence algorithms towards solving key issues within the realm of medicine, such as COVID-19 detection [83] and the extraction and detection of a fetal electrocardiogram [84,85], with the introduction of transformers for vision, a large number of techniques such as image synthesis/reconstruction, registration, segmentation, detection, and diagnosis have been unlocked.Indeed, as Li et al. [86] discuss, the ability of transformers to capture long-range dependencies as well as the scalability of self-attention enables their diverse usage within the medical field.In addition to the capabilities of transformers to be used within medical imaging, Shamshad et al. [87] discuss their implementations in various other medical applications such as leveraging their text generation ability to generate medical reports as well as using it for regression tasks such as survival outcome prediction.
With the increase in the general depth and complexity of transformers, a number of researchers have chosen to focus on the stability of extremely deep transformers.One such approach relying on scaling is that of the DeepNet proposed by Wang et al. [40], which introduces a new normalization function to modify the residual connections in transformers along with having a theoretically derived initialization process.Using this technique, they report being able to successfully scale transformers up to 1000 layers.Furthermore, with the rise in the adaptation and use of transformers, an increase in the focus on developing a lighter version of transformers has been noted.This is because, while transformers have produced revolutionary results, it has been at a huge computation cost, thereby preventing the models from being as easily adapted as earlier deep-learning techniques such as CNNs [88].To this end, numerous researchers have proposed works aiming to scale or slim the weights of a traditional transformer.A notable attempt is that of the EfficientFormer and EfficientFormerV2 proposed by Li et al. [41,42].These models make use of a process called latency-driven slimming to reduce the time taken for inferencing using the trained transformers.The EfficientFormerV2 work further introduces a fine-grained joint-search strategy that can find efficient architectures by optimizing the latency and the number of parameters simultaneously.A similar work aiming to achieve efficient image recognition was that of the AdaViT proposed by Meng et al. [43], which serves as a computational framework learning to derive policies on which patches, selfattention heads, and transformer blocks to use throughout the backbone on a per-input basis.This is done by attaching a lightweight decision network to the backbone to produce on-the-fly decisions.A similar thought process was seen in the case of the A-ViT method proposed by Yin et al. [44] that adaptively adjusts the inference cost for images of different complexities.This is done by reducing the number of tokens in the ViT as the inference proceeds.Using the proposed method requires no extra parameters or sub-networks, unlike the AdaViT, as the learning of the adaptive halting is based on the original network parameters.A recent work aiming to improve the efficiency of transformer inference is that of Pope et al. [45], who develop an analytical model for inference efficiency to select the best multi-dimensional partitioning techniques.These are combined with low-level optimizations to achieve a Pareto frontier on latency and FLOPS utilization tradeoffs.
Another key work was that of Zhang et al. in the introduction of the MiniViT model [46], which applies weight multiplexing to reduce the complexity of the traditionally immense vision transformer.This is done by multiplexing the weights of consecutive transformer blocks, wherein weights are shared across layers, while imposing a transformation on the weights to increase diversity.Furthermore, the weight distillation over self-attention is also applied to transfer knowledge from the large ViT models to the weight-multiplexed compact models.
Yu and Wu [47] proposed a pruning framework to be applied to ViTs in order to simplify all components in a transformer without altering the structure.This framework, called the UP-ViT, estimates the importance score of each filter in a pre-trained ViT model before removing redundant channels.Furthermore, they propose a progressive block-pruning method that removes the least important block and proposes new hybrid blocks for ViTs.
An interesting area of recent work has been in making the training of transformers a more data-efficient process.An early work in this space was that of the previously discussed DeiT model proposed by Touvron et al. [35], who proposed using what they called a distillation token to effectively learn from a teacher in a teacher-student method employed to train transformers.This distillation token is learned through backpropagation, through the interaction with the class and patch tokens through self-attention layers.A more recent approach towards achieving data-efficient training is proposed by Wang et al. [89], who aim to achieve this by claiming that the sparse feature sampling from local image areas is key and, therefore, they propose a procedure where they alternate how key and value sequences are constructed in the cross-attention layer.Furthermore, they also introduce a label augmentation method which provides richer supervision, in turn, achieving greater data efficiency.

Historical Insight
Table 1 summarizes the historical works discussed in the previous section.The works are color-coded in the timeline, wherein the works targeted towards text and NLP tasks are color-coded in blue and the works targeted at image-related tasks are color-coded in orange.

Name of Paper
The table above summarizes key information from the history of the studies discussed in the previous section.In addition to the name of the study, the author and the date, the table also outlines the approach presented as well as the datasets evaluated upon, the models benchmarked against, and the obtained results.Finally, the number of citations attained by the paper as of the writing of this paper are also listed in order to emphasize the importance of some of the presented studies.
In general, it can be seen that a number of works have chosen to add or modify layers of the base transformer models, which has overall been seen to achieve good performance.Indeed, such an approach is seen in works such as those of Shaw et al. [19], Wang et al. [25], Sukhbaatar et al. [27], and Shazeer [28].
Another common approach for NLP tasks which has been shown to work really well is to increase the size of the model to a very large number of parameters and to pre-train it in an unsupervised fashion on a large corpus of data.This has been seen in numerous state-of-the-art models such as the GPT [23], BERT [17], GPT-2 [90], RoBERTa [26], T5 [22], and the GPT-3 [31] models, in the work of Radford et al. [37], and in the OPT model [39].
Yet another form of approaches involves the addition or modification of the loss functions associated with the transformer model.Such an approach was seen in the case of the work performed by Carion et al. [30].
When it comes to images, the general procedure followed by the previous studies was to split images into patches and apply position embeddings on these patches, much like what is done for texts.This was indeed the process followed by the Vision Transformer (ViT) [32].Other vision models implemented varied decoders such as the work proposed by Zheng et al. [33].Similarly, studies such as that by Chen et al. [34] make use of multiple pairs of heads and tails corresponding to different low-level vision tasks.The ViT-G model proposed by Zhai et al. [38] followed a procedure where the class token was removed and the non-linear projection before the final layer was removed.

Text-Based Applications
As can be seen from the types of models used for these applications, an important aspect in the implementation of text-based transformers is the encoder and the encoding of the input.Indeed, the model achieving the most widespread usage has been the BERT model [17], which involves modifying the input encodings to make them bidirectional.The RoBERTa [26] builds upon this by adding an optimized pre-training process.Indeed, most of the other NLP-solving approaches have involved modifications to input encodings such as the TENER [69], ETC [67], and the Big Bird [68] models, thereby demonstrating the importance of encodings to the NLP process.

Image-Based Applications
Table 3 below successfully illustrates that the vision domain of transformers is very extensive and is used for many different kinds of applications such as image classification and segmentation.To date, the greatest model is the ViT [32], and many other significant models are based on improving its performance by tweaking its architecture, such as the study by Wang et al. [74] where they just modify the patch generation by avoiding overlap.The recent introduction of the work by Hamilton et al. [75] opens the door to unsupervised segmentation, proven through their decent results of an accuracy of 76.1%.This solution would solve a lot of real-world-based problem applications, as those datasets are often unbalanced or have less amounts of labeled data.A concrete quantitative analysis across the previous studies is difficult to achieve due to the fact that all the authors report results on different datasets and also report different evaluation metrics.4 below it is well illustrated that the contributions towards transformer models are not just limited to the domain of NLP and images, but they have also been recently used in audio and time series domains.Here, too, it is difficult to do a concrete quantitative analysis as the specific application domains of the works summarized above are all different.An interesting work to note is that of Koizumi et al. [77], which merges NLP analysis within the audio domain and is quite successful in outperforming the results of the traditional LSTM model that is usually used for such an application, with a best score of 52.1 for the BLUE-1 dataset.Dong et al. [76] achieve a WER score of 10.9 on the eval92 subset of the Wall Street Journal dataset, and Gong et al. [78] achieve their best results on the Speech commands v2 dataset with an accuracy of 98.11% without adding additional audio data while training.The second half of the table demonstrates different areas in the domain of time series using transformers.The domains illustrated are those of the Time Series Classification by Liu et al. [79], who were able to beat the state-of-the-art results on 7 out of 13 competitive datasets, those of the Time Series forecasting proposed by Zhou et al. [80], who achieved SOTA results in all 6 datasets, and the Time Series Anomaly Detection proposed by Tuli et al. [81], who also beat the SOTA results in their domain on 7 out of 10 competitive datasets.

Gaps and Future Work
As the above discussion illustrates, the realm of transformer architectures is one that has exploded with the new and existing works being rapidly proposed ever since Vaswani et al.'s revolutionary publication [1]. Figure 10 is presented to highlight the progression in the architecture and the complexity of transformer models ever since, with the architecture of notable transformer implementations visualized in order to give the reader a perspective of the rapid rise.However, despite this rapid progression, certain gaps in the field remain.One major gap seen in contemporary research is that transformers generally have a quadratic computation and memory complexity due to their being required to model arbitrary long dependencies [91].This has presented a major issue in the accessibility of the use of transformers and has led to a promising avenue of research aimed at simplifying the training process of transformer models [92].Indeed, the Lite Transformer [29] discussed earlier was introduced with the intention of addressing this very issue, as were implementations such as the Longformer [93], Reformer [94], Linformer [95], Performer [96], and the OPT [39].However, these models are a start to what is a vast potential research space in optimizing transformer-training procedures.This is a pressing issue, as many of the state-ofthe-art models aim to simply increase a model s size (GPT-4, for instance) [97], and, therefore, make it impractical for that model to be used in many real-world applications.
Another interesting research issue is the problem of integrating all modalities without changing the architecture towards a single modality.Early implementations of this have been seen in models such as the Perceiver [98], which accepts all kinds of input but can only generate fixed outputs such as class probabilities, and the Perceiver IO, which has flexible inputs and outputs but still relies on the specifics of the modalities, such as augmentation or position encoding, to properly learn [99].This research area is ripe for expansion, as a model that is truly adaptable to anything would lead to massive progress in the field of deep learning and would broaden the scope of the real-world applications that could be improved with artificial intelligence.However, despite this rapid progression, certain gaps in the field remain.One major gap seen in contemporary research is that transformers generally have a quadratic computation and memory complexity due to their being required to model arbitrary long dependencies [91].This has presented a major issue in the accessibility of the use of transformers and has led to a promising avenue of research aimed at simplifying the training process of transformer models [92].Indeed, the Lite Transformer [29] discussed earlier was introduced with the intention of addressing this very issue, as were implementations such as the Longformer [93], Reformer [94], Linformer [95], Performer [96], and the OPT [39].However, these models are a start to what is a vast potential research space in optimizing transformer-training procedures.This is a pressing issue, as many of the state-of-the-art models aim to simply increase a model's size (GPT-4, for instance) [97], and, therefore, make it impractical for that model to be used in many real-world applications.
Another interesting research issue is the problem of integrating all modalities without changing the architecture towards a single modality.Early implementations of this have been seen in models such as the Perceiver [98], which accepts all kinds of input but can only generate fixed outputs such as class probabilities, and the Perceiver IO, which has flexible inputs and outputs but still relies on the specifics of the modalities, such as augmentation or position encoding, to properly learn [99].This research area is ripe for expansion, as a model that is truly adaptable to anything would lead to massive progress in the field of deep learning and would broaden the scope of the real-world applications that could be improved with artificial intelligence.
A final research area which can be worked upon is that, generally, large amounts of data are needed to train a good transformer.This is less than ideal as many real-world applications do not contain adequate amounts of labeled data and therefore would not be able to leverage this powerful model.Promising research towards achieving this is that of the ViT-G [38], which reports having achieved few-shot learning by training with just 10 examples per class in the ImageNet dataset.More work needs to be done in this realm to truly make transformers accessible for wide implementations.A possible avenue to achieve this could be exploring ways to train transformers in a semi-supervised fashion [100].With the successful exploration of these avenues of research, it might be possible to leverage the great power and achievements attained by transformers in real work applications which would affect our daily lives.

Figure 1 .
Figure 1.A depiction of transformer architecture.

Figure 1 .
Figure 1.A depiction of transformer architecture.

Figure 2 .
Figure 2. The structure of the attention layer.Left: Scaled Dot-Product Attention.Right: a multi-head attention mechanism.

Figure 2 .
Figure 2. The structure of the attention layer.Left: Scaled Dot-Product Attention.Right: a multi-head attention mechanism.

Figure 5 .
Figure 5.The pre-training and fine-tuning process of the BERT model.

Figure 5 .
Figure 5.The pre-training and fine-tuning process of the BERT model.

Figure 7 .
Figure 7.The DETR's architecture.The CNN is used to extract a compact feature representation of the input image by generating a low-resolution activation map.The transformer's encoder and decoder follow the model architecture of Vaswani et al. [1].The decoder output encodings are decoded into box coordinates and class labels by the feedforward network.The object detection set prediction loss produces a bipartite matching between the predicted and the ground truth objects and then optimizes the object-specific losses.This model is on par with state-ofthe-art Faster R-CNN baseline on the famous COCO object detection dataset.The Faster R-CNN was a model proposed by Ren et al. which used a Region Proposal Network to generated region proposals which were then used by a Fast R-CNN for detection [61].Around mid-2020, Brown et al. [31] proposed a work which improved on the state-ofthe-art NLP transformer model by proposing their improved GPT-3 model.The authors scale-up the model by training it with 175 billion parameters which results in a model which can perform a variety of tasks without requiring task-specific gradient updates or fine-tuning, unlike the previous generations of the model.The other variation from the architecture of GPT-2 is that of the use of alternating dense and locally banded sparse attention patterns in the layers of the transformer.The model is able to perform well and even achieve SOTA results on famous NLP dataset tasks with few-shot demonstrations which are specified purely via text interactions with the model.

Figure 8 .
Figure 8.An overview of the ViT model s architecture.

Figure 8 .
Figure 8.An overview of the ViT model's architecture.
Author Contributions: Conceptualization, A.R.S. and I.Z.; methodology, A.R.S., I.Z.and D.S.; investigation, A.R.S. and I.Z.; resources, I.Z.; writing-original draft preparation, A.R.S. and D.S.; writing-review and editing, I.Z.; visualization, D.S.; supervision, I.Z.All authors have read and agreed to the published version of the manuscript.Funding: The work in this paper was supported, in part, by the Open Access Program from the American University of Sharjah [grant number: OAPCEN-1410-E00291].

Table 1 .
A summary of the history of transformer studies.
Table 2 below displays a summary of notable transformer studies in the domain of NLP.

Table 2 .
A summary of the transformer studies in the domain of NLP.

Table 3 .
A summary of the transformer-related works in the domain of computer vision.

Table 4 .
A summary of the transformer-related works in the audio and time series domains.