Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review

: Transformers are models that implement a mechanism of self-attention, individually weighting the importance of each part of the input data. Their use in image classification tasks is still somewhat limited since researchers have so far chosen Convolutional Neural Networks for image classification and transformers were more targeted to Natural Language Processing (NLP) tasks. Therefore, this paper presents a literature review that shows the differences between Vision Transformers (ViT) and Convolutional Neural Networks. The state of the art that used the two architectures for image classification was reviewed and an attempt was made to understand what factors may influence the performance of the two deep learning architectures based on the datasets used, image size, number of target classes (for the classification problems), hardware, and evaluated architectures and top results. The objective of this work is to identify which of the architectures is the best for image classification and under what conditions. This paper also describes the importance of the Multi-Head Attention mechanism for improving the performance of ViT in image classification.


Introduction
Nowadays, transformers have become the preferred models for performing Natural Language Processing (NLP) tasks.They offer scalability and computational efficiency, allowing models to be trained with more than a hundred billion parameters without saturating model performance.Inspired by the success of the transformers applied to NLP and assuming that the self-attention mechanism could also be beneficial for image classification tasks, it was proposed to use the same architecture, with few modifications, to perform image classification [1].The author's proposal was an architecture, called Vision Transformers (ViT), which consists of breaking the image into 2D patches and providing this linear sequence of patches as input to the model.Figure 1 presents the architecture proposed by the authors.
In contrast to this deep learning architecture, there is another very popular tool for processing large volumes of data called Convolutional Neural Networks (CNN).The CNN is an architecture that consists of multiple layers and has demonstrated good performance in various computer vision tasks such as object detection or image segmentation, as well as NLP problems [2].The typical CNN architecture starts with convolutional layers that pass through the kernels or filters, from left to right of the image, extracting computationally interpretable features.The first layer extracts low-level features (e.g., colours, gradient orientation, edges, etc.), and subsequent layers extract high-level features.Next, the pooling layers reduce the information extracted by the convolutional layers, preserving the most important features.Finally, the fully-connected layers are fed with the flattened output of the convolutional and pooling layers and perform the classification.Its architecture is shown in Figure 2.
With the increasing interest in Vision Transformers as a novel architecture for image recognition tasks, and the established success of CNNs in image classification, this work aims to review the state of the art in comparing Vision Transformers (ViT) and Convolutional Neural Networks (CNN) for image classification.Transformers offer advantages such as the ability to model long-range dependencies, adapt to different input sizes, and the potential for parallel processing, making them suitable for image tasks.However, Vision Transformers also face challenges such as computational complexity, model size, scalability to large datasets, interpretability, robustness to adversarial attacks, and generalization performance.These points highlight the importance of comparing ViTs with older and established CNN models.tics differ between the two architectures, that allow them to perform differently for the same objective.Some of the aspects that will be compared include datasets considerations, robustness, performance, evaluation, interpretability, and architecture.Specifically, we aim to answer the following research questions: RQ1-Can the ViT architecture have a better performance than the CNN architecture, regardless of the characteristics of the dataset?RQ2-What influences CNNs that do not to perform as well as ViTs?RQ3-How does the Multi-Head Attention mechanism, which is a key component of ViTs, influence the performance of these models in image classification?
In order to address these research questions, a literature review was conducted by searching various databases such as Google Scholar, Scopus, Web of Science, ACM Digital Library, and Science Direct using specific search terms.This paper presents the results of this review and analyses the methodologies and findings from the selected papers.
The rest of this paper is structured as follows.Section 2 describes the research methodology and search results.Section 3 presents the knowledge, methodology, and results found in the selected documents.Section 4 provides a brief overview of the reviewed papers and attempts to answer the three research questions.Section 5 discusses threats to the validity of the research.Section 6 overviews the strengths and weaknesses of each architecture and suggests future research directions, and Section 7 presents the main conclusions of this work.

Research Methodology
The purpose of a literature review is to evaluate, analyse and summarize the existing literature on a specific research topic, in order to facilitate the emergence of theoretical frameworks [3].In this literature review, the aim is to synthesize the knowledge base, critically evaluate the methods used and analyze the results obtained in order to identify the shortcomings and improve the two aforementioned deep learning architectures for image classification.The methodology for conducting this literature review is based on the guidelines presented in [3,4].

Data Sources
ACM Digital Library, Google Scholar, Science Direct, Scopus, and Web of Science, were chosen as the data sources to extract the primary studies.The number of results found after searching papers in each of the data sources is shown in Table 1.

Search String
The research questions developed for this paper served as the basis for the search strings utilized in each of the data sources.Table 2 provides a list of the search strings used in each electronic database.

Inclusion Criteria
The inclusion criteria set to select the papers were that the studies were recent, had been written in English, and were published between January 2021 and December 2022.This choice of publication dates is based on the fact that ViTs were not proposed until the end of 2020 [1].In addition, the studies had to demonstrate a comparison between CNNs and ViTs for image classification and could use any pre-trained model of the two architectures.Studies that presented a proposal for a hybrid architecture, where they combined the two architectures into one, were also considered.The dataset used during the studies did not have to be a specific one, but it had to be a dataset of images that allowed classification using both deep learning architectures.

Exclusion Criteria
Studies that oriented their research on using only one of the two deep learning architectures (i.e., Vision Transforms, or Convolutional Neural Networks) were excluded.Additionally, papers that were discovered to be redundant when searches were conducted throughout the chosen databases were eliminated.It was also defined that one of the exclusion criteria will be that the papers would have more than seven citations.
In summary, with the application of these criteria, 10,690 papers were excluded from the Google Scholar database, 89 papers from Web of Science, 53 papers from Scopus, 19,158 papers from ACM Digital Library, and 1,434 papers from Science Direct.

Results
After applying the inclusion and exclusion criteria to the papers obtained in each of the electronic databases, seventeen (17) papers were selected for the literature review.Table 3 lists all the papers selected for this work, the year of publication and the type of publication.Figure 3 shows the distribution of the selected papers by year of publication.Figure 4 shows the distribution of the selected studies by application area.In the figure, most of the papers are generic in their application area.In these papers without a specific application area, the authors try to better understand the characteristics of the two architectures.For example, between CNNs and ViTs, the authors have tried to understand which of the architectures is more transferable.If architectures based on transformers are more robust than CNNs.And if the ViT will be able to see the same information as CNN with a different architecture.Within the health domain, some studies have been developed in different sub-areas, such as breast cancer, to show that ViT can be better than CNNs.The figure also shows that some work has been done, albeit to a lesser extent, in other application areas.Agriculture stands out with two papers comparing ViTs with CNNs.

Findings
An overview of the studies selected through the research methodology is shown in Table 4.This information summarizes the authors' approach, the findings, and other architectures that were used to build a comparative study.Therefore, to address the research questions, this section will offer an overview of the data found in the collected papers.
In the study developed in [12], the authors aimed to compare the two architectures (i.e., ViT and CNN), as well as the creation of a hybrid model that corresponded to the combination of the two.The experiment was conducted using the ImageNet dataset and perturbations were applied to the dataset images.It was concluded that ViT can perform better and be more resilient on images with natural or adverse disturbances than CNN.It was also found in this work that the combination of the two architectures results in a 10% improvement in accuracy (Acc).
The work done in [14] aimed to compare vision transformers (ViT) with Convolutional Neural Networks (CNN) for digital holography, where the goal was to reconstruct amplitude and phase by extracting the distance of the object from the hologram.In this work, DenseNet201, DenseNet169, EfficientNetB4, EfficientNetB7, ViT-B/16, ViT-B32 and ViT-L/16 architectures were compared with a total of 3400 images.They were divided into four datasets, original images with or without filters, and negative images with or without filters.The authors concluded that ViT despite having an accuracy like CNN, was more robust because, due to the self-attention mechanism, it can learn the entire hologram rather than a specific area.
The authors in [7] studied the performance of ViT in comparison with other architectures to detect pneumonia, through chest X-ray images.Therefore, a ViT model, a CNN network developed by the authors and the VGG-16 network were used for the study which focussed on a dataset with 5856 images.After the experiments performed, the authors concluded that ViT was better than CNN with 96.45% accuracy, 86.38% validation accuracy, 10.8% loss and 18.25% validation loss.In this work, it was highlighted that ViT has a self-attention mechanism that allows splitting the image into small patches that are trainable, and each part of the image can be given an importance.However, the attention mechanism as opposed to the convolutional layers makes ViT's performance saturate fast when the goal is scalability.
In the study [21] the goal was to compare ViT with state-of-art CNN networks to classify UAV images to monitor crops and weeds.The authors compared the influence of the size of the training dataset on the performance of the architectures and found that ViT performed better with fewer images than CNN networks in terms of F1-Score.They concluded that ViT-B/16 was the best model to do crop and weed monitoring.In com-parison with CNN networks, ViT could better learn the patterns of images in small datasets due to the self-attention mechanism.
In the scope of lung diseases, the authors in [11] investigated the performance of ViT models to automatically classify emphysema subtypes through Computed Tomography (CT) images in comparison with CNN networks.In this study, they performed a comparative study between the two architectures using a dataset collected by the authors (3192 patches) and a public dataset of 168 patches taken from 115 HRCT slides.In addition to this, they also verified the importance of pre-trained models.They concluded that ViT failed to generalize when trained with fewer images, because when comparing the pre-training accuracy with 91.27% on the training and 70.59% on the test.
In the work in [9], a comparison between state-of-the-art CNNs and ViT models for Breast ultrasound image classification was developed.The study was performed with two different datasets: the first containing 780 images and the second containing 163 images.The following architectures were selected for the study: ResNet50, VGG-16, Inception, NASNET, ViT-S/32, ViT-B/32, ViT-Ti/16, R + ViT-Ti/16 and R26 + ViT-S/16.ViT models were found to perform better than CNN networks for image classification.The authors also highlighted that ViT models could perform better when they were trained with a small dataset, because via the attention mechanism, it was possible to collect more information from different patches, instead of collecting information from the image.
Benz et al., in [5], compared ViT models, with the MLP-Mixer architecture and with CNNs.The goal was to evaluate which architecture was more robust in image classification.The study consisted of generating perturbations and adverse examples in the images and understanding which of the architectures was most robust.However, this study did not aim to analyse the causes.Therefore, the authors concluded that ViT were more robust than CNNs to adversarial attacks and from a features perspective CNN networks were more sensitive to high-frequency features.It was also described that the shiftvariance property of convolutional layers may be at the origin of the lack of robustness of the network in the classification of images that have been transformed.
The authors in [15] performed an analysis between ViT and CNN models aimed at detecting deepfake images.The experiment consisted in using the ForgeryNet dataset with 2.9 million images and 220 thousand video clips, together with three different image manipulation techniques, where they tried to train the models with real and manipulated images.By training the ViT-B model and the EfficientNetV2 network the authors demonstrated that the CNN network could generalize better and obtain higher training accuracy.However, ViT could have better generalization, reducing the bias in the identification of anomalies introduced by one or more different techniques to introduce anomalies.
Chao Xin et al. [17] aimed to compare their ViT model with CNN networks and with another ViT model to perform image classification to detect skin cancer.The experiment conducted by the authors used a public HAM10000 dataset with dermatoscopic skin cancer images and a clinical dataset collected through dermoscopy.In this study, a multi-scale image and the overlapping sliding window were used to serialize the images.They also used contrastive learning to improve the similarity of different labels and minimize the similarity in the same label.Thus, the ViT model developed was better for skin cancer classification using these techniques.However, the authors also demonstrated the effectiveness of balancing the dataset on the model performance, but they did not present the F1-Score values before the dataset is balanced to verify the improvement.
The authors in [19] aimed to study if ViT models could be an alternative to CNNs in time-critical applications.That is, for edge computing instances and IoT networks, applications using deep learning models consume multiple computational resources.The experiment used pre-trained networks such as ResNet152, DenseNet201, InceptionV3, and SL-ViT with three different datasets in the scope of images, audio, and video.They concluded that the ViT model introduced less overhead and performed better than the ar-chitectures used.It was also shown that increasing the kernel size of convolutional layers and using dilated convolutional caused a reduction in the accuracy of a CNN network.
In a study carried out in [20], the authors tried to find in ViTs an alternative solution to CNN networks for asphalt and concrete crack detection.The authors concluded that ViTs, due to the self-attention mechanism, had better performance in crack detection images with intense noise.CNN networks in the same images suffered from a high number of false negative rates, as well as the presence of biases in image classification.
Haolan Wang in [16] aimed to analyse eight different Vision Transformers and compare them with the performance of a pre-trained CNN network and without the pretrained parameters to perform traffic signal recognition in autonomous driving systems.In this study, three different datasets with images of real-world traffic signals were used.This allowed the authors to conclude that the pre-trained DenseNet161 network had a higher accuracy than the ViT models to do traffic sign recognition.However, it was found in this work that ViT models performed better than the DenseNet161 network without the pre-trained parameters.From this work, it was also possible to conclude that the ViT models with a total number of parameters equal to or greater than the CNN networks, used during the experiment, had a shorter training time.
The work done in [13] compared CNN networks with Vision Transformers models for the classification of Diabetic Foot Ulcer images.For the study, the authors decided to use the following architectures: Big Image Transfer (BiT), EfficientNet, ViT-base and Data-efficient Image Transformers (DeIT) upon a dataset composed of 15,683 images.A further aim of this study to compare the performance of deep learning models using Stochastic Gradient Descent (SGD) [22] with Sharpness-Aware Optimization (SAM) [23,24].These two tools are optimizers that seek to minimize the value of the loss function, improving the generalization ability of the model.However, SAM minimizes the value of the loss function and the sharpness loss, looking for parameters in the neighbourhood with a low loss.Therefore, this work concluded that the SAM optimizer originated an improvement in the values of F1-Score, AUC, Recall and Precision in all the architectures used.However, the authors did not present the training and test values that allow for evaluating the improvement in the generalization of the models.Therefore, the BiT-ResNetX50 model with the SAM optimizer obtained the best performance for the classification of Diabetic Foot Ulcer images with F1-Score = 57.71%,AUC = 87.68%,Recall = 61.88%,and Precision = 57.74%.
The authors in [18] performed a comparative study between ViT models and CNN networks used in the state of the art with a model developed by them, where they combined CNN and transformers to perform insect pest recognition to protect agriculture worldwide.This study involved three public datasets: the IP102 dataset, the D0 dataset and Li's dataset.The algorithm created by the authors consisted of using the sequence of inputs formed by the CNN feature maps to make the model more efficient, and a flexible attention-based classification head was implemented to use the spatial information.Comparing the results obtained, the proposed model obtained a better performance in insect pest recognition with an accuracy of 74.897%.This work demonstrated that finetuning worked better on vision transformers than CNN, but on the other hand, this caused the number of parameters, the size, and the inference time of the model to increase significantly with respect to CNN networks.Through their experiments, the authors also demonstrated the advantage of using decoder layers in the proposed model to perform image classification.The greater the number of decoder layers, the greater the accuracy value of the model.However, this increase in the number of decoder layers increased the number of parameters, the size, and the inference time of the model.In other words, the architecture to process the images consumes far greater computational resources, which may not compensate for the increase in accuracy value with few layers.In the case of this study, the increase from one layer to three decoder layers represented only an increase of 0.478% in the accuracy value.
Several authors in [6,8,10] went deeper into the investigation and aimed to understand how the learning process of vision transformers works if ViT could be more transferable and better understand if the transform-based architecture were more robust than CNNs.In this sense, the authors in [8] intended to analyse the internal representations of ViT and CNN structures in image classification benchmarks and found differences between them.One of the differences was that ViT has greater similarity between high and low layers, while CNN architecture needs more low layers to compute similar representations in smaller datasets.This is due to the self-attention layers implemented in ViT, which allows it to aggregate information from other spatial locations, vastly different from the fixed field sizes in CNNs.They also observed that ViTs in the lower, selfattention layers can access information from local heads (small distances) and global heads (large distances).Whereas CNNs have access to information locally in the lower layers.On the other hand, the authors in [10] systematically analysed the transfer learning capacity in the two architectures.The study was conducted by comparing the performance of the two architectures on single-task and multi-task learning problems, using the ImageNet dataset.Through this study, the authors concluded that the transformbased architecture contained more transferable representations compared to convolutional networks for fine-tuning, presenting better performance and robustness in multitask learning problems.
In another study carried out in [6], the goal was to prove if ViT were more robust than CNN as the most recent studies have shown.The authors developed their work comparing the robustness of the two architectures using two different types of perturbations: adversarial samples, which consists in evaluating the robustness of deep learning architectures in images with human-caused perturbations (i.e., data augmentation) and out-of-distribution samples, which consists in evaluating the robustness of the architectures in benchmarks of classification images.Through this experiment, it was demonstrated that by replacing the activation function ReLU by the activation function of transformer-based architecture (i.e., GELU) the CNN network was more robust than ViT in adversarial samples.In this study, it was also demonstrated that CNN networks were more robust than ViT in patch-based attacks.However, the authors concluded that the self-attention mechanism was the key to the robustness of the transformer-based architecture in most of the experiments performed.

Discussion
The results can be summarized as follows.In [12], ViTs were found to perform better and be more resilient to images with natural or adverse disturbances compared to CNNs.Another study [14] concluded that ViTs are more robust in digital holography because they can access the entire hologram rather than just a specific area, giving them an advantage.ViTs have also been found to outperform CNNs in detecting pneumonia in chest X-ray images [7] and in classifying UAV images for crop and weed monitoring with small datasets [21].However, it has been noted that ViT performance may saturate if scalability is the goal [7].In a study on classifying emphysema subtypes in CT images [11], ViTs were found to struggle with generalization when trained on fewer images.Nevertheless, ViTs were found to outperform CNNs in breast ultrasound image classification, especially with small datasets [9].Another study [5] found that ViTs are more robust to adversarial attacks and that CNNs are more sensitive to high-frequency features.The authors in [15] found that CNNs had higher training accuracy and better generalization, but ViTs showed potential to reduce bias in anomaly detection.In [17], the authors claimed that the ViT model showed better performance for skin cancer classification.ViTs have also been shown to introduce less overhead and perform better for timecritical applications in edge computing and IoT networks [19].In [20], the authors investigated the use of ViTs for asphalt and concrete crack detection and found that ViTs performed better due to the self-attention mechanism, especially in images with intense noise and biases.Wang [16] found that a pre-trained CNN network had higher accuracy, but the ViT models performed better than the non-pre-trained CNN network and had a shorter training time.The authors in [13] used several models for diabetic foot ulcer image classification and compared SGD and SAM optimizers, concluding that the SAM optimizer improved several evaluation metrics.In [18], the authors showed that fine-tuning performed better on ViT models than CNNs for insect pest recognition.
Therefore, based on the information gathered from the selected papers, we attempt to answer the research questions posed in Section 1: RQ1-Can the ViT architecture have a better performance than the CNN architecture, regardless of the characteristics of the dataset?
The literature review shows that ViT in image processing can be more efficient in smaller datasets due to the increase of relations created between images through the selfattention mechanism.However, it is also shown that if ViT trained with little data will have less generalization ability and worse performance compared to CNN's.
RQ2-What influences the CNNs that do not allow them to perform as well as the ViTs?Shift-invariance is a limitation of CNN that makes the same architecture not have a satisfactory performance because the introduction of noise in the input images makes the same architecture unable to get the maximum information from the central pixels.However, the authors in [27] propose the addition of an anti-aliasing filter which combines blurring with subsampling in the Convolutional, MaxPooling and AveragePooling layers.Demonstrating through the experiment carried out that the application of this filter originates a greater generalization capacity and an increase in the accuracy of CNN.Furthermore, increasing the kernel size in convolutional layers and using dilated convolution have been shown as limitations that deteriorate the performance of CNNs against ViTs.

RQ3-How does the Multi-Head Attention mechanism, which is a key component of
ViTs, influence the performance of these models in image classification?
The Attention mechanism is described as the mapping of a query and a set of keyvalue pairs to an output, the output being the result of a weighted sum of the values, in which the weight given is calculated, through the query with the corresponding key by a compatibility function.Multi-head Attention mechanism instead of performing a single attention function will perform multiple projections of attention functions [28].This mechanism improves the ViT architecture because it allows it to extract more information from each pixel of the images that have been placed inside the embedding.In addition, this mechanism can have better performance if the images have more secondary elements that illustrate the central element.And since this mechanism performs several computations in parallel, it reduces the computational cost [29].
Overall, ViTs have shown promising performance compared to CNNs in various applications, but there are limitations and factors that can affect their performance, such as dataset size, scalability, and pre-training accuracy.

Threats to Validity
This section discusses internal and external validity threats.The validity of the entire process performed in this study is demonstrated and how the results of this study can be replicated in other future experiments.
In this literature review, different search strings were used in each of the selected data sources, resulting in different results from each source.This approach may introduce a bias into the validation of the study, as it makes it difficult to draw conclusions about the diversity of studies obtained by replicating the same search.In addition, the maturity of the work was identified as an internal threat to validity, as the ViT architecture is relatively new and only a limited number of research projects have been conducted using it.In order to draw more comprehensive conclusions about the robustness of ViT compared to CNN, it is imperative that this architecture is further disseminated and deployed, thereby making more research available for analysis.
In addition to these threats, this study did not use methods that would allow to quantitatively and qualitatively analyse the results obtained in the selected papers.This may bias the validity of this review in demonstrating which of the deep learning architectures is more efficient in image processing.
The findings obtained in this study could be replicated in other future research in image classification.However, the results obtained may not be the same as those described by the selected papers because it has been proven that for different problems and different methodologies used, the results are different.In addition, the authors do not describe in sufficient detail all the methodologies they used, nor the conditions under which the experiment was performed.

Strengths, Limitations, and Future Research Directions
The review made it possible to identify not only the strengths of each architecture (outlined in Section 6.1), but also their potential for improvement (described in Section 6.2).Future research directions were also derived from this and are presented in Section 6.3.

Strengths
Both CNNs and ViTs have their own advantages, and some common ones.This section will explore these in more detail, including considerations on the Datasets, Robustness, Performance optimization, Evaluation, Explainability and Interpretability, and Architectures.

Dataset Considerations
CNNs have been widely used and extensively studied for image-related tasks, resulting in a rich literature, established architectures, and pre-trained models, making them accessible and convenient for many datasets.On the other hand, ViTs can process patches in parallel, which can lead to efficient computation, especially for large-scale datasets, and allow faster training and inference.ViTs can also handle images of different sizes and aspect ratios without losing resolution, making them more scalable and adaptable to different datasets and applications.

Robustness
CNNs are inherently translation-invariant, making them robust to small changes in object position or orientation within an image.The main advantage of ViTs is their ability to effectively capture global contextual information through the self-attention mechanism, enabling them to model long-range dependencies and contextual relationships, which can improve robustness in tasks that require understanding global context.ViTs can also adaptively adjust the receptive fields of the self-attention mechanism based on input data, allowing them to better capture both local and global features, making them more robust to changes in scale, rotation, or perspective of objects.
Both architectures can be trained using data augmentation techniques, such as random cropping, flipping, and rotation, which can help improve robustness to changes in input data and reduce overfitting.Another technique is known as adversarial training, where they are trained on adversarial examples: perturbed images designed to confuse the model, to improve its ability to handle input data with adversarial perturbations.Combining models, using ensemble methods, such as bagging or boosting, can also improve robustness by exploiting the diversity of multiple models, which can help mitigate the effects of individual model weaknesses.

Performance Optimization
CNNs can be effectively compressed using techniques such as pruning, quantization, and knowledge distillation, reducing model size and improving inference efficiency without significant loss of performance.They are also well-suited for hardware optimization, with specialized hardware accelerators (e.g., GPUs, TPUs) designed to perform convolutional operations efficiently, leading to optimized performance in terms of speed and energy consumption.
ViTs can efficiently scale to handle high-resolution images or large-scale datasets because they operate on the entire image at once and do not require processing of local receptive fields at multiple spatial scales, potentially resulting in improved performance in terms of scalability.
Transfer Learning for pre-train on large-scale datasets and fine-tune on smaller datasets can potentially lead to improved performance with limited available data and can be used with both architectures.

Evaluation
CNNs have been widely used in image classification tasks for many years, resulting in well-established benchmarks and evaluation metrics that allow meaningful comparison and evaluation of model performance.The standardized evaluation protocols, such as cross-validation or hold-out validation, which provide a consistent framework for evaluating and comparing model performance across different datasets and tasks, are applicable for both architectures.

Explainability and Interpretability
CNNs produce feature maps that can be visualized, making it possible to interpret the behaviour of the model by visualizing the learned features or activations in different layers.They capture local features in images, such as edges or textures, which can lead to interpretable features that are visually meaningful and can provide insight into how the model is processing the input images.ViTs, on the other hand, are designed to capture global contextual information, making them potentially more interpretable in tasks that require an understanding of long-range dependencies or global context.They have a hierarchical structure with self-attention heads that can be visualized and interpreted individually, providing insights into how different heads attend to different features or regions in the input images.

Architecture
CNNs have a wide range of established architecture variants, such as VGG, ResNet, and Inception, with proven effectiveness in various image classification tasks.These architectures are well-tested and widely used in the deep learning community.ViTs can be easily modified to accommodate different input sizes, patch sizes, and depth, providing flexibility in architecture design and optimization.

Limitations
Despite their many advantages and the breakthroughs made over the years.There are still some drawbacks to the architectures studied.This section focuses on these.

Dataset Considerations
CNNs can be susceptible to biases present in training datasets, such as biased sampling or label noise, which can affect the validity of training results.They typically operate on fixed input spatial resolutions, which may not be optimal for images of varying size or aspect ratio, resulting in information loss or distortion.While pre-trained models for CNNs are well-established, pre-trained models for ViT are (still) less common for some datasets, which may affect the ease of use in some situations.

Robustness
CNNs may struggle to capture long-range contextual information, as they focus primarily on local feature extraction, which may limit the ability to understand global context, leading to reduced robustness in tasks that require global context, such as scene understanding, image captioning or fine-grained recognition.
Both architectures can be prone to overfitting, especially when the training data is limited or noisy, which can lead to reduced robustness to input data outside the training distribution.Adversarial attacks can also pose a challenge to the robustness of both architectures.In particular, ViTs do not have an inherent spatial inductive bias like CNNs, which are specifically designed to exploit the spatial locality of images.This can make them more vulnerable to certain types of adversarial attacks that rely on spatial information, such as spatially transformed adversarial examples.

Performance Optimization
CNNs can suffer from reduced performance and increased memory consumption when applied to high-resolution images or large-scale datasets, as they require processing of local receptive fields at multiple spatial scales, leading to increased computational requirements.Compared to CNNs, ViTs are computationally expensive, especially as the image size increases or model depth increases, which may limit their use in certain resource-constrained environments.Reduced computational complexity can sometimes result in decreased robustness, as models may not have the ability to learn complex features that can help generalize well to adversarial examples.

Evaluation
As mentioned above, CNNs are primarily designed for local feature extraction and may struggle to capture long-range contextual dependencies, which can limit the evaluation performance in tasks that require understanding of global context or long-term dependencies.ViTs are relatively newer than CNNs, and as such, may lack wellestablished benchmarks or evaluation metrics for specific tasks or datasets, which can make performance evaluation difficult and less standardized.

Explainability and Interpretability
Despite well-established methods for model interpretation, CNNs still lack full interpretability because the complex interactions between layers and neurons can make it difficult to fully understand the model's decision-making process, particularly in deeper layers of the network.
While ViTs produce attention maps for interpretability, the complex interactions between self-attention heads can still present challenges in accurately interpreting the model's behaviour.ViTs can have multiple heads attending to different regions, which can make it difficult to interpret the interactions between different attention heads and to understand the reasoning behind model predictions.

Architecture
CNNs typically have fixed, predefined model architectures, which may limit the flexibility to adapt to specific task requirements or incorporate domain-specific knowledge, potentially affecting performance optimization.For ViTs, the availability of established architecture variants is still limited, which may require more experimentation and exploration to find optimal architectures for specific tasks.

Future Research Directions
As future research, meta-analysis or systematic reviews should be conducted within the scope of this review to provide the scientific community with more detail on which of the architectures is more effective at image classification, in addition to specifying under what conditions a particular architecture stands out from the others.It is therefore necessary to facilitate the choice of the deep learning architecture to be used in future image classification problems.This section aims to provide guidelines for future research in this area.

Dataset considerations
The datasets used in most studies may not be representative of real-world scenarios.Future research should consider using more diverse datasets that better reflect the complexity and variability of real-world images.As an example, it would be interesting to study the impact that image resolution might have on the performance of deep learning architectures.That is, it would be important to find out in which of the architectures (i.e., ViT, CNN, and MLP-Mixer) the resolution of the images will influence their performance, as well as what impact it will have on the processing time of the deep learning architectures.

Robustness
As documented in [5], deep learning models are typically vulnerable to adversarial attacks, where small perturbations to an input image can cause the model to misclassify it.Future research should focus on developing architectures that are more robust to adversarial attacks (for example by further augmenting the robustness of ViTs), as well as exploring ways to detect and defend against these attacks.
Beyond that, most studies (as the ones reviewed in this work) have focused on the performance of deep learning architectures on image classification tasks, but there are many other image processing tasks (such as object detection, segmentation, and captioning) that could benefit from the use of these architectures.Future research should further explore the effectiveness of these architectures on these tasks.

Performance Optimization
Deep learning architectures require substantial amounts of labelled data to achieve high performance.However, labelling data is time-consuming and expensive.Future research should explore ways to improve the efficiency of deep learning models, such as developing semi-supervised learning methods or transfer learning (following up on the finding in [10]) that can leverage pre-trained models.
In addition, the necessity of large amounts of labelled data requires significant computational resources, which limits the deployment on resource-constrained devices.Future research should focus on developing architectures that are optimized for deployment on these devices, as well as exploring ways to reduce the computational cost of existing architectures.It should explore the advantages of the implementation of the knowledge distillation of deep learning architectures to reduce computational resources.

Evaluation
The adequacy of the metrics to the task and problem at hand is also another suggested line of future research.Most studies have used standard performance metrics (such as accuracy and F1-score) to evaluate the performance of deep learning architectures.Future research should consider using more diverse metrics that better capture the strengths and weaknesses of different architectures.

Explainability and Interpretability
Deep learning models are often considered as black boxes because they do not provide insight into the decision-making process.This may prevent the usage of the models in certain areas such as justice and healthcare [30], among others.Future research should focus on making these models more interpretable and explainable.For example, by designing transformer architectures that provide visual explanations of their decisions or by developing methods for extracting features that are easily interpretable.

Architecture
In future investigations, it will be necessary to study the impact of the MLP-Mixer deep learning architecture in image processing, what are the characteristics that allow it to have a performance superior to CNNs, but inferior to the performance obtained by the ViT architecture [5].Future research should also focus on developing novel architectures that can achieve high performance with fewer parameters or that are more efficient in terms of computation and memory usage.

Conclusions
This work has reviewed recent studies done in image processing to give more information about the performance of the two architectures and what distinguishes them.A common feature across all papers is that transformer-based architecture or the combination of ViTs with CNN allows for better accuracy compared to CNN networks.It has also been shown that this new architecture, even with hyperparameters fine-tuning, can be lighter than the CNN, consuming fewer computational resources and taking less training time as demonstrated in the works [16,19].
In summary, the ViT architecture is more robust than CNN networks for images that have noise or are augmented.It manages to perform better compared to CNN due to the self-attention mechanism because it makes the overall image information accessible from the highest to the lowest layers [12].On the other hand, CNN's can generalize better with smaller datasets and get better accuracy than ViTs, but in contrast, ViTs have the advantage of learning information better with fewer images.This is because the images are divided into small patches, so there is a greater diversity of relationships between them.

Figure 1 .
Figure 1.Example of an architecture of the ViT, based on [1].

Figure 2 .
Figure 2. Example of an architecture of a CNN, based on [2].

Figure 3 .
Figure 3. Distribution of the selected studies by years.

Figure 4 .
Figure 4. Distribution of the selected studies by application area.

Table 1 .
Data sources and the number of obtained results.

Table 2 .
Data sources and used search string.

Table 3 .
List of selected studies.

Table 4 .
Overview of selected studies.