Comparison of Convolutional Neural Networks and Transformers for the Classification of Images of COVID-19, Pneumonia and Healthy Individuals as Observed with Computed Tomography

In this work, the performance of five deep learning architectures in classifying COVID-19 in a multi-class set-up is evaluated. The classifiers were built on pretrained ResNet-50, ResNet-50r (with kernel size 5×5 in the first convolutional layer), DenseNet-121, MobileNet-v3 and the state-of-the-art CaiT-24-XXS-224 (CaiT) transformer. The cross entropy and weighted cross entropy were minimised with Adam and AdamW. In total, 20 experiments were conducted with 10 repetitions and obtained the following metrics: accuracy (Acc), balanced accuracy (BA), F1 and F2 from the general Fβ macro score, Matthew’s Correlation Coefficient (MCC), sensitivity (Sens) and specificity (Spec) followed by bootstrapping. The performance of the classifiers was compared by using the Friedman–Nemenyi test. The results show that less complex architectures such as ResNet-50, ResNet-50r and DenseNet-121 were able to achieve better generalization with rankings of 1.53, 1.71 and 3.05 for the Matthew Correlation Coefficient, respectively, while MobileNet-v3 and CaiT obtained rankings of 3.72 and 5.0, respectively.


Introduction
Coronavirus 2019 (COVID- 19) is an infectious disease caused by the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [1], which has lead to a pandemic with millions of cases and deaths confirmed all over the world. The outbreak has triggered not only a health crisis but has also had a severe psychological, social and economic impact worldwide [2,3]. Attempts to improve the diagnosis, contain, and reduce the spread of the disease, has led COVID-19 to become one of the most researched topics in the world. At the time of writing (June 2022), PubMed reported 269,111 COVID-19-related entries (https://pubmed.ncbi.nlm.nih.gov/?term=covid-19 (accessed on 1 June 2022)), a significant number accumulated in just two years as prior to 2019 there are only 56 entries. Diagnosis of COVID-19 with Reverse Transcriptase Polymerase Chain Reaction (RT-PCR) tests is widespread; however, its sensitivity is only moderate [4][5][6][7][8]. For this reason, medical imaging diagnosis with radiographs and Computed Tomography (CT) has been widely used [9][10][11] as it is generally considered more reliable for the identification of COVID-19 hallmarks, which include ground glass opacity with or without consolidation in the posterior and peripheral lung, linear opacity, "crazy-paving" pattern, "reversed halo" sign and vascular enlargement in the lungs [12][13][14][15].
Deep Learning (DL) is a sub-field of Artificial Intelligence (AI) that enables algorithms to automatically extract features from data, learn patterns and characteristics, and generate predictions on unseen data. Some principles and ideas behind DL and AI have been known for decades, e.g., the WISARD architecture from 1984 [16] or the Self Organising Maps from 1982 [17]). Interestingly, discussions about the hype or reality of Neural Computing [18] and how difficult Artificial Intelligence really is [19], have remained. With the addition of a large number of multiple processing layers, thus the use of the term deep, the availability of large data sets and the increase computational power, these techniques have recently shown excellent results in many areas [20]. These multiple layers allow a large number of nonlinear modules to convert the representation of the input data, which can be an image or text, to a more abstract representation [21]. The breakthrough of deep learning is sometimes related to the outstanding results presented in the classification of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [22]. In medical applications, deep learning architectures have provided excellent results, for instance, in the classification of skin cancer images [23].
Recent studies on image analysis of radiographs and CT scans have shown that DLbased methods are capable of detecting, quantifying and monitoring COVID-19 with high accuracy [24][25][26][27][28][29]. For conventional radiographs or 2D X-ray images, deep learning approaches have combined Convolutional Neural Networks with Long Short Time Memory (CNN-LSTM) models [30], Genetic Adversarial Networks with Long Short Time Memory [31], or self-augmentation mechanisms [11], with the objective to distinguish between different cases such as healthy against disease, COVID against pneumonia, etc. Some studies have focused on the segmentation of regions of interest within the lung region [32][33][34] using modifications of the popular U-Net architecture [35]. The present study focused on classification of the images and not in segmentations, in part because there was no ground truth available, but also because the focus was the methodology to compare the classification with non-parametric statistics. CT scanners, as well as Magnetic Resonance Imaging and other medical imaging devices can generate volumetric data, which in some cases can provide better results than analysis on a per-slice basis [36]. As such, some COVID studies have focused on the volumetric analysis of the data. For instance, Bougourzi analysed the percentage of the COVID infection to infer the state of patients (e.g., Normal, Moderate, Severe, etc.) [37][38][39]. It is also possible to combine slice-level decisions with tools such as Long Short Term Memory models [40]. When data other than images is present, i.e., medical notes, electronic health records or audio recordings, it is possible to perform multimodal diagnosis [41][42][43]. However, it is not always the case that researchers have access to multi-modal data and thus thorough evaluation of a single type of data is important. Ensemble techniques, in which several deep learning models are trained and a decision is taken based on votes from all the models are popular [44][45][46] and can provide good results. However, the energy consumption of training many models, several of which may provide suboptimal results should be considered in the present world where carbon footprint of computational processes is not negligible [47] For further information about approaches to COVID, the reader is referred to recent review papers, e.g., [48][49][50][51]. Many of the reported COVID studies consist mainly on fine-tuning pre-trained convolutional networks [24,28,52,53], or ensembles of methods together with optimization techniques [54,55]. Although these studies have shown very promising results, a large proportion of them do not provide sufficient information about the how and from where the data are sourced, how the data are handled and pre-processed if at all, the training configurations or statistical grounds to support why a proposed model is significantly better than another and studies have shown that the limitations of studies still present little value in clinical settings [56].
To address these issues, in this work, 20 experiments were built based on five DL architectures: CaiT-24-XXS-224, DenseNet-121, MobileNet-v3-large, ResNet-50 and ResNet-50r. The first selection was the ResNet architectures, which have had a major influence on the design of deep neural convolutional and sequential networks followed by DenseNet and MobileNet. These are based on convolutions and the latter is designed for mobile applications. Meanwhile CaiT is a state-of-the-art transformer that does not use convolutions. Each of these DL architectures were developed in intervals of roughly two years from 2015 to 2021. Table 1 outlines the parameters, the inference time required in GPU/CPU (running in GPU), the operations utilising more resources when processing images and the year in which each of the architecture was published. Whilst there are numerous options of architectures, for the present work, it was considered that the choice of architectures were representative of the fast developments in artificial neural networks for application in computer vision. For each architecture, two loss functions (cross entropy and weighted cross entropy) and two optimisers (Adam and AdamW) were applied and the architectures were evaluated with seven metrics. In addition, the results for each metric were bootstrapped for 1000 cycles and their prediction power was further compared with non-parametric statistics for robustness in different scenarios. Thus the main contributions of this work are summarised as follows: • A well-structured experimental setup for the evaluation and unbiased comparison of the performance of five representative deep learning architectures (CaiT-24-XXS-224, DesNet-121, MobileNet-v3-large, ResNet-50 and ResNet-50r) for the classification of COVID-19 as observed with Computed Tomography was proposed. • The ResNet-50r architecture, which is based on ResNet-50 but the convolutional layer (Conv1) with filters of size 5 × 5, was used to observe the effect of the kernels size (filters) on the classification of COVID-19. • Bootstrapping technique was applied to derive a very large number of samples, which will compensate for any cases of outliers for non-normal distributed data. • The results of each deep network architecture and experiment with different optimisers and loss functions were compared using non-parametric statistical comparison of the performance of deep network architectures.

•
The results show that less resource-demanding networks can outperform more complex architectures. This is a significant consideration related to the energy consumption necessary to train deep learning architectures in the light of the current climate emergency, and given the climate emergency of the present world [57][58][59].

Data Collection
To build the classification dataset, data from two public repositories was collated. The first dataset was sourced from Kaggle [60,61]. This dataset comprises a multi-nation collection of curated COVID-19 CT scans from 7 public sources [62]. This dataset contained 7593 curated images from 466 patients that were diagnosed with COVID-19, 6893 images from 604 patients that were considered as healthy with normal lungs, and 2618 images from 60 patients diagnosed with community-acquired pneumonia (CAP). The second dataset was sourced from Mendeley Data [63] and contained COVID-19 and common pneumonia (CP) CT scans. From this dataset only used 328 CP images which were merged with the CAP images were used. Figure 1 shows some representative images of COVID-19, CAP and Non-COVID (normal lungs) from the collated dataset. The dataset was split by class stratification to ensure all subsets had the same number of images from the minority class "CAP". The split ratios were 80:10:10 for training (13,945 images), validation (1743 images) and test (1744 images), respectively.

Deep Learning Architectures
To build the classification experiments (models) four pre-trained out-of-the-box deep convolutional architectures with different levels of complexity and one transformer were proposed. Next, the prediction power of CaiT-24-XXS-224, DenseNet-121, MobileNet-v3large, ResNet-50, ResNet-50r on the collated multi-class COVID-19 dataset were evaluated and compared.
ResNet-50 is a 50-layer deep convolutional architecture that features 3-layer skip connections (bottleneck block). The skip connections enable the network to copy activation from bottleneck block to bottleneck block [64]. To built ResNet-50r the kernels from the first convolutional layer (Conv1d) of ResNet-50 were resized from 7 × 7 to 5 × 5. The term 'r' was used to indicated that the network had resized kernels. DenseNet-121 (121 layers deep) consists of multiple dense blocks (small convolutional layers, batch normalisation and ReLU activation) and transition layers. Each layer in a dense block is forward-connected to every other layer by using concatenation shortcuts [65]. A MobileNet-v3-large (MobileNet-v3) is a convolutional neural network designed for mobile and embedded vision applications. It is based on a streamlined architecture that uses depth-wise separable convolutions. MobileNet-v3 has an efficient last stage at the end of the network that further reduced latency [66]. Transformers were initially used in the field of Natural Language Processing [67] and have been recently adapted for large scale image classification, demonstrating that convolutional networks are not strictly necessary for image processing. Class Attention in Image Transformers (CaiT) is a transformer for computer vision applications created to optimise the performance of the transformers when they have large number of layers [68]. CaiT-24-XXS-224, indicates that the architecture has a depth of 24 class attention layers, working dimensionality of 192 and was trained at resolution 224. The parameters of each architecture are shown in Table 1.

Experimental Setup
The experimental setup for this work is described in Table 3. The experiments (classifiers) were designed by combining three factors: neural network architecture, the loss function (objective) and the optimiser to minimise the loss. To detect the impact of the class imbalance on the predictions, the weighted version of the cross entropy (wCE) loss function was minimised with Adam and AdamW. The wCE handles the class imbalance by penalising with higher cost (weight) misclassification of the minority class. The Adam optimiser (derived from adaptive moment estimation) is an optimiser in which the learning rate is adaptive and can handle sparse gradients on noisy problems [69]. AdamW is Adam with weight decay and the primary choice to train transformer models. Whilst there are many other options for optimisers such as adaptive gradient algorithm (Adagrad), stochastic gradient descent (SGD), SGD Momentum (SGDM), Root Mean Square Propagation (RMSprop), Adam and AdamW were selected due to fact that these adaptive gradient methods do not underperform momentum or gradient descent optimisers [70]. Adam is probably the most popular optimiser option. A search for "Adam optimiser" in Google Scholar (https://scholar.google.co.uk/scholar?q=adam+optimizer accessed on 4 August 2022) returned 103,000 entries. Equivalent searches returned fewer entries: Adagrad: 10,200, SGD: 36,300, SGDM: 1500, RMSProp: 20,700, AdamW: 5950. On the other hand, AdamW has shown to improve the performance of Adam by decoupling the weight decay and outperforming SGD Momentum in image classification tasks [71]. With these, 20 experiments were built and evaluated (5 × 2 × 2) by training CaiT, DenseNet, MobileNet-v3, ResNet-50, ResNet-50r to minimise CE and wCE with Adam or AdamW (Table 3). Figure 2 illustrates the pipeline approach to classification and comparison of the models in out experimental setup. All experiments were implemented in PyTorch Framework and run on google Colab Pro and Pro+. The code is available on an as-is basis on the following github repository: https://github.com/ace-aitech/COVID-19-classification (accessed on 4 August 2022).

Figure 2.
Graphical illustration of the pipeline steps used for the training, evaluation and comparison of the deep neural network models for the classification of COVID-19 in a multi-class setup used in this work. All outputs from test phased were bootstrapped. Training, validation and test were run 10 times. After this, the test and validation results were further bootstrapped for 1000 cycles. * Indicates the bootstrapped outputs from validation phase.

Training and Validation
All networks architectures described in Section 2.2 were pre-trained on the ImageNet dataset. Thus, the images were normalised and resized to align them to the pre-trained setup. On-line random horizontal flips augmentations were also applied at training. The approach to training was transfer learning by fine-tuning the experiments for 8 epochs with a learning rate of 2 × 10 −5 in batches of 8 images. Training and validation classification loss and accuracy were calculated after each epoch of training. The validation accuracy was the primary indicator on how well the classifiers were performing.

Performance Metrics for Evaluation
The training procedure outlined in Section 2.4 and the evaluation on the test dataset was repeated 10 times per experiment. To make an unbiased comparison of the classifiers, the weights from the last epoch of training for evaluation on the test set were used. For this study, True/False Positives/Negatives (TP/TN/FP/FN) were defined by the correct or incorrect prediction of the class for the whole image. Accuracy (Acc), Balanced Accuracy (BA), F 1 and F 2 from the general F β macro score, Matthew's Correlation Coefficient (MCC), Sensitivity (Sens) and Specificity (Spec) were used to evaluate the performance of models by experiment and network. These seven metrics are defined as follows: Despite being widely used to evaluate the performance of classifiers, accuracy is biased towards the majority class in imbalanced datasets. On the other hand, Precision, Recall, F β macro score and MCC have been widely used to overcome the imbalance problem [72]. The BA provides an average measure of how likely an instance of a class is correctly classified across different classes. It consists of the arithmetic mean of the recall of each class, so it is "balanced" because every class has the same weight and the same importance [73]. The macro F β score is a weighted harmonic mean of the macro-precision and the macro-recall. For the multi-class setup, the F 1 and F 2 where β takes the values of 1 and 2, respectively, were used. F 1 score weights all classes equally (recall and precision). [73] whereas F 2 score weights twice the recall favouring it against precision. F 2 -score severely penalizes false negatives. The MCC is a measure of the correlation between the true and the predicted class. Moreover, it is regarded as a good indicator of total imbalanced of the prediction model. Recent work [72] demonstrated that the MCC is a well-suited metric for imbalanced multi-class domains. The sensitivity (or recall) is the number of true positive results divided by the number of all samples that should have been identified as positive. Specificity is the fraction of the true negatives divided by the total number of negatively classified instances.

Statistical Comparison
Non-parametric statistics do not require the distribution of the data to be known to make assumptions about them. Parametric comparison tests such ANOVA and MANOVA not only assume that samples come from a normal distribution but most importantly that all variables have equal variance (sphericity). For comparison of intelligent algorithms this cannot be assumed and can also have a detrimental impact on the post hoc test [74]. Therefore, non-parametric Friedman omnibus test [75] and Nemenyi post hoc pairwise comparison were used [76]. In this work, a holdout approach to evaluate the performance of the classifiers (experiments) by using the metrics defined in Section 2.5 was used. Bootstrapping is a non-parametric method that consists of sampling, with replacement, from a single original sample. This allows an approximation of sample distribution of statistics from original data [77]. To build the initial sample, the hold-out process was ran and evaluated 10 times per experiment. Then, 1000 bootstrap samples per experiment were generated and the average ranking and confidence interval (CI) for each of the evaluation metrics by network and experiment were calculated. Bootstrapped BA has been used to compare traditional machine learning classifiers with DL methods for the stress recognition in drivers [78]. Statistical difference amongst the performance of experiments (classifiers) was determined by using the Friedman test followed by the Nemenyi post hoc test at α = 0.05 for the Acc, BA, F 1 , F 2 , MCC, Sens and Spec. These tests have been used to compare the performance of time series classification algorithms for gravitational waves [79]. The Friedman test indicates whether the ranked classifiers are significantly different amongst themselves while the Nemenyi test applies pairwise comparison to the ranked classifiers [74,80]. The statistical tests were applied by using scipy and scikit-post hoc libraries.      (Table 5). At architecture level CaiT models showed the highest standard deviation for the three stages, in particular Exp-04 at training and test. It should be noted that in all phases and for all models, the average accuracy was greater than 98.0%.   Table 6 summarises the performance metrics during the test phase by network prior to bootstrapping. The average performance of each network for all evaluation metrics was in the following descending order: ResNet-50, ResNet-50r, DesNet-121, MobileNet-v3 and CaiT (Figure 4). The five networks reached an average performance of over 98% for the 7 evaluation metrics, except for CaiT, which showed an average MCC of 97.64%. ResNet-50 achieved values over 99.0% in six of the seven evaluation metrics, followed by ResNet-50r which hit the the highest F 1 and F 2 scores and the same average performance than ResNet-50 in Sens and Spec. It was observed that the BA, F 1 , F 2 and Sens presented very close values to each other when evaluating the performance of the architectures ( Table 6). The MCC suggests that there is high correlation among the predictions and their real class and reflects the impact of the class imbalance [81]. In addition, from the metrics by experiment it was noted that, Exp-18 and Exp-20 obtained the highest values from all experiments for all metrics followed by Exp-05 and Exp-15 (Table 7). Exp-03 and Exp-04 showed the largest standard deviation in three and four of the evaluation metrics, respectively.  Tables 8-11 summarise the medians, ranking and confidence intervals with α = 0.05 of the architectures and experiments by metric. The ranks of the networks from the best to poorest performance was as follows: ResNet-50, ResNet-50r, DenseNet-121, MobileNet-v3 and CaiT (Table 8). ResNet-50 models outperformed the other architecture models in 5 of the 7 evaluation metrics (Acc, BA, MCC, Sens and Spec). ResNet-50r surpassed ResNet-50 in the F 1 and F 2 scores. It also obtained the second best rank for the rest of metrics. Although the medians obtained for each metric by network were almost identical to their respective mean prior to bootstrapping, the confidence intervals were tighter (Tables 6-11). In general, all networks showed wider CI for specificity than for sensitivity ( Table 9). The MCC showed lower confidence bounds for all networks. CaiT confidence intervals by architecture were the lowest ranging from 97.36% to 97.88%. Models based on ResNet-50 and ResNet-50r performance lower bounds were greater than 99.0% in five of the seven metrics and over 98.0% for the MCC and accuracy. The ranking of the bootstrap samples by experiments for each metric is shown in Table (Table 11, Exp-03 and Exp-04).

Maximum Training Epochs
The training of the networks has a critical effect on the evaluation of the models on unseen data (generalisation performance). To have an insight into the number of training cycles required for each of the classifiers to achieve their best validation accuracy, the number of epochs at which each experiment obtained the highest validation accuracy and the accuracy achieved was recorded. Following this, 1000 bootstrap samples for the maximum validation accuracy and the number epochs required to reach the maximum were simultaneously obtained. Tables 12 and 13 provide the ranking, the median of maximum accuracy and the number of epochs required to reach the maximum accuracy during validation by architecture and experiment. Figures 5 and 6 show the medians and distribution of the data for both the maximum accuracy by architecture and experiment. The best ranks for the number of epochs were given to models based on ResNet-50 and Dense-Net-121 which required six and seven rounds of epochs training, respectively. Conversely, ResNet-50r and CaiT mostly required to train for 8 epochs to reach their best performance. The information in Tables 1 and 12 provides a wider overview on how the GPU/CPU/Memory resources were utilised and their impact on training on the training and validation process. The operation that required more memory for CaiT was matrix multiplication for attention. It was observed that a huge reduction on ATTE for CaiT (over 56%) when training in Co-labPro+. This huge reduction was not as apparent on the other architectures. Nevertheless, CaiT models required more resources and longer training time that all networks in spite of having less parameters than ResNet-50 architecture. Although ResNet-50 and ResNet-50r have more parameters than the other three architectures, the only architecture that trained faster than these two DL networks was MobileNet-v3. However, with a median of 6 epochs, ResNet-50 not only outperforms MobileNet-v3 in training time but all other networks. On the other hand, Exp-14 and Exp-15 achieved the best ranking for number of training epochs accounting for 6 followed by Exp-05.
ResNet-50 and ResNet-50r reached the top ranks for the maximum accuracy during validation while the top ranks by experiment were given to Exp-16 to Exp-18. Exp-17 obtained the highest rank for the maximum validation accuracy. All networks at their best validation accuracy allowed lower accuracy bounds greater from 98.73%.

Non-Parametric Ranks Comparisons
The results of the Friedman test suggested significant differences among the average ranks of the networks and experiments (p-value < 0.001). Therefore, the pairwise Friedman-Nemenyi multiple comparisons by network and experiment for all performance metrics was the next step. Figure 7 shows that there was no significant difference on the performance ResNet-50 and ResNet-50r models for the MCC, Sens and Spec. The rest of the architectures performed significantly different to each other for all metrics (p-value < 0.01). For the maximum accuracy and the number of epochs, all architectures performed significantly different (p-value < 0.01). This suggest that ResNet-50 networks train faster than the other networks (Table 12). Figure 8 shows the Nemenyi test results by experiment for all the performance metrics. Exp-18, Exp-20 obtained the highest ranks for all metrics (Table 10). The two experiments showed no significant difference in the F 1 , F 2 , MCC, Sens and Spec. Consequently, the two experiments show no statistical difference for the maximum validation accuracy. The Nemenyi test also determined that there was no significant difference in the number of the training epochs required by Exp-14 and Exp-15 to achieved the maximum validation accuracy (top ranks for the number of epochs in Table 13).
In addition, Exp-05 showed no significant difference to Exp-14 and Exp-15 for the F 1 and MCC. It can be noted that Exp-19 which was the under performer model based on ResNet-50r has similar predictive power for BA, F 1 , Sens and Spec than from Exp-02.
The post hoc tests confirm with 95.0% confidence that in general ResNet-50 based models have a significantly higher performance in the classification of COVID-19 than the other architectures. ResNet-50 architecture not only outperformed the other networks in all metrics but it also make a more effective use of resources by utilising less training time. ResNet-50r models optimized with AdamW outperform all models configurations in all metrics . However, it requires longer training time, this may be due to the size of the kernel generating more convolutions therefore increasing the number of trainable parameters ( Table 1).
Kernels are small learnable filters that convolve along the depth of an image producing a feature map [82]. It can be observed that the resized kernel on ResNet-50 has a positive effect on experiments Exp-18 and Exp-20 which are the counterpart of Exp-14 and Exp-16. It is possible to attribute this to the fact that the kernel was able to generate feature maps at greater detail for images that might have only small or occluded areas with infection. ResNet-50 is pretrained on the ImageNet dataset which has 1000 object classes of regular size. Whilst identifying a large number of classes represents a challenge in itself, in the medical field to be able to identify small areas of concern is critical for diagnosis. The scope of this work was limited to evaluate the performance of the models by network and experiment after eight rounds of epochs training. Alternatively, from this study of the maximum validation accuracy and the number of epochs by experiment, the post hoc test shows that there is a potential for improvement in the performance of DenseNet-121, MobileNet-v3, ResNet-50 and ResNet-50r. The performance of Exp-06, Exp-12 and Exp-17 can be evaluated after seven rounds of epochs training. Meanwhile the performance of Exp-16 can be measured after six rounds of epochs training.

Limitations of the Present Work
The methodology describe in this paper has several limitations. First, the number of deep learning architectures that was compared was limited. This was a choice as the main objective is not a through comparison of all possible architectures, but rather to present a methodology through which these can be compared with statistical non-parametric tests. Still, a variety of architectures with some of the most recent ones at the time of writing was selected. Second, the data considered for this work were 2D images and not the 3D datasets that can be obtained directly from CT scanners. The authors did not have access to these datasets. Third, this work considered classification of images, but did not extended to segmentation [34], localisation [83], assessment of severity or evolution of the disease [37]. As previously mentioned, one objective was to present a methodology for comparison. Finally, this work is not ready to be deployed in a clinical setting. It is hoped that the methodology here described will help the comparisons of future works and when a methodology is to be deployed clinically, a thorough and fair comparison such as the ones suggested here will be performed.

Conclusions
In this work, public datasets of chest CT scans were collated and analysed with five AI techniques which were capable to distinguish between positive cases of COVID-19, community-acquired pneumonia and healthy individuals. All the deep learning models were trained and their performance was evaluated with different metrics: accuracy, balanced accuracy, F 1 and F 2 score, MCC, sensitivity and specificity. Non-parametric statistics were applied, starting from bootstrapping to obtain confidence intervals, followed with the comparison of the models by using the Friedman test and the Nemenyi pairwise post hoc test. It can be concluded with statistical confidence that the ResNet-50 architectures are robust to classify COVID-19 in a multi-class set-up. ResNet-50 models achieved performances over 98% in all metrics and outperformed MobileNet-v3, DenseNet-121 and CaiT. In the particular case, ResNet-50r which is a modified version of ResNet-50 was shown to be the best classifier when optimising either CE or wCE (Exp-18 and Exp-20) with AdamW. In these conditions, confidence intervals of 99.24% to 99.41%, 99.23% to 99.41%, and 98.48% to 98.86%, were obtained for the BA, F 1 , MCC, respectively. Whilst the metrics of most experiments were high, the rankings after thousands of bootstrap repetitions were more discriminatory and placed ResNet-50r with AdamW in the top place. On the other hand, the CaiT architectures had the lowest rankings. One important observation was that the results suggest that less complex architectures can outperform more complex network architectures in the detection of COVID-19 in a multi-class setup. In general, ResNet-50 showed to be more robust to changes achieving the top ranks in all metrics. With exception of Exp-05, Exp-06, Exp-19, it was observed from Table 10 that Exp-13 to Exp-20 ranked better than experiments Exp-01 to Exp-12 (i.e., those not using ResNet). This study was not aimed to provide causal inference about the reason why ResNet-50 and ResNet-50r networks achieved better results. However, it can be assumed that there was positive interaction between the hyper-parameter selection and experimental setup.  Data Availability Statement: Datasets are available at: https://www.kaggle.com/datasets/maed emaftouni/large-covid19-ct-slice-dataset (accessed on 3 August 2022) https://data.mendeley.com/ datasets/3y55vgckg6/2 (accessed on 3 August 2022).

Conflicts of Interest:
The authors declare no conflict of interest.