Deep Learning for COVID-19 Diagnosis from CT Images

Featured Application: The investigation proposed in this work is a preliminary step towards the realisation of an automated COVID-19 detection system from CT images. It can be crucial in this medical task to aid clinicians in obtaining valuable information about CT images for their classiﬁcation or obtaining general information about the status of the disease. Abstract: COVID-19, an infectious coronavirus disease, caused a pandemic with countless deaths. From the outset, clinical institutes have explored computed tomography as an effective and complementary screening tool alongside the reverse transcriptase-polymerase chain reaction. Deep learning techniques have shown promising results in similar medical tasks and, hence, may provide solutions to COVID-19 based on medical images of patients. We aim to contribute to the research in this ﬁeld by: (i) Comparing different architectures on a public and extended reference dataset to ﬁnd the most suitable; (ii) Proposing a patient-oriented investigation of the best performing networks; and (iii) Evaluating their robustness in a real-world scenario, represented by cross-dataset experiments. We exploited ten well-known convolutional neural networks on two public datasets. The results show that, on the reference dataset, the most suitable architecture is VGG19, which (i) Achieved 98.87% accuracy in the network comparison; (ii) Obtained 95.91% accuracy on the patient status classiﬁcation, even though it misclassiﬁes some patients that other networks classify correctly; and (iii) The cross-dataset experiments exhibit the limitations of deep learning approaches in a real-world scenario with 70.15% accuracy, which need further investigation to improve the robustness. Thus, VGG19 architecture showed promising performance in the classiﬁcation of COVID-19 cases. Nonetheless, this architecture enables extensive improvements based on its modiﬁcation, or even with preprocessing step in addition to it. Finally, the cross-dataset experiments exposed the critical weakness of classifying images from heterogeneous data sources, compatible with a real-world scenario.


Introduction
COVID-19 is a disease caused by the SARS-CoV-2 virus, declared a pandemic by the World Health Organisation on 11 March 2020. At the time of writing, COVID-19 has more than one hundred and eighty million confirmed cases and has caused more than three million deaths, with a mortality rate of 2.1% [1]. As hospitals have been shown to have limited availability of adequate equipment, a rapid diagnosis would have been and still is essential to control the spread of the disease, increase the effectiveness of medical treatment, and, consequently, the chances of survival without intensive care. Basically, the polymerase chain reaction and reverse transcriptase (RT-PCR) method is the primary screening tool for COVID-19, in which SARS-CoV-2 ribonucleic acid (RNA) is detected within an upper respiratory tract sputum sample [2]. However, many countries are unable to provide sufficient testing, and, in any case, only people with apparent symptoms are tested, and it takes hours to provide an accurate result. Therefore, there is a need for faster and more reliable screening techniques that could further confirm the PCR test or replace it entirely, such as imaging-based methods. They may complement its use to achieve greater diagnostic certainty or even substitute in some countries where RT-PCR is not readily available. In some cases, chest X-ray (CXR) abnormalities are seen in patients who initially had a negative RT-PCR test, and several studies have shown that chest computed tomography (CT) has greater sensitivity for COVID-19 than RT-PCR and could be considered a primary tool for diagnosis [3][4][5][6]. In response to the pandemic, researchers have rushed to develop models using artificial intelligence (AI), particularly machine learning, to support clinicians [7].
Computed tomography is already a widely explored medical imaging technique that allows non-invasive visualisation of the interior of an object [8][9][10][11][12][13] and is widely used in many applications, such as medical imaging for clinical purposes [14][15][16][17][18]. For this reason, clinical institutions have used CT as an effective and complementary screening tool alongside RT-PCR [5,6] with a higher sensitivity of up to 98% compared to 71% for RT-PCR [19,20]. In particular, several studies have shown that CT has excellent utility in detecting COVID-19 infections during routine CT examinations for reasons unrelated to COVID-19, such as monitoring of elective surgical procedures and neurological examinations [21]. Other scenarios where CT imaging has been exploited include cases where patients have worsening respiratory complications and cases where patients with negative RT-PCR test results are suspected to be COVID-19 positive due to other factors. Early studies have shown that chest CT images of patients may contain some potential indicators for COVID-19 [2,5,6,22] infections, but may also be contained in non-COVID- 19 infections. This issue can lead to challenges for radiologists in distinguishing COVID-19 infections from non-COVID-19 infections using chest CT [23,24]. However, the duration of diagnosis is the main limitation of CT scan tests: even experienced radiologists need about 21.5 min to analyse the test results of each case [25], and during the emergency, a large number of CT images have to be evaluated in a very short time, thus increasing the probability of misclassification. For this reason, intelligent automatic diagnosis systems that automatically classify chest CT images can help to improve speed and to rapidly confirm the test result.
In recent years, deep learning workflows have emerged from the proposed AlexNet convolutional neural network (CNN) in 2012 [26]. CNNs do not follow the typical workflow of image analysis because they can extract features independently without the need for feature descriptors or specific feature extraction techniques. Therefore, they differ from conventional machine learning methods because they require little or no image preprocessing and can automatically infer an optimal data representation from raw images without requiring prior feature selection, resulting in a more objective and less biased process. Furthermore, they achieved optimal results in many domains, such as computer vision devoted to medical analysis, with images coming from magnetic resonance imaging (MRI) [27], microscopy [28], CT [29], ultrasound [30], X-ray [31], and mammography [32]. They have been successfully applied to various different problems, like classification or segmentation [33][34][35][36]. Deep learning-based methods have also made significant progress in the analysis of lung diseases, which is a comparable scenario to COVID-19 [37][38][39]. However, the scenario of CT images of lungs referred to COVID-19 and non-COVID-19 patients can be particularly problematic to classify, especially when the damage due to pneumonia of different causes is present simultaneously. The main findings of chest CT scans of COVID-19 positive patients indicate traces of ground-glass opacity (GGO) [40]. Two CT scans of COVID-19 and non-COVID-19 are shown in Figure 1. The overall objective of this study is to investigate the behaviour of the main existing off-the-shelf CNNs for the classification of patients' CTs. This work is a preliminary investigation for the future development of a tool that provides confirmation of the viral test result or provides more details about the ongoing infection, also considering that according to the Center for Disease Control (CDC), even if a chest CT or X-ray suggests COVID-19, the viral test is the only specific method for diagnosis [42]. Specifically, we propose a comprehensive investigation of the problem of COVID-19 classification from chest CT images from different perspectives:

1.
We present a comparative study of several off-the-shelf CNN architectures in order to select a suitable deep learning model to perform a three-class classification on the public COVIDx CT-2A dataset, specifically divided into COVID-19, pneumonia and healthy cases; 2.
On the same dataset, we performed a patient-oriented experiment by grouping all the CT images of the patients, in which the aim was to provide a diagnosis; 3.
We investigated the robustness of the methods by performing two cross-dataset experiments and evaluating the performance of CNNs previously trained on COVIDx CT-2A. In particular, we performed a two-class classification between COVID-19 and healthy cases, on the COVID-CT dataset, without fine-tuning; 4.
We repeated the experiment just described, by fine-tuning the most promising CNNs, demonstrating that it is still problematic to integrate automatic methods in the clinical diagnosis of COVID-19.
We both demonstrate how off-the-shelf deep learning architectures can be utilised to classify CT images representing COVID-19 affected patients and how transfer learning capabilities are still far from offering a concrete contribution in a real-world scenario, as demonstrated by our cross-dataset experiments, without addressing it with different techniques. The experiments are not intended to provide an exhaustive comparison of the performance of these methods; instead, we wanted to select the most suitable for our classification of CT images without, for the time being, investigating possible parametric improvements. The purpose is to create a concrete baseline with the potential to be modified and developed further. Moreover, several works in the context of COVID-19 diagnostics have considered small or private datasets or lacked rigorous experimental methods, potentially leading to over-fitting and overestimation of performance [7,43]. For this reason, we:

1.
Carefully selected the two datasets on which to conduct the experiments described.
In fact, Roberts et al. [7] have recently shown that most of the datasets used in the literature for the diagnosis or prognosis of COVID-19 suffer from duplication and quality problems; 2.
Selected COVIDx CT-2A, a public reference dataset specifically proposed for COVID-19 detection from CT imaging, because of the high risks of bias due to source problems and datasets created from unsupervised public online repositories. It has already been provided with train, validation, and testing splits.
We verified the robustness of the solution on both the public COVIDx CT-2A and COVID-CT datasets. Our proposed approach achieves promising results on COVID-19 identification, although it does not show satisfactory performance on cross-dataset experiments.
The rest of the article is organised as follows. The following paragraph presents a review of deep learning approaches for COVID-19 detection. Section 2 describes the datasets used in our experiments and presents the metrics adopted to evaluate the experimental results illustrated in Section 3. In Section 4, we analyse and discuss the experimental results and give a comparison with the state of the art. Finally, conclusions and future directions are drawn in Section 5.
Among the CT-based methods, Jin et al. [48] proposed a deep learning-based system for COVID-19 diagnosis, performing lung segmentation, COVID-19 diagnosis, and COVIDinfectious slices location. In contrast, Hu et al. [51] proposed a weakly supervised multiscale deep learning framework for COVID-19 detection, inspired by the VGG architecture [57], which assimilates different scales of lesion information using CT data of the chest. Polsinelli et al. [52] implemented a lightweight CNN, based on the SqueezeNet model [58] for efficient discrimination of COVID-19 CT images against other community-acquired pneumonia or healthy CT images. Biswas et al. [53] used a transfer learning strategy on the three pretrained models of VGG-16 [57], ResNet50 [59], and Xception [60], combining them with the ensemble stacking strategy and tested the method on CT images of the chest. Zhao et al. [55] adopted the ResNet-v2, a modified version of ResNet [59]. Moreover, they added group normalisation instead of batch normalisation and conducted a weight standardisation for all convolutional layers. Lastly, they also incorporated the pre-training data from CIFAR-10 [61], ILSVRC-2012 [62], and ImageNet-21k [63] as the parameters for initialisation.
On the subject of CXR-based works, Minaee et al. [50] employed four pretrained models (ResNet18 [59], ResNet50 [59], SqueezeNet [58] and DenseNet-121 [64]) on CXR data, and analysed their performance for COVID-19 detection. On the other hand, Signoroni et al. [43] developed BS-Net, a multi-block deep learning-based architecture designed for the assessment of pneumonia severity on CXRs. More recently, Oyelade et al. [56] proposed CovFrameNet, a novel deep learning-based framework based on a substantial image pre-processing step and a CNN architecture for detecting the presence of COVID-19 on CXRs.
Thanks to the powerful discriminative ability of CNNs, several authors tried to propose CNN-based frameworks for the diagnosis or prognosis of COVID-19, even though CNNs typically require large scale datasets to perform a correct classification. However, most of the existing CT scan datasets for COVID-19 contain at most hundreds of CT images [65][66][67]. Therefore, we exploited COVIDx CT-2A [68], composed of 194,922 CT images, as described in Section 2.1.1 to propose a baseline classification approach and we evaluated it on the external dataset COVID-CT, described in Section 2.1.2 to assess generalisability of the proposal. In general, we aim to avoid the following drawbacks: 1.
Using small scale datasets;

2.
Using not robust or multiple unsupervised source datasets; 3.
Testing the method without external validation.
Regarding the works that employed the datasets used in our study, Zhao et al. [41] worked on COVID-CT, while Gunraj et al. [54] on COVIDx CT-2A. The former is based on a transfer learning approach on the DenseNet network, while the latter proposed COVID-Net CT [54], a deep convolutional neural network tailored for detection of COVID-19 cases from chest CT images.
This work differs from those described above because: We propose an extensive comparison between different off-the-shelf CNN architectures, in order to obtain the most suitable for the task, using a large and public dataset; We avoid the high risks of errors due to datasets created from unsupervised online public repositories, using two public reference datasets, to try to validate our approach; We introduce a preliminary solution based on learning by sampling, showing how CNNs need further improvements to generalise the detection of COVID-19 in heterogeneous datasets.

Materials and Methods
In this work, we exploited two publicly available datasets, as described in Section 2.1. Then, in Section 2.2 we give a detailed description of the metrics adopted to evaluate the experimental results.

Datasets
The datasets exploited in this work are COVIDx CT-2A and COVID-CT, both of which are publicly available. We describe them as follows.

COVIDx CT-2A
COVIDx CT-2A [68] is an open-access dataset. At the time of writing, it is composed of 194,922 CT images from 3745 patients from 15 different countries, between 0 and 93 years old (median age of 51), with strongly clinically verified findings. Every image belongs to a particular class verified by expert pathologists. In particular, the classes are COVID-19, indicating CT images of COVID-19 positive patients, pneumonia indicating CT images of patients with pneumonia not caused by COVID-19, and normal, indicating CT images of patients in normal conditions.
The countries involved are part of a multinational cohort that consists of patient cases collected by the following organisations and initiatives from around the world: Radiopaedia collection [74] (Iran, Italy, Australia, Afghanistan, Scotland, Lebanon, England, Algeria, Peru, Azerbaijan, some countries unknown).

Metrics
The performance measures have been evaluated by averaging five different simulations for all the networks. The measures used to quantify the performance of a network are the accuracy (Acc), precision (Pre), specificity (Spec), recall (Rec), F1-score (F1) as following defined: Precision measures the number of correctly labelled items belonging to the positive class divided by the items correctly or incorrectly labelled as belonging to the same class. Specificity measures the proportion of correctly identified negatives (also called the true negative rate), while sensitivity measures the proportion of correctly identified positives (also called the true positive rate). The fourth measure is accuracy, defined as the ratio of correctly labelled instances to the entire pool of instances. The last is the F1 score, which conveys the balance between accuracy and recall.
Furthermore, since we are facing a multiclass imbalance problem, we adopted two global metrics for multiclass imbalance learning to evaluate the performance of the networks [75]. They are the macro geometric average (MAvG), defined as the geometric mean of the partial accuracy of each class, and the macro arithmetic average (MAvA), defined as the arithmetic mean of the partial accuracies of each class.
We describe them as follows:

Results
We now describe the experimentation conducted in this work. In detail, in Section 3.1 we first describe the experimental setup adopted for the classification tasks. Then, in Section 3.2 we report the results of the experiments performed on both datasets.

Experimental Setup
The images to be classified are lung CTs. They are organised into classes, as described below. Considering this work as a baseline for further investigation, the images are not subject to any preprocessing or augmentation process. In order to make the experiments reproducible, we kept the dataset splits provided by the authors and did not apply any randomisation strategy. We employed two different training strategies: Fine-tuning the previously trained networks.
The experiments were performed using the hyperparameters setting described in Table 1 for all networks to assess potential performance variations. In particular, after empirical evaluation, we adopted Adam, which performed better than the other solvers. In addition, the maximum number of epochs was set to 20 due to a large number of images. Since COVIDx CT-2A is the largest dataset, we employed it for model training. Its images were divided by the authors according to the following percentages: 70%, 20%, and 10% for training, validation, and testing, respectively. As for the COVID-CT dataset, we used it in two ways: the first time, it was taken as a whole as a test set, while the second time, it was divided in the same way as COVIDx CT-2A to be used for a fine-tuning strategy.

CT Image Classification via Deep Learning
Several types of experiments were designed in this work in order to assess the feasibility of the deep learning approach and its robustness. In particular, on the COVIDx CT-2A dataset, we performed:

1.
A three-class classification as reported in Section 3.2.1; 2.
A patient-oriented classification, described in Section 3.2.2.
On the other hand, on the COVID-CT dataset, we realised:

1.
A two-class classification using the four best-performing networks from the previous experiments on the entire dataset; 2.
A two-class classification using the same four networks, fine-tuning them on this dataset.
Both are reported in Section 3.2.3.

Three-Class Classification on COVIDx CT-2A
In this experiment, we trained each network used in this work, using the split proposed by the authors, in order to obtain a baseline result. Table 2 shows the results obtained with each architecture employed, while Figure 4 shows the relationship between the MAvG metric and the three classes included in the dataset.

Patient-Oriented Classification on COVIDx CT-2A
For this experiment, the models obtained from the experiments described in Section 3.2.1 were used. We proceeded as follows: the test set (consisting of 25,658 images) was used, according to the subdivision provided by the authors, to have one set of images for each of the 426 different patients in the test set. We ensured that each patient only had images belonging to one class because otherwise, this would invalidate the test. Once the images of each patient had been examined, the model would produce results similar to those of the ternary classification. With this in mind, it was decided to use class accuracy as a critical metric: if it was above 50%, the patient belonged to the class. Otherwise, it would be classified as incorrect. In this way, it was possible to see how each model behaved with each class, and finally, average accuracy was calculated to describe the level of accuracy of the network, as shown in Table 3.

Two-Class Classification on COVID-CT
For this experiment, we proceeded in two steps: initially, we used the entire COVID-CT dataset as a test dataset for the top four models obtained from the Section 3.2.1 experiments. Subsequently, it was decided to perform a fine-tuning strategy on the same models. In particular, we chose VGG19, given its results in both previous experiments, MobileNetV2, one of the most superficial networks with good results in classifying patients of the normal and pneumonia classes, and, finally, VGG16 and ResNet18, being the two networks with the best results after VGG19. The dataset was then divided into training, validation, and testing, according to the percentages provided by the authors. Table 4 shows the results on the whole dataset, while Table 5 shows the results after the fine-tuning strategy.

Discussion
In this section, we give a detailed discussion of the results obtained, shown in Section 3.

On the Three-Class Classification on COVIDx CT-2A
As Table 2 shows, it can be seen that GoogLeNet, still the worst performing network, scored consistently below 90%, except for specificity and accuracy. Even ResNet50, which reports quite different values, drops to 89.53% in recall, indicating that the network had some difficulty in accurately distinguishing true positives. As for the rest of the models, they all show consistent results, as was the case during the validation phase. VGG19 presents exceptional results, reaching 99.08% in specificity, while the other metrics exceed 97%. VGG16 and ResNet18 obtained similar results. In general, it can be seen that the network that produced the best results was VGG19, having consistently high values in every metric. It is followed by VGG16 and ResNet18, with excellent results narrowly below those reached by VGG19. The fine-tuning strategy, which will be carried out to prepare the models for the classification on the COVID-CT dataset, will concern these three models, plus MobileNet V2, to deduce possible adaptability of specific networks in the mobile environment for this task.
Some specific insights come from Figure 4, which shows the relationship between the MAvG metric and the class accuracies for each classifier used. In particular, it can be seen that in addition to having by far the highest accuracy, VGG19 is also the only one to have a uniform and acceptable score for all three classes. Concerning the problem of unbalanced and multiclass classification, it is fundamental to underline that all the networks manage to achieve high accuracies essentially thanks to the scores obtained on the normal and pneumonia classes. For example, ResNet50 obtained 99.21% accuracy on the normal class and only 66.08% on the COVID-19 class, being the best and the worst, respectively. Furthermore, MobileNetV2 obtained 100.00% accuracy on the pneumonia class, but only 77.78% on the COVID-19 class, being the best and the fourth-worst, respectively.

On the Patient-Oriented Classification on COVIDx CT-2A
The peculiarity of misclassification of CT images of patients belonging to the COVID-19 class affects every network. In some cases in a more pronounced form and in others less so. A deeper analysis of the results showed that many of the misclassified patients were classified as belonging to the class pneumonia. This issue is due to the similarity of the images, as both classes represent pneumonia, although of different origins.
On the other hand, in the case of AlexNet, for the normal class, a relatively high result was obtained, a sign that few patients were misclassified; a similar situation for the pneumonia class where some patients were classified as belonging to both the COVID-19 and the normal class. In general, the model obtained a balanced average accuracy.
Additionally, GoogLeNet's results obtained for patients of the class COVID-19 are much lower than AlexNet, and even if the accuracy of the other classes is average, we can see how it affects the average accuracy of the network.
As for InceptionV3, the accuracy of the COVID-19 class is higher than GoogLeNet but remains at a high level when compared to the results of the other classes. On the other hand, the average accuracy of the network obtains a positive result.
VGG16 achieves 100% in the classification of patients in the class pneumonia, the highest result so far and the optimal one. The class normal also obtained a very high result, while the class COVID-19, although obtaining outstanding results, had some minor difficulties in classification. As for VGG19, although it did not reach 100% in the pneumonia class compared to VGG16, this model achieves high and uniform results: 95.91% in the COVID-19 class is the highest achieved so far. The other classes also produced satisfactory results, and the average accuracy of the network is 97.31%.
ShuffleNet performs as well as the average, leading to a relatively low result for the COVID-19 class compared to the other two classes, which perform very well. However, network accuracy is still average.
MobileNetV2 has excellent performance in the pneumonia and normal classes, while the poor result obtained in the COVID-19 class affects the average accuracy of the network.
With regard to the Residual networks, ResNet18 obtained high results in the normal and pneumonia classes and remained average in the COVID-19 class. The network's final accuracy is also average; although both ResNet50 and ResNet101 achieved very high results in the two classes pneumonia and normal, as ResNdidet18 did, those obtained in the COVID-19 class are drastically low, the lowest to date. This makes ResNet50 the network with the lowest average accuracy, caused mainly by the results obtained from the classification of patients in the COVID-19 class; ResNet101 is slightly higher, but still with unsatisfactory results.
To sum up, considering the results obtained with VGG19, the network with the lowest number of misclassified COVID-19 patients, we went to see what they were to try to draw specific conclusions. VGG19 misclassified seven patients and, of these, not all were misclassified by the other networks, indicating that a hybrid approach could improve results.

On the Two-Class Classification on COVID-CT
Although VGG19 was the best in the previous tests, it did not produce the same results with this dataset, as it can be seen from Table 4. As for VGG16, it scored lower than VGG19 in the previous tests and the binary classification of COVID-CT. In this case, ResNet18 scores higher and closer to VGG19. The same applies to MobileNetV2, which achieves results just below ResNet18.
As a result of the fine-tuning, the results have improved significantly, although still below the results obtained with the previous dataset, as shown in Table 5. These lowerthan-average results may be due to the fact that the original COVID-CT dataset, dating back to the early 2020s, has been slowly modified over the months with the addition of new CT images of poor quality or compromised by overlaps. This fact explains why the networks cannot classify the images correctly, as having been trained with the high-quality images from the COVIDx CT-2A dataset, they struggle to accurately classify these new elements.
To sum up, the results of the two tables do not differ as much as one would expect from a fine-tuning strategy. Finally, it can be said that fine-tuned ResNet18 was able to outperform the other CNNs, with metric values that always hover around 70%. 4.1.3. Comparison with the State-of-the-Art Table 6 shows a brief but effective comparison with the state of the art works on COVIDx CT-2A. As it can be seen, the works of Gunraj et al. [54] produced significant early results with their proposed COVID-Net CT convolutional neural network. However, recently, the work of Zhao et al. [55] outperformed state of the art, reaching 99.2% using ResNet-v2, a modified version of ResNet and several improvements, such as the group normalisation or the weight standardisation for all convolutional layers. Finally, our baseline work demonstrates that even pre-existing architectures can reach outstanding results. As reported in Table 6, we reached 98.87% accuracy with VGG-19 without any improvement, such as preprocessing, data standardisation, group normalisation, etc. This opens the field to further investigations and improvements, as detailed in the following section.

Limitations of This Work
Although interesting results have been shown, our work suffers from some limitations. First, the most performing solution on COVID-19 class relies entirely on the VGG19 architecture, even though other networks showed excellent results in the other two classes. Considering the properties of these networks, combining their features could improve the results, particularly in increasing the capacity to distinguish the different classes more specifically. Second, every experimental condition assumed no preprocessing step. However, in the context of proposing a complete framework in the future, preprocess the images (e.g., with denoising) could be crucial. Third, the patient-oriented experiments confirmed the excellent results obtained by VGG19, even though certain patients have been misclassified in contrast to other CNNs. Efforts should be made in this sense in order to understand more clearly the sections of the CT scan that are discriminative in this critical scenario. Fourth, as represented by the classwise performance, the COVID-19 class is generally harder to distinguish with respect to the others because of their structure. For this reason, handcrafted and, potentially, the combination of heterogeneous descriptors could help recognise the most challenging cases, as already shown in similar tasks [80][81][82].

Conclusions
The objective of this work was to propose a classification methodology for the diagnosis of COVID-19 through deep learning techniques applied on CT images. To achieve this goal, an extensive comparative study of the main existing CNN architectures was carried out.
The tests carried out on the two datasets showed very different results. Those obtained with the COVIDx CT-2A dataset are excellent for all the models used; in particular, VGG19 stands out for the high values obtained in the specificity metric and precision and recall. No other network has achieved these results. However, it is important to say that networks such as VGG16 and ResNet18 also achieved more than satisfactory results. As far as the other networks are concerned, GoogLeNet and ResNet50 seem the least suitable, as they always deviated considerably from the average values obtained. In addition, the results obtained with VGG-19 are comparable with the results of the networks currently existing at the state of the art that works on COVIDx CT-2A.
The patient-oriented classification also brought outstanding results, with high accuracy values for the class COVID-19, and, in some cases, 100% accuracy for the class pneumonia. The best network remains, in any case, VGG19, being the one with the highest average accuracy and, therefore, misclassifying a few patients compared to the other networks. Through the analysis of the misclassified patients, it was deduced that it is probably necessary to create an ad hoc network exploiting the existing CNNs to improve the results.
About the COVID-CT dataset, however, the results do not match the previous ones and, on the contrary, there was a drop in performance of almost 50%. Only fine-tuning was able to remedy this, increasing the values obtained by 20%. Nevertheless, this does not compensate for the difference in performance. The problem could be mainly due to the quality of the images of the COVID-CT dataset, which are often compromised or of very poor quality.
This work highlighted some limitations. First of all, cross-dataset experiments showed that existing CNNs, even after a fine-tuning procedure, really suffer from limited dataset scenarios. Second, patient-oriented experiments show that some networks misclassified some COVID-19 patients as normal pneumonia cases, while others did not. This clearly motivates further investigation on the models and, also, possible modifications. Third, the absence of defined standards in the acquisition of these images and, in addition, the problem of building affordable COVID-19 datasets from heterogeneous sources, especially during the early months of the pandemic [7] can be considered a limitation and also a future direction, as it clearly appears that the distinctive COVID-19 features need to be further studied.
The indications emerging from this work are that: (i) In addition to fine-tuning, some preprocessing steps oriented to the enhancement of CT images could be helpful for the networks to produce more discriminative features; and (ii) Considering the results of the patient-oriented experiments, a hybrid approach, even involving ad hoc handcrafted features, could improve the results.
In future directions, we certainly aim to discover other valuable features from CT images to recognise COVID-19, extending the investigation to include handcrafted features and even combining them with deep features. In addition, we also want to consider assessing the severity of COVID-19.
We will conduct further experiments to identify key features in CT images and facilitate screening by medical doctors. We want to stress again that this work is still at the stage of theoretical research, and the models have not been validated in real clinical routines. Our contribution is to offer a baseline with some public benchmark datasets to be extended with new investigations. Therefore, we would like to test our system in the clinical routine and communicate with doctors to understand how such a system can be integrated into the clinical routine.
Therefore, we would like to: 1. Modify VGG19 to investigate the best accuracy density (accuracy divided by the number of parameters) and the best inference time; 2.
Optimise the hyperparameters, for example with Bayesian method; 3.
Use class activation map (CAM) to understand which parts of the image are relevant in the misclassification cases obtained by VGG19 but not from the other networks; 4.
Test our system in the clinical routine and communicate with doctors to understand how such a system can be integrated into the clinical routine.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: All the codes and the data used in this study are available at the following url: GitHub repository (accessed on 12 August 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: