Transfer Learning for the Detection and Diagnosis of Types of Pneumonia including Pneumonia Induced by COVID-19 from Chest X-ray Images

Accurate early diagnosis of COVID-19 viral pneumonia, primarily in asymptomatic people, is essential to reduce the spread of the disease, the burden on healthcare capacity, and the overall death rate. It is essential to design affordable and accessible solutions to distinguish pneumonia caused by COVID-19 from other types of pneumonia. In this work, we propose a reliable approach based on deep transfer learning that requires few computations and converges faster. Experimental results demonstrate that our proposed framework for transfer learning is a potential and effective approach to detect and diagnose types of pneumonia from chest X-ray images with a test accuracy of 94.0%.


Introduction
COVID-19 was declared to be the most lethal pandemic the world has had to grapple with with in recent human history [1][2][3]. Its origin of the first transmission remains unknown; the first reported cases were in December 2019 in Wuhan and has lead to massive loss of life and stagnation of the global economy. The primary mode for the virus transmission from an infected person's mouth or nostrils in small fluid particles is through coughing, sneezing, speaking, singing, or breathing heavily. The fluid particles vary significantly in size, ranging from larger respiratory droplets to smaller aerosols. They cannot spread person-to-person without coming in close contact because the droplets are too heavy to travel more than a meter. It has been indicated that COVID-19 can be in the air for up to 3 h, 4 h on copper surfaces and nearly 72 h on plastic and stainless materials [3]. Nonetheless, identifying the exact nature of the virus remains an open problem in the medical research community. Early and accurate detection of this viral pneumonia in asymptomatic cases is vital to reducing the transmissibility of the viral infection, the burden on healthcare capacity and the overall mortality rate. Machine Learning (ML) is increasingly being integrated into healthcare systems ranging from medical image acquisition to reconstruction, outcome analytics, and prediction. Thus, the use of ML to detect and classify traditional pneumonias from the pneumonia induced by COVID-19 is vital to providing an early, fast and efficient diagnosing mechanism [4]. Although mass vaccination campaigns are being carried out worldwide, coronavirus cases have been rising largely due to emerging variants of the viral disease. The global death toll of COVID-19 has been markedly increasing. India has surpassed Brazil as the second country with the highest infections and is currently experiencing the third wave of infection and COVID-19 related deaths. Virologists around the world have been extensively working to develop COVID-19 vaccines. Several COVID-19 vaccines have been approved for vaccination and are being used in many countries around the world for COVID-19 immunization. For example, candidates such as Pfizer-BioNTech is an mRNAbased COVID-19 vaccine developed in Germany by the biotechnology company BioNTech in collaboration with the American company Pfizer, Moderna is also an mRNA-based coronavirus disease vaccine developed by Moderna, the United States National Institute of Allergy and Infectious Diseases (NIAID) in association with the Biomedical Advanced Research and Development Authority (BARDA), Oxford-AstraZeneca is amongst the class of viral vector vaccines for prevention and was jointly developed in the United Kingdom by Oxford University and AstraZeneca, CoronaVac (Sinovac COVID-19 vaccine) is an inactivated virus coronavirus vaccine developed by Sinovac Biotech a company in China, Sputnik V (Gam-COVID-Vac) is an adenovirus viral vector vaccine for coronavirus disease developed by the Gamaleya Research Institute of Epidemiology and Microbiology in Russia. Large-scale clinical trials on the safety and efficacy of these vaccines have indicated their effectiveness with minor side effects on tested demographics [5]. Clinical trials of Sinovac in Brazil, Chile, Indonesia, the Philippines, and Turkey have shown 67% effectiveness against symptoms while reducing hospitalisations, intensive care visits and deaths by 85%, 89%, and 80%, respectively [6]. The vaccine was developed using conventional technology similar to BBIBP-CorV and BBV152; other inactivated-virus COVID-19 vaccines. The primary advantage of Sinovac is that it does not need to be frozen as does Moderna's vaccine, which needs to be stored at −20 • C. The AstraZeneca vaccine requires regular fridge temperature and Pfizer's vaccine at −70 • C [5,7]. Secondly, both the vaccine and raw material for formulating the new doses could be transported and refrigerated at 2-8 • C temperatures, at which flu vaccines are also kept. This compounded advantage makes the vaccine suitable for developing countries that have limited public health infrastructure. Sputnik V was developed by Gamaleya Institute in Russia and is currently being used in Belarus, United Arab Emirates, Saudi Arabia, India, and Iran [8]. However, mass vaccine adaption in many countries remains a public health logistical (e.g., logistics of manufacturing, storing and distributing the vaccine, and mass vaccination) and leadership challenge. The problem is attributed to many reasons ranging from citizens' vaccine resistance to vaccine nationalism [9][10][11]. With the development and usage of vaccines, further research is needed to address open questions such as: will the new vaccines be able to control the COVID-19 pandemic? What is the efficacy of current vaccines on the emerging variants of COVID-19 from the UK, South Africa, Brazil, Portugal, and India identified as more contagious and lethal? What will be the long-term efficacy and side effects of current vaccines that have been researched, developed and tried at break-neck speed on different population demographics? Giving the upsurge in vaccine nationalism, will it be possible to surmount both financial and political challenges for equitable distribution of vaccines, especially to low-and-middle-income countries?

Problem Statement
One of the critical steps in containing viral spread is the timely detection of positive cases in the community. Clinical laboratories have been developing, validating, and implementing various molecular and serologic assays to test SARS-CoV-2 nucleic acid [12]. Reverse Transcriptase-Polymerase Chain Reaction (RT-PCR) is a laboratory testing method that combines reverse transcription of RNA into DNA (called complementary DNA or cDNA) and amplification of specific DNA targets using the standard Polymerase Chain Reaction (PCR). RT-PCR diagnostic has been identified to be effective in detecting the SARS-CoV-2 virus. This technique, however, has inherent limitations such as long delays in obtaining test results, patients with high clinical suspicion testing falsely negative on initial RT-PCR test often requiring multiple tests runs to validate the result and a slew of other laboratory logistical challenges [12,13]. Low test sensitivity may be possible due to: sub-optimal clinical sampling approaches; variations in viral load; and manufacturer test kit sensitivity. With communities having a high surge in caseloads, managing these RT-PCR negative patients are overwhelmingly cumbersome. Procedural adherence requirements in the laboratory and a multitude of the testing characteristics could be attributed to the limitations [12][13][14][15]. Laboratories and virology research centres are working towards overcoming the current limitations of RT-PCR testing in enabling more accurate detection of the coronavirus. According to the World Health Organization recommendations of October 2020, chest imaging examination is a useful and effective approach for detecting clinical symptoms of COVID-19 suspected and recovered cases [16,17]. These imaging modalities include ultrasound, X-rays, MRI of the chest, computed tomography (CT) and needle biopsy of the lung. Among these modalities, the chest X-ray is primarily used to detect coronavirus in contrast to CT, MRI and other medical imaging modalities. The CT image takes longer for imaging, and CT scanners are sparsely available in low-income countries. Additionally, CT imaging is costly, and it may pose health risks to pregnant women and children due to its high ionizing radiations [18]. In stark contrast, X-ray imaging has a multiplicity of use cases in many medical and epidemiological applications because it is readily available around the world [19,20]. Thus, the chest X-ray is a well-suited modality for examining and diagnosing cases due to its lightweight operating speed, lower cost and ease of use by radiologists. However, prior research has indicated some degree of inconsistencies in chest X-ray images of COVID-19 patients [21].

Objectives
This work aims to detect and classify three types of pneumonia (lung opacity pneumonia, COVID-19 pneumonia, and viral pneumonia) and distinguish these types of pneumonia from a healthy chest X-ray scan to aid safe, accurate, less cumbersome and timely diagnosis. We aim to use domain-invariant representations from a source domain to transfer unto the chest X-ray target domain to improve model prediction performance given limited target domain sample data without overfitting. Moreover, we provide a framework of end-to-end learning using a dataset collected from multiple locations and periods to study the transferrable properties of latent representations across domains and tasks using transfer learning.

Contributions
Our main contribution in this work is a novel end-to-end Deep Transfer Learning framework using deep convolutional neural network that detects and classifies three types of pneumonia from chest X-ray scans. This study used the public COVID-19 Radiography dataset collected from more than 20 hospitals across the world [22]. In this dataset, we have a total of 21,165 chest X-ray images in which 3616 images are infected by pneumonia induced by COVID-19, 6012 images are lung opacity infection, 1345 images are viral pneumonia, and 10,192 images are normal images that are not infected. We review the most recent 16 papers applying ML to classify the different types of pneumonias from chest X-ray images. We found that most published ML models are dealing with limited data that are mostly two classes (COVID-19 pneumonia and uninfected chest X-ray) or three classes (COVID-19 pneumonia, all the other types of pneumonia regrouped into a single class, and uninfected chest X-ray). Among the 16 most recent papers reviewed, we found two papers dealing with four classes and both resulting in an accuracy less than 90%. These two works are based on limited data i.e., one used 1251 X-ray images and another 5941 X-ray images. We proposed a ResNet50 CNN architecture that is built to detect and classify four types of classes (lung opacity infection, viral pneumonia, pneumonia induced by COVID-19, uninfected chest X-ray) with an accuracy of 94.0% using 21,165 chest X-ray images with a well-adopted methodology to deal with class imbalance. The ResNet50 performance in convergence and generalization is in contrast to Alexnet, VGG, and ResNet34.

Outline
This paper is organized as follows: In Section 2, the reader is introduces to the problem background and related literature, in Section 3, the research methodology is described.
In Section 4, experimental results are presented and a comparative survey of performance of existing literature is given. Section 5 discusses the result of the paper, and conclusions and future research directions are outlined in Section 6.

Related Literature
Machine learning techniques have been used extensively in health informatics, from drug discovery to epidemiological modeling and diagnosis of deseases from medical images [23][24][25]. The use of machine learning in healthcare analysis has been enhanced by the increased availability of datasets to the research community and advancements in architectures for modeling scientific events. The outbreak of the COVID-19 pandemic has seen machine learning, deep learning and hybrid methods used to model the non-linear and complex nature of the spread of the SARS-CoV-2 virus with a capacity of higher generalization and predictive reliability in longer time windows [26][27][28][29].
There has been an extensive body of scholarly work to detect COVID-19 from chest Xray (XCR) and CT images. These methods are varied in their use of different pipelines and ML techniques from feature preprocessing to the choice of architecture under different contexts and considerations, thus yielding different performance results. Ahammed et al. [30] in their comparative survey of ML and deep learning approaches for the detection of COVID-19 using a dataset of all publicly available chest X-ray images of COVID-19 patients, reported 94.03% accuracy, 95.52% AUC, and 94.03% sensitivity. Ng et al. [21] created a massive dataset of 13,1975 XCR images and used a deep neural network model to classify the images which resulted to an accuracy of 93.30%. Abbas et al. [31] developed a convolutional neural network (CNN) model named Decompose-Transfer and Compose (DeTraC) to classify chest X-ray images where they reported accuracy of 93.1% and 100% sensitivity. In their study, Apostolopoulos et al. [32] using a dataset of 1,427 XCR images and a transfer learning approach, reported the accuracy of 96.78%, a sensitivity of 98.66% and specificity of 96.46%.
El-Din Hemdan et al. [33] proposed a Computer-Aided Diagnosis system (COVIDX-Net) to classify positive and negative COVID-19 cases with reported F1-Scores of 89% and 91% for normal and COVID-19, respectively. Karar et al. [34], developed a cascaded architecture of VGG-16, ResNet-50V2 and DensNet-169 which achieved the accuracy of 99.9%. Minaee et al. [35] used radiograms of 5000 chest X-ray images to perform transfer learning using ResNet-18, ResNet-50, SqueezeNet and DenseNet-121 with a reported best sensitivity of 98% and specificity of 90%. Heidari et al. [36] used histogram equalization and bilateral filtering techniques for preprocessing resulting in filtered images. These features were used to train a VGG-16 network obtaining the best accuracy of 94.5% on all image classes and 98.1% accuracy on COVID-19. Khan et al. [37] proposed CoroNet based on the Inception architecture. They used the model to classify Normal, Pneumonia-bacterial, Pneumonia-viral and COVID-19 from chest X-ray images. Their model achieved an overall accuracy of 89.6% with 93% precision and 98.2% recall for the COVID-19 class. In their work, Chandra et al. [38] proposed an Automatic COVID-19 Screening (ACoS) system using a two-staged majority voting scheme of an ensemble of models. They reported a validation accuracy of 98.062% in the first stage and 91.329% accuracy in the second stage. Ismael et al. [39] used a pre-trained ResNet-18, ResNet-50, ResNet-101, VGG-16 and VGG-19 models for feature extraction from XCR images. Using a Support Vector Machine (SVM) classifier applied with different kernels, they obtained the accuracy of 94.7% on ResNet and SVM with a linear kernel.
Karthik et al. [40] developed a custom CNN model which learns latent feature filters. Their model has a reported F1-Score of 97.20% and an accuracy of 99.80%. Ohata et al. [41] used a pre-trained MobileNet model with a linear SVM classifier and DenseNet-201 with a Multi-Layer Perceptron to detect COVID-19. They reported accuracy and F1-Score of 95.6% for the MobileNet model and 95.6% accuracy and F1-Score for the DenseNet-201 model. De Moura et al. [42] demonstrated three end-to-end models for the classification of chest X-ray images from portable equipment using a dataset of 1,616 images. Their proposed DenseNet-201 CNN models have a reported accuracy of 79.62%, 90.62%, and 79.86%, respectively. Duran-Lopez et al. [43] proposed a deep learning model (COVID-XNet) trained using five-fold cross-validation. They obtained an accuracy of 94.43% and an AUC of 98.8%. Shorfuzzaman et al. [44] used a pre-trained Convnet encoder with a contrastive loss function to learn the representation of XCR image features. Afterward, the learned features were classified using a Siamese neural network with a reported accuracy of 95.6% and an AUC of 97%. Shankar et al. [45], introduced a hand-crafted feature extraction method (FM-HCF-DLF) and a CNN based on the Inception-V3 architecture for the classification of XCR images. Their model yielded a sensitivity of 93.61%, specificity of 94.56%, accuracy of 94.08%, precision of 94.85%, F1-Score of 93.2%, and Kappa value of 93.5%.
The 16 most recent papers using ML models reviewed above are compared in Table 1 in terms of the size of the total number of images used, number of classes, adopted method, and reported accuracy. Table 1 shows important results or observations; so it warrants a detailed explanation. Out of the 16 papers reviewed, four papers addressed a binary classification with images ranging from 50 to 1531 and reported accuracy ranging from 95% to 99.0%; the average total of images and accuracy in these four papers are 725 images and 96.81% respectively. We have 10 papers that built the ML model based on a dataset with three classes. The reported accuracy ranges from 87% to 99.7% and the data size ranges from 542 to 16,995 images; the average total number of data size and accuracy reported in these 10 papers are 4623 and 94.417%, respectively. Only two papers attempted the work with a dataset of four classes. Of these two, the first published paper reported accuracy of 89.0% using 1251 images while the second paper reported accuracy of 89.92% with 5941 images. From this survey, it can be noted that a few classes classification problems are prompt to have greater accuracy with a small dataset. The accuracy measured on a small amount of data are generally unreliable since the model can be generalizing very poorly. In addition, most of the datasets discussed are not balanced with the number of COVID-19 cases ranging from 25 images to 1968 images. Some of the papers did not report the number of images per class. We reported an accuracy of 94.0% using a larger data that includes 3616 images of COVID-19, thus guarantees generalization compared to the above four classes classification papers. Table 1. Presents a survey of proposed COVID-19 chest X-ray classification and/or detection methodology, the dataset(s) used in the research, and reported accuracy for the proposed approach.

Number of Images Classes and Size Method(s) Reported Accuracy
Khan et al. [

Methods
This section provides discussion on the transfer learning framework and the related deep transfer learning setting. We elucidate the conceptual framework of transfer learning in medical image analysis. Also, an exploratory data analysis is carried out to understand the dataset and its inherent characterization. In addition to this, we state the performance criteria for the proposed framework for generalization.

Transfer Learning
We provide the formal notation for transfer learning. Consider the source domain D S as: where X S is the input space and P(X S ) is the marginal probability of the input. Te source input X S ⊂ X S is defined as: where m is the number of vectors x S i of size n. In D S , a source task T S is defined as: where Y S is the label space and P(Y S |X S ) is the conditional probability of the output given the input. This suggests that if Y S ⊂ Y S then the source output is: Each of the y S i ∈ {c 1 , c 2 , . . . , c k } with c k a given class. A target domain D T is defined as: where X T is the input space and P(X T ) is the marginal probability of the input. The target input X T ⊂ X T is given by: For source target domain S T , a target task T T is defined as: where Y T is the label space, P(Y T |X T ) is the conditional probability of the output given the input. The output, Y T ⊂ Y T is defined as: The goal of transfer learning is to learn a representation ∼ X such that where L is the loss function defined as: The classification empirical risk R D T is measured by: such that:

System Architecture
CNNs have increasingly been used in vision tasks such as detection, classification, and segmentation [55][56][57][58][59]. CNNs take a biological inspiration from the visual cortex. The visual cortex is a region of the brain that has cells that are sensitive to visual perception. The adoption of CNNs in ML stems from a research conducted by Hubel and Weisel in 1962 where they demonstrated that some individual neuronal cells in the brain fire or activate only in the presence of certain edges and in specific orientations [60]. A CNN architecture consists of a convolutional, pooling which is used for down-sampling then followed by a non-linear activation and a fully connected layers. The convolutional has a filter that acts as a feature detector and selection. The convolved region is known as the receptive field. The output of a convolution is a feature or activation map which serves as input to deeper layers of the network. The role of the filter (a concatenation of kernels) is to detect low level features such as edges, colors, curves, virtual lines, boundaries, and high level features such orientations, local surfaces and discontinuities as was first proposed by Marr [61]. In this paper, ImageNet (i.e., a dataset with over one million images and a thousand categories of objects) was used as the source domain for the basis of transfer learning to chest X-ray images. Figure 1 shows the proposed architectural framework for this representational transfer learning task. Transfer learning is a well-suited framework for healthcare computer vision tasks where target domain datasets for learning are significantly small, and model generalization is a key consideration.

System Model and Assumptions
A high-level overview of our proposed transfer learning framework is presented in Figure 2. The initial stage is dataset setup, which includes loading both the brain chest X-ray image scans and associated class labels followed by batch normalization and crossvalidation split into train, validation and test sets. We used various data augmentation approaches such as zooming, flipping, rotation, mirroring, etc. to make the model generalize better. Afterwards, we used ResNet50 CNN architecture, Stochastic Gradient Descent for the transfer learning framework.

Dataset
The COVID-19 Radiography Database was used to carry out experiments in this research [22]. This dataset was collected from multiple international sources, at time different timescales [62][63][64][65]. A summary of the dataset is presented in Table 2. All the X-ray images are in Portable Network Graphics (PNG) file format and of 299 × 299 pixel resolution. To understand the underlying signal distribution in the chest X-ray image dataset, we performed exploratory data analysis. Figure 3 gives a histogram of the dataset class distributions. Healthy and Lung Opacity samples compose 80% of the dataset. Given that our primary objective is to recognise COVID-19 patients to aid early diagnosis thus preemptive medical care, the figure indicates a problem of class imbalance. Viral Pneumonia is the least represented class indicating 6.4% of the total dataset, thus in such situation Precision, F1-Score or Recall are better suited as metrics in contrast to accuracy.

Normal
Lung_Opacity  The dataset contains a substantial number of images compared to existing literature [4,66,67]. However, due to privacy concerns, additional clinical information about patients is not available. Thus, we proceed with the investigation of image patterns and relationships between the classes. The data are unbalanced with almost 50% of samples belongs to the "Healthy" class that may bias the model towards this class in terms of performance. The X-ray images are rank three tensors that represent the height, width, and number of channels. We proceed by examining the inherent pattern between the image colour values and their class. Figures 4 and 5 show the RGB color intensity distributions for the four classes which is scaled between 0 to 255-pixel intensities for the individual image classes. The distribution illustrates how the minimum, mean, and maximum colour values are presented in the dataset.
We continue by observing the relationship between a sample X-ray image mean as shown in Equation (1) and its standard deviation using Equation (2): where I c is the number of color channels, I h is the height of the image and I w is the width of the image.
(2)  The distribution of the whole dataset is very akin to the individual healthy and lung opacity patient images, this is because both classes contribute the most to the dataset relative to the remaining classes of viral pneumonia and COVID-19. Separating by class, we can visualise that the mean, maximum, and minimum values vary according to the image class. Amongst all classes, viral pneumonia shows a Gaussian-like distribution across the three different distributions while COVID-19 shows a nearly normal distribution. The maximum value possible for an image is 255 and most classes peak around it. From Figure 4, viral pneumonia is the class that has the highest samples with lower maximum values. Most samples' RGB colour intensities are within the 200-225 range. Normal (Healthy) and lung opacity images show a very similar distribution of their mean values. This may be due to the two classes having the most sample X-ray images in the dataset. Moreover, the different peaks in the distribution could be attributed to the image source (e.g., two different hospitals where instrument noise, compression error or some other phenomena may have contributed to the veracity of the underlying signals). Figure 4 shows a similar distribution with regards to the maximum values as indicated by the local minimum between intensities of 220 to 240. At the same time, normal patients have a peak at 150 and another peak around 250. We observe that the images are in gray-scale, they have the three channels which contain repeated RGB values. A visualised sample images from the dataset are shown in Figure 6 in a rainbow color map.  Figure 7, gives a depiction of the dataset projected onto a two-dimensional plane, shows that most of the data points are clustered in the central region of the graph. That implies the inter-class pixel intensity variability is low. However, COVID-19 samples show some clusters with a high channel colour mean and pixel intensities laying between 150 through 200. This can be visualized in Figure 7 (bottom-panel), a 10% zoom in the centre of Figure 7 (top-panel). It can be observed that the samples with a lower mean and a low standard deviation are in the lower corner and the upper corner has the converse situation. Because of the dense clustering of data points in Figure 7, which gives a highlevel overview of the dataset but not fine-grain information, a plot of class-level depictions of the dataset is shown in Figure 8. We observe that the classes (COVID-19, Lung Opacity, and Viral Pneumonia) have high intra-class variability with outliers that are distant from the centroid. Figure 8 shows that Normal (Healthy) and Lung Opacity have similar data point cluster formations. This similarity in distributional pattern between the two classes is characterised by the spread of the respective graphs where a majority of the samples are along regions of high standard deviation, and within 100 to 180 mean pixel intensities. Viral Pneumonia images, on the other hand, show a denser scatter plot which is due to the samples having higher in-class similarity. The graph of the COVID-19 does not indicate any semblance to the other three classes. It has a higher variance and more outliers compared to the other classes. The data points are scattered across all regions of the graph.

Learning Setting
The CNN architectures comprise varied hyperparameter configurations and tuning techniques. Due to the high in-balance nature of the dataset under study, an inverse classweighting scheme in Equation (3) was used to balance the class weights to help avoid a biased model that performs well on majority class and poor on minority classes: where W j is the weight matrix of class j, m is the total number of training examples, k is the total number of classes, and n j is the number of examples belonging to class j. Another useful technique that we use to help combat overfitting is the time-based learning rate decaying technique to vary the learning rate over each iteration on the training batch to aid faster convergence as shown in Equation (4): where η t+1 is the new learning rate, η t is the current learning rate, ρ is the decay rate hyperparameter and e t is the epoch number at time t. Another useful, technique to obtain high performance is the use of Stochastic Gradient Descent (SDG) with Momentum. The vanilla SGD is shown in Equation (5), where θ j is the weight at batch j which is updated with respect to the gradient of θ j that shows the direction of optimization across the loss landscape; η ∈ [0, 1] is the step size, and the loss function L(θ) we seek to minimize.
The Stochastic Gradient Descent uses an iterative search approach to find the optimal minimizer (parameters or weights) that minimize the objective function (the loss function) thus obtaining a model that generalizes to OOD examples. However, finding the global minimum (or minima) which is the ideal objective of optimization is a hard problem largely due to saddle points or convergence of the optimization algorithm to a local minima. This optimization challenge is more so common for deep learning problems which are high dimensional and suffer from the curse of dimensionality. Thus Stochastic Gradient Descent with Momentum (SDGM), Equation (6), is a handy optimization technique to overcome the problems of saddle points and local minima.
In SDGM, the term v is often considered as the velocity and ρ as the frictional force controlling the velocity as the weight parameter θ is updated in each iteration. As a memory unit, the velocity v, which has accumulated previous gradient information, address the problem of convergence to saddle points or local minima in the loss landscape which results in SDGM having a better generalization and performance guarantees to SDG.

Simulation Environment
Experiments are carried out on the Google Colaboratory (Colab) platform. Colab provides access to a high performance Virtual Machine (VM) that dynamically allocates NVIDIA K80s, T4s, P4s and P100s Graphic Processing Unit (GPU), Random Access Memory (RAM) and GDrive storage for high-end computing freely.

Performance Metrics
Model performance evaluation is a key constituent in the pipeline of building any ML system. Given that the primary focus of such model is to perform well on unseen future data, therefore, evaluate train, validation and test sets give a good indication on the generalization bounds of such model. In that regard, a confusion matrix is useful metric to help evaluate a classification model. Confusion matrix is an intuitive cross-tabulation of actual class values and predicted class values. It contains a cross-tabulation of every observations that fall in each category. • F 1 Score: is a measure of the harmonic mean of recall and precision. This is a good measure of performance when the classes are in-balanced.
• False Positive Rate (FPR): is the fraction of negative examples incorrectly classified to the total number of negative samples. It is also regarded as the complement of specificity.

Results
This section presents the results of all experiments conducted in this work. Three CNN architectures Visual Geometric Group-19 (VGG-19), Densely Connected Convolutional Network-121 (DenseNet-121), and Deep Residual Network-50 (ResNet-50) were used to carry out experiments. The result of each model has some performance similarities as well as marked contrasts highlighted in the illustrations (loss and accuracy curves, confusion matrices and Receiver Operating Characteristic curves). With the goal of ML being finding a model that shows robust bias-variance trade-off, albeit, out of sample distribution (OOD) generalization. The hyperparameter choices and configurations in the present work where empirically-driven and are indicative of best practices in modeling with deep neural networks, reason being, deep network models are highly opaque and black-box in nature.
Using a 5-fold cross validation technique, Figure 9 shows the train and validation loss and Figure 10 shows the accuracy for train and validation for the VGG-19 model. With the aforementioned technique, the VGG-19 model obtained a best train and validation accuracy of 96.4% and 93.4% respectively, and a test accuracy of 93.99% superseding the other two models with respect to test accuracy. This is supported by the side-by-side comparison of the test image classification summary in Table 3.  However, the accuracy is not very informative in situations where a dataset is highly in-balanced as in the current work. Secondly, the objective of developing a model highly influences the performance metric choice to be made. Thus, there is a trade-off where a model should have a high sensitivity towards a certain class. In the current context, COVID-19 which if incorrectly classified albeit goes on detected can lead to a massive infection rate in the community retarding the containment of the virus efforts. So, the confusion matrix in Figure 11 and the Receiver Operator Characteristic curve in Figure 12  To further gain a deeper understanding of the classifiers performance across the four classes, the above ROC curve in Figure 12, characterizes the AUC. The AUC gives an indicator of the classifiers performance per class across the test dataset. And from the ROC curve, the VGG-19 model showed a remarkable AUC score of 1.0 across the COVID-19 images. The advantage of this measure is, it contrasts the True Positive Rate to the False Positive Rate. So a higher AUC indicates a class-level performance of the model and the reverse holds true. In summary, the model was able to learn disentangled representations of COVID-19 images and classify them better than the three other classes.
In the next experiment, a DenseNet-121 model was trained for 100 epochs, using the same test size used to testing the VGG-19 model. The goal of the experiment, was to compare the models generalization performance so as to perform a selection of the best model. From Figures 13 and 14 , one can notice that the model had high degree of unstable convergence to the optima, i.e., in both the train and validation loss as well as the train and validation accuracy graph. The train loss remained consistently lower than the validation, though weight decaying, class weighting, learning rate scheduling, checkpoints and regularization techniques we used. A similar phenomenon was noticed with the train and validation accuracy curves. However, the DenseNet-121 model shows a greater degree of convergence stability in the last 50 epochs as opposed to the VGG-19 model. This is indicative of a lower spread (upper and lower standard deviations) of the curves. Nonetheless, the model achieved an overall train, validation and test accuracy of 97.4%, 93.58% and 93.24%, respectively. To better understand and analyze the class-level performance of the model, Figures 15 and 16 show the test confusion matrix and the ROC curve respectively. From the confusion matrix, most of the images were classified correctly; however, the same problem of high misclassification for Normal and Lung Opacity classes persists. This clearly warrants further inquire into the data generation process (at least for the classes in question). It is worth noting that, the dataset was aggregated from multiple sources, which can, inevitably introduce a range of problems from measurement errors, to wrong class labelling. Given, that the field of radiology extensively requires dedicated training and specialization, annotating medical images relies on domain knowledge and tracking problems is painstakingly difficult. Nonetheless, the four classes showed a remarkable AUC performance, relative the to other two models. COVID-19 class shows a consistent 1.0 AUC score in the DenseNet-121 model while Lung Opacity has the least AUC score of 0.982.  Finally, a ResNet-50 pre-trained CNN model was used for transfer learning in the XCR dataset. The model after fine-tuning obtained an overall test accuracy of 0.938534. One can discern from the loss in Figure 17 and accuracy Figure 18 that the model had a nearly smooth convergence to the optimum during training. This is indicative by the perfect fit relationship between the train and validation loss and accuracy curves across all 100 epochs. This phenomenon can be explained by the robust representational power of ResNet-50 in tackling the vanishing gradient problem through residual connections in its architectural formulation. The residual connection in ResNet, allows not only the building of deep representational stacks of hidden layers but the preservation of information and gradient flow in layer-wise transformations in a CNN network.
The test confusion matrix is shown in Figure 19 to comparatively assess class-level performance of the model. In this figure, as indicated in the former models, Lung Opacity has the highest misclassification with a total of 68 images followed by the normal class having 31 misclassified images. A subset of images in these two classes appear to be entangled, thus one being predicted as the other by the model. However, ResNet-50 has the least number of misclassification with a total of 140 followed by VGG-19 having a total of 137 misclassified images and DenseNet-121 with 153 misclassified XCR images. We showed the ROC curve in Figure 20 that depicts the False Positive Rate (FPR) against the True Positive Rate (TPR) for the four classes under study. The results of the experiments indicate that ResNet-50 achieved a better AUC performance in contrast to VGG-19 and DenseNet-121 models. Based on the analysis of the performance of the three models, we observe that ResNet-50 is well suited for the detection and classification of traditional pneumonia and pneumonia induced by the COVID-19 from Chest X-ray images.  To understand the nature of the learnt representation in the fine-tuned ResNet-50 model, the activation map for layer 48 is shown in Figure 21. In this layer, the model learned the latent factors of variation in the chest X-ray images as shown by the first four rows in the 16 × 16 grid of filters. This indication of specific firing patterns of the filters for separate image classes is a striking feature of hierarchy representations of disentangled representations learnt in the CNN model. Learning. This shows that the representation of the network has learnt inherent invariant regularities in the XCR dataset.

Discussion
The maturation of machine learning and computer vision fields have offset remarkable research interest in their application to medical image analysis. Traditionally, medical image analysis of patients is done by radiologist. This process is laborious and timeconsuming. Thus extensive scholarly work has been done in medical image analysis with computer vision and machine learning to help improve healthcare outcomes. The outbreak of the COVID-19 caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has led to extensive research into the application of Deep Learning for fast and accurate detection of the disease. Deep Learning for the detection of COVID-19 from chest X-ray modality has primary been used in prior work as a viable complementary test method to the Reverse Transcriptase-Polymerase Chain Reaction (RT-PCR). The existing literature on deep learning models for the detection of COVID-19 mostly utilize unrealistic experimental setups (small, overly handcrafted and augmented datasets, performance scoring on validation as opposed to test set and an ensemble of models that compound compute cost thus introducing reproducibility inequity). On that ground, we carried out experiment using a noisy dataset of chest X-ray images collected from 20 health centres across the world. This was done primary to tackle the problems of distribution shift and concept drift. Distribution shift (co-variate shift) happens where the distribution of independent variables shift potentially due to spatio-temporal variability in latent processes.
While concept drift occurs in the change of the statistical properties of target variables over time in unforeseen ways. A survey of recent work in detecting and classifying COVID-19 from chest X-ray images is presented in Table 1. Our result shows an overall accuracy of 94.0% using a VGG-19 model. However, upon fine-grain performance analysis, we showed that ResNet-50 has the best test performance with respect to ROC curve analysis.

Conclusions and Future Directions
In this present work, we proposed a framework for the diagnosis and classification of traditional pneumonia and pneumonia induced by the COVID-19 from Chest X-ray images.
We demonstrated that Transfer Learning shows a promising direction of training medical diagnostic deep learning models where access to annotated dataset is limited as manual labelling in such setting is very laborious and expensive because it requires domain expertise. Thus, using 5-fold cross validation, our work indicates the potential utilization of transfer learning to aid fast and accurate early detection of COVID-19 especially in asymptomatic patients. Nonetheless, further inquiry is required to use hyperparameter optimization techniques such as Grid Search, Bayesian Optimization or Evolutionary Optimization, to find the right set of hyperparameters for better test performance improvement. This approach was a limitation in our study because of limited compute power to carryout such hyperparameter space search in a deep learning setting. Additionally, we compared our work with the literature related to the detection of COVID-19 on chest X-ray images. Future research may compare computed tomography modality and chest X-ray to determine which one is well suited for diagnosing the COVID-19.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://www.kaggle.com/tawsifurrahman/covid19-radiography-database (accessed on 5 August 2021).

Conflicts of Interest:
The authors declare no conflict of interest.