A Novel Transfer Learning Based Approach for Pneumonia Detection in Chest X-ray Images

: Pneumonia is among the top diseases which cause most of the deaths all over the world. Virus, bacteria and fungi can all cause pneumonia. However, it is di ﬃ cult to judge the pneumonia just by looking at chest X-rays. The aim of this study is to simplify the pneumonia detection process for experts as well as for novices. We suggest a novel deep learning framework for the detection of pneumonia using the concept of transfer learning. In this approach, features from images are extracted using di ﬀ erent neural network models pretrained on ImageNet, which then are fed into a classiﬁer for prediction. We prepared ﬁve di ﬀ erent models and analyzed their performance. Thereafter, we proposed an ensemble model that combines outputs from all pretrained models, which outperformed individual models, reaching the state-of-the-art performance in pneumonia recognition. Our ensemble model reached an accuracy of 96.4% with a recall of 99.62% on unseen data from the Guangzhou Women and Children’s Medical Center dataset.


Introduction
Today's deep learning models can reach human-level accuracy in analyzing and segmenting an image [1].The medical industry is one of the most prominent industries, where deep learning can play a significant role, especially when it comes to imaging.All those advancements in deep learning make it a prominent part of the medical industry.Deep learning can be used in wide variety of areas like the detection of tumors and lesions in medical images [2,3], computer-aided diagnostics [4,5], the analysis of electronic health-related data [6], the planning of treatment and drug intake [7], environment recognition [8] and brain-computer interface [9], aiming to come up with decision support for the evaluation of the person's health.The key element of the success of deep learning is based on the capability of the neural networks to learn high level abstractions from input raw data through a general purpose learning procedure [10].
Although currently, deep learning still cannot replace doctors/clinicians in medical diagnosis, it can provide support for experts in the medical domain in performing time-consuming works, such as examining chest radiographs for the signs of pneumonia.
Pneumonia is an inflammation of the lungs that may be caused by pathogens, like bacteria, viruses and fungi [11].It can occur to anyone, young or even to healthy people.It becomes life-threatening for infants, people having other diseases, people with an impaired immune system, elderly people, people who are hospitalized and have been placed on a ventilator, people with chronic disease like asthma and people who smoke cigarettes.The cause of pneumonia also determines is severity.Viral pneumonia is milder, and symptoms occur gradually.However, it can become complicated to diagnose if a bacterial infection develops at same time with viral pneumonia.On the other side, bacterial pneumonia is more severe, and its symptoms can occur gradually or even suddenly, especially among children [12].This type of pneumonia affects a large part of the lungs, and can lead to affect many lobes of the lung.When multiple lobes of the lungs are affected, a person needs to be hospitalized [13].Another form of pneumonia is fungal pneumonia, which can occur to persons with weak immune systems.This type of pneumonia could be dangerous also, and requires time for the patient to regain health.Therefore, there is an urgent need to perform research and to develop new methods helping to provide computer-aided diagnosis to reduce pneumonia-related mortality, especially, child mortality, in the developing world [14].
The analysis of chest radiography has a crucial role in medical diagnostics and the treatment of the disease.The Centers for Disease Control and Prevention (CDC, Atlanta, GA, USA) reported that about 1.7 million adults in the US seek care in a hospital due to pneumonia every year, and about 50,000 people died in United States from pneumonia in 2015 [15].Chronic obstructive pulmonary disease (COPD) is the primary cause of mortality in the United States, and it is projected to increase by 2020 [16].The World Health Organization (WHO, Geneva, CH) reported that it is one of the leading causes of death all over the world for children under 5 years of age, killing an estimated 1.4 million, which is about 18% of all children deaths under the age of 5 years worldwide [17].More than 90% of new diagnoses of children with pneumonia happen in the underdeveloped countries with few medical resources available.Therefore, the development of cheap and accurate pneumonia diagnostics is required.
Recently, a number of researchers have proposed different artificial intelligence (AI)-based solutions for different medical problems.Convolutional neural networks (CNNs) have allowed researchers to obtain successful results in wide medical problems like breast cancer detection, brain tumor detection and segmentation, disease classification in X-ray images, etc. [18].
For example, Wang et al. [19] contributed by proposing a new dataset called ChestX-ray8 that contains images of different 32,717 patients, having 108,948 frontal X-ray images.They achieved promising results using a deep CNN.They further stated that this dataset can be extended using more disease labels.Not only in chest X-rays, Ronneburger et al. [20], used the power of data augmentation and CNN, and achieved great results, even training on small sample of images.In another study, Roth et al. [21] showed how deep CNN can be used to detect lymph nodes.They got promising results even in case of an adverse quality of images.Shin et al. [22] worked on different CNN architectures and addressed the problem of lymph and lung disease detection.
Rajpurkar et al. [23] suggested CheXNeXt, a very deep CNN with 121 layers, to detect 14 different pathologies, including pneumonia, in frontal-view chest X-rays.First, neural networks were trained to forecast the probability of 14 abnormalities in the X-ray image.Then, an ensemble of those networks, were used to issue predictions by calculating the mean predictions of individual networks.Woźniak et al. [24] suggested a novel algorithm for training probabilistic neural networks (PNNs) that allows one to construct smaller networks.Gu et al. [25] used a 3D deep CNN and a multiscale forecasting strategy, with cube clustering and multiscale cube prediction, for lung nodule detection.Ho and Gwak [26] used a pretrained DenseNet-121 and four types of local features and deep features acquired by using the scale-invariant feature transform (SIFT), GIST, Local binary patterns (LBP) and a histogram of oriented gradients (HOGs), and CNN features for classification of 14 thoracic diseases.
Jaiswal et al. [27] used Mask-RCNN, a deep neural network, which utilizes both global and local features for pulmonary image segmentation combined with image augmentation, alongside with dropout and L2 regularization, for pneumonia identification.Jung et al. [28] use a 3D deep CNN (3D DCNN) which has shortcut connections and a 3D DCNN with dense connections.The shortcuts and dense connections solve the gradient vanishing problem.Connections allow deep networks to capture both the general and specific features of lung nodules.Lakhani and Sundaram [29] use AlexNet and GoogLeNet neural networks with data augmentation and without any pre-training to obtain an Area Under the Curve (AUC) of 0.94-0.95.An improved architecture using transfer learning and a two-network ensemble achieved an AUC of 0.99.
Li et al. [30] used a CNN-based approach combined with rib suppression and lung filed segmentation.For pixel patches acquired in the lung area, three CNNs were trained on different resolution images, and feature fusion was applied to merge all information.Liang et al. [31] designed a novel network with residual structures, which has 49 convolutional layers, only one global average pooling layer and two dense connection layers for pediatric pneumonia diagnosis.Nam et al. [32] used a deep CNN with 25 layers and eight residual connections.The outputs of three networks trained with different hyperparameter values were averaged for the prediction of malignant pulmonary nodules in chest radiographs.Nasrullah et al. [33] used two 3D-customized mixed link network (CMixNet) architectures.Lung nodule recognition was performed with faster R-CNN on features learned from CMixNet and U-Net like encoder-decoder, while a gradient boosting machine (GBM) was used for classification.Pasa et al. [34] proposed a custom neural network consisting of five convolutional blocks (each having two 3 × 3 convolutions with Rectified Linear Units (ReLUs), followed by a max-pooling operation), a global average pooling layer and a fully-connected softmax layer with two outputs.
Pezeshk et al. [35] suggested a 3D fully CNN for the fast screening and generation of candidate suspicious regions.Next, an ensemble of 3-D CNNs was trained using extensive data augmentations obtained from the positive and negative patches.The classifiers were trained on false positive patches using different thresholds and data augmentation types.Finally, the outputs of second stage networks were averaged to produce the final prediction.Sirazitdinov et al. [36] suggested an ensemble of RetinaNet and Mask R-CNN networks for pneumonia localization.First, networks recognized the pneumonia-affected regions, then non-maximum suppression was applied in the predicted lung regions.
Souza et al. [37] used AlexNet-based CNN for lung patch classification.Then, a second CNN model, based on ResNet18, was employed to reconstruct the missing parts of the lung area.The output is obtained by the ensemble combination of initial segmentation and reconstruction.Taylor et al. [38] used standard network architectures (VGG16/19, Xception, Inception and ResNet) and explored the parameter space of pooling and flattening operations.Final results were obtained by fully connected layers followed by a sigmoid function.Wang et al. [39] split X-ray images into six types of patches and used ResNet to classify six patches and recognize lung nodules.They used rotation, translation and scaling techniques for data augmentation.Xiao et al. [40] proposed a multi-scale heterogeneous 3D CNN (MSH-CNN), which uses multiscale 3D nodule blocks with contextual data as input; (2) two 3D CNNs to acquire expression features; (3) and a set of weights defined by backpropagation for feature fusion.Xie et al. [41] suggested a semi-supervised adversarial classification (SSAC) framework that is trained using both labeled and unlabeled data.They used an adversarial autoencoder-based unsupervised network for reconstruction, a supervised network for classification and transition layers that map from the reconstruction network to the classification network.
Xu et al. [42] designed a hierarchical CNN CXNet-m1, which used a novel sin-loss loss function to learn from misclassified and indistinguishable images for anomaly identification on chest X-rays.Yates et al. [43] retrained a final layer of the CNN model, Inception v3, and performed binary normality classification.In addition, da Nóbrega et al. [44] compared the ResNet50 feature extractor combined with the SVM RBF classifier for recognition of the malignancy of a lung nodule.Ke et al. [45] used the spatial distribution of Hue, Saturation and Brightness in the X-ray images as image descriptors, and an ANN with heuristic algorithms (Moth-Flame and Ant Lion optimization) to detect diseased lung tissues.Finally, Behzadi-khormouji et al. [46] used transfer learning with DCNNs pretrained on the ImageNet dataset and combined with a problem-based ChestNet architecture with unnecessary pooling layers removed, and a three-step pre-processing to support model generalization.Zhang et al. [9] systematically investigated brain signal types and related deep learning concepts for brain signal analysis, and they covered open challenges and future directions for brain signal analysis with help of 230 contributions performed by researchers.Zhou et al. [5] proposed three-stage deep features learning and a fusion framework for identifying Alzheimer disease and its prodromal status.Fan et al. [47] proposed a fully convolutional network deep learning approach for image registration to predict the deformation field which become insensitive to parameter tuning and small hierarchical loss.Liu et al. [1] exploited the usage of a deep convolutional network to explore the uses of local description and feature encoding for image representation and registration.
Apart from these significant achievements, CNNs work very well on large datasets.However, most of the time they fail on small datasets if proper care is not taken.To meet the same level of performance even on a small dataset, and to classify pneumonia from normal chest X-rays, we conducted this study to propose a novel approach of transfer learning using pretrained architectures trained on ImageNet.We used five different pretrained models and analyzed their performances.Finally, the combination of all five models was taken, forming a large ensemble architecture and achieving promising results.
Our main contribution in this paper is to provide a novel ensemble approach based on the transfer learning of five different neural network architectures.

Outline of Methodology
Our methodology is summarized in Figure 1.It includes the following steps: chest X-ray image preprocessing, data augmentation, transfer learning using AlexNet, DenseNet121, InceptionV3, resNet18 and GoogLeNet neural networks, feature extraction and ensemble classification.Then steps are explained in more detail in the subsequent subsections.
Appl.Sci.2020, 10, x FOR PEER REVIEW 4 of 17 Ke et al. [45] used the spatial distribution of Hue, Saturation and Brightness in the X-ray images as image descriptors, and an ANN with heuristic algorithms (Moth-Flame and Ant Lion optimization) to detect diseased lung tissues.Finally, Behzadi-khormouji et al. [46] used transfer learning with DCNNs pretrained on the ImageNet dataset and combined with a problem-based ChestNet architecture with unnecessary pooling layers removed, and a three-step pre-processing to support model generalization.Zhang et al. [9] systematically investigated brain signal types and related deep learning concepts for brain signal analysis, and they covered open challenges and future directions for brain signal analysis with help of 230 contributions performed by researchers.Zhou et al. [5] proposed three-stage deep features learning and a fusion framework for identifying Alzheimer disease and its prodromal status.Fan et al. [47] proposed a fully convolutional network deep learning approach for image registration to predict the deformation field which become insensitive to parameter tuning and small hierarchical loss.Liu et al. [1] exploited the usage of a deep convolutional network to explore the uses of local description and feature encoding for image representation and registration.
Apart from these significant achievements, CNNs work very well on large datasets.However, most of the time they fail on small datasets if proper care is not taken.To meet the same level of performance even on a small dataset, and to classify pneumonia from normal chest X-rays, we conducted this study to propose a novel approach of transfer learning using pretrained architectures trained on ImageNet.We used five different pretrained models and analyzed their performances.Finally, the combination of all five models was taken, forming a large ensemble architecture and achieving promising results.
Our main contribution in this paper is to provide a novel ensemble approach based on the transfer learning of five different neural network architectures.

Outline of Methodology
Our methodology is summarized in Figure 1.It includes the following steps: chest X-ray image preprocessing, data augmentation, transfer learning using AlexNet, DenseNet121, InceptionV3, resNet18 and GoogLeNet neural networks, feature extraction and ensemble classification.Then steps are explained in more detail in the subsequent subsections.

Data Pre-Processing and Augmentation
All of the pre-trained models were quite large to hold this dataset, and each model could be overfitted easily.To prevent this, some noise was added to the dataset; it is well known that by adding some noise to inputs of neural network, in some situations, this leads to significant improvement in generalizing the dataset.Moreover, adding noise acts as some sort of augmentation of the dataset.Furthermore, other augmentation techniques were also used.Since not all augmentation approaches were good for X-ray images, we processed the images in four steps.First, we resized images to 224 × 224 × 3, then three augmentation techniques were used, Random Horizontal Flip (to deal with the pneumonia symptoms on either side of the chest), Random Resized Crop (to get deeper relation among pixels), and finally augmenting images with a varying intensity of images.

Convolutional Neural Networks and the Use of Transfer Learning
Currently, the modern deep learning models in computer vision use convolutional neural networks (CNNs).These layers make the explicit assumption that any input to them is an image.Early convolutional layers in a network process an image and come up with detecting low-level features in images like edges.These layers are successful in capturing the spatial and temporal dependencies in an image.This is done with the help of filters.Unlike normal feed-forward layers, these layers have a much lower number of parameters and use a weight-sharing technique, thus reducing computation efforts.The learnable parameters of each layer consist of filters (or kernels), extended through the full depth of the input volume but these have a small receptive field.When an input is subjected to forward pass, each kernel is convolved across the height and width of the input volume, creating a 2-D activation map of that filter.If 'N' filters are used, then stacking those 'N' activation maps along the depth forms the full output of the convolutional layer [48] (see Figure 2).

Data Pre-Processing and Augmentation
All of the pre-trained models were quite large to hold this dataset, and each model could be overfitted easily.To prevent this, some noise was added to the dataset; it is well known that by adding some noise to inputs of neural network, in some situations, this leads to significant improvement in generalizing the dataset.Moreover, adding noise acts as some sort of augmentation of the dataset.Furthermore, other augmentation techniques were also used.Since not all augmentation approaches were good for X-ray images, we processed the images in four steps.First, we resized images to 224 × 224 × 3, then three augmentation techniques were used, Random Horizontal Flip (to deal with the pneumonia symptoms on either side of the chest), Random Resized Crop (to get deeper relation among pixels), and finally augmenting images with a varying intensity of images.

Convolutional Neural Networks and the Use of Transfer Learning
Currently, the modern deep learning models in computer vision use convolutional neural networks (CNNs).These layers make the explicit assumption that any input to them is an image.Early convolutional layers in a network process an image and come up with detecting low-level features in images like edges.These layers are successful in capturing the spatial and temporal dependencies in an image.This is done with the help of filters.Unlike normal feed-forward layers, these layers have a much lower number of parameters and use a weight-sharing technique, thus reducing computation efforts.The learnable parameters of each layer consist of filters (or kernels), extended through the full depth of the input volume but these have a small receptive field.When an input is subjected to forward pass, each kernel is convolved across the height and width of the input volume, creating a 2-D activation map of that filter.If 'N' filters are used, then stacking those 'N' activation maps along the depth forms the full output of the convolutional layer [48] (see Figure 2).The activation layer is very useful as it helps to approximate almost any nonlinear function [49].The feature map from the convolutional layer is taken as input to the activation layer.
Pooling layers are used to reduce the spatial size of representation generated by previous kernels after convolution.This helps in reducing the number of parameters, thus reducing the computation work.These layers are used to extract dominant features that are positional and rotational invariant.It is common practice to include a pooling layer in between two convolutional layers.The most common pooling layer is the max pooling layer; it separates input into squares of a given size, and outputs the maximum value of each square.On the other hand, an average pooling layer finds the average of each square, both the methods reduce the dimensionality and computation efforts [50].
When working on a similar computer vision problem, we can use those pre-trained models, instead of going through the long process of training models from scratch.This method of transferring learning from one predefined and trained model to some new domain by reusing the The activation layer is very useful as it helps to approximate almost any nonlinear function [49].The feature map from the convolutional layer is taken as input to the activation layer.
Pooling layers are used to reduce the spatial size of representation generated by previous kernels after convolution.This helps in reducing the number of parameters, thus reducing the computation work.These layers are used to extract dominant features that are positional and rotational invariant.It is common practice to include a pooling layer in between two convolutional layers.The most common pooling layer is the max pooling layer; it separates input into squares of a given size, and outputs the maximum value of each square.On the other hand, an average pooling layer finds the average of each square, both the methods reduce the dimensionality and computation efforts [50].
When working on a similar computer vision problem, we can use those pre-trained models, instead of going through the long process of training models from scratch.This method of transferring learning from one predefined and trained model to some new domain by reusing the network layer weights is called transfer learning (see Figure 3).Transfer learning is a very useful technique and has achieved significant results in computer vision and other areas also [51][52][53][54].
Appl.Sci.2020, 10, x FOR PEER REVIEW 6 of 17 network layer weights is called transfer learning (see Figure 3).Transfer learning is a very useful technique and has achieved significant results in computer vision and other areas also [51][52][53][54].

AlexNet Architecture
AlexNet is a CNN that is similar to LeNet [60], but deeper.This network replaced the tanh function with a Rectified Linear Unit (ReLU) to add non-linearity.It used dropout layers instead of using regularization to deal with overfitting.Overlapping pooling was also used to reduce the size of the network.We used a pre-trained AlexNet as one of our models and used transfer learning (Figure 4) by freezing the convolutional layers and training the classifier only.

AlexNet Architecture
AlexNet is a CNN that is similar to LeNet [60], but deeper.This network replaced the tanh function with a Rectified Linear Unit (ReLU) to add non-linearity.It used dropout layers instead of using regularization to deal with overfitting.Overlapping pooling was also used to reduce the size of the network.We used a pre-trained AlexNet as one of our models and used transfer learning (Figure 4) by freezing the convolutional layers and training the classifier only.
Appl.Sci.2020, 10, x FOR PEER REVIEW 6 of 17 network layer weights is called transfer learning (see Figure 3).Transfer learning is a very useful technique and has achieved significant results in computer vision and other areas also [51][52][53][54].

AlexNet Architecture
AlexNet is a CNN that is similar to LeNet [60], but deeper.This network replaced the tanh function with a Rectified Linear Unit (ReLU) to add non-linearity.It used dropout layers instead of using regularization to deal with overfitting.Overlapping pooling was also used to reduce the size of the network.We used a pre-trained AlexNet as one of our models and used transfer learning (Figure 4) by freezing the convolutional layers and training the classifier only.

DenseNet121 Architecture
The DenseNet architecture requires fewer parameters than a traditional CNN.DenseNet layers use only 12 filters with a small set of new feature maps (Figure 5).Another problem with DenseNet is the training time, because every layer has its input from previous layers.
Appl.Sci.2020, 10, x FOR PEER REVIEW 7 of 17 The DenseNet architecture requires fewer parameters than a traditional CNN.DenseNet layers use only 12 filters with a small set of new feature maps (Figure 5).Another problem with DenseNet is the training time, because every layer has its input from previous layers.However, DenseNet solves this issue by giving access to the gradient values from the loss function and the input image.This significantly reduces the computation cost and makes this model a better choice.

ResNet18 Architecture
The ResNet model comes with a residual learning framework to simplify the training of deeper networks.The architecture is based on the reformulation of network layers as learning residual functions with respect to the layer inputs.The depth of the residual network is eight times deeper than VGG nets [61], but its complexity is lower.Figure 6 shows the architecture used.However, DenseNet solves this issue by giving access to the gradient values from the loss function and the input image.This significantly reduces the computation cost and makes this model a better choice.

ResNet18 Architecture
The ResNet model comes with a residual learning framework to simplify the training of deeper networks.The architecture is based on the reformulation of network layers as learning residual functions with respect to the layer inputs.The depth of the residual network is eight times deeper than VGG nets [61], but its complexity is lower.Figure 6 shows the architecture used.The DenseNet architecture requires fewer parameters than a traditional CNN.DenseNet layers use only 12 filters with a small set of new feature maps (Figure 5).Another problem with DenseNet is the training time, because every layer has its input from previous layers.However, DenseNet solves this issue by giving access to the gradient values from the loss function and the input image.This significantly reduces the computation cost and makes this model a better choice.

ResNet18 Architecture
The ResNet model comes with a residual learning framework to simplify the training of deeper networks.The architecture is based on the reformulation of network layers as learning residual functions with respect to the layer inputs.The depth of the residual network is eight times deeper than VGG nets [61], but its complexity is lower.Figure 6 shows the architecture used.This allows the model to use all kinds of kernels on the image and to get results from all of those.All such outputs are stacked along the channel dimension, and used as input to the next layer.This model achieved top performance for computer vision tasks, by using some advanced techniques.The architecture we use is shown in Figure 7.The Inception V3 model allows for increasing the depth and width of the deep learning network, but maintaining the computational cost constant at the same time.This model was trained on the original ImageNet dataset with over 1 million training images.It works as a multi-level feature generator by computing 1 × 1, 3 × 3 and 5 × 5 convolutions.This allows the model to use all kinds of kernels on the image and to get results from all of those.All such outputs are stacked along the channel dimension, and used as input to the next layer.This model achieved top performance for computer vision tasks, by using some advanced techniques.The architecture we use is shown in Figure 7.

GoogLeNet Architecture
The architecture of GoogLeNet is completely different from AlexNet.This model used global average pooling.It also contains inception modules, which can output convolutions of different types using different kernels on the same input; all outputs are then stacked as the final output of that layer.For instance, the inception module performs 1 × 1, 3 × 3 and 5 × 5 convolution and stacks all the outputs together.This model has used 1 × 1 convolution to reduce the dimensionality and computation cost.The architecture used, showing the trainable and "frozen" layers is presented in Figure 8.

GoogLeNet Architecture
The architecture of GoogLeNet is completely different from AlexNet.This model used global average pooling.It also contains inception modules, which can output convolutions of different types using different kernels on the same input; all outputs are then stacked as the final output of that layer.For instance, the inception module performs 1 × 1, 3 × 3 and 5 × 5 convolution and stacks all the outputs together.This model has used 1 × 1 convolution to reduce the dimensionality and computation cost.The architecture used, showing the trainable and "frozen" layers is presented in Figure 8.

Ensemble Classification
To combine the prediction of five pre-trained neural networks, we use the ensemble classification approach reproduced in Figure 9.The outputs of pre-trained neural networks are combined into a prediction vector, and majority voting is used to come to a final prediction.

Dataset
For evaluation we used a dataset from the Guangzhou Women and Children's Medical Center [62].The dataset contains a total of 5232 images, where 1346 images belong to the Normal category, 3883 images depicted Pneumonia, out of which 2538 images belongs to Bacterial Pneumonia and 1345 images depicted Virus Pneumonia.The sample images for each class are represented in Figure 10.The partition of this dataset for cross-validation is summarized in Table 1.After the network model is trained using the training test, the test set of images is used to evaluate the accuracy of the model.For instance, the inception module performs 1 × 1, 3 × 3 and 5 × 5 convolution and stacks all the outputs together.This model has used 1 × 1 convolution to reduce the dimensionality and computation cost.The architecture used, showing the trainable and "frozen" layers is presented in Figure 8.To combine the prediction of five pre-trained neural networks, we use the ensemble classification approach reproduced in Figure 9.The outputs of pre-trained neural networks are combined into a prediction vector, and majority voting is used to come to a final prediction.

Dataset
For evaluation we used a dataset from the Guangzhou Women and Children's Medical Center [62].The dataset contains a total of 5232 images, where 1346 images belong to the Normal category, 3883 images depicted Pneumonia, out of which 2538 images belongs to Bacterial Pneumonia and 1345 images depicted Virus Pneumonia.The sample images for each class are represented in Figure 10.The partition of this dataset for cross-validation is summarized in Table 1.After the network model is trained using the training test, the test set of images is used to evaluate the accuracy of the model.

Normal
Bacterial Pneumonia Virus Pneumonia    To combine the prediction of five pre-trained neural networks, we use the ensemble classification approach reproduced in Figure 9.The outputs of pre-trained neural networks are combined into a prediction vector, and majority voting is used to come to a final prediction.

Dataset
For evaluation we used a dataset from the Guangzhou Women and Children's Medical Center [62].The dataset contains a total of 5232 images, where 1346 images belong to the Normal category, 3883 images depicted Pneumonia, out of which 2538 images belongs to Bacterial Pneumonia and 1345 images depicted Virus Pneumonia.The sample images for each class are represented in Figure 10.The partition of this dataset for cross-validation is summarized in Table 1.After the network model is trained using the training test, the test set of images is used to evaluate the accuracy of the model.

Normal
Bacterial Pneumonia Virus Pneumonia

Results
The primary goal of our transferring learning approach was to correctly diagnose pneumonia among normal chest X-ray images.For this we prepared all of the models as shown above and trained them separately.We performed training and testing using a computer with Intel(R) Core (TM) i7-

Results
The primary goal of our transferring learning approach was to correctly diagnose pneumonia among normal chest X-ray images.For this we prepared all of the models as shown above and trained them separately.We performed training and testing using a computer with Intel(R) Core (TM) i7-6700 (Intel Corporation, Santa Clara, CA, USA) with 3.30 GHz CPU, NVIDIA GeForce GTX 1070 8 GB GPU (NVIDIA Corporation, Santa Clara, CA, USA) and 48 GB of RAM.For training, we used the Adam optimizer [63] and the cross-entropy loss function.The learning rate is started from the value of 0.001 and is reduced by 2 after every three epochs.AlexNet was trained for 200 iterations at learning rate of 0.001 and then trained at very low learning rate of 0.00001; it achieved a test accuracy of 92.86% and a train accuracy of 93.0%.The AUC value was 97.83%.ResNet18 did better than AlexNet and other models; it achieved an area under the ROC curve as 99.36% and test accuracy of 94.23%.
The average computational time for all models on CPU is 0.332 s and for GPU it is 0.043 s, whereas the ensemble model took 0.161 s for computation.
Figure 11 shows plots of accuracy and loss against epochs.The best results were obtained by the ResNet18 network both in terms of accuracy and loss values.
GPU (NVIDIA Corporation, Santa Clara, CA, U.S.) and 48 GB of RAM.For training, we used the Adam optimizer [63] and the cross-entropy loss function.The learning rate is started from the value of 0.001 and is reduced by 2 after every three epochs.AlexNet was trained for 200 iterations at learning rate of 0.001 and then trained at very low learning rate of 0.00001; it achieved a test accuracy of 92.86% and a train accuracy of 93.0%.The AUC value was 97.83%.ResNet18 did better than AlexNet and other models; it achieved an area under the ROC curve as 99.36% and test accuracy of 94.23%.
The average computational time for all models on CPU is 0.332 s and for GPU it is 0.043 s, whereas the ensemble model took 0.161 s for computation.
Figure 11 shows plots of accuracy and loss against epochs.The best results were obtained by the ResNet18 network both in terms of accuracy and loss values.
Figure 12 shows the plot of sensitivity against specificity for each model and the ensemble model on the testing set.Note that the ensemble model has performed better.
Figure 13 shows the activation map in early convolutional layers for each of the models.Note that the networks can capture and learn similar activation maps, albeit at different layers.For example, the Conv1_4 layer activation map of ResNet18 is very similar to the Conv1_1 layer activation map of Inception V3.Note that activations maps show that the network in most cases correctly differentiates between bacterial pneumonia, which is characterized by activations in the regions of lobar consolidations, and viral pneumonia, which typically has diffuse interstitial patterns across the lungs.Figure 13 shows the activation map in early convolutional layers for each of the models.Note that the networks can capture and learn similar activation maps, albeit at different layers.For example, the Conv1_4 layer activation map of ResNet18 is very similar to the Conv1_1 layer activation map of Inception V3.Note that activations maps show that the network in most cases correctly differentiates between bacterial pneumonia, which is characterized by activations in the regions of lobar consolidations, and viral pneumonia, which typically has diffuse interstitial patterns across the lungs.Table 2 shows the results for each neural network model.After analyzing the obtaining of the predictions of all models, we combined the results of 5 models: AlexNet, DenseNet121, InceptionV3, GoogLeNet, and ResNet18, and predicted the class which was most frequently predicted by all models.This combination gave the remarkably best performance and achieved a test accuracy of 96.39%, with an area under the ROC curve of 99.34%, and the sensitivity as 99.62%.

Comparison
We compare our results with the results of other authors using the same dataset.The comparison is given in Table 3.In the comparison of pneumonia versus normal, Kermany et al. [62] used Tensorflow and adapted an Inception V3 architecture pretrained on the ImageNet dataset.Retraining consisted of initializing the convolutional layers with loaded pretrained weights.They obtained a sensitivity of 93.2% and a specificity of 90.1%, with an accuracy of 92.8%.Cohen et al. [64] used an implementation of the CheXnet DenseNet-121 model [23], which is based on the DenseNet-121 [50].It is trained using Adam optimization with default parameters values (b1 = 0.9 and b2 = 0.999), a learning rate of 0.001, and a learning rate decay of 0.1, and achieved an AUC of 98.4% (unfortunately, no other metric value was provided in their paper for this dataset).Despite other authors not providing full data for evaluation, our model has outperformed other solutions in recall (sensitivity), accuracy and Area Under the Curve (AUC).

Discussion
Correct diagnostics requires deeper understanding of the radiological features visible in chest X-rays.Unfortunately, deep neural networks are known for providing no explanation as to how the final decision is made.To make the decision support process more useful, the deep network should provide also explanations beyond plain decisions [65].Such explanations can take the form of semantic segmentation with explanations in natural language assigned to each identified segment of a chest X-ray photo.Correct implementation of such tasks may require the use of additional eHealth data from the patients, and high-quality annotated datasets.
Another limitation is introduced by the scarcity of image data representing all types of pneumonia pathologies, which prevents for achieving a higher accuracy or using a deeper network with more parameters.Successful deep learning models such as AlexNet, GoogLeNet and ResNet, have been trained on more than a million images, which are hardly available in the medical domain.Training deep neural networks with limited data available also may lead to over-fitting and prevent from good generalization.
The accuracy results reached in this paper could be still improved by adding more sophisticated deep networks to the ensemble and training the networks with a larger dataset.
Future research directions will include the exploration of image data augmentation techniques [66] to improve accuracy even more, while avoiding overfitting.

Conclusions
In this article, our goal is to propose a deep learning-based approach to classify pneumonia from chest X-ray images using transfer learning.In this framework, we adopted the transfer learning approach and used the pretrained architectures, AlexNet, DenseNet121, Inception V3, GoogLeNet and ResNet18 trained on the ImageNet dataset, to extract features.These features were passed to the classifiers of respective models, and the output was collected from individual architectures.Finally, we employed an ensemble model that used all five pretrained models and outperformed all other models.We observed that performance could be improved further, by increasing dataset size, using a data augmentation approach, and by using hand-crafted features, in future.
Our findings support the notion that deep learning methods can be used to simplify the diagnostic process and improve disease management.While pneumonia diagnoses are commonly confirmed by a single doctor, allowing for the possibility of error, deep learning methods can be regarded as a two-way confirmation system.In this case, the decision support system provides a diagnosis based on chest X-ray images, which can then be confirmed by the attending physician, drastically minimizing both human and computer error.Our results suggest that deep learning methods can be used to improve diagnosis relative to traditional methods, which may improve the quality of treatment.When compared with the previous state-of-the-art methods, our approach can effectively detect the inflammatory region in chest X-ray images of children.

Figure 1 .
Figure 1. Outline of the methodology.Figure 1. Outline of the methodology.

Figure 1 .
Figure 1. Outline of the methodology.Figure 1. Outline of the methodology.

Figure 3 .
Figure 3. Use of general architecture in transfer learning.Layers at last are generally trained on a new domain, and knowledge of the previous domain is used as pretrained weights.

Figure 3 .
Figure 3. Use of general architecture in transfer learning.Layers at last are generally trained on a new domain, and knowledge of the previous domain is used as pretrained weights.

Figure 3 .
Figure 3. Use of general architecture in transfer learning.Layers at last are generally trained on a new domain, and knowledge of the previous domain is used as pretrained weights.
model allows for increasing the depth and width of the deep learning network, but maintaining the computational cost constant at the same time.This model was trained on the original ImageNet dataset with over 1 million training images.It works as a multi-level feature generator by computing 1 × 1, 3 × 3 and 5 × 5 convolutions.

Figure 9 .
Figure 9. Ensemble model using five pretrained different architectures and majority voting.

Figure 9 .
Figure 9. Ensemble model using five pretrained different architectures and majority voting.

Figure 9 .
Figure 9. Ensemble model using five pretrained different architectures and majority voting.

Figure 11 .
Figure 11.Accuracy against epoch (a) and cross-entropy loss against epoch (b) for all trained models.DenseNet121 and InceptionV3 were trained for 100 epochs, and GoogLeNet was trained for 50 epochs.This was done to prevent overfitting and get a better generalization.Plots are drawn for the training set.

Figure 12 Figure 11 .
Figure12shows the plot of sensitivity against specificity for each model and the ensemble model on the testing set.Note that the ensemble model has performed better.

Figure 12 .
Figure 12.Sensitivity and specificity for each model and the ensemble model on the testing set.

Figure 12 .
Figure 12.Sensitivity and specificity for each model and the ensemble model on the testing set.

Figure 13 .
Figure 13.Plots of activation maps for each model in early convolutional layers.Activation maps are calculated for the positive class and are shown for the last layer of each convolutional block before pooling.

Figure 13 .
Figure 13.Plots of activation maps for each model in early convolutional layers.Activation maps are calculated for the positive class and are shown for the last layer of each convolutional block before pooling.

Table 1 .
Dataset partition for training and testing.
CategoryTraining Set (No. of Images) Test Set (No. of Images)

Table 1 .
Dataset partition for training and testing.

Table 1 .
Dataset partition for training and testing.

Table 2 .
Comparative results for each model on the test set.

Table 3 .
Comparative results for each model on test set.