Classiﬁcation and Predictions of Lung Diseases from Chest X-rays Using MobileNet V2

Featured Application: The method presented in this paper can be applied in medical computer systems for supporting medical diagnosis


Introduction
Different types of lung disease have affected many people around the world. Lung diseases make the lungs more prone to certain physical problems and air pollution. As a result, lung function is impaired. Some lung diseases, such as emphysema, asthma, pleural effusion, tuberculosis, and other diseases including aspiration fibrosis, pneumonia, and lung malignancies, lead to loss of versatility in the lungs, which causes a decrease in the total volume of air [1]. Lung disease spreads easily, so it is important to identify the problem and provide the patient with the appropriate treatment. To detect and diagnose lung diseases, radiologists mainly deal with chest X-rays. Radiologists can use chest Xrays to diagnose and detect diseases and many other conditions. These diseases include bronchitis, infiltrations, atelectasis, pericarditis, fractures, and many others [2]. Chest radiography is considered the most-used type of disease examination in the world, with more than 2 million procedures performed each year [3].
This technology is essential and critical for the examination, diagnosis, and treatment of chest diseases, which is one of the main causes of death worldwide [4]. Therefore, to improve workflow priority and clinical decision support in large-scale projections and global population health programs, computer systems must be used to interpret chest radiographs as effectively as radiologists. In practice, this could play an important role in many clinical environments. There is a lot of work done using AI technology to detect and diagnose lung diseases [5,6]. As a kind of probabilistic neural network, multilayer perceptron, learning vector quantization [7], and recurrent neural network (RNN) have been used to diagnose lung disease [8]. For the detection of lung diseases such as tuberculosis Sahlol, Ahmed T et al. [9] proposed a new hybrid method for tuberculosis X-ray image classification by extracting features from X-ray images with transfer learning and then filtering the produced huge number of features using the recently proposed Artificial Ecosystem-based Optimization (AEO); this method surpassed 90% accuracy in two datasets. Chronic Obstructive Pulmonary Disease (COPD) and pneumonia were performed with a neural network to diagnose cancer [10]. He-xuan Hu et al. [11] exceeded the state-of-the-art in lung tumor segmentation by proposing parallel deep learning model a hybrid attention mechanism and DenseNet, which got 94% accuracy.
The advances in deep learning and large datasets allow the algorithm to match the performance of medical professionals in various medical imaging tasks, including the detection of diabetic retinopathy [12], the classification of skin cancer [13], and metastases in the detection of lymph nodes [14]. The automatic diagnosis of chest imaging has attracted increasing attention [15,16] and special algorithms have been developed for the detection of pulmonary nodules [17], but to detect other diseases, such as pneumonia and pneumothorax, a method to classify multiple pathologies at the same time is suggested. Nilanjan, Dey et al. [18] developed a new method to detect pneumonia by computing the handcrafted features from the chest X-ray with the use of a modified VGGNet19 (Visual Geometry Group Network) [19]. This method achieved a 97.94% classification accuracy (Check Table 1). Until recently, computing power and the availability of large datasets allowed this approach to flourish. The National Institute of Health published ChestX-ray14 [20], which has led to many studies using deep learning for diagnosis from chest X-rays [21][22][23][24]. Convolutional neural networks are the most widely used methods from ImageNet [25]. We want to deliver high accuracy results while keeping the parameters and mathematical operations as low as possible to bring deep neural networks to mobile devices. The MobileNet V2 architecture [26] was designed to create small, low-latency applications such as computer vision and IoT applications in many fields, namely fruit image classification for higher fruit production rates [27]; an artificial intelligent diagnostic system was built for cholelithiasis [28].
In this work, we propose a new method to combine transfer learning with feature selection to improve the state-of-the-art medical classification methods. After this section, we list some of the previous state-of-the-art lung disease classification methods in the related work Section 2. Dataset and disease distribution are discussed in Section 3. We present our model architecture and evaluation procedure in Section 4. Experimental results and comparisons with other works are presented in Section 5, The conclusions are presented in Section 6.

Related Work
As the spread of pulmonary disease increases, new, automatic, and reliable methods of accurate detection are essential to reduce the exposure of medical experts to an outbreak.
The ChestX-ray14, which consists of 112,120 frontal chest X-ray images of 30,805 unique patients, was triggered by the work of Wang et al. [20]. It has attracted a huge attention in the deep learning community, using convolutional neural networks (CNNs) in computer vision, and several research groups have started to focus on the application of CNNs for chest X-ray classifications. In the work of Yao et al. [21], they offered a combination of a CNN and RCNN (Recurrent Neural Network) to exploit the dependency on labels. As a CNN skeleton, they used a DenseNet [22] model that was fully adapted and trained on X-ray data. Li et al. [23] provided a framework for classifying and detecting pathologies using CNNs. More recently, Rajpurkar et al. [24] proposed refined transfer learning using a DenseNet-121 [26], which further increased the AUC results on ChestX-ray14 for multimarker classification. In [2], various lung abnormalities, such as pulmonary nodules and diffuse lung, are detected and classified using CNN for feature extractions and R-CNN for the detection of lung disease. Author Yaniv Bar et al. [29] examined the power of deep learning methods on chest radiography data for the detection of pathologies. The dataset was trained using the ImageNet [25] competition, and mainly the descriptions DeCAF and PiCoDes (image codes) are extracted using the implementation of the Convolution Neural Network.

Dataset
The ChestX-ray14 database [20] was used to develop deep learning algorithms. This database is currently one of the largest public X-ray databases, containing 112 back-to-front and front-to-back thoracic films from 30,805 unique patients. Each image was annotated with as many as 14 different thoracic pathological ethics, which were selected based on the frequency of observation and diagnosis in clinical practice. The label of each image was obtained with the automatic extraction method in the radiological report, and each image produced 14 binary values (Atelectasis, Cardiomegaly, Consolidation, Edema, Effusion, Emphysema, Fibrosis, Hernia, Infiltration, Mass, Nodule, Pleural Thickening, Pneumonia, Pneumothorax). Some of them are provided in Figure 1, where 0 means that the pathology does not exist and 1 means that there are many pathologies in each image. We show the distribution of each class illustrated in Figure 2. The prevalence of individual pathologies is generally low and varies between 0:2% and 17:74%, as shown in Table 2a, while the distribution of patient gender and view position is quite low. Even with a ratio of 1.3 and 1.5, respectively (see Table 2b), most of the data image samples were biased, as we got "Infiltration" as the most frequent with (9546) image samples, while the least frequent was "Hernia", with just 227 samples. "Emphysema"," Fibrosis"," Edema" Pneumonia", were close to each other, but were lower, with the number of samples between (322) and (727). For the other diseases, like in Figure 2a, the frequency of classes was over 1093 and under 4214. This class imbalance issue will have had a huge impact on the result. Next, we create the data generator to do data augmentation and split the dataset into training, validation, and testing. Figure 2b,c illustrates the distribution of the least frequent disease, "Hernia", and the most frequent, "Infiltration", with respect to the meta-information (age, gender, negative case, or positive case).  For the split, we used a built function from the "sklearn" library called "train_test_split", and we got 38,819 image samples for the training, and for validation/testing we got, successively, 12,940 and 12,940.
The dataset was imbalanced and biased, as the difference between "Hernia" (227 samples) and "Infiltration" (more than 17,000 samples) shows. We tried to focus on the positive and negative frequency. As we can see in Table 2b, the false class samples were much higher than the true class samples (Ex: Edema True: 2303, False: 109,817 prevalence: 2.05%). To solve the class imbalance problem, we tried to include meta-information. At first, we used the patient age as the reference and then balanced the positive and negative samples for each disease by multiplying the weight of the positive cases with the negative sample frequency and did the same for the negative weight. The result is better illustrated in Figures 3 and 4 for both the least most disease frequencies.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 7 of 18 (a) "Infiltration" after applying the true/false frequency balance (dark blue is true samples; light blue is false).
(b) "Hernia" after applying the true/false frequency balance (dark blue is true samples; light blue is false).

Methodology
In this section, we propose the listed model in both Figures 5 and 6 to achieve the classification and prediction of 14 different lung diseases in chest X-rays using the Keras framework (which is an open-source software library that provides a Python interface for artificial neural networks) [30]. In this proposed work, we have identified and classified 112,120 chest X-ray images from the NIH dataset. The MobileNet V2 [26] model, and additionally the CNN layers, are employed in this work to predict and classify the chest thoracic diseases in the chest X-ray images. MobileNet V2 is an associated design suitable for basic mobile and on-board vision applications. Nowadays, deep learning methods are not only used to help computer vision, but also include various applications such as robotics, the Internet of Things (IoT), Natural Language Processing (NLP), and medical image processing application fields.
The architecture of the customized MobileNet V2 has a group of a hidden layers based on a bottleneck residual block and they have a depth-wise separable convolution that considerably reduces the number of parameters and leads to a lightweight neural network differing from the normal convolution. The normal convolution is replaced by a depth-wise convolution and has a single filter which is followed by a pointwise convolution that is termed a depth-wise severable convolution [31]. Figure 6. The convolutional neural network architecture: Constructed from an ImageNet [25] pretrained MobileNet V2 with an additional Global Average Pooling layer and one dense layer.

MobileNet V2 Architecture
The bottleneck residual block was mainly composed of three convolutional layers. As listed in Figure 6, the last two are the ones that already existed in the first generation of MobileNet: a depth-wise convolution that filters the inputs, followed by a 1 × 1 pointwise convolution layer. However, this 1 × 1 layer now has a different job.
MobileNet V1 [32] is one of the most popular mobile-centric deep learning architectures, which is not only small in size but also computationally efficient while achieving a high performance. The main idea of MobileNet is that instead of using regular 3 × 3 convolution filters, the operation is split into depth-wise separable 3 × 3 convolution filters followed by 1 × 1 convolution. While achieving the same filtering and combination process as a regular convolution, the new architecture requires a lower number of operations and parameters. In MobileNet V1, the pointwise convolution had to either double the number of channels or keep them the same. In MobileNet V2, the pointwise convolution does the opposite: the pointwise convolution makes the number of channels smaller. This is why this layer is now known as the projection layer, because it projects data with a high number of dimensions (channels) into a tensor while lowering the dimension, as illustrated in Table 3.
The first new feature that came with MobileNet V2 is the expansion layer. The expansion layer is a 1 × 1 convolution. Its role is to expand the number of channels in the image data before going into the depth-wise convolution. Hence, this expansion layer always has more output channels than input channels, as it does the opposite of the projection layer.
The second new feature in the MobileNet V2's building block is the residual connection presented in Figure 7. This works like in the ResNet [31] and helps with the flow of gradients through the network. The feature channels are increased by an expansion factor t. In our experiments, we used MobileNet V2 with 0.5× and a 1× channel multiplier with an input size of 224 × 224 for testing.   The MobileNet V2 focuses on optimizing latency, but at the same time, it also enables small networks to operate efficiently and support any size input, which can provide a better performance from the implementation of ReLU6 as the activation function in each layer, and also batch normalization. In the experimental results, the ImagNet [25] classification task shows that the MobileNet V2 outperforms MobileNet V1 and ShuffleNet [33] (width multipliers 1.5) with a comparable model size and computational cost. For a width multiplier of 1.4, MobileNet V2 also outperforms ShufflNet [33] (width multiplier 2) and NASNet [34], with a faster inference time. The MS COCO Object Detection task [35], Mo-bileNetV2 + SSDLite [36] is 20× more efficient and 10× smaller while it still outperforms YOLOv2 (YOLOv2, You Only Look Once version 2 is a state-of-the-art, real-time object detection system), [37] on the COCO dataset. In PASCAL VOC 2012 [38] Semantic Segmentation, MobileNet V2 outperforms ResNet-101 [31] by obtaining mIoU (mean intersection of union) of 75.32%, with a lower model size and computational cost (2.11M parameters less than 70% of the ResNet-101 [31] parameters). The full MobileNet V2 architecture as shown in Figure 6 consists of 17 of these building blocks in a row followed by a regular 1 × 1 convolution, a global average pooling layer, and a classification layer (the very first block is slightly different and it uses a regular 3 × 3 convolution with 32 channels instead of the expansion layer).

Preparation for Training the Data
In this work, MobileNet deep neural network was used with the supervised learning algorithm. After balancing the disease samples, we used a data generator function to divide the entire dataset into three groups, i.e., training (60%), validation (20%), and testing (20%).
We also use the sparse technique (also known as one hot encoding, to generate a column for each disease, this column will be given a "1" if the image in question has the discussed disease else, we affect to it a "0").
In the processing steps the datasets are loading and transforming images randomly by using TensorFlow and Keras packages. The image batch size is 32 for the three stages.

•
Proposed model was trained in TensorFlow and Keras. • First, we give inputs (size 224 × 224) to the MobileNet V2 as a base and additionally a global average pooling (GAP) layer (2 × 2) filter is performed for spatial data and reduce the dimension of data, and finally the Dense layer is an FC (fully connected) layer for all neurons connected to the next output nodes. For the activation function for the fully connected layer, we used the sigmoid function because of its robustness in the classification machine learning model.
For the training model, we applied several stages by applying fine-tuning the number of epochs and the batch size: • At first, we trained the model for the epochs, we floor divided (floor division is a normal division operation except that it returns the largest possible integer), the total number of the train samples by their batch size. We used early stopping to avoid overfitting due to a big load of data in the step per epoch.

•
In the second phase, we increased the number of epochs to 15 and divided the steps per epochs into half. To improve the accuracy, we reduced the learning rate. • Lastly, we added 10 epochs and returned the step per epochs to the initial phase, and to minimize the overfitting, we both reduced the learning rate and early stopping and we configured the patience of the early stopping to force him to finch the 10 epochs.
We proposed a model performing under different hyper-parameters, namely filter size, number of kernels in all input channels, number weights, and pooling. The activation function translates the input range and calculates the loss function with the help of forward propagation on the training images. Batch normalization (BN) and ReLU were applied after each convolution layer for preventing overfitting problems, while epochs were also responsible for early stopping and data augmentation techniques were implemented (ImageDataGenerator). GAP layers act as a more depth type of reducing dimensionality: H im * W im * D im , where H im is height, W im weight and D im is Dimension. By handling all H * W average values, the global average pooling (GAP) layer decreased by a single number (1 × 1 × D) of each H * W feature map. The dense layer extracts the features from the convolution layer and was down-sampled by the GAP layer. In a fully connected layer (dense layer), every input node is connected to the output node.
The Adam optimizer algorithm finds the individual learning rate in every parameter for calculating the loss by binary cross-entropy and MAE (mean absolute error). We measured the performance of the classification and prediction probability value to be between 0 and 1.

Evaluation
We used accuracy (Acc), sensitivity (Sens), specificity (Spec), AUC and F1-score and time consumption as fitness measures. These are defined as follows: Where TP (true positives) is the number of the disease samples that were labeled correctly, TN (true negatives) is the number of none of the 14 diseases samples that were labeled correctly. FP (false positives) is the number of the disease samples that were labeled incorrectly as they did not contain disease in their samples, and "FN" (false negatives) is the number of the not-diseases samples that were miss-classified as disease samples.
The proposed approach was implemented in Python 3.7 on Windows 10 bit using intel Xeon and 16 GB RAM, besides Google Colaboratory "Kaggle". The model was developed using Keras library [30] with a Tensorflow backend [39].

Experimental Results
The lack of computing power had a big role in limiting the experiment. In this work, we used the "Kaggle" open free cloud to edit and experiment with our model. Although this cloud offers a free GPU and fifteen gigabytes of ram, it was not enough to train our model over ten epochs. However, we managed to get a good result (Check Figures 8 and 9). We started by viewing the accuracy of the model and in the first epoch we had (0.951) a training accuracy and (0.913) a validation accuracy, with a difference of 3.74% between the training and validation. In the epoch number, the fourth gap between the training and validation accuracy gets much smaller with a training accuracy of 0.953 and a validation accuracy of 0.941. The difference decreases to 1.18%. In the last epoch, and we got some overfitting signs, with an increase in the difference between the training accuracy and validation accuracy with more than 2% of a difference.  The same case with the other metric, which is the loss function. In the first epoch, the difference between train loss and validation loss was about 4%. In the middle epoch, the gap starts to get bigger to pass 6% of a difference. In the last epoch, it starts to settle at 4%.
The specificity did show some promising results averaging 97%. The sensitivity on the other side did not provide a good result as it shows an average sensitivity of 45%. However, the F1-score got 55% (68% best and 40% lowest). The detail of the 14 disease class is provided in Table 4. We used the ROC curve to extract more detail about the result of our model. The ROC curve illustrated in Figure 10 is a plot which compares the False Positive Rate over the True Positive Rate of a predicted image.
For the "Emphysema", the AUC is (0.891), and the "Effusion" and "Edema" got 0.876 and 0.884, the "Pneumothorax" and "Cardiomegaly" also had a good AUC, with 0.880 and 0.885 and "Mass", "Hernia" and "Atelectasis" had a good AUC (0.826, 0.811 and 0.794). For the other diseases-"Consolidation", "Fibrosis", and "Pleural Thickening"-the was similar (0.790, 0.762, 0.763). The last three diseases got the lowest AUC: "Nodule" and AUC of 0.743; "Pneumonia" an AUC of 0.733; and "Infiltration" an AUC of (0.701). As we can see in the image with the index zero from Figure 9, the ground truth label was "Effusion". Knowing that the model had AUC 0.89, the result was not surprising in that it gets 89.22% as a prediction. For the second image, we can see that the model starts getting a compromise and that the true diseases in this picture was "Mass", and the prediction of the models of two diseases got a prediction rate higher than 97%.

Comparison to Other Approaches
In our assessment, a significant distribution of results relative to AUC values was observed. In addition to the data split used, this might flow from the random initialization of the models and the stochastic nature of the optimizer.
When ChestX-ray14 was made publicly available, only images and no official data splitting was released. As a result, the researchers began training and testing their proposed methods on their own split database. With a different split to our re-sampling, we found a wide diversity in performance. Therefore, as shown in Table 5, direct comparisons with other results may fail in the sense of leading results. For example, Rajpurkar et al. [24] reported excellent results for the 14 categories in their own division. We compare the best performance of our model architecture (i.e., MobileNet V2) with experimental sampling with Li et al. [23] and Baltruschat et al. [40]. Baltruschat et al. in his paper compared the state-of-the-art of three models; the ResNet-38-large-meta had the best performance, and we will only include this model in the comparison. For our model, we recorded the minimum and maximum AUC.
We present the outcomes for our best performing architecture in this table and compare them to other categories. In addition, an AUC average of overall pathologies was provided in the last row, and we emphasize the overall highest AUC value by putting it in bold text.
We published results for our best performing architecture on the official released split by Wang et al. [20] in order to provide a fair comparison with other classes. First, we compared our results to Wang et al. [20] and Yao et al. [21] While the MobileNet V2 already has a higher average AUC, with 0.810 and in 6 out of 14 classes, there was a higher individual AUC with a plus-30 percent improvement on more than three disease classes over Baltruschat et al. [40], with a ResNet-38-large-meta by Yao et al. [21]. Li et al. [23] also reported state-of-the-art results for the official split in all 14 classes, with an average AUC of 0.812. While our MobileNet V2 is trained with fewer images, it still achieved state-of-the-art results for "Effusion", "Edema", "Infiltration", "Pneumonia", and "Pneumothorax" and a slightly less average AUC of 0.810.

Conclusions
A convolutional neural network (CNN) is developed for the diagnosis of thoracic (pulmonary) diseases in an X-ray image. The outperformance achieved is mainly due to the deep structure of CNN, which uses the mining power of different level features and resulted in a better generalization ability ( Figure 11 shows some prediction result on the test-set). CNN models such as the MobileNet V2 network have superior generalization capabilities and precision compared to other networks and the results obtained demonstrate the high recognition rates of the proposed CNN and show that it gave very precise results in the diagnosis of the disease in the chest X-ray and then in classification. Figure 11. Result of the prediction attempt from the testing set: two other samples with different index labels. Each one could have one or more diseases.
Our optimized MobileNet V2 architecture achieves state-of-the-art results in six out of fourteen classes compared to Baltruschat et al. [40] (who had state-of-the-art results in five out of fourteen classes on the official split). For other approaches, even higher scores are reported in the literature. However, a comparison of the different CNN methods with respect to their performance is inherently difficult, as most evaluations have been performed on individual and random partitions of the datasets. The model also had some failure when inspecting the F1-score, which could open the way to modify the model or to implement cross-validation.
In this work, the proposed model opens the way for light neural network architecture to be implemented in the medical field, which is very severe and depends on high quality resources and performs well under these conditions (AUC result was good, averaging 0.811, accuracy 95%) and can be more implemented in future experiments. However, the sensitivity of the model is rather low, and it could raise the false negative classification of samples in the prediction. The low sensitivity rate is the result of the biased dataset (the difference in the class distribution).
While the results obtained suggest that deep neural network training in the medical field is a viable choice, as more and more public databases become available, the practical application of deep learning in clinical practice remains an open topic. Especially for the ChestX-ray14 databases, the rather high tag noise of 10% makes it difficult to estimate the real network. Therefore, a clean test set without a label is necessary for a clinical efficiency evaluation.
Future work may include the investigation of other model architectures, new architectures for leveraging label dependencies and the incorporation of segmentation information.