Evaluation of Scalability and Degree of Fine-Tuning of Deep Convolutional Neural Networks for COVID-19 Screening on Chest X-ray Images Using Explainable Deep-Learning Algorithm

According to recent studies, patients with COVID-19 have different feature characteristics on chest X-ray (CXR) than those with other lung diseases. This study aimed at evaluating the layer depths and degree of fine-tuning on transfer learning with a deep convolutional neural network (CNN)-based COVID-19 screening in CXR to identify efficient transfer learning strategies. The CXR images used in this study were collected from publicly available repositories, and the collected images were classified into three classes: COVID-19, pneumonia, and normal. To evaluate the effect of layer depths of the same CNN architecture, CNNs called VGG-16 and VGG-19 were used as backbone networks. Then, each backbone network was trained with different degrees of fine-tuning and comparatively evaluated. The experimental results showed the highest AUC value to be 0.950 concerning COVID-19 classification in the experimental group of a fine-tuned with only 2/5 blocks of the VGG16 backbone network. In conclusion, in the classification of medical images with a limited number of data, a deeper layer depth may not guarantee better results. In addition, even if the same pre-trained CNN architecture is used, an appropriate degree of fine-tuning can help to build an efficient deep learning model.


Introduction
CORONAVIRUS disease (COVID-19) has quickly become a global pandemic since it was first reported in December 2019, reaching approximately 21.3 million confirmed cases and 761,799 deaths as of 16 August 2020 [1]. Due to the highly infectious nature and unavailability of appropriate treatments and vaccines for the virus, early screening of COVID-19 is crucial to prevent the spread of the disease by the timely isolation of susceptive individuals and the proper allocation of limited medical resources.
Currently, reverse transcription polymerase chain reaction (RT-PCR) was introduced as the gold standard screening method for COVID- 19 [2]. However, since the overall positive rate of RT-PCR, using nasal and throat swabs, is reported to be 60-70% [3], there is a risk that a false-negative patient may Figure 1. The experiment consists of a total of 12 experimental subgroups. It is largely divided into two main groups according to the layer depths, and each convolutional neural network (CNN) subgroup is divided into 6 subgroups according to the degree of fine-tuning.

Datasets
The datasets used for classification are described in Table 1. Several publicly available image data repositories have been used to collect COVID-19 chest -ray images. Normal and pneumonia samples were extracted from the open source NIH chest X-ray dataset used for the Radiological Figure 1. The experiment consists of a total of 12 experimental subgroups. It is largely divided into two main groups according to the layer depths, and each convolutional neural network (CNN) subgroup is divided into 6 subgroups according to the degree of fine-tuning.

Datasets
The datasets used for classification are described in Table 1. Several publicly available image data repositories have been used to collect COVID-19 chest-ray images. Normal and pneumonia samples were extracted from the open source NIH chest X-ray dataset used for the Radiological Society of North America (RSNA) pneumonia detection challenge [20]. The total dataset was curated into three classes: normal, pneumonia, and COVID-19. Since the balance of data for each class is a very important factor in classification analysis, this study randomly extracted the images of other classes according to the number of COVID-19 images that can be obtained as much as possible. The entire dataset was combined with 607 COVID-19 image data publicly shared at the time of the study, as well as 607 normal and 607 pneumonia chest radiographs randomly extracted from the RSNA Pneumonia Detection Challenge dataset, resulting in 1821 data being combined. In the case of the COVID-19 dataset, four public datasets were used, and only one image was used when the source of the image was duplicated. In the public datasets used in the experiment, patient information was de-identified or not provided.
The entire collected dataset was randomly divided into a training and testing ratio of 80:20 for each class, and training data were also randomly divided by a training and validation ratio of 80:20 for use in the 5-fold cross validation.

Image Preprocessing
Because the image data used in this experiment were collected from multiple centers, most of the images have different contrast and dimensions. Therefore, all images used in this study required contrast correction through the histogram equalization technique and resizing to a uniform size before the experiment. In this study, preprocessing was performed using the contrast limited adaptive histogram equalization (CLAHE) technique [25], which has been adopted in previous studies related to lung segmentation and pneumonia classification [26][27][28]. Figure 2 shows sample images with CXR contrast corrected using the CLAHE technique. For the consistency of image analysis, each image was resized to a uniform size of 800 × 800.

Convolutional Neural Networks
This study employed two different deep CNNs as backbone networks: VGG-16 and VGG-19. VGG [29] is a pre-trained CNN, from the Visual Geometry Group, Department of Engineering Science, University of Oxford. The numbers 16 and 19 represent the number of layers with trainable weights of VGG networks. VGG architecture had been widely adopted and recognized as a state of the art in both general and medical image classification tasks [30]. Since VGG-16 and VGG-19 have the same neural network architecture but different layer depths, a comparative evaluation of performance according to the degree of layer depths can be performed under the same architectural condition.
contrast correction through the histogram equalization technique and resizing to a uniform size before the experiment. In this study, preprocessing was performed using the contrast limited adaptive histogram equalization (CLAHE) technique [25], which has been adopted in previous studies related to lung segmentation and pneumonia classification [26][27][28]. Figure 2 shows sample images with CXR contrast corrected using the CLAHE technique. For the consistency of image analysis, each image was resized to a uniform size of 800 × 800.

Convolutional Neural Networks
This study employed two different deep CNNs as backbone networks: VGG-16 and VGG-19. VGG [29] is a pre-trained CNN, from the Visual Geometry Group, Department of Engineering Science, University of Oxford. The numbers 16 and 19 represent the number of layers with trainable weights of VGG networks. VGG architecture had been widely adopted and recognized as a state of the art in both general and medical image classification tasks [30]. Since VGG-16 and VGG-19 have the same neural network architecture but different layer depths, a comparative evaluation of performance according to the degree of layer depths can be performed under the same architectural condition.

Fine-Tuning
When the training dataset is relatively small, transferring a network pre-trained on a large annotated dataset and fine-tuning it for a specific task can be an efficient way to achieve acceptable accuracy and less training time [31]. Although the classification of diseases from CXR images differs from object classification and natural images, they can share similar learned features [32]. During the fine-tuning of transfer learning with deep CNNs, model weights were initialized based on pretraining on a general image dataset, except that some of the last blocks were unfrozen so that their weights were updated in each training step. In this study, the VGG-16 and VGG-19, used in this study as a backbone neural network, consist of 5 blocks regardless of the network layer depth. Therefore, Figure 2. Sample images after applying contrast correction by contrast limited adaptive histogram equalization (CLAHE) and the semantic segmentation of lung on original chest X-ray (CXR) images.

Fine-Tuning
When the training dataset is relatively small, transferring a network pre-trained on a large annotated dataset and fine-tuning it for a specific task can be an efficient way to achieve acceptable accuracy and less training time [31]. Although the classification of diseases from CXR images differs from object classification and natural images, they can share similar learned features [32]. During the fine-tuning of transfer learning with deep CNNs, model weights were initialized based on pre-training on a general image dataset, except that some of the last blocks were unfrozen so that their weights were updated in each training step. In this study, the VGG-16 and VGG-19, used in this study as a backbone neural network, consist of 5 blocks regardless of the network layer depth. Therefore, fine-tuning was performed in a total of 6 steps in a manner that was unfrozen sequentially from 0 to 5 blocks starting from the last block, depending on how many blocks were unfrozen. As a result, VGG-16 and VGG-19 were used as backbone networks, and each deep CNN was divided into 6 subgroups according to the degree of fine-tuning. Figure 3 shows the schematic diagrams of the layer composition and the degree of fine-tuning of VGG-16 and VGG-19.

Training
The 1458 images selected as the training dataset were randomly divided into five folds. This was done to perform 5-fold cross validation to evaluate the model training, while avoiding overfitting or bias [33][34][35]. Within each fold, the dataset was partitioned into independent training and validation sets using an 80 to 20% split. The selected validation set was a completely independent fold from the other training folds and was used to evaluate the training status during the training. After one model training step was completed, the other independent fold was used as a validation set and the previous validation set was reused as part of the training set to evaluate the model training. An overview of the 5-fold cross validation performed in this study is presented in Figure 4. As an additional method to prevent overfitting, drop out was applied to the last fully connected layers, and early stopping was also applied by monitoring the validation loss at each epoch. The above training process was repeated for all 24 experimental groups ( Figure 1). All deep CNN models were trained and evaluated on an NVIDIA DGX StationTM (NVIDIA Corp., Santa Clara, CA, USA) with an Ubuntu 18 operating system, 256 GB system memory, and four NVIDIA Telsa V100 GPU. The building, training, validation, and prediction of DL models were performed using the Keras [36] library and TensorFlow [37] backend engine. The initial training rate of each model was 0.00001. A ReduceLROn-Plateau method was employed because it reduces the learning rate when it stops improving the training performance. The RMSprop algorithm was used as the solver.
After training all the 5-fold deep CNN models, the best model was identified by testing with the test dataset. The 1458 images selected as the training dataset were randomly divided into five folds. This was done to perform 5-fold cross validation to evaluate the model training, while avoiding overfitting or bias [33][34][35]. Within each fold, the dataset was partitioned into independent training and validation sets using an 80 to 20% split. The selected validation set was a completely independent fold from the other training folds and was used to evaluate the training status during the training. After one model training step was completed, the other independent fold was used as a validation set and the previous validation set was reused as part of the training set to evaluate the model training. An overview of the 5-fold cross validation performed in this study is presented in Figure 4. As an additional method to prevent overfitting, drop out was applied to the last fully connected layers, and early stopping was also applied by monitoring the validation loss at each epoch. The above training process was repeated for all 24 experimental groups ( Figure. 1). All deep CNN models were trained and evaluated on an NVIDIA DGX StationTM (NVIDIA Corp., CA, USA) with an Ubuntu 18 operating system, 256 GB system memory, and four NVIDIA Telsa V100 GPU. The building, training, validation, and prediction of DL models were performed using the Keras [36] library and TensorFlow [37] backend engine. The initial training rate of each model was 0.00001. A ReduceLROn-Plateau method was employed because it reduces the learning rate when it stops improving the training performance. The RMSprop algorithm was used as the solver. After training all the 5-fold deep CNN models, the best model was identified by testing with the test dataset.

Performance Evaluation
To comprehensively evaluate the screening performance on the test dataset, the accuracy, sensitivity, specificity, receiver operating characteristic (ROC) curve, and precision recall (PR) curve were calculated. The accuracy, sensitivity, and specificity score can be calculated as follows:

Performance Evaluation
To comprehensively evaluate the screening performance on the test dataset, the accuracy, sensitivity, specificity, receiver operating characteristic (ROC) curve, and precision recall (PR) curve were calculated. The accuracy, sensitivity, and specificity score can be calculated as follows: TP and FP are the number of correctly and incorrectly predicted images, respectively. Similarly, TN and FN represent the number of correctly and incorrectly predicted images, respectively. The area under the ROC curve (AUC) was also calculated in this study.

Interpretation of Model Prediction
Because it is difficult to know the process of how deep CNNs make predictions, DL models have often been referred to as non-interpretable black boxes. To determine the decision-making process of the model, and which features are most important for the model to screen COVID-19 in CXR images, this study employed the gradient-weighted class activation mapping technique (Grad-CAM) [18,19] so that the most significant regions for screening COVID-19 in CXR images were highlighted. Table 2 summarizes the classification performance of the three classes, normal (N), pneumonia (P), and COVID-19 (C), for each experimental group. Compared with all the tested deep CNN models, the fine-tuned with two blocks of the VGG-16 (VGG16-FT2) model achieved the highest performance in terms of the COVID-19 classification of accuracy (95.9%), specificity (97.5%), sensitivity (92.5%), and AUC (0.950). For all the tested deep CNNs, fine-tuning the last two convolutional blocks presented a higher classification performance compared to the fine-tuning of the other number of convolutional blocks. In addition, the case of all untrainable convolutional blocks without fine-tuning, regardless of the scalability of the backbone network, showed the lowest classification. Generally, the fine-tuned models using VGG16 as a backbone architecture were better than those using VGG19. Figure 5 shows how the number of fine-tuned deep CNN blocks influences the classification performance in terms of the accuracy of COVID-19 screening. In this figure, the classification performance was not proportionately dependent on the degree of fine-tuning with the base model. There was a decrease in classification accuracy when more than three convolutional blocks of all deep CNNs were used. In addition, regardless of the number of fine-tuned blocks, the VGG19 models with more convolutional layers had lower classification accuracy than the VGG16 models. The confusion matrix and ROC of VGG16-FT2 achieving the highest performance in multi-class classification are presented in Figures 6 and 7. network, showed the lowest classification. Generally, the fine-tuned models using VGG16 as a backbone architecture were better than those using VGG19. Figure 5 shows how the number of fine-tuned deep CNN blocks influences the classification performance in terms of the accuracy of COVID-19 screening. In this figure, the classification performance was not proportionately dependent on the degree of fine-tuning with the base model. There was a decrease in classification accuracy when more than three convolutional blocks of all deep CNNs were used. In addition, regardless of the number of fine-tuned blocks, the VGG19 models with more convolutional layers had lower classification accuracy than the VGG16 models. The confusion matrix and ROC of VGG16-FT2 achieving the highest performance in multi-class classification are presented in Figures 6 and 7.   network, showed the lowest classification. Generally, the fine-tuned models using VGG16 as a backbone architecture were better than those using VGG19. Figure 5 shows how the number of fine-tuned deep CNN blocks influences the classification performance in terms of the accuracy of COVID-19 screening. In this figure, the classification performance was not proportionately dependent on the degree of fine-tuning with the base model. There was a decrease in classification accuracy when more than three convolutional blocks of all deep CNNs were used. In addition, regardless of the number of fine-tuned blocks, the VGG19 models with more convolutional layers had lower classification accuracy than the VGG16 models. The confusion matrix and ROC of VGG16-FT2 achieving the highest performance in multi-class classification are presented in Figures 6 and 7.                  (normal,pneumonia, in the VGG16-TF2 experimental group that showed the highest classification performance. Through the Grad-CAM result in Figure 8, it is possible to identify the significant region where the difference in CXR image features of each of the three classes is made. Figures 9 and 10 show representative examples of wrong and right classifications based on the wrong reasons. In most cases where classification has occurred based on the wrong reason, there is a foreign body in the chest cavity of the CXR image.

Discussion
In addition to the long-term sustainability of the COVID-19 pandemic and symptom similarity with other pneumonia diseases, the limited medical resources and lack of expert radiologists have greatly increased the importance of screening for COVID-19 from CXR images for the right concentration of medical resources and isolation of potential patients. To overcome these limitations, various cutting-edge artificial intelligence (AI) technologies have been applied to screen COVID-19 from various medical data. Accordingly, until recently, numerous new DL models, such as COVID-Net [10], Deep-COVID [16], CVDNet [38], and Covid-resnet [13], to classify COVID-19 through publicly shared CXR images have been proposed, or mutual comparison studies through the transfer learning of various pre-trained DL models have been presented [39,40]. These previous papers showed high accuracy of more than 95%. However, most of them performed transfer learning but did not mention the specific degree of fine-tuning. It is also rare to have a qualitative evaluation. As a result, it is often difficult to reproduce a similar degree of accuracy with the same pre-trained DL model. Therefore, in the present study, the effects of the degree of fine-tuning and layer depths on deep CNNs for the screening performance of COVID-19 from CXR images were evaluated. Furthermore, these influences were visually interpreted using the Grad-CAM technique.

Scalability of Deep CNN
It is known that the VGG architecture used as the deep CNN backbone network in this experiment does not leverage residual principles, has a lightweight design, and low architectural diversity, so it is convenient to fine-tune [10]. In particular, the VGG-16 and VGG-19 used in this study have the same architecture with five convolutional blocks; however, the depth of the layers of VGG-19 is deeper than that of VGG-16 ( Figure 3).
According to Table 2 and Figure 5, the overall classification performance of VGG-16 was higher than that of VGG-19, regardless of the fine-tuning degree. These results are similar to the fact that the latest deep neural networks do not guarantee higher accuracy in the classification of medical images such as CXR images, as in other previous research papers [39]. It can be considered that in the case of medical images requiring less than 10 classifications, deep CNNs with low scalability can show better performance, unlike the classification of general objects that require more than 1000 classifications.

Degree of Fine-Tuning of Deep CNN
In general, the deep CNN model learned from pre-trained deep neural networks on a large natural image dataset which could be used to classify common images but cannot be well utilized for specific classifying tasks of medical images. However, according to a previous study that described the effects and mechanisms of fine-tuning on deep CNNs, when certain convolutional blocks of a deep CNN model were fine-tuned, the deep CNN model could be further specialized for specific classifying tasks [32,41]. More specifically, the earlier layers of a deep CNN contain generic features that should be useful for many classification tasks; however, later layers progressively contain more specialized features to the details of the classes contained in the original dataset. Using this property, when the parameters of the early layers are preserved and that in later layers are updated during the training of new datasets, the deep CNN model can be effectively used in new classification tasks. In conclusion, fine-tuning uses the parameters learned from a previous training of the network on a large dataset, and then adjusts the parameters in later layers from the new dataset, improving the performance and accuracy in the new classification task.
As far as the authors know, there has been no previous research paper evaluating the accuracy of COVID-19 screening according to the degree of fine-tuning. According to Figure 5, regardless of the scalability of VGG, classification accuracy increases as the degree of fine-tuning increases; however, the fine-tuning of more than a certain convolutional block (more than 3 blocks in this experiment) decrease the classification accuracy. Therefore, it seems necessary to find the appropriate degree of fine-tuning by judging the degree of fine-tuning in the transfer learning by a hyper-parametric variable such as batch-size or learning rate in DL.

Visual Interpretation Using Grad-CAM
Grad-CAM uses the gradient information flowing into the last convolutional layer of the deep CNN to understand the significance of each neuron for making decisions [18]. In this experiment, a qualitative evaluation of classification adequacy was performed using the Grad-CAM technique. In the case of the deep CNN model, which showed the best classification as shown in Figure 8, image feature points for each class were specified within the lung cavity in CXR images. However, as shown in Figure 9, if there is a foreign substance in the lung cavity in a CXR image, it can be classified incorrectly. Moreover, even if a CXR image is correctly classified, it can be classified for an incorrect reason as shown in Figure 10. In the CXR image analysis using the DL algorithm, the implanted port catheter and pacemaker or defibrillator generator have shown similar results to the previous studies that interfere with the performance of the DL algorithm by causing false positives or false negatives [42]. This shows the pure function of the Grad-CAM technique and suggests candidate areas to be excluded through image preprocessing for areas or foreign body subjects that affect classification accuracy improvement on the image.

Conclusions
This experiment showed the appropriate transfer learning strategy of a deep CNN to screen for COVID-19 in CXR images as follows. In using the deep CNNs for COVID-19 screening in CXR images, it is not always guaranteed to achieve cutting-edge results, increasing their complexity and layer depth. In addition, when applying transfer learning to a deep CNN for classification, an appropriate degree of fine-tuning is required, and this must also be treated as an important hyper-parametric variable that affects the accuracy of DL. In particular, in the case of image classification using DL, it is also necessary to qualitatively evaluate a classification as to whether an appropriate classification has occurred based on the correct reason, using visual interpretation methods such as the Grad-CAM technique.