CSGBBNet: An Explainable Deep Learning Framework for COVID-19 Detection

The COVID-19 virus has swept the world and brought great impact to various fields, gaining wide attention from all walks of life since the end of 2019. At present, although the global epidemic situation is leveling off and vaccine doses have been administered in a large amount, confirmed cases are still emerging around the world. To make up for the missed diagnosis caused by the uncertainty of nucleic acid polymerase chain reaction (PCR) test, utilizing lung CT examination as a combined detection method to improve the diagnostic rate becomes a necessity. Our research considered the time-consuming and labor-intensive characteristics of the traditional CT analyzing process, and developed an efficient deep learning framework named CSGBBNet to solve the binary classification task of COVID-19 images based on a COVID-Seg model for image preprocessing and a GBBNet for classification. The five runs with random seed on the test set showed our novel framework can rapidly analyze CT scan images and give out effective results for assisting COVID-19 detection, with the mean accuracy of 98.49 ± 1.23%, the sensitivity of 99.00 ± 2.00%, the specificity of 97.95 ± 2.51%, the precision of 98.10 ± 2.61%, and the F1 score of 98.51 ± 1.22%. Moreover, our model CSGBBNet performs better when compared with seven previous state-of-the-art methods. In this research, the aim is to link together biomedical research and artificial intelligence and provide some insights into the field of COVID-19 detection.


Introduction
In February 2020, the World Health Organization (WHO) named the cause of the Coronavirus Disease 2019 (COVID-19) as SARS-CoV-2 [1], which is a novel coronavirus that firstly been found in the human body. This virus shares a structural similarity of 87.1% [2] to the SARS-related virus found in bats, and 79.5% [3] to the SARS virus, which means novel coronavirus and SARS coronavirus belong to the same family of viruses but are not the same.
According to the latest real-time statistics reported by the WHO in Figure 1, as of 10:37 a.m. Central European Summer Time (CEST) on 9 August 2021, there have been 202,296,216 confirmed cases of COVID-19 and 4,288,134 confirmed deaths over 225 countries, areas, or territories. Although vaccine doses have been administered in significant numbers, newly reported confirmed cases and deaths are still emerging every day.
From the current epidemiological investigation for COVID-19 [4], the patient of latent period and some confirmed patients has no difference in appearance with the normal person but still has infectivity. Therefore, it is urgent to expand the coverage of COVID-19 inspection, accelerate the speed of detection, improve the accuracy of detecting infected persons including the asymptomatic infected persons as early as possible to avoid secondary transmission [5]. However, at present, based on a variety of practical

Reference Number of Scans in the Dataset Method(s) Reported Accuracy
Wang et al. [13] • COVID- 19  GoogLeNet is a classic deep learning structure proposed by Szegedy, Ioffe [16]. Our novel model was proposed based on transfer learning from GoogLeNet. In addition, to prevent gradient explosion or dispersion and keep most activation functions away from their saturated region, more batch normalization relevant blocks were added into the newly designed layers. In this way, the robustness of the model to different hyperparameters during training will be improved and the optimization process will become smoother. This smoothness makes the behavior of the gradient more predictable and stable, allowing for faster training [17].
When dealing with deep neural networks, the selection of hyperparameters is always a problem. Manual parameter tuning is time-consuming and may not achieve optimal results. The advantage of Bayesian Optimization (BO) in parameter tuning is sample efficiency. In other words, if we consider one step as training the neural network with a set of hyperparameters, BO can find a better choice of hyperparameters with very few steps. In addition, BO does not ask for the gradient, which is difficult to be obtained from the hyperparameter of the neural network under general circumstances. These two benefits make BO become the algorithm we chose to tune the neural network in this experiment.
In this study, we linked biomedical research and artificial intelligence together to propose a novel, rapid, stable deep learning framework called CSGBBNet with excellent accuracy for distinguishing between COVID-19 patients and healthy people. There are four main improvements in our study. The first is we proposed a novel image preprocessing model-COVID-Seg based on the Maximum entropy threshold segmentation method, which can accurately outline the lung area. The 'COVID' here stands for 'COVID-19 detection', and the 'Seg' signifies 'Segmentation process'. The second is, in this COVID-Seg model, we introduced an 'erosion and dilation' based process to refine the output segmented image. This is the first time this combination of techniques being applied to the COVID-19 detection field, and the results show that this COVID-Seg model has a very significant effect on improving the model performance. Thirdly, we presented a novel GBBNet through restructuring, retraining the GoogLeNet model, and adding more batch normalization relevant blocks into it. The fourth is, in our GBBNet, we used Bayesian Optimization as the hyperparameter tuning algorithm to find a better choice of hyperparameters rapidly. In the name 'GBBNet', the 'G' represents 'GoogLeNet'; the first 'B' stands for 'Batch normalization relevant blocks'; the second 'B' stands for 'Bayesian Optimization'. These four improvements can help enrich the performance of our model specially built for COVID-19 detection. Sections 2-5 will introduce the dataset, methodology, experimental results, and conclusions of our study, respectively.

Dataset
The dataset named 'COVID Academic' is from [18]. It consists of 196 slice images and have been carefully labelled by human experts from the Fourth People's Hospital of Huai'an City. It can be divided into two parts: one part of 98 CT scan images from patients with novel coronavirus pneumonia and the other part of 98 CT scan images from healthy control (HC). The example images are displayed in Figure 2. The instrument and settings for the CT acquisition are Philips Ingenuity 64 row spiral CT machine, Mas: 240, KV: 120, layer spacing 3 mm, layer thickness 3mm, Mediastinum window (Width, Level: 350, 60 in Hounsfield units), thin layer reconstruction according to the lesion display. The slice images using the agreed slice level selection (SLS) method: (i) selecting the slice with the largest size and number of lesions for the patients with COVID-19 and (ii) for the normal cases, any level of the slice can be chosen. Through utilizing this SLS method, 196 slice images were extracted with a resolution of 1024 × 1024 from COVID-19 patients and healthy people.
11, x FOR PEER REVIEW 4 of

Dataset
The dataset named 'COVID Academic' is from [18]. It consists of 196 slice images an have been carefully labelled by human experts from the Fourth People's Hospital o Huai'an City. It can be divided into two parts: one part of 98 CT scan images from patien with novel coronavirus pneumonia and the other part of 98 CT scan images from health control (HC). The example images are displayed in Figure 2. The instrument and setting for the CT acquisition are Philips Ingenuity 64 row spiral CT machine, Mas: 240, KV:12 layer spacing 3mm, layer thickness 3mm, Mediastinum window (Width, Level: 350, 60 i Hounsfield units), thin layer reconstruction according to the lesion display. The slice im ages using the agreed slice level selection (SLS) method: (i) selecting the slice with th largest size and number of lesions for the patients with COVID-19 and (ii) for the norm cases, any level of the slice can be chosen. Through utilizing this SLS method, 196 slic images were extracted with a resolution of 1024 × 1024 from COVID-19 patients an healthy people.

Methodology
The abbreviation list and symbol list for this section are shown in Tables 2 and 3. ning tion of resizing the images to a same size tion of utilizing Maximum entropy threshold segmentation-based method tion of 'erosion and dilation-based process'

Methodology
The abbreviation list and symbol list for this section are shown in Tables 2 and 3.

Preprocessing
Preprocessing has shown many benefits in medical image analysis [9,19]. We presented a novel model-COVID-Seg for segmenting lung regions in each CT image. Tissues regions other than the lung regions, such as the regions of heart, ribs, and thoracic vertebrae, were removed because no changes in the CT image will appear if these regions were infected by the COVID-19 virus. Taking the Figure 2 images as input examples, the general process of the COVID-Seg model is illustrated in Figure 3.

Preprocessing
Preprocessing has shown many benefits in medical image analysis [9,19]. We presented a novel model-COVID-Seg for segmenting lung regions in each CT image. Tissues regions other than the lung regions, such as the regions of heart, ribs, and thoracic vertebrae, were removed because no changes in the CT image will appear if these regions were infected by the COVID-19 virus. Taking the Figure 2 images as input examples, the general process of the COVID-Seg model is illustrated in Figure 3. In this model, we first utilized the maximum entropy threshold segmentation method to obtain the preliminary mask of lung regions by classifying each pixel in the image [20,21], which will be discussed later in Section 3.1.1. After that, we refined the mask via some appropriate image processing techniques such as 'erosion and dilationbased process', which will be discussed later in Section 3.1.2.
In this study, the COVID-Seg model can help remove the background interference factors such as the regions of the heart, ribs, and thoracic vertebrae, and thus improves In this model, we first utilized the maximum entropy threshold segmentation method to obtain the preliminary mask of lung regions by classifying each pixel in the image [20,21], which will be discussed later in Section 3.1.1. After that, we refined the mask via some appropriate image processing techniques such as 'erosion and dilation-based process', which will be discussed later in Section 3.1.2.
In this study, the COVID-Seg model can help remove the background interference factors such as the regions of the heart, ribs, and thoracic vertebrae, and thus improves the classification performance [22]. The basic steps for the COVID-Seg model are as follows: Step 1: Import the raw image set and store them into variable T R .
Step 2: Resize the images to the same size. Since the neural network receives inputs of the same size [23], all images need to be resized to a fixed size before inputting them into the neural network. 224 × 224 is a very suitable choice for the pre-trained model in our research.
where F resize refers to a function of resizing the images to the same size.
Step 3: Use the Maximum entropy threshold segmentation-based method to obtain the preliminary mask set.
where F mets refers to a function of utilizing the Maximum entropy threshold segmentationbased method.
Step 4: To refine the preliminary mask, we set the different sizes of erosion and dilation squares and apply 'erosion and dilation-based process' to the mask.
where F ed refers to a function of 'erosion and dilation-based process'.
Step 5: Remove unwanted connectivity areas of the mask and then make an inversion of the values in the mask (change all the '0' into '1' and all the '1' into '0') to obtain the final refined mask set. This inversion needs to be taken for letting the areas we are interested in and want to retain in the mask be set to '1' and the areas we want to remove in the mask be set to '0'.
where F crwc refers to a function of choosing and retaining the wanted connectivity areas.
Step 6: Conduct element-wise multiplication between the final refined mask set and the original image set to output the segmented image set.
where F comb refers to a function of conducting an element-wise multiplication process (.*) between the final refined mask and the original image to get the final segmented image. For easier understanding, the pseudocode of our proposed COVID-Seg model is listed in Algorithm 1.

Algorithm 1 Pseudocode of our proposed COVID-Seg model
Input: Original Image set T R Phase I: Resize Resize the images to 224 × 224: T 1 = F resize (T R ); Phase II: Apply Segmentation Apply Maximum entropy threshold segmentation-based method to obtain the preliminary mask set: T 2 = F mets (T 1 ); Phase III: Refine Erode and dilate the mask: T 3 = F ed (T 2 ); Remove unwanted connectivity areas of the mask and make an inversion of the values in the mask to obtain the final refined mask set: T 4 = F crwc (T 3 ); Phase IV: Combine Combine the final refined mask set and the original image set together to obtain the segmented image set: T Seg = F com (T 4 , T R ). Maximum entropy threshold segmentation [21] was utilized to help classify each pixel in the given image and remove the background interference factors, and thus improves the classification performance. An illustration of the histogram graph for Figure 1a is shown in Figure 4. Maximum entropy threshold segmentation [21] was utilized to help classify each pixel in the given image and remove the background interference factors, and thus improves the classification performance. An illustration of the histogram graph for Figure  1a is shown in Figure 4. Assuming an image I contains N pixels, the probability ( ) of the occurrence of the pixel value equals i in the image can be defined as: where h(i) is the y-axis value in Figure 4 and refers to the frequency of the pixel value equals i in the image; i stands for the x-axis value in Figure 4 and typically takes integer values from 0 to 255; k stands for the limited range of pixel value (e.g., 256). The cumulative distribution function P() can be defined as: where H stands for the cumulative histogram corresponding to the cumulative probability.
The probability of the occurrence of each gray level x (with the abscissa value of I as u, and the ordinate value of I as v) can be expressed as: Because all probabilities need to be known in advance, these probabilities are also called prior probabilities which can be estimated by observing the frequency of the corresponding gray value in one image or more images. The probability vectors for k different gray values, x = 0,…, k − 1 can be expressed as: (0), (1), … , ( − 1). According to Equation (6), the probability distribution p(x) of the image is: According to Equation (7), the corresponding cumulative probability distribution function is: Assuming an image I contains N pixels, the probability p(i) of the occurrence of the pixel value equals i in the image can be defined as: where h(i) is the y-axis value in Figure 4 and refers to the frequency of the pixel value equals i in the image; i stands for the x-axis value in Figure 4 and typically takes integer values from 0 to 255; k stands for the limited range of pixel value (e.g., 256). The cumulative distribution function P() can be defined as: where H stands for the cumulative histogram corresponding to the cumulative probability. The probability of the occurrence of each gray level x (with the abscissa value of I as u, and the ordinate value of I as v) can be expressed as: Because all probabilities need to be known in advance, these probabilities are also called prior probabilities which can be estimated by observing the frequency of the corresponding gray value in one image or more images. The probability vectors for k different gray values, x = 0, . . . , k − 1 can be expressed as: p(0), p(1), . . . , p(k − 1). According to Equation (6), the probability distribution p(x) of the image is: According to Equation (7), the corresponding cumulative probability distribution function is: Given an estimated probability density function p(x) in the digital image, the entropy in the digital image can be defined as: Given a specific threshold y (0 ≤ y < k − 1), for the two image regions Y 0 and Y 1 segmented by this threshold, the estimated probability density function can be expressed as: where P 0 (y) and P 1 (y) respectively represent the cumulative probability of background and foreground pixels segmented by y threshold, and the sum of them is 1. The corresponding entropy of background and foreground are shown as follows: Under this threshold, the total entropy of the image is: Subsequently, we calculated the total entropy of the image under all the segmentation thresholds and then found out the maximum entropy. The final threshold should be the segmentation threshold corresponding to the maximum entropy. The pixel in the image whose gray value is larger than this threshold would act as the foreground, and the pixel whose gray value is smaller than the threshold would act as the background [24].
3.1.2. Improvement 2: Using Erosion and Dilation-Based Technique to Refine the Mask However, in the mask set produced by entropy threshold segmentation, there is a severe problem that some lesion regions that contain useful information have been eliminated together with the heart, ribs, and thoracic vertebrae. Hence, to refine the mask, we utilized a method based on the Erosion and Dilation processing technique, which is a collection of nonlinear operations related to the morphology of objects in an image. This method relies only on the relative ordering, rather than the numerical values of pixel values, and thus is especially suited to the processing of binary images-the mask images [25,26]. Through utilizing this technique, we probed each mask by positioning a small structuring element at all possible locations in the mask, then compared that element with the corresponding neighborhood of pixels, and finally conducted the refining process.
Erosion with small (e.g., 2 × 2-5 × 5) square structuring elements can shrink the object by stripping away a layer of pixels from both the inner and outer boundaries of regions. Through utilizing erosion, we can let the holes and gaps between different regions become larger to eliminate some useless small details in the mask. While dilation has the opposite effect to erosion-it adds a layer of pixels to both the inner and outer boundaries of regions. Through utilizing dilation, we can make the holes enclosed by a single region, the gaps between different regions become smaller, and let the small intrusions into boundaries of a region be filled in [27,28]. Results of dilation or erosion are influenced by both the size and iterated utilization times of a structuring element: A larger structuring element and more times of utilization would bring a more pronounced effect. Furthermore, the result of erosion or dilation with a larger structuring element will be similar to the result obtained by iterated erosion or dilation using a smaller structuring element of the same shape. In other words, if s 1 and s 2 are a pair of structuring elements identical in shape, with s 2 twice the size of s 1 , then f s 2 ≈ ( f s 1 ) s 1 (16) where f s denotes a new binary image produced from the erosion of a binary image f by a structuring element s. We worked on adjusting the size and utilization times of the structuring element to achieve our desired refined mask. For 'COVID Academic', after many attempts, we found that the most suitable refining method is: where r represents the refined mask, p represents the preliminary mask.
Since the dilation and erosion are dual operations that have opposite effects, when we found the mask have been excessively eroded, we added the dilation step to recover the useful information in the image through recovering the mask: where q is the size of square structuring element which would be decided according to different degrees of excessive erosion; f ⊕ s denotes a new binary image produced from the dilation of a binary image f by a structuring element s.
In the following steps, we repeatedly used the algorithm of selecting the largest connected areas and made an inversion of the values in the mask to retain the areas we are interested in. A series of sample significant output we got from the erosion and dilation-based process is as displayed in Figure 5.
Diagnostics 2021, 11, x FOR PEER REVIEW 9 of 25 size and iterated utilization times of a structuring element: A larger structuring element and more times of utilization would bring a more pronounced effect. Furthermore, the result of erosion or dilation with a larger structuring element will be similar to the result obtained by iterated erosion or dilation using a smaller structuring element of the same shape. In other words, if s1 and s2 are a pair of structuring elements identical in shape, with s2 twice the size of s1, then where ⊖ denotes a new binary image produced from the erosion of a binary image f by a structuring element s. We worked on adjusting the size and utilization times of the structuring element to achieve our desired refined mask. For 'COVID Academic', after many attempts, we found that the most suitable refining method is: where r represents the refined mask, p represents the preliminary mask.
Since the dilation and erosion are dual operations that have opposite effects, when we found the mask have been excessively eroded, we added the dilation step to recover the useful information in the image through recovering the mask: where q is the size of square structuring element which would be decided according to different degrees of excessive erosion; ⊕ denotes a new binary image produced from the dilation of a binary image f by a structuring element s. In the following steps, we repeatedly used the algorithm of selecting the largest connected areas and made an inversion of the values in the mask to retain the areas we are interested in. A series of sample significant output we got from the erosion and dilationbased process is as displayed in Figure 5.

Transfer Learning Model
Many neural networks increase the depth to achieve better training performance. However, deepen the network will meanwhile bring several negative effects, e.g., overfitting, gradient disappearance, and gradient explosion. Unlike them, GoogLeNet improves the training results by making more efficient usage of computing resources [29,30].
We restructured and retrained the GoogLeNet by replacing the layers after Inception Block 5b with our newly designed layers containing the Conv Block and incorporated the framework with Bayesian Optimization algorithm to propose a new network named GBB-Net for the COVID-19 classification task.
The basic structure of GBBNet is illustrated in Figure 6. The numbers in the convolutional layer block and the max-pooling layer block represent the exact information of the layer. For instance, Conv 1 × 1 + 1 (V) represents a convolutional layer with 1 × 1 filter size,

Transfer Learning Model
Many neural networks increase the depth to achieve better training performance. However, deepen the network will meanwhile bring several negative effects, e.g., overfitting, gradient disappearance, and gradient explosion. Unlike them, GoogLeNet improves the training results by making more efficient usage of computing resources [29,30].
We restructured and retrained the GoogLeNet by replacing the layers after Inception Block 5b with our newly designed layers containing the Conv Block and incorporated the framework with Bayesian Optimization algorithm to propose a new network named GBBNet for the COVID-19 classification task.
The basic structure of GBBNet is illustrated in Figure 6. The numbers in the convolutional layer block and the max-pooling layer block represent the exact information of the layer. For instance, Conv 1 × 1 + 1 (V) represents a convolutional layer with 1 × 1 filter size, stride 1, and 'valid' padding; MaxPool 3 × 3 + 2 (S) represents a max-pooling layer with 3 × 3 filter size, stride 2, and 'same' padding.
Diagnostics 2021, 11, x FOR PEER REVIEW 10 of 25 stride 1, and 'valid' padding; MaxPool 3 × 3 + 2 (S) represents a max-pooling layer with 3 × 3 filter size, stride 2, and 'same' padding. The Inception Block structure, which can be seen in Figure 7a, is reserved for two advantages. On the one hand, utilizing a 1 × 1 convolution block can raise and lower the dimension. On the other hand, convolution and reintegration on multiple scales can be conducted at the same time, which saves much time [16,31]. After the inception structure, two new Conv Blocks with batch normalization layers as illustrated in Figure 7b were added. The reason would be discussed in Section 3.2.1.  Global average pooling then undertook the process of feature map reduction to convert a feature map of w × w × c to 1 × 1 × c. Subsequently, the original fully connected layer (FCL) was altered to a new randomly initialized FCL of two neurons since the original one was used to conduct 1000 categories classification for ImageNet. And the classes in that 1000 categories are not associated with the main classification task for our research. The new classification layer also only has two classes. (COVID-19 patient and healthy people). Next, we utilized a precomputation strategy for retraining [32]. We firstly set the cognate learning factor of specified layers to zero for freezing those layers and then calculated the activation maps at the last frozen layer, 'inception_5b-output' for all the images. The Inception Block structure, which can be seen in Figure 7a, is reserved for two advantages. On the one hand, utilizing a 1 × 1 convolution block can raise and lower the dimension. On the other hand, convolution and reintegration on multiple scales can be conducted at the same time, which saves much time [16,31]. After the inception structure, two new Conv Blocks with batch normalization layers as illustrated in Figure 7b were added. The reason would be discussed in Section 3.2.1. stride 1, and 'valid' padding; MaxPool 3 × 3 + 2 (S) represents a max-pooling layer with × 3 filter size, stride 2, and 'same' padding. The Inception Block structure, which can be seen in Figure 7a, is reserved for tw advantages. On the one hand, utilizing a 1 × 1 convolution block can raise and lower th dimension. On the other hand, convolution and reintegration on multiple scales can b conducted at the same time, which saves much time [16,31]. After the inception structure two new Conv Blocks with batch normalization layers as illustrated in Figure 7b wer added. The reason would be discussed in Section 3.2.1.  Global average pooling then undertook the process of feature map reduction to con vert a feature map of w × w × c to 1 × 1 × c. Subsequently, the original fully connected laye (FCL) was altered to a new randomly initialized FCL of two neurons since the origina one was used to conduct 1000 categories classification for ImageNet. And the classes i that 1000 categories are not associated with the main classification task for our research The new classification layer also only has two classes. (COVID-19 patient and health people). Next, we utilized a precomputation strategy for retraining [32]. We firstly set th cognate learning factor of specified layers to zero for freezing those layers and then calcu lated the activation maps at the last frozen layer, 'inception_5b-output' for all the images Global average pooling then undertook the process of feature map reduction to convert a feature map of w × w × c to 1 × 1 × c. Subsequently, the original fully connected layer (FCL) was altered to a new randomly initialized FCL of two neurons since the original one was used to conduct 1000 categories classification for ImageNet. And the classes in that 1000 categories are not associated with the main classification task for our research. The new classification layer also only has two classes. (COVID-19 patient and healthy people). Next, we utilized a precomputation strategy for retraining [32]. We firstly set the cognate learning factor of specified layers to zero for freezing those layers and then calculated the activation maps at the last frozen layer, 'inception_5b-output' for all the images. Finally, we saved the feature maps to local disk and used them as input for training the trainable layers.

Improvement 3: Adding More Batch Normalization Relevant Blocks to Offset the Impact of Different Scales
In our research, if we tolerate the existence of covariate shift-the difference of characteristic distribution between different parts of the dataset, then the different distribution of input values would lead to the difference in input feature values. After multiplying the matrix with the weight, difference values with large deviation would be generated. Due to the reason that our deep learning network needs to be updated and improved through training, little changes in the difference value will have a deep impact on the back layer. The greater the deviation, the more obvious the impact will be. Therefore, to avoid these phenomena causing gradient divergence during backpropagation, we decided to add more batch normalization relevant blocks in the network to offset the impact of different scales and, at the same time improve the performance of our framework. BN can standardize the input values and reducing the scale differences to the same range, which brings two main benefits. On the one hand, this improves the convergence degree of gradient and speeds up the training speed. On the other hand, each layer can face the input value of the same characteristic distribution as far as possible, which reduces the uncertainty brought by changes and also reduces the influence on the back-end network, leading the network of each layer becomes relatively independent [33,34].
Assume BN is used to normalize the layer's input X = {x 1 , x 2 , . . . , x n } to guarantee the output Y = {y 1 , y 2 , . . . , y n } to have a uniform distribution [35]. The mini-batch mean µ can be calculated as The mini-batch variance σ 2 can be calculated as The input x i ∈ X would be normalized tox i according tǒ where ϕ is a constant added to the mini-batch variance to improve numerical stability and is set to 10 −4 in our experiment. And the transformation for achieving y i ∈ Y would be as where ξ and λ are two parameters going to be learned during training. Afterwards, the output y i would be sent to the next layer and the normalizedx i remains internal to the current layer.

Improvement 4: Incorporating Bayesian Optimization in Network
Tuning parameters is an essential step that has a significant impact on model performance in machine learning. Traditional manual tuning is time-consuming and may not obtain optimal results especially when the network is complex and decided by many hyperparameters. The subsequent emerging tuning methods such as Grid Search and Random Search, though free of manpower, still take a long time and are prone to local optimal situations when dealing with non-convex functions [36][37][38][39].
In this research, we incorporated Bayesian Optimization for two main reasons. Firstly, it saves more time because they have fewer iterations. Secondly, it is also robust when dealing with non-convex problems. Figure 8 portrays the pipeline for Bayesian Optimization in our experiment.
In this research, we incorporated Bayesian Optimization for two main reasons. Firstly, it saves more time because they have fewer iterations. Secondly, it is also robust when dealing with non-convex problems. Figure 8 portrays the pipeline for Bayesian Optimization in our experiment. In our experiment, the acquisition (AC) function we chose is the Expected Improvement (EI) function, as it solves the problem of PI by considering how much larger the unknown point is than the known maximum point. The equation [40] is as below: where x is an observed point; is the mean of all the observation points; ( ) represents the current maximum value; is the standard deviation of all observation points; Φ() and () denote the cumulative distribution function (CDF) and probability density function (PDF) of the standard normal distribution respectively.
Besides, we set the range of the following hyperparameters as displayed in Table 4 to fine-tune the model. (1 × 10 −10 , 1 × 10 −2 ) LearnRateDropFactor (0, 1) The best learning rate and learning drop factor would depend on their based dataset and the network. Momentum adds inertia to the parameter updates by having the current update contain a contribution proportional to the update in the previous iteration. L2Regulization is usually used to prevent the circumstance of overfitting. Thus, they are all good choices of variables for us to search the space for finding a suitable value. In our experiment, the acquisition (AC) function we chose is the Expected Improvement (EI) function, as it solves the problem of PI by considering how much larger the unknown point is than the known maximum point. The equation [40] is as below: where x is an observed point; µ is the mean of all the observation points; f (x + ) represents the current maximum value; σ is the standard deviation of all observation points; Φ() and φ() denote the cumulative distribution function (CDF) and probability density function (PDF) of the standard normal distribution respectively. Besides, we set the range of the following hyperparameters as displayed in Table 4 to fine-tune the model.

InitialLearnRate
(1 × 10 −5 , 1) Momentum (0.8, 0.98) L2Regularization (1 × 10 −10 , 1 × 10 −2 ) LearnRateDropFactor (0, 1) The best learning rate and learning drop factor would depend on their based dataset and the network. Momentum adds inertia to the parameter updates by having the current update contain a contribution proportional to the update in the previous iteration. L2Regulization is usually used to prevent the circumstance of overfitting. Thus, they are all good choices of variables for us to search the space for finding a suitable value.

K-Fold Cross-Validation
The main reason for adopting K-fold in this experiment is it performs well in processing small datasets. The original dataset needs to be grouped, with one part as a training set and one part as a testing set. In the traditional machine learning process, the testing set will not participate in the training process, which wastes some parts of the data. This limits the optimization of the model, especially when dealing with a small dataset because data determines the upper limit of program performance [41][42][43][44].
Thus, it is significant to make good use of data sets, especially for this experiment. K-fold cross-validation will randomly divide the training data into K folds, do K times of training, and get K testing results, which reports unbiased performances. In this way, all data sets can be well utilized, and finally, the model performance can be expressed reasonably utilizing averaging [45]. Considering the factor of overfitting and running time, we intended to choose five as the K value for our experiment. An illustration of fivefold cross-validation is shown in Figure 9. The D in the figure represents 'Dataset'.

K-Fold Cross-Validation
The main reason for adopting K-fold in this experiment is it performs well in pro cessing small datasets. The original dataset needs to be grouped, with one part as a train ing set and one part as a testing set. In the traditional machine learning process, the testin set will not participate in the training process, which wastes some parts of the data. Th limits the optimization of the model, especially when dealing with a small dataset becaus data determines the upper limit of program performance [41][42][43][44].
Thus, it is significant to make good use of data sets, especially for this experiment. K fold cross-validation will randomly divide the training data into K folds, do K times o training, and get K testing results, which reports unbiased performances. In this way, a data sets can be well utilized, and finally, the model performance can be expressed rea sonably utilizing averaging [45]. Considering the factor of overfitting and running tim we intended to choose five as the K value for our experiment. An illustration of fivefol cross-validation is shown in Figure 9. The D in the figure represents 'Dataset'.

Implementation of Our Proposed CSGBBNet
We proposed the framework of CSGBBNet by combining the COVID-Seg model an the GBBNet. The basic steps for implementation are as followed.
Step 1: Import the original Image set and its ground truth label . Conduct th image pre-processing using COVID-Seg.
Step 3: Pre-train each model and make a comparison on their testing accuracy resul to select the optimum one with maximum accuracy and then store the model into variab . Suppose as its number of learnable layers.
Step 4: Remove the last L learnable layers from to get a new , where refers to a function of removing layers, and L refers to the number of specif last layers to be removed. Step

Implementation of Our Proposed CSGBBNet
We proposed the framework of CSGBBNet by combining the COVID-Seg model and the GBBNet. The basic steps for implementation are as followed.
Step 1: Import the original Image set T R and its ground truth label G t . Conduct the image pre-processing using COVID-Seg.
Step 3: Pre-train each model and make a comparison on their testing accuracy results to select the optimum one with maximum accuracy and then store the model into variable M 0 . Suppose L 0 as its number of learnable layers.
Step 4: Remove the last L learnable layers from M 0 to get a new M 1 , where F rl refers to a function of removing layers, and L refers to the number of specific last layers to be removed.
Step 5: Add five new designed layers/blocks into M 1 to get a new M 2 , where F andl refers to a function of adding newly designed layers, and the constant 5 refers to the number of newly designed layers/blocks that would be appended to M 1 . Suppose L 2 as the number of learnable layers of M 2 .
Step 6: Freeze the specific early layers by setting their learning rate to 0, where l r refers to the learning rate, and M(m, n) refers to the layers from m to n in the network model M.
Step 7: Get the new added five layers/blocks retrainable by setting their learning rate to 1, Step 8: Import the segmented image set T Seg .
Step 9: Retrain the whole neural network with input T train, i seg , Bayesian Optimization training algorithm, and get the new network GBBNet i , where F rn refers to a function of retraining the network and T train, i seg refers to the input training dataset.
Step 10: Output the test performances of GBBNet. Finally, the pseudocode for the CSGBBNet is listed in Algorithm 2. is the testing set. Remove the last L learnable layers from M 0 to get a new M 1 , see Equation (25). Add 5 new designed layers/blocks (containing Batch Normalization layers), see Equation (26). Freeze the specific early layers, see Equation (27). Get the new added layers retrainable, see Equation (28). Retrain the whole neural network with input T train, i seg and Bayesian Optimization training algorithm. Get a new network, see Equation (29). end Phase IV: Report the test performance of our proposed framework The Training set is T train seg and its labels G t (T train seg ) The Testing set is T test seg and its labels G t (T test seg ) for i = 1:5 Prediction: Pred(i) = predict (GBBNet i , T test,i seg ) Test Confusion matrix: Con f Test = compare Pred(i), G t (T test,i seg ) Calculate Indicators: accuracy, sensitivity, specificity, precision, F1 score, see Equation (30). end Output: The best model GBBNet and its test performances.
All in all, our experiment used a novel model COVID-Seg for image preprocessing, GBBNet for classification, and five-fold as a validation method that reports unbiased performances. We named the whole framework as CSGBBNet. Considering the factor of overfitting and running time, we intended to choose five as the K value for our experiment. A simplified method diagram for the running process of our proposed CSGBBNet is displayed in Figure 10. The detailed structures of 'Frozen layers' and 'New layers' are defined in Figure 6.

Data Splits
All the experiments were conducted on a laptop with GTX1060. The majority of experiments were written using MATLAB 2020a. To verify the robustness of our framework, we introduced a new benchmark dataset 'COVID-CT' with 539 CT scan images. In our experiment, we consider COVID-19 patient images as positive samples and healthy control images as negative samples on both datasets. Due to the reason that we utilized a fivefold cross-validation method, we would have five times of running, and in each running, images are divided into 80% for training, 20% for testing. A clear table for data splits for both the COVID Academic dataset and the new benchmark dataset can be found in Table 5.  [18]. 'COVID-CT' is referenced from [46]).

Data Splits
All the experiments were conducted on a laptop with GTX1060. The majority of experiments were written using MATLAB 2020a. To verify the robustness of our framework, we introduced a new benchmark dataset 'COVID-CT' with 539 CT scan images. In our experiment, we consider COVID-19 patient images as positive samples and healthy control images as negative samples on both datasets. Due to the reason that we utilized a fivefold cross-validation method, we would have five times of running, and in each running, images are divided into 80% for training, 20% for testing. A clear table for data splits for both the COVID Academic dataset and the new benchmark dataset can be found in Table 5. ('COVID Academic' is referenced from [18]. 'COVID-CT' is referenced from [46]). Besides, to comprehensively evaluate the performance of the classifier, we added an evaluation criterion of harmonic average F1 score [47][48][49][50], which considers the value of both precision and recall according to the equation:

Statistical Results
From the test performance and the confusion matrix results displayed in Table 6 and Figure 11, on the COVID Academic dataset, CSGBBNet predicted 193 correctly among all the 196 samples. While on the COVID-CT dataset, CSGBBNet achieves a mean accuracy of 95.17 ± 1.22%, and a mean F1 score of 96.21 ± 0.98%, which are higher than the best accuracy (89.1%) and the best F1 score (89.6%) reported in [46]. This verifies that our framework has excellent robustness when encountered different datasets. Taking the overall confusion matrix on the COVID Academic dataset as an illustration, the classifier correctly recognized 97 images of 'COVID-19' as 'COVID-19', 96 images of 'Healthy Control' as 'Healthy Control' but misidentified two images of 'Healthy Control' as 'COVID-19' 1 image of 'COVID-19' as 'Healthy Control'. In the upper-left box of the confusion matrix, the value 98.0% means 98.0% of the 'samples' predicted as 'COVID-19' were classified right and the value 99.0% means 99.0% of the real 'COVID-19' samples were classified right. Meanwhile, in the lower-right box, the value 99.0% means 99.0% of the 'samples' predicted as 'Healthy' were classified right, and the value 98.0% represents 98.0% of the real 'Healthy' samples were classified right. In a medical sense, our approach has a very low misdiagnosis rate, especially when the target is a real COVID-19 patient. This is very helpful for practical applications because the high rate of diagnosis of COVID-19 patients can effectively prevent them from the second transmission of the epidemic.
The graphs of minimum objective vs. number of function evaluations achieved during the Bayesian Optimization process on Test Sets in both datasets are shown in Figure 12, where different combinations of hyperparameter such as 'initial learning rate', 'momentum', etc. were attempted in the training process to achieve an optimum result. In the next section, we will implement an ablation study to conduct a full analysis of our results.   The graphs of minimum objective vs. number of function evaluations achieved during the Bayesian Optimization process on Test Sets in both datasets are shown in Figure  12, where different combinations of hyperparameter such as 'initial learning rate', 'momentum', etc. were attempted in the training process to achieve an optimum result. In the next section, we will implement an ablation study to conduct a full analysis of our results.

Ablation Study
To more explicit verify the effectiveness of different proposed modules in our experiment, we named the framework without utilizing the COVID-Seg model as 'GBBNet Only', the model only without newly added designed layers as 'CSGBBNet no BN', the model only without Bayesian Optimization module as 'CSGBBNet no BO'. In this section, all the experiments were conducted on 'COVID Academic' As displayed in Table 7 and Figure 13, firstly, after introducing the COVID-Seg model, the results in the testing set were significantly improved with the accuracy of 6.68%, the sensitivity of 7.26%, the specificity of 6.21%, the precision of 5.44%, the F1 score of 6.56%, and the standard deviation in a range of 3.36-7.25%. This means our segmentation was very successful and effective. Second, after adding newly designed layers, the accuracy in the testing set improved 2.07%, the sensitivity improved 3.05%, the specificity improved 1.06%, the precision improved 1.10%, the F1 score improved 2.13%, and the stability of results also slightly improved. Third, utilizing Bayesian Optimization can help to find the optimum hyperparameters for the model because all the evaluated criteria have

Ablation Study
To more explicit verify the effectiveness of different proposed modules in our experiment, we named the framework without utilizing the COVID-Seg model as 'GBBNet Only', the model only without newly added designed layers as 'CSGBBNet no BN', the model only without Bayesian Optimization module as 'CSGBBNet no BO'. In this section, all the experiments were conducted on 'COVID Academic'.
As displayed in Table 7 and Figure 13, firstly, after introducing the COVID-Seg model, the results in the testing set were significantly improved with the accuracy of 6.68%, the sensitivity of 7.26%, the specificity of 6.21%, the precision of 5.44%, the F1 score of 6.56%, and the standard deviation in a range of 3.36-7.25%. This means our segmentation was very successful and effective. Second, after adding newly designed layers, the accuracy in the testing set improved 2.07%, the sensitivity improved 3.05%, the specificity improved 1.06%, the precision improved 1.10%, the F1 score improved 2.13%, and the stability of results also slightly improved. Third, utilizing Bayesian Optimization can help to find the optimum hyperparameters for the model because all the evaluated criteria have different degrees of improvements: the accuracy improved with 4.11%, the sensitivity improved with 7.32%, the specificity improved with 0.95%, the precision improved with 1.21%, the F1 score improved with 4.48% and the stability improved with 0. 16-5.19% revealed by standard deviation. Finally, according to the Receiver operating characteristic (ROC) curve in Figure 14, the illustrated diagnostic ability of the CSGBBNet classifier system is also superior to those without complete modules. 1.21%, the F1 score improved with 4.48% and the stability improved with 0. 16-5.19% revealed by standard deviation. Finally, according to the Receiver operating characteristic (ROC) curve in Figure 14, the illustrated diagnostic ability of the CSGBBNet classifier system is also superior to those without complete modules. To sum up, the results above indicate the modules: COVID-Seg model, Bayesian Optimization training algorithm, and newly added layers are all effective in our framework.
In Table 8, when compared with seven state-of-the-art methods on the COVID Academic dataset, CSGBBNet has advantages in terms of most of the performance metrics. The results show excellent stability according to the values provided by standard deviation. And it is clear in Figure 15 that CSGBBNet presents very good computational efficiency as it performs faster than most other approaches, except the comparative performance with GoogLeNet and VGG16. However, it can be easily observed that the speed difference among them is not significant, and our framework shows a much higher diag- To sum up, the results above indicate the modules: COVID-Seg model, Bayesian Optimization training algorithm, and newly added layers are all effective in our framework.
In Table 8, when compared with seven state-of-the-art methods on the COVID Academic dataset, CSGBBNet has advantages in terms of most of the performance metrics. The results show excellent stability according to the values provided by standard deviation. And it is clear in Figure 15 that CSGBBNet presents very good computational efficiency as it performs faster than most other approaches, except the comparative performance with GoogLeNet and VGG16. However, it can be easily observed that the speed difference among them is not significant, and our framework shows a much higher diagnostic rate when compared with GoogLeNet and VGG16.  To conclude for this section, CSGBBNet achieves excellent performance with the mean accuracy of 98.49 ± 1.23%, the sensitivity of 99.00 ± 2.00%, the specificity of 97.95 ± 2.51%, the precision of 98.10 ± 2.61%, and the F1 score of 98.51 ± 1.22%. It is worth mentioning that, as illustrated in Figure 16, it performs better than seven other state-ofthe-art methods in accuracy, specificity, precision, F1 score, and especially the sensitivity criterion of the model. The sensitivity of CSGBBNet is 12.64% higher than the approach ranked second in the table. The reason is we effectively removed the factors in the image that would affect the judgment of the model, such as the regions of the heart, ribs, and thoracic vertebrae. Moreover, our proposed model shows excellent stability, robustness, and very good computational efficiency when compared with other state-of-the-art machine learning approaches and the traditional RT-PCR test methods. In all, CSGBBNet has a low misdiagnosis rate, high stability, and can successfully diagnose multiple slice images within one second, which is of great significance in the practical application of COVID-19 detection.
To further verify if CSGBBNet provides an explainable and promising way to detect COVID-19, gradient-weighted class activation mapping (Grad-CAM) is employed in the next section. To conclude for this section, CSGBBNet achieves excellent performance with the mean accuracy of 98.49 ± 1.23%, the sensitivity of 99.00 ± 2.00%, the specificity of 97.95 ± 2.51%, the precision of 98.10 ± 2.61%, and the F1 score of 98.51 ± 1.22%. It is worth mentioning that, as illustrated in Figure 16, it performs better than seven other state-of-the-art methods in accuracy, specificity, precision, F1 score, and especially the sensitivity criterion of the model. The sensitivity of CSGBBNet is 12.64% higher than the approach ranked second in the table. The reason is we effectively removed the factors in the image that would affect the judgment of the model, such as the regions of the heart, ribs, and thoracic vertebrae. Moreover, our proposed model shows excellent stability, robustness, and very good computational efficiency when compared with other state-of-the-art machine learning approaches and the traditional RT-PCR test methods. In all, CSGBBNet has a low misdiagnosis rate, high stability, and can successfully diagnose multiple slice images within one second, which is of great significance in the practical application of COVID-19 detection. and very good computational efficiency when compared with other state-of-the-art machine learning approaches and the traditional RT-PCR test methods. In all, CSGBBNet has a low misdiagnosis rate, high stability, and can successfully diagnose multiple slice images within one second, which is of great significance in the practical application of COVID-19 detection.
To further verify if CSGBBNet provides an explainable and promising way to detect COVID-19, gradient-weighted class activation mapping (Grad-CAM) is employed in the next section.  To further verify if CSGBBNet provides an explainable and promising way to detect COVID-19, gradient-weighted class activation mapping (Grad-CAM) is employed in the next section.

Explainability of Proposed CSGBBNet
We take samples from the COVID-19 and HC class respectively and display the relevant heatmaps in Figure 17a-d to verify the explainability of CSGBBNet. The manual delineations of lesion area for them are shown in Figure 17e-h. These heatmaps were generated utilizing the 'Inception 5b output' feature map in GBBNet with the help of the Grad-CAM approach. The red area in the heatmap is the part with strong attention in the model, and the deep blue area is the part with little attention in the model. According to these heatmaps, we can confirm that, for the COVID-19 image (Figure 17a,b), our framework is paying more attention to the lesion areas and meanwhile paying little attention to the non-lesion areas. While for the HC image (Figure 17c,d), the attention of the model is not focused on any particular area because there is no lesion area in the healthy control category.

Explainability of Proposed CSGBBNet
We take samples from the COVID-19 and HC class respectively and display the relevant heatmaps in Figure 17a-d to verify the explainability of CSGBBNet. The manual delineations of lesion area for them are shown in Figure 17e-h. These heatmaps were generated utilizing the 'Inception 5b output' feature map in GBBNet with the help of the Grad-CAM approach. The red area in the heatmap is the part with strong attention in the model, and the deep blue area is the part with little attention in the model. According to these heatmaps, we can confirm that, for the COVID-19 image (Figure 17a,b), our framework is paying more attention to the lesion areas and meanwhile paying little attention to the non-lesion areas. While for the HC image (Figure 17c,d), the attention of the model is not focused on any particular area because there is no lesion area in the healthy control category. To sum up, the heatmaps provide a clear and understandable interpretation of how our CSGBBNet predicts COVID-19 images from healthy control images. In other words, the concerns of our model are very similar to the standard already approved in the medical community, which adds confidence that it is capable of assisting the diagnosis of doctors, radiologists, and inspectors at each epidemic prevention site in the real world. To sum up, the heatmaps provide a clear and understandable interpretation of how our CSGBBNet predicts COVID-19 images from healthy control images. In other words, the concerns of our model are very similar to the standard already approved in the medical community, which adds confidence that it is capable of assisting the diagnosis of doctors, radiologists, and inspectors at each epidemic prevention site in the real world.

Conclusions
Chest CT is a rapid, non-invasive method for screening COVID-19. Our proposed deep learning framework CSGBBNet using the COVID-Seg model as an image preprocessing method, GBBNet as a classification method is also an explainable framework to accomplish the classification task. CSGBBNet not only shows advantages in diagnostic rate, stability, and computational efficiency when compared with other state-of-the-art machine learning approaches and traditional RT-PCR tests, but also shares similar concerns with the standard approved in the medical community. Moreover, it shows excellent robustness when encountered with different datasets. To conclude, combining these two techniques can help to realize rapid, effective, stable, and safe detection of COVID-19, which is of great significance to clinical medicine and society.
In the future, we plan to incorporate or learn from more advanced deep learning techniques and revise the model to improve classification efficiency.