A Novel COVID-19 Image Classification Method Based on the Improved Residual Network

Chen, Hui; Zhang, Tian; Chen, Runbin; Zhu, Zihang; Wang, Xu

doi:10.3390/electronics12010080

Open AccessArticle

A Novel COVID-19 Image Classification Method Based on the Improved Residual Network

by

Hui Chen

^*,

Tian Zhang

,

Runbin Chen

,

Zihang Zhu

and

Xu Wang

School of Computer Science and Engineering, Anhui University of Science & Technology, Huainan 232001, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(1), 80; https://doi.org/10.3390/electronics12010080

Submission received: 25 November 2022 / Revised: 20 December 2022 / Accepted: 21 December 2022 / Published: 25 December 2022

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, chest X-ray (CXR) imaging has become one of the significant tools to assist in the diagnosis and treatment of novel coronavirus pneumonia. However, CXR images have complex-shaped and changing lesion areas, which makes it difficult to identify novel coronavirus pneumonia from the images. To address this problem, a new deep learning network model (BoT-ViTNet) for automatic classification is designed in this study, which is constructed on the basis of ResNet50. First, we introduce multi-headed self-attention (MSA) to the last Bottleneck block of the first three stages in the ResNet50 to enhance the ability to model global information. Then, to further enhance the feature expression performance and the correlation between features, the TRT-ViT blocks, consisting of Transformer and Bottleneck, are used in the final stage of ResNet50, which improves the recognition of complex lesion regions in CXR images. Finally, the extracted features are delivered to the global average pooling layer for global spatial information integration in a concatenated way and used for classification. Experiments conducted on the COVID-19 Radiography database show that the classification accuracy, precision, sensitivity, specificity, and F1-score of the BoT-ViTNet model is 98.91%, 97.80%, 98.76%, 99.13%, and 98.27%, respectively, which outperforms other classification models. The experimental results show that our model can classify CXR images better.

Keywords:

novel coronary pneumonia; image classification; ResNet50; multi-headed self-attention (MSA); TRT-ViT

1. Introduction

At the end of December 2019, some cases of novel coronavirus pneumonia of unknown origin were reported in Wuhan, Hubei province, which was officially named COVID-19 by the World Health Organization in February 2020 [1]. COVID-19 has pandemic characteristics and spread rapidly worldwide, seriously threatening the lives and health of people [2]. According to the Global New Coronavirus Pneumonia Epidemic Real-Time Big Data Report, as of October 2022, more than 200 countries and regions worldwide have been infected with the new coronavirus pneumonia, and the global number of confirmed cases cumulatively exceeds 600 million and the number of deaths exceeds 6 million.

COVID-19 is a novel infectious disease caused by severe acute respiratory syndrome coronavirus-2 infection, whose early clinical features are mainly fever, dry cough, and malaise, with a few accompanying symptoms such as runny nose and diarrhea. Severe cases can cause dyspnea and organ failure, which even can lead to death [3,4]. For more than two years, due to the instability of the COVID-19 gene sequence, several variants of COVID-19 have been generated. These variants are characterized by greater concealment, which makes it extremely difficult to diagnose them accurately.

Nucleic acid testing is the most common method to diagnose COVID-19. This method detects viral fragments by using a reverse transcription polymerase chain reaction (RT-PCR) [4] technique. However, the nucleic acid test has disadvantages such as being time consuming, low sensitivity, high false negative rate, and the need for special test kits [5,6], which limits its development. In recent years, medical imaging techniques have been widely used for the diagnosis of various diseases. Chest X-ray (CXR) and computed tomography (CT) [7] are used to diagnose COVID-19. Compared with the nucleic acid test, medical-imaging-based diagnostics are faster and more effective. CT generates a lot of radiation and is not suitable for pregnant women and infants, while CXR contains little radiation and can reduce the risk of cross-infection to some extent. Furthermore, CXR is less costly and more widely used than CT. However, the manual analysis and diagnostic process based on CXR images depends heavily on the expertise of healthcare professionals and the analysis of image characteristics is time consuming, which makes it difficult to observe occult lesions at an early stage and distinguish between other viral and bacterial pneumonias [8]. Due to the urgent need, experts recommend the use of computer-aided diagnosis to replace manual diagnosis to improve the efficiency of detection and help doctors diagnose more accurately.

With the development of artificial intelligence, deep learning methods [9,10] have achieved good success in the field of computer vision. Several studies [11,12,13] have shown that convolutional neural networks (CNNs) have excellent feature extraction capabilities and can accurately extract image features of different scale sizes. Medical image classification using CNN requires the fusion of feature maps from different scale sizes, while taking into account both local and global information. The representative models used for COVID-19 classification include VGG networks, ResNet networks, and high-resolution networks [14]. These experimental results suggest that local feature extraction of medical images using CNN is feasible. CNN has a fixed sampling location and its limited perceptual field leads to poor global modeling capabilities, which cannot learn image features effectively according to the complex changes of lesion regions. COVID-19 CXR images showed consolidation in the lung and ground-glass clouding with commonly irregular shapes, such as hazy, patchy, diffuse, and reticular nodular patterns, which greatly increased the difficulty of COVID-19 detection [15,16]. Consequently, improving feature extraction from infected regions with complex shapes and establishing long-distance dependencies between features is the key to recognize COVID-19 accurately. Transformer is the most advanced sequence encoder, whose core idea is self-attention [17], which can establish long-range dependencies between feature vectors and improve feature extraction and representation. Vision Transformer (ViT) [18,19] is the representative model. The experimental results indicate that the extraction of global information from medical images using pure Transformer is practicable, but it tends to lead to excessive memory and computational costs. As a result, some studies [20,21] have shown that combining convolution and Transformer as a hybrid network model can help improve classification performance while reducing computational cost.

In this paper, we design a new deep learning network model (BoT-ViTNet) based on ResNet50 for automatic image classification to help doctors more accurately identify COVID-19 and other viral pneumonias. The network model first combines the advantages of CNN for extracting local feature representations and multi-headed self-attention for global information modeling. Then, the TRT-ViT blocks are used in the final stage to fuse global and local feature information. This well solves the problem of learning feature representations from different complexly infected regions of CXR images, thus significantly improving the classification performance of the model. The main work includes:

A novel model called BoT-ViTNet is constructed for COVID-19 image classification and it can simultaneously extract both local feature information and global semantic information from infected regions with complex shapes of CXR images, achieving good classification performance.
Introducing the multi-headed self-attention (MSA) to the last Bottleneck block of the first three stages in the ResNet50 to enhance the ability to model global information.
In the final stage of ResNet50, the TRT-ViT block [22] is used to replace the original bottleneck block, which can extract both global and local feature information to enhance the feature representation and correlation between feature locations, so as to help identify complex lesion areas in CXR images.

The remainder of this paper is described as follows. The related work of this paper, which includes convolutional neural network, vision Transformer, and hybrid network models, is briefly mentioned in Section 2. Section 3 presents the proposed BoT-ViTNet in detail and describes it in part. In Section 4, extensive experiments are conducted to prove the effectiveness of BoT-ViTNet and the experimental results are discussed and analyzed. Finally, the conclusion of the whole paper is given in Section 5.

2. Related Work

2.1. Convolutional Neural Network

In recent years, CNN models have been widely used in the field of computer vision, such as image classification, target detection, and semantic segmentation. Anand et al. [23] fine-tuned a VGGNet [24] with an input size of 200 × 200 and tested it using three different pooling layers to obtain high classification accuracy. Rajpal et al. [25] designed a new classification method that consists of three modules. In the first module, they used ResNet50 [26] for feature extraction and solved the network degradation problem through skip connections in the residual blocks. In the second module, a pool of frequencies and textures of selected features were constructed and the features were further simplified using PCA before being passed to the feed-forward neural network. The third module connected the features obtained from the first and second modules, passing them to the dense layer and classifying them by the softmax function. Sarker et al. [27] proposed a COVID-DenseNet, which used the densely connected Densenet-121 [28] as a feature extractor and a fully connected layer with softmax activation as a classifier for COVID-19 patient detection. This method reduces the number of parameters and computational complexity through feature reuse, and can effectively mitigate the gradient disappearance problem. However, there are still some challenges when using CNN for COVID-19 classification, such as irregular shapes and complex positional information in CXR images.

2.2. Vision Transformer

Transformer was originally applied in the field of natural language processing with significant results. ViT [18,19] showed that Transformer can achieve better performance in computer vision tasks. ViT performs self-attention by mapping a series of image patches to semantic tags, which helps capture long-term relationships between sequence elements. The Data Efficient Image Converter (DeiT) network [19] uses a strategy of knowledge distillation based on the ViT architecture. The self-attention mechanism in the encoder is used for different regions of the image and integrated with the information in the whole image to complete the classification prediction by adding two connected classifiers. Jalalifar et al. [29] discovered experimentally that DeiT without a single convolutional layer successfully achieved the same performance as DenseNet169, which showed that ViT can be applied to medical image analysis tasks. To reduce the computational effort, Swin Transformer [30] was proposed to compute self-attention in non-overlapping local windows which reduced computational complexity, but the sparse attention employed limited the ability to model remote relationships. Currently, researchers are focusing more on efficiency, including the effectiveness of self-attention and various training strategies.

2.3. Hybrid Network Models

Recent research [31,32] has indicated that combining convolution and Transformer as a hybrid network model can help to fully incorporate the benefits of both. Rao et al. [33] enabled the model to focus on deep semantic information more by introducing a self-attention mechanism in CNN. Lin et al. [34] proposed an adaptive attention network (AANet). The network first used deformable ResNet to learn feature representations to adapt to the diversity of COVID-19 features. Then, the network utilized a self-attention mechanism to model non-local interactions and learn rich contextual information to detect complex-shaped lesion regions, which improved recognition efficiency. Aboutalebi et al. [35] proposed a multi-scale encoder-decoder self-attention (MEDUSA) model to solve the problem of overlapping image appearance. The model improved the ability to model global remote spatial context by introducing self-attention modules, achieving good classification performance on several datasets. Li et al. [36] proposed a new UniFormer that effectively unifies convolution and self-attention in a concise transformer format for overcoming local redundancy and global dependencies. The method achieves better performance on image classification tasks.

3. Method

To identify COVID-19 CXR images accurately, we propose a novel model called BoT-ViTNet in this paper, whose architecture is shown in Figure 1. The BoT-ViTNet model contains three parts. For the first part, we use the bottleneck block for local feature extraction of the lesion region of CXR images. For the second part, multi-headed self-attention block (MSA) is introduced to learn the contextual information of the extracted features, which can enhance the global modeling capability of the feature information. For the last part, the TRT-ViT block is used to extract both local and global feature information to further enhance the feature representation and correlation between feature locations.

The general structure of BoT-ViTNet is similar to ResNet50, which also goes through four stages and each stage consists of 3, 4, 6, and 3 blocks, respectively, as shown in Figure 1. The CXR image is first passed through a 7 × 7 convolution layer with a step size of 2 and a 3 × 3 pooling layer with a step size of 2 to obtain a feature map with a resolution of 56 × 56 × 64. Then, the feature map is input into the Bottleneck block, which consists of two types of residual convolution structure, as shown in Figure 1a,b. The Bottleneck block conducts a channel expansion of the feature map when the step size is 1 and performs a downsampling operation to increase the perceptual field on the feature map when the step size is 2. After passing 2 Bottleneck blocks sequentially, the feature map is input to the MSA block, whose structure is shown in Figure 1c. The MSA block learns the global information of image features and establishes long-range dependencies of features to enhance the expression of features. To further fuse the global and local information, the feature maps are input to the TRT-ViT block after being processed by multiple Bottleneck blocks and MSA blocks. The structure of the TRT-ViT block is shown in Figure 1d. The TRT-ViT block extracts global features and local information by Transformer and Bottleneck, respectively, and then fuses the global features and local information, which improves the expression of features and the correlation of positions between features. The global average pooling layer is used to integrate the global spatial information for the fused features. The output features are mapped to the softmax layer for probability prediction. BoT-ViTNet can not only capture deep global semantic information of CXR images but also extract shallow local texture information of CXR images. It inherits the advantages of Transformer and CNN, improving the recognition performance. Table 1 shows the structural details of the BoT-ViTNet model.

3.1. Bottleneck Block

Unlike the extraction of features using standard convolution, the Bottleneck block can reduce the computational complexity of the model while extracting features. Therefore, we used the Bottleneck block for local feature extraction of complex lesion regions in CXR images. The Bottleneck block consists of two 1 × 1 convolutions and a 3 × 3 depth-wise convolution. The first 1 × 1 convolution is used to reduce the number of channels of the feature map so that feature extraction can be performed more efficiently and intuitively. The 3 × 3 depth-wise convolution is used to extract the local feature information of the image. The second 1 × 1 convolution is used to expand the number of channels of the feature map so that the number of channels of the output feature map is equal to the number of channels of the input feature map, and to perform summation. The use of the Bottleneck structure greatly reduces the number of parameters and computation, thus improving computational efficiency. In addition, a residual structure is added to each output to avoid causing network degradation and over-fitting problems. The residual block is computed as follows:

y = F (x) + x

(1)

where

x

denotes the input feature map,

y

indicates the output feature map, and

F (x)

represents the convolution operation. The residual network can span the previous layers of the network and act on the later layers, which can improve the gradient disappearance problem when the network is trained for back propagation.

3.2. MSA Block

To enhance the long-term dependencies of the features, we introduce the MSA block in ResNet50, as shown in Figure 1c. MSA [37] is an essential component of Transformer, which can unite feature information from different locations representing different subspaces. It is an extension of Self-attention (SA), which runs k SA operations in parallel at the same time and projects their concatenated outputs. We first review the basic SA modules that are widely used in neural network architectures. SA is the core idea of Transformer, which has the feature of a weak inductive bias. By performing similarity calculation, it can establish long-distance dependency between feature vectors and improve feature extraction and expression ability. The input of each SA consists of query Q, key K, and value V, which are linear transformations of the input sequence. The new vectors Q, K, and V are obtained by multiplying the original Q, K, and V with the weight matrices W^Q, W^K, and W^V, which are learned during the training process, respectively. In this section, we use Scaled Dot-Product Attention for the similarity calculation among vectors with the following equation:

S A (X) = s o f t \max (\frac{Q K^{T}}{\sqrt{d}}) V

(2)

where

X

denotes the input sequence,

S A (\cdot)

represents the SA operation, and

d

means the dimension of the head.

MSA concatenates k single-head self-attentions and performs a linear projection operation on them with the following equation:

X_{m} = M S A (X) = C o n c a t [S A_{1} (X), \dots, S A_{k} (X)] W_{m}

(3)

In Equation (3),

X_{m}

is the output of MSA,

M S A (\cdot)

means the MSA operation,

C o n c a t [\cdot]

denotes the connection of feature maps with the same dimension, and

W_{m}

is the learnable linear transformation.

CNN has strong inductive bias and can effectively extract local texture information of feature maps in shallow networks, while MSA has weak inductive bias and can establish long-range dependencies of features in deep networks. Consequently, combining CNN with MSA can obtain powerful feature representation capability and high accuracy.

3.3. TRT-ViT Block

To further improve the feature representation, the TRT-ViT block [22], consisting of Transformer and Bottleneck is introduced in the last stage of ResNet50, which takes a global-then-local hybrid block pattern for feature extraction. As described in the paper [38], usually the Transformer with a larger receptive field can extract global information from the feature map and enable information exchange. In contrast, a convolution with a small receptive field can only extract local information from the feature map. TRT-ViT block fully combines the advantages of the Transformer and Bottleneck, which enhance the expression of features and the correlation of positions between features, thus helping to identify complex lesion regions in CXR images.

The network structure of the TRT-ViT block is shown in Figure 1d, which first uses Transformer to model the global information and then uses Bottleneck to extract the local information. Transformer is calculated as follows:

\begin{array}{l} X = X_{i n} + M S A (N o r m (X_{i n})) \\ X_{o u t} = X + M L P (N o r m (X)) \end{array}

(4)

where

X_{i n} \in R^{H \times W \times C}

is the input feature map and

X_{o u t} \in R^{H \times W \times C}

is the output feature map. We firstly perform an operation of channel reduction using a 1 × 1 convolution with a step size of 1, reducing the number of channels of the feature map to half of the original feature map. Then, to capture the long-range dependencies of features in complex lesion regions of the image, we use MSA in Transformer to extract global information of the feature map and implement information exchange within each channel. Finally, the global features after information exchange are delivered to the multilayer perceptron (MLP) layer to improve the ability of the network to acquire image background information, which helps to identify complex lesions in CXR images. After Transformer operation, the feature map containing global information is input into Bottleneck blocks to learn the local space information. We connect the extracted global features with local features to enhance the expression of the features and the correlation between the positions of the features, improving the recognition accuracy.

Transformer aims to establish global connections between features, whereas convolution captures only local information. The computational effort of the Transformer and Bottleneck is almost equal when the resolution of the input image is low, indicating that the placement of Transformer at a later stage of the network helps to balance performance and efficiency [39]. It was further demonstrated that using a hybrid block pattern of global-then-local can be helpful to identify complex lesion regions in CXR images. Consequently, in this section, the Bottleneck block is replaced by the TRT-ViT block in the last stage of the ResNet50 network and cross-stacked, which can effectively extract local texture information and global semantic information from infected regions with complex shapes and performs feature fusion, while achieving high performance and high accuracy.

4. Results and Discussion

4.1. Datasets

We used the COVID-19 Radiography database [39] as the experimental data, which was obtained by researchers from Qatar University and Dhaka University in collaboration with physicians from Pakistan and Malaysia. The dataset contains 15,169 CXR images from 15,153 patients, including 3616 patients positive with COVID-19 (COVID), 1345 patients with viral pneumonia (Viral), and 10,192 patients with uninfected pneumonia (normal). The partial CXR images of the COVID-19 Radiography database are shown in Figure 2. In the experiments, the dataset is divided into a training set and a testing set in the ratio of 6:4 for training the model with parameters and validating the classification accuracy, respectively. There are 9094 images in the training set and 6059 images in the testing set. The number of images in each category in the dataset is shown in Table 2. To further demonstrate the robustness of our model, another CXR dataset called Coronahack [40] was used. This dataset has 5922 CXR images, includng1576 normal images and 4346 images of pneumonia. The detailed information about the dataset is shown in Table 3. We divided this dataset in the ratio of 8:2 to obtain 4737 images in the training set and 1185 images in the test set, as shown in Figure 3.

4.2. Experimental Details

The specific configuration of the environment and parameters for the experiment are shown in Table 4. In this experiment, our programming environment used the deep learning framework PyTorch 1.9.0 and the programming language Python 3.8. The operating system is ubuntu18.04. In the training process, we reset the size of all CXR images to 224 × 224 and use the Adam optimizer for model optimization, with the learning rate set to 0.001, the number of iterations set to 100, and the batch size set to 64. All experiments are performed on an RTX A4000 GPU and 16 GB memory.

4.3. Evaluation Metrics

In order to verify the validity and robustness of the BoT-ViTNet model, the confusion matrix and commonly used evaluation metrics were selected for effect evaluation in this experiment, including accuracy, precision, sensitivity, specificity, and F1-score. The equations for each indicator are as follows:

A c c u r a c y (A c c .) = \frac{N_{c}}{N_{t}}

(5)

\Pr e c i s i o n (\Pr e .) = \frac{T P}{T P + F P}

(6)

S e n s i t i v i t y (S e n .) = \frac{T P}{T P + F N}

(7)

S p e c i f i c i t y (S p e .) = \frac{T N}{T N + F P}

(8)

F 1 - s c o r e (F_{1} .) = 2 \times \frac{\Pr e . \times S e n .}{\Pr e . + S e n .}

(9)

where N_c is the number of correctly predicted cases and N_t is the total number of predicted cases. TP (True Positive) denotes the number of correctly predicted COVID-19 positive cases. TN (True Negative) represents the number of correctly predicted normal and viral pneumonia cases. FP (False Positive) is the number of normal or viral pneumonia cases misdiagnosed as COVID-19 positive. FN (False Negative) indicates the number of COVID-19 positive cases misdiagnosed as normal or viral pneumonia.

4.4. Experimental Results and Analysis

4.4.1. Comparison of Classification Effects of Different Models

To validate the effectiveness of the BoT-ViTNet model, we use some common deep learning models to perform experimental comparisons. The classification results of the COVID-19 Radiography database are shown in Table 5. From the results in Table 5, it can be known that the BoT-ViTNet model achieves the highest values of 98.91%, 97.80%, 98.76%, 99.13%, and 98.27% in terms of classification accuracy, precision, sensitivity, specificity, and F1-score, respectively. Compared with other common deep learning models, the classification accuracy of our model is improved by 1.98%, 1.80%, 1.72%, 1.53%, 2.11%, 4.19%, 5.99%, 4.70%, 1.29%, and 0.79%, respectively. Table 6 provides detailed results of the different models with BoT-ViTNet on the Coronahack dataset. From the results in Table 6, it can be known that the BoT-ViTNet model achieves the highest values of 98.40%, 97.99%, 97.89%, 97.89%, and 97.94% in terms of classification accuracy, precision, sensitivity, specificity, and F1-score, respectively. Compared with other common deep learning models, the classification accuracy of our model is improved by 2.03%, 1.10%, 1.27%, 0.93%, 1.71%, 3.29%, 6.42%, 4.48%, 0.68%, and 0.43%, respectively. These results show that the BoT-ViTNet model makes full use of the advantages of convolution and MSA for CXR images classification, which can not only extract the local texture information of CXR images, but also capture the global semantic information of the images. Meanwhile, using the global-then-local hybrid block pattern (TRT-ViT) to acquire image information at a later stage is much more efficient, which can be more helpful to identify complex lesions in images to achieve higher classification performance.

Figure 4 shows the confusion matrix for different models on the test set of the COVID-19 Radiography database. The identification results for COVID-19, normal, and Viral can be visualized from the confusion matrix. The confusion matrix shows that the CXR images in the test set are substantially concentrated on the diagonal, indicating that these images are correctly classified into the categories to which they belong. Meanwhile, it can also be seen that the BoT-ViTNet model has 21 misclassified COVID-19 cases and correctly predicted 1426 COVID-19 cases with an error rate of only 1.47%. Consequently, the BoT-ViTNet model can effectively and robustly identify COVID-19 cases.

Figure 5 shows the confusion matrix for different models on the test set of the Coronahack dataset, which indicates that the BoT-ViTNet model has 9 misclassified pneumonia cases and correctly predicted 860 pneumonia cases with an error rate of only 1.46%. Consequently, BoT-ViTNet can effectively and robustly identify pneumonia. The data in Figure 5 illustrates the good classification performance of BoT-ViTNet.

4.4.2. Comparison of Loss Curves of Different Models

Figure 6 and Figure 7 show the loss curves of ResNet50, ViT, AlterNet, and BoT-ViTNet on the COVID-19 Radiography database and Coronahack dataset during the training process. The curves in Figure 6 and Figure 7 show that BoT-ViTNet has a relatively faster convergence rate, whereas the ViT model has the slowest convergence rate. The AlterNet and ResNet50 also converge at a similar rate. These results suggest that the BoT-ViTNet model has a shorter training time, lower train loss curve, and faster rate of convergence on the same dataset, thus achieving a local optimum and improving the training efficiency of the model.

4.4.3. Analysis of Classification Results by Each Category

Table 7 and Table 8 represent the specific performance of BoT-ViTNet for each category on the COVID-19 Radiography database and Coronahack dataset.

As can be seen from the experimental results in Table 7, BoT-ViTNet has achieved high classification results for COVID-19 case recognition on the COVID-19 Radiography database, with accuracy, sensitivity, specificity, and F1-score of 98.55%, 99.00%, 99.67%, and 98.77% respectively. Table 8 indicates that the BoT-ViTNet model has achieved high classification results for pneumonia case recognition on the Coronahack dataset, with accuracy, sensitivity, specificity, and F1-score of 98.85%, 98.96%, 96.83%, and 98.90% respectively. These results illustrate the good recognition performance of BoT-ViTNet.

4.4.4. Ablation Experiments

In this section, we perform ablation experiments to verify the performance impact of introducing the TRT-ViT block and MSA block to replace the Bottleneck block in ResNet50. The results of the ablation experiment on the COVID-19 Radiography database and Coronahack dataset are shown in Table 9 and Table 10.

(1) The effect of the MSA block: To clearly show the positive impact of replacing the partial Bottleneck block with the MSA block on the classification results, we used the Bottleneck block for feature extraction in all the first three stages, as shown in Table 9. For the results of No. 3, the removal of the MSA block in BoT-ViTNet results in a significant degradation of performance on the dataset. In comparison with the results of No. 4, the accuracy, precision, sensitivity, and F1-score of No. 3 decreased by 1.09%, 0.08%, 2.18%, and 1.12%, respectively. In contrast, after introducing the MSA block in the last Bottleneck block of the first three stages, No. 2 improved 0.53%, 1.63%, 0.70%, and 0.77% in accuracy, sensitivity, specificity, and F1-score over the original ResNet50 (No. 1), respectively. This shows that the MSA block is important to improve the performance of the BoT-ViTNet model.

(2) The effect of the TRT-ViT block: We also explored the contribution to the classification results by introducing the TRT-ViT block in the last stage of ResNet50, as shown in Table 9. In comparison with the original ResNet50 (No. 1), the classification results of No. 3 using the TRT-ViT block show a significant improvement in accuracy, precision, sensitivity, specificity, and F1-score by 0.71%, 1.05%, 1.51%, 1.51%, and 1.32%, respectively. These results demonstrate that the TRT-ViT block plays a crucial role in BoT-ViTNet. The data in Table 10 also shows that the BoT-ViTNet also has fairly good classification performance on the Coronahack dataset.

As mentioned above, the MSA block and the TRT-ViT block can effectively improve the performance of COVID-19 classification in BoT-ViTNet. As shown in Table 9 and Table 10, the classification effect of No. 4 is superior to the other settings in most metrics. These results show that the MSA block and the TRT-ViT block are important components of BoT-ViTNet for achieving good classification results.

4.4.5. Robustness of BoT-ViTNet

In order to further verify that the BoT-ViTNet model has better robustness, we added experiments of CXR image classification with different batch sizes. The batch sizes selected in this experiment were 4, 8, 16, 32, and 64 and the dataset was the same as those in the classification accuracy test above. The classification effects of different batch sizes on the COVID-19 Radiography database and Coronahack dataset are shown in Table 11 and Table 12.

From the data in Table 11 and Table 12, it can be observed that when the batch size is 4, the BoT-ViTNet model has the worst classification effect and the classification accuracy is 96.39% and 96.03%. When the batch size is 64, the BoT-ViTNet model has the best classification effect with a classification accuracy of 98.91% and 98.40%, which greatly improves the classification performance. During the training process, the batch size being too small will cause a long training time and gradient oscillation, which is not conducive to the convergence of the model parameters.

5. Conclusions

In this paper, we designed a BoT-ViTNet model for COVID-19 image classification based on the ResNet50. Firstly, the MSA block is introduced in the last Bottleneck block of the first three stages of ResNet50 to enhance the ability to model global information. Then, to further enhance the correlation between features and the representation of features, the TRT-ViT block, which consists of Transformer and Bottleneck, is used in the final stage of ResNet50 to fuse global and local information for improving the recognition of complex lesion regions in CXR images. Finally, the extracted features are delivered to the global average pooling layer for global spatial information integration in a concatenated way and used for classification. The experimental results of image classification on the publicly accessible COVID-19 Radiography database and Coronahack dataset show that BoT-ViTNet model achieves the better results. The overall accuracy, precision, sensitivity, specificity, and F1-score of the BoT-ViTNet model on the COVID-19 Radiography database are 98.91%, 97.80%, 98.76%, 99.13%, and 98.27%, respectively. The BoT-ViTNet model has better recognition effect on COVID-19 with 98.55%, 99.00%, 99.67%, and 98.77% in precision, sensitivity, specificity, and F1-score, respectively. Compared with other classification models, the BoT-ViTNet model has better performance in recognizing and classifying COVID-19 images. Although the BoT-ViTNet model can achieve good results for the classification of COVID-19 images, further clinical studies and tests are still required.

Author Contributions

Methodology, H.C., T.Z. and R.C.; conceptualization, T.Z.; software, T.Z. and R.C.; validation, H.C., Z.Z. and X.W.; writing—original draft preparation, H.C., T.Z., R.C. and Z.Z.; writing—review and editing, H.C., T.Z., R.C. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by The National Natural Science Foundation of China grant number 61170060, the Key teaching research project of Anhui province grant number 2020jyxm0458.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wu, C.; Chen, X.; Cai, Y.; Zhou, X.; Xu, S.; Huang, H.; Zhang, L.; Zhou, X.; Du, C.; Zhang, Y. Risk factors associated with acute respiratory distress syndrome and death in patients with coronavirus disease 2019 pneumonia in Wuhan, China. JAMA Intern. Med. 2020, 180, 934–943. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Cucinotta, D.; Vanelli, M. WHO declares COVID-19 a pandemic. Acta Biomed. 2020, 91, 157–160. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7569573 (accessed on 23 November 2022).
Wu, Q.; Xing, Y.; Shi, L.; Li, W.; Gao, Y.; Pan, S.; Wang, Y.; Wang, W.; Xing, Q. Coinfection and Other Clinical Characteristics of COVID-19 in Children. Pediatrics 2020, 146, e20200961. [Google Scholar] [CrossRef] [PubMed]
Wang, W.; Xu, Y.; Gao, R.; Lu, R.; Han, K.; Wu, G.; Tan, W. Detection of SARS-CoV-2 in different types of clinical specimens. JAMA 2020, 323, 1843–1844. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Mohamadou, Y.; Halidou, A.; Kapen, P.T. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Appl. Intell. 2020, 50, 3913–3925. [Google Scholar] [CrossRef] [PubMed]
Ito, R.; Iwano, S.; Naganawa, S. A review on the use of artificial intelligence for medical imaging of the lungs of patients with coronavirus disease 2019. Diagn. Interv. Radiol. 2020, 26, 443–448. [Google Scholar] [CrossRef] [PubMed]
Ng, M.-Y.; Lee, E.Y.; Yang, J.; Yang, F.; Li, X.; Wang, H.; Lui, M.M.-s.; Lo, C.S.-Y.; Leung, B.; Khong, P.-L. Imaging profile of the COVID-19 infection: Radiologic findings and literature review. Radiol. Cardiothorac Imaging 2020, 2, e200034. [Google Scholar] [CrossRef] [Green Version]
Wu, X.; Hui, H.; Niu, M.; Li, L.; Wang, L.; He, B.; Yang, X.; Li, L.; Li, H.; Tian, J. Deep learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: A multicentre study. Eur. J. Radiol. 2020, 128, 109041. [Google Scholar] [CrossRef]
Abdullah, S.M.S.; Abdulazeez, A.M. Facial expression recognition based on deep learning convolution neural network: A review. J. Soft Comput. Data Min. 2021, 2, 53–65. Available online: https://publisher.uthm.edu.my/ojs/index.php/jscdm/article/view/7906 (accessed on 23 November 2022).
Ardakani, A.A.; Kanafi, A.R.; Acharya, U.R.; Khadem, N.; Mohammadi, A. Application of deep learning technique to manage COVID-19 in routine clinical practice using CT images: Results of 10 convolutional neural networks. Comput. Biol. Med. 2020, 121, 103795. [Google Scholar] [CrossRef]
Li, L.; Qin, L.; Xu, Z.; Yin, Y.; Wang, X.; Kong, B.; Bai, J.; Lu, Y.; Fang, Z.; Song, Q. Using Artificial Intelligence to Detect COVID-19 and Community-acquired Pneumonia Based on Pulmonary CT: Evaluation of the Diagnostic Accuracy. Radiology 2020, 296, E65–E71. [Google Scholar] [CrossRef] [PubMed]
Singh, D.; Kumar, V.; Kaur, M. Classification of COVID-19 patients from chest CT images using multi-objective differential evolution–based convolutional neural networks. Eur. J. Clin. Microbiol. Infect. Dis. 2020, 39, 1379–1389. [Google Scholar] [CrossRef] [PubMed]
Narin, A.; Kaya, C.; Pamuk, Z. Automatic detection of coronavirus disease (COVID-19) using X-ray images and deep convolutional neural networks. Pattern Anal. Appl. 2021, 24, 1207–1220. [Google Scholar] [CrossRef]
Hong, G.; Chen, X.; Chen, J.; Zhang, M.; Ren, Y.; Zhang, X. A multi-scale gated multi-head attention depthwise separable CNN model for recognizing COVID-19. Sci. Rep. 2021, 11, 18048. [Google Scholar] [CrossRef]
Jacobi, A.; Chung, M.; Bernheim, A.; Eber, C. Portable chest X-ray in coronavirus disease-19 (COVID-19): A pictorial review. Clin. Imaging 2020, 64, 35–42. [Google Scholar] [CrossRef]
Stogiannos, N.; Fotopoulos, D.; Woznitza, N.; Malamateniou, C. COVID-19 in the radiology department: What radiographers need to know. Radiography 2020, 26, 254–263. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. Available online: https://arxiv.org/abs/2010.11929 (accessed on 20 November 2022).
Al Rahhal, M.M.; Bazi, Y.; Jomaa, R.M.; AlShibli, A.; Alajlan, N.; Mekhalfi, M.L.; Melgani, F. COVID-19 detection in ct/x-ray imagery using vision transformers. J. Pers. Med. 2022, 12, 310. [Google Scholar] [CrossRef] [PubMed]
Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Kuala Lumpur, Malaysia, 18–20 December 2021; pp. 16519–16529. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 22–31. [Google Scholar]
Xia, X.; Li, J.; Wu, J.; Wang, X.; Wang, M.; Xiao, X.; Zheng, M.; Wang, R. TRT-ViT: TensorRT-oriented Vision Transformer. arXiv 2022, arXiv:2205.09579. Available online: https://arxiv.org/abs/2205.09579 (accessed on 23 August 2022).
Anand, R.; Sowmya, V.; Gopalakrishnan, E.; Soman, K. Modified Vgg deep learning architecture for Covid-19 classification using bio-medical images. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1084, 12001. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. Available online: https://arxiv.org/abs/1409.1556 (accessed on 7 June 2022).
Rajpal, S.; Lakhyani, N.; Singh, A.K.; Kohli, R.; Kumar, N. Using handpicked features in conjunction with ResNet-50 for improved detection of COVID-19 from chest X-ray images. Chaos Solitons Fractals 2021, 145, 110749. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sarker, L.; Islam, M.M.; Hannan, T.; Ahmed, Z. COVID-DenseNet: A deep learning architecture to detect COVID-19 from chest radiology images. Preprints 2020, 2020050151. Available online: https://www.preprints.org/manuscript/202005.0151/v3 (accessed on 20 August 2022).
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Jalalifar, S.A.; Sadeghi-Naini, A. Data-Efficient Training of Pure Vision Transformers for the Task of Chest X-ray Abnormality Detection Using Knowledge Distillation. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Grass, UK, 11–15 July 2022; pp. 1444–1447. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5270–5279. [Google Scholar]
Fan, X.; Feng, X.; Dong, Y.; Hou, H. COVID-19 CT image recognition algorithm based on transformer and CNN. Displays 2022, 72, 102150. [Google Scholar] [CrossRef] [PubMed]
Rao, A.; Park, J.; Woo, S.; Lee, J.-Y.; Aalami, O. Studying the Effects of Self-Attention for Medical Image Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 3416–3425. [Google Scholar]
Lin, Z.; He, Z.; Xie, S.; Wang, X.; Tan, J.; Lu, J.; Tan, B. AANet: Adaptive attention network for COVID-19 detection from chest X-ray images. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4781–4792. [Google Scholar] [CrossRef] [PubMed]
Aboutalebi, H.; Pavlova, M.; Gunraj, H.; Shafiee, M.J.; Sabri, A.; Alaref, A.; Wong, A. MEDUSA: Multi-Scale Encoder-Decoder Self-Attention Deep Neural Network Architecture for Medical Image Analysis. Front. Med. 2021, 8, 821120. [Google Scholar] [CrossRef]
Li, K.; Wang, Y.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv 2022, arXiv:2201.04676. Available online: https://arxiv.org/abs/2201.04676 (accessed on 16 May 2022).
Rao, R.M.; Liu, J.; Verkuil, R.; Meier, J.; Canny, J.; Abbeel, P.; Sercu, T.; Rives, A. Msa transformer. In Proceedings of the 38th International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8844–8856. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001; pp. 9355–9366. [Google Scholar]
Oyelade, O.N.; Ezugwu, A.E.-S.; Chiroma, H. CovFrameNet: An enhanced deep learning framework for COVID-19 detection. IEEE Access 2021, 9, 77905–77919. [Google Scholar] [CrossRef]
Hilmizen, N.; Bustamam, A.; Sarwinda, D. The multimodal deep learning for diagnosing COVID-19 pneumonia from chest CT-scan and X-ray images. In Proceedings of the 2020 3rd International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Yogyakarta, Indonesia, 10–11 December 2020; pp. 26–31. [Google Scholar]
Khan, E.; Rehman, M.Z.U.; Ahmed, F.; Alfouzan, F.A.; Alzahrani, N.M.; Ahmad, J. Chest X-ray classification for the detection of COVID-19 using deep learning techniques. Sensors. 2022, 22, 1211. [Google Scholar] [CrossRef]
Giełczyk, A.; Marciniak, A.; Tarczewska, M.; Lutowski, Z. Pre-processing methods in chest X-ray image classification. PLoS ONE 2022, 17, e0265949. [Google Scholar] [CrossRef]
Hamza, A.; Khan, M.A.; Wang, S.-H.; Alqahtani, A.; Alsubai, S.; Binbusayyis, A.; Hussein, H.S.; Martinetz, T.M.; Alshazly, H. COVID-19 classification using chest X-ray images: A framework of CNN-LSTM and improved max value moth flame optimization. Front Public Health. 2022, 10, 948205. [Google Scholar] [CrossRef]
Park, N.; Kim, S. How Do Vision Transformers Work? arXiv 2022, arXiv:2202.06709. Available online: https://arxiv.org/abs/2202.06709 (accessed on 18 July 2022).

Figure 1. The whole structure of the BoT-ViTNet model. (a) Bottleneck Block(s = 2); (b) Bottleneck Block(s = 1); (c) MSA Block; (d) TRT-ViT Block.

Figure 2. The partial CXR images of the COVID-19 Radiography database.

Figure 3. The partial CXR images of the Coronahack dataset.

Figure 4. The confusion matrix on the test set of the COVID-19 Radiography database. (a) VGG16; (b) ResNet50; (c) DenseNet121; (d) [41]; (e) [42]; (f) [43]; (g) ViT; (h) Swin Transformer; (i) Uniformer; (g) AlterNet; (k) BoT-ViTNet.

Figure 5. The confusion matrix on the test set of the Coronahack dataset. (a) VGG16; (b) ResNet50; (c) DenseNet121; (d) [41]; (e) [42]; (f) [43]; (g) ViT; (h) Swin Transformer; (i) Uniformer; (j) AlterNet; (k) BoT-ViTNet.

Figure 6. Loss variation curves of different models on the COVID-19 Radiography database.

Figure 7. Loss variation curves of different models on the Coronahack dataset.

Table 1. The structure details and specific parameters of the BoT-ViTNet model.

Stage	Output	Operation
Conv	56 × 56 × 64	7 × 7, 64, stride = 2 3 × 3, Max_pool, stride = 2
stage 1	56 × 56 × 256	$[\begin{array}{l} 1 \times 1, 64 \\ 3 \times 3, 64 \\ 1 \times 1, 56 \end{array}] \times 2, [M S A, 256] \times 1$
stage 2	28 × 28 × 512	$[\begin{array}{l} 1 \times 1, 128 \\ 3 \times 3, 128 \\ 1 \times 1, 512 \end{array}] \times 3, [M S A, 512] \times 1$
stage 3	14 × 14 × 1024	$[\begin{array}{l} 1 \times 1, 256 \\ 3 \times 3, 256 \\ 1 \times 1, 1024 \end{array}] \times 5, [M S A, 1024] \times 1$
stage 4	7 × 7 × 2048	$[\begin{array}{l} 1 \times 1, 512 \\ 3 \times 3, 512 \\ 1 \times 1, 2048 \end{array}] \times 1, [\begin{array}{l} M S A, 1024 \\ 1 \times 1, 256 \\ 3 \times 3, 256 \\ 1 \times 1, 1024 \end{array}] \times 2$
	1 × 1 × 3	average pool, fc, softmax

Table 2. Detailed information about the COVID-19 Radiography database.

Dataset	COVID	Viral	Normal	Total
Training set	2169	808	6116	9093
Testing set	1447	537	4076	6060

Table 3. Detailed information about the Coronahack dataset.

Dataset	Pneumonia	Normal	Total
Training set	3477	1260	4737
Testing set	869	316	1185

Table 4. The configuration of the environment and parameters for the experiment.

Experimental Environment	Configuration Description
CPU	12 Core Intel(R) Xeon(R) Gold 5320 CPU @ 2.20 GHz
GPU	RTX A4000
Operating system	ubuntu18.04
Development environment	PyCharm2021.3
Cuda	11.1
Programming environment	PyTorch 1.9.0, python 3.8
Optimizer	Adam
Learning rate	0.0001
Iterations	100
Batch size	64
Image resolution	224 × 224

Table 5. The classification results of different models on the COVID-19 Radiography database.

Model	Acc.	Pre.	Sen.	Spe.	F₁.
VGG16 [24]	96.93%	96.87%	94.83%	97.63%	95.83%
ResNet50 [26]	97.11%	96.67%	95.07%	97.77%	95.86%
DenseNet121 [28]	97.19%	96.77%	95.43%	97.83%	96.07%
Khan et al. [41]	97.38%	96.83%	95.37%	98.00%	96.07%
Giełczyk et al. [42]	96.80%	95.88%	94.65%	97.63%	95.25%
Hamza et al. [43]	94.72%	95.28%	90.93%	95.49%	92.96%
ViT [18]	92.92%	92.47%	88.33%	94.50%	90.27%
Swin Transformer [30]	94.21%	94.97%	89.87%	95.00%	92.23%
Uniformer [36]	97.62%	95.71%	97.60%	98.00%	96.63%
AlterNet [44]	98.12%	97.40%	96.41%	98.83%	96.90%
BoT-ViTNet	98.91%	97.80%	98.76%	99.13%	98.27%

Table 6. The classification results of different models on the Coronahack dataset.

Model	Acc.	Pre.	Sen.	Spe.	F₁.
VGG16 [24]	96.37%	95.57%	95.11%	95.11%	95.34%
ResNet50 [26]	97.30%	96.35%	96.07%	96.07%	96.21%
DenseNet121 [28]	97.13%	96.50%	96.13%	96.13%	96.31%
Khan et al. [41]	97.47%	96.85%	96.66%	96.66%	96.75%
Giełczyk et al. [42]	96.69%	96.00%	95.54%	95.54%	95.77%
Hamza et al. [43]	95.11%	94.50%	92.83%	92.83%	93.62%
ViT [18]	91.98%	90.71%	88.39%	88.39%	89.46%
Swin Transformer [30]	93.92%	93.02%	91.22%	91.22%	92.07%
Uniformer [36]	97.72%	97.22%	96.93%	96.93%	97.08%
AlterNet [44]	97.97%	97.32%	97.51%	97.51%	97.41%
BoT-ViTNet	98.40%	97.99%	97.89%	97.89%	97.94%

Table 7. Recognition results of the BoT-ViTNet model on the COVID-19 Radiography database.

Category	Pre.	Sen.	Spe.	F₁.
COVID	98.55%	99.00%	99.67%	98.77%
Normal	99.51%	99.00%	97.88%	99.25%
Viral	95.34%	98.27%	99.84%	96.78%
Average	97.80%	98.76%	99.13%	98.27%

Table 8. Recognition results of the BoT-ViTNet model on the Coronahack dataset.

Category	Pre.	Sen.	Spe.	F₁.
Pneumonia	98.85%	98.96%	96.83%	98.90%
Normal	97.14%	96.83%	98.96%	96.98%
Average	97.99%	97.89%	97.89%	97.94%

Table 9. Results of the ablation experiment on the COVID-19 Radiography database.

No.	MSA Block	TRT-ViT Block	Acc.	Pre.	Sen.	Spe.	F₁.
1	×	×	97.11%	96.67%	95.07%	97.77%	95.83%
2	√	×	97.64%	96.63%	96.70%	98.47%	96.60%
3	×	√	97.82%	97.72%	96.58%	99.28%	97.15%
4	√	√	98.91%	97.80%	98.76%	99.13%	98.27%

Table 10. Results of the ablation experiment on the Coronahack dataset.

No.	MSA Block	TRT-ViT Block	Acc.	Pre.	Sen.	Spe.	F₁.
1	×	×	97.30%	96.35%	96.07%	96.07%	96.21%
2	√	×	97.55%	96.83%	96.92%	96.92%	96.87%
3	×	√	98.01%	97.21%	97.85%	97.85%	97.53%
4	√	√	98.40%	97.99%	97.89%	97.89%	97.94%

Table 11. Classification results of different batch sizes on the COVID-19 Radiography database.

Batch Size	Acc.	Pre.	Sen.	Spe.	F₁.
4	96.39%	95.29%	92.71%	97.18%	93.94%
8	97.18%	96.10%	94.54%	97.87%	95.30%
16	97.24%	96.03%	94.62%	98.01%	95.31%
32	97.96%	97.37%	95.88%	98.80%	96.61%
64	98.91%	97.80%	98.76%	99.13%	98.27%

Table 12. Classification results of different batch sizes on the Coronahack dataset.

Batch Size	Acc.	Pre.	Sen.	Spe.	F₁.
4	96.03%	95.22%	94.58%	94.58%	94.89%
8	96.96%	96.46%	95.71%	95.71%	96.08%
16	97.21%	96.57%	96.28%	96.28%	96.42%
32	97.64%	96.89%	97.08%	97.08%	96.98%
64	98.40%	97.99%	97.89%	97.89%	97.94%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Zhang, T.; Chen, R.; Zhu, Z.; Wang, X. A Novel COVID-19 Image Classification Method Based on the Improved Residual Network. Electronics 2023, 12, 80. https://doi.org/10.3390/electronics12010080

AMA Style

Chen H, Zhang T, Chen R, Zhu Z, Wang X. A Novel COVID-19 Image Classification Method Based on the Improved Residual Network. Electronics. 2023; 12(1):80. https://doi.org/10.3390/electronics12010080

Chicago/Turabian Style

Chen, Hui, Tian Zhang, Runbin Chen, Zihang Zhu, and Xu Wang. 2023. "A Novel COVID-19 Image Classification Method Based on the Improved Residual Network" Electronics 12, no. 1: 80. https://doi.org/10.3390/electronics12010080

APA Style

Chen, H., Zhang, T., Chen, R., Zhu, Z., & Wang, X. (2023). A Novel COVID-19 Image Classification Method Based on the Improved Residual Network. Electronics, 12(1), 80. https://doi.org/10.3390/electronics12010080

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel COVID-19 Image Classification Method Based on the Improved Residual Network

Abstract

1. Introduction

2. Related Work

2.1. Convolutional Neural Network

2.2. Vision Transformer

2.3. Hybrid Network Models

3. Method

3.1. Bottleneck Block

3.2. MSA Block

3.3. TRT-ViT Block

4. Results and Discussion

4.1. Datasets

4.2. Experimental Details

4.3. Evaluation Metrics

4.4. Experimental Results and Analysis

4.4.1. Comparison of Classification Effects of Different Models

4.4.2. Comparison of Loss Curves of Different Models

4.4.3. Analysis of Classification Results by Each Category

4.4.4. Ablation Experiments

4.4.5. Robustness of BoT-ViTNet

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI