A Survey on Deep Learning in COVID-19 Diagnosis

According to the World Health Organization statistics, as of 25 October 2022, there have been 625,248,843 confirmed cases of COVID-19, including 65,622,281 deaths worldwide. The spread and severity of COVID-19 are alarming. The economy and life of countries worldwide have been greatly affected. The rapid and accurate diagnosis of COVID-19 directly affects the spread of the virus and the degree of harm. Currently, the classification of chest X-ray or CT images based on artificial intelligence is an important method for COVID-19 diagnosis. It can assist doctors in making judgments and reduce the misdiagnosis rate. The convolutional neural network (CNN) is very popular in computer vision applications, such as applied to biological image segmentation, traffic sign recognition, face recognition, and other fields. It is one of the most widely used machine learning methods. This paper mainly introduces the latest deep learning methods and techniques for diagnosing COVID-19 using chest X-ray or CT images based on the convolutional neural network. It reviews the technology of CNN at various stages, such as rectified linear units, batch normalization, data augmentation, dropout, and so on. Several well-performing network architectures are explained in detail, such as AlexNet, ResNet, DenseNet, VGG, GoogleNet, etc. We analyzed and discussed the existing CNN automatic COVID-19 diagnosis systems from sensitivity, accuracy, precision, specificity, and F1 score. The systems use chest X-ray or CT images as datasets. Overall, CNN has essential value in COVID-19 diagnosis. All of them have good performance in the existing experiments. If expanding the datasets, adding GPU acceleration and data preprocessing techniques, and expanding the types of medical images, the performance of CNN will be further improved. This paper wishes to make contributions to future research.


Introduction
WHO declared COVID-19 as a global pandemic in March 2020. The COVID-19 pandemic has affected thousands of people. It has affected people's ordinary lives and the global economy. COVID-19 can cause respiratory, gastrointestinal, and neurological syndromes [1]. Cough, fever, and other respiratory issues are the most common symptoms [2].
Rapid COVID-19 diagnosis is essential during the pandemic. Currently, the commonly used diagnostic methods include the molecular assay, chest computed tomography (CT) scan combined with the evaluation of clinical symptoms [3], artificial intelligence (AI) methods [4], potential electrochemical (EC) biosensors [5], surface plasmon resonance (SPR)-based biosensors [6], field-effect transistor (FET)-based biosensors [7], etc. As a frequently employed auxiliary detection technology, CT images can show the changes in the lung caused by virus infection [8]. Compared with CT images, X-ray images are more accessible to obtain. It is also an important means of medical detection.
The most common diagnostic measure for COVID-19 is through reverse transcriptionpolymerase chain reaction (RT-PCR) assays of nasopharyngeal swabs [9]. However, the high false negative rate of RT-PCR [10] may affect the timely treatment of infected patients.

Convolutional Neural Networks
Convolutional neural networks (CNN) are widely applied to computer vision currently. In the aspect of medical image processing about health, CNN performs outstanding [50]. CNN uses multi-layer superposition to extract from low-level features to high-level features. It is like the hierarchical structure of the human brain function [51][52][53][54]. A CNN comprises an input layer, an output layer, convolution layers, pooling layers, and fully connected layers. Input the raw data in the input layer. Convolution operations are performed to extract features in the convolution layers. The scale of parameters is further reduced in the pooling layers. Fully connected layers connect all the features and output them to the classifier.
CNN is a kind of local perception. Divide the entire image into multiple small windows that can overlap locally. The local features of the image are identified by utilizing sliding windows. A window can be regarded as a filter, which is a neuron. The convolutional layer has a set of such filters. Figure 1 is a 6 × 6 image. The size of filter one is 3 × 3. Put filter one in the upper left corner of the 6 × 6 image to do the convolution operation, and obtain the result −1. The step size of the sliding window is one. Filter one moves one step to the right and then performs the convolution operation, and the result is −2. Do convolution operation step by step from top to bottom and from left to right, and a 4 × 4 matrix can be obtained as shown in Equation (1). Doing the same operation with filter2 will obtain another different 4 × 4 matrix.
where n in is the number of input features. p is the size of padding. k is the kernel size. s is the stride. n out is the number of output features.
fully connected layers. Input the raw data in the input layer. Convolution operations are performed to extract features in the convolution layers. The scale of parameters is further reduced in the pooling layers. Fully connected layers connect all the features and output them to the classifier. CNN is a kind of local perception. Divide the entire image into multiple small windows that can overlap locally. The local features of the image are identified by utilizing sliding windows. A window can be regarded as a filter, which is a neuron. The convolutional layer has a set of such filters. Figure 1 is a 6 × 6 image. The size of filter one is 3 × 3. Put filter one in the upper left corner of the 6 × 6 image to do the convolution operation, and obtain the result −1. The step size of the sliding window is one. Filter one moves one step to the right and then performs the convolution operation, and the result is −2. Do convolution operation step by step from top to bottom and from left to right, and a 4 × 4 matrix can be obtained as shown in Equation (1). Doing the same operation with filter2 will obtain another different 4 × 4 matrix.
where is the number of input features. is the size of padding. is the kernel size. is the stride.
is the number of output features. The dimensionality of the convolution results can be further reduced in the pooling layers. The location of the pooling layer is usually in the middle of the two convolutional layers [55]. Pooling layers reduce the feature model size of the previous convolutional layer, speed up the calculation, and reduce the probability of overfitting [56]. Pooling operations divide the feature image into regions and choose a value in one pooling area to replace. In this way, the data and parameters of the feature image are compressed. Average pooling [57] and max pooling [58] are the two most typical pooling operations.
The fully connected layers come after the convolution layers and the pooling layers. There may be one or more fully connected layers [59][60][61]. In the fully connected layers, each neuron is fully connected to all neurons in the previous layer. The local features of the previous outputs are integrated into global features through fully connected layers.

Batch Normalization
A deep neural network may have many hidden layers. When training each layer, the parameters are updated, causing the input data distribution of the upper layer to change accordingly. Layer-by-layer accumulation will cause the input distribution of the front layer to change drastically. There will be changes not only in the input layer but also in the hidden layers. The phenomenon is defined as Internal Covariate Shift (ICS) [62]. The front layers should constantly adapt to the update of the parameters in the last layers, which will reduce the training efficiency of the whole network.
Whitening the neural network's input layer can make the network training converge faster [63,64]. The whitening operation refers to linearly transforming the input data to reach a normal distribution with means of 0 and variances of 1. Each neuron in hidden layers can be seen as input to the next layer. Scholars can perform a whitening operation The dimensionality of the convolution results can be further reduced in the pooling layers. The location of the pooling layer is usually in the middle of the two convolutional layers [55]. Pooling layers reduce the feature model size of the previous convolutional layer, speed up the calculation, and reduce the probability of overfitting [56]. Pooling operations divide the feature image into regions and choose a value in one pooling area to replace. In this way, the data and parameters of the feature image are compressed. Average pooling [57] and max pooling [58] are the two most typical pooling operations.
The fully connected layers come after the convolution layers and the pooling layers. There may be one or more fully connected layers [59][60][61]. In the fully connected layers, each neuron is fully connected to all neurons in the previous layer. The local features of the previous outputs are integrated into global features through fully connected layers.

Batch Normalization
A deep neural network may have many hidden layers. When training each layer, the parameters are updated, causing the input data distribution of the upper layer to change accordingly. Layer-by-layer accumulation will cause the input distribution of the front layer to change drastically. There will be changes not only in the input layer but also in the hidden layers. The phenomenon is defined as Internal Covariate Shift (ICS) [62]. The front layers should constantly adapt to the update of the parameters in the last layers, which will reduce the training efficiency of the whole network.
Whitening the neural network's input layer can make the network training converge faster [63,64]. The whitening operation refers to linearly transforming the input data to reach a normal distribution with means of 0 and variances of 1. Each neuron in hidden layers can be seen as input to the next layer. Scholars can perform a whitening operation on each hidden layer neuron to solve the ICS. Transform increasingly skewed data distributions to more standard normal distributions. This is Batch Normalization (BN). BN is located after the input linear activation and before the nonlinear transformation. That is, BN is performed on the activation values a i of neurons in each hidden layer. If there are n activation values in a mini-batch, the process of BN is as follows [62]: (1) Calculate the mean µ, as shown in Equation (2); J. Imaging 2023, 9, 1 6 of 31 (2) Calculate the variance δ 2 , as shown in Equation (3); (3) Normalize a * i , as shown in Equation (4); where e is a constant used to ensure numerical stability. (4) Scale and shift, as shown in Equation (5).
ϕ is the parameter that scales the normalized value. ω is the parameter that shifts the normalized value. These parameters are learnable. Simply normalizing the input may lead to changing what the input represents. The addition of these parameters can restore the representation of the network.

Dropout Technology
Between the input and the output layers, there are hidden layers. A deep neural network (DNN) usually has many hidden layers. The training set can be modeled correctly by adapting the weights on the incoming connections of the neurons in hidden layers [60]. However, the weight matrix performs poorly on the test set and has poor generalization ability. This phenomenon is called overfitting.
Dropout can solve the problem of overfitting. Dropout means that some neurons in the network are dropped out. However, instead of being deleted, they are temporarily dropped out of the network, including their output and input connections [65], as shown in Figure 2. The neurons that are dropped out are randomly selected. Each neuron is left with a fixed probability p. The neurons that are left form new networks thinner than the original network. These networks use the same weight matrix, so there is no increase in the number of parameters.
butions to more standard normal distributions. This is Batch Normalization (BN). BN is located after the input linear activation and before the nonlinear transformation. That is, BN is performed on the activation values of neurons in each hidden layer. If there are activation values in a mini-batch, the process of BN is as follows [62]: (1) Calculate the mean , as shown in Equation (2); (2) Calculate the variance 2 , as shown in Equation (3); (3) Normalize * , as shown in Equation (4); * = − where e is a constant used to ensure numerical stability.
is the parameter that scales the normalized value. is the parameter that shifts the normalized value. These parameters are learnable. Simply normalizing the input may lead to changing what the input represents. The addition of these parameters can restore the representation of the network.

Dropout Technology
Between the input and the output layers, there are hidden layers. A deep neural network (DNN) usually has many hidden layers. The training set can be modeled correctly by adapting the weights on the incoming connections of the neurons in hidden layers [60]. However, the weight matrix performs poorly on the test set and has poor generalization ability. This phenomenon is called overfitting.
Dropout can solve the problem of overfitting. Dropout means that some neurons in the network are dropped out. However, instead of being deleted, they are temporarily dropped out of the network, including their output and input connections [65], as shown in Figure 2. The neurons that are dropped out are randomly selected. Each neuron is left with a fixed probability p. The neurons that are left form new networks thinner than the original network. These networks use the same weight matrix, so there is no increase in the number of parameters.  When the network is training, a neuron is left with probability p (p is randomly generated from a Bernoulli distribution). It is connected to the next layer with a weight w. Then the weight value is ultimately p × w. To make the weights connecting the neurons in the next layer consistent with those during training, the neurons during testing are always present, and the weights are reduced to p × w, as shown in Figure 3.
When the network is training, a neuron is left with probability ( is randomly generated from a Bernoulli distribution). It is connected to the next layer with a weight . Then the weight value is ultimately × . To make the weights connecting the neurons in the next layer consistent with those during training, the neurons during testing are always present, and the weights are reduced to × , as shown in Figure 3.  The dropout operation of the neural network is as Equation (6) [65]: where is a vector produced by the probability of the neurons in the hidden layer . is the output vector of hidden layer .
is the input vector of the hidden layer . and are the weights and biases of hidden layer l. ( ) is the activation function. ̃ is the output vector of hidden layer l with the dropout. ̃ is processed to be used as the input of the hidden layer + 1.

ReLU
As the activation function, ReLU functions increase the nonlinear relationship between the layers of CNN to complete complex tasks. The neurons of each layer in the convolutional neural network are weighted and shifted, then output to the next layer, as shown in Equation (7).
where x is the input vector, y is the output vector. w is the weight vector for the layer, and b is the bias vector for the layer. If there are two hidden layers, the input is weighted and biased twice to obtain the output, as shown in Equations (8) and (9). = 2 ( 1 + 1 ) + 2 (8) From Equation (9), it can be concluded that multiple hidden layers can be combined into one layer if only linear transformations of weights and biases are applied to the input. The linear expression ability is limited and insufficient to fit nonlinear neural networks. The activation functions process the outputs of the neurons in the previous layers. The processing results are passed as inputs to the neurons of the next layer. Activation functions can add nonlinear factors. Rectified linear unit (ReLU) is a kind of activation function commonly used in CNN. ReLU function is defined as Equation (10): The dropout operation of the neural network is as Equation (6) [65]: where r l is a vector produced by the probability p l of the neurons in the hidden layer l. d l is the output vector of hidden layer l. c l is the input vector of the hidden layer l. w l and b l are the weights and biases of hidden layer l. f () is the activation function. d l is the output vector of hidden layer l with the dropout. d l is processed to be used as the input of the hidden layer l + 1.

ReLU
As the activation function, ReLU functions increase the nonlinear relationship between the layers of CNN to complete complex tasks. The neurons of each layer in the convolutional neural network are weighted and shifted, then output to the next layer, as shown in Equation (7).
where x is the input vector, y is the output vector. w is the weight vector for the layer, and b is the bias vector for the layer. If there are two hidden layers, the input is weighted and biased twice to obtain the output, as shown in Equations (8) and (9).
From Equation (9), it can be concluded that multiple hidden layers can be combined into one layer if only linear transformations of weights and biases are applied to the input. The linear expression ability is limited and insufficient to fit nonlinear neural networks. The activation functions process the outputs of the neurons in the previous layers. The processing results are passed as inputs to the neurons of the next layer. Activation functions can add nonlinear factors. Rectified linear unit (ReLU) is a kind of activation function commonly used in CNN. ReLU function is defined as Equation (10): ReLU is illustrated in Figure 4a. The ReLU function is a piecewise linear function. It has unilateral inhibition, which enables sparse activation of neurons in a deep neural network. Sparsity networks are more helpful for mining features. The ReLU function accelerates the convergence, and the gradient vanishing problem is solved [66]. A large parameter update may cause some ReLU neurons to never be activated. The phenomenon of "dying ReLU" appeared [67]. Variants of the ReLU function appeared to solve the problem, such as LReLU (Figure 4b ReLU is illustrated in Figure 4a. The ReLU function is a piecewise linear function. It has unilateral inhibition, which enables sparse activation of neurons in a deep neural network. Sparsity networks are more helpful for mining features. The ReLU function accelerates the convergence, and the gradient vanishing problem is solved [66]. A large parameter update may cause some ReLU neurons to never be activated. The phenomenon of "dying ReLU" appeared [67]. Variants of the ReLU function appeared to solve the problem, such as LReLU (Figure 4b

Leaky ReLU
Leaky ReLU (LReLU) solves the dying ReLU problem using a small slope [68]. The LReLU function is defined as Equation (11): where is generally 0.01. The LReLU has the advantage that if the input of the LReLU activation function is less than zero, the gradient can also be calculated, unlike ReLU, which keeps the gradient zero.

Leaky ReLU
Leaky ReLU (LReLU) solves the dying ReLU problem using a small slope [68]. The LReLU function is defined as Equation (11): where β is generally 0.01. The LReLU has the advantage that if the input of the LReLU activation function is less than zero, the gradient can also be calculated, unlike ReLU, which keeps the gradient zero.

Parametric ReLU
Parametric rectified linear unit (PReLU) is a variant of LeakyReLU. The parameter β of PReLU [69] is not set artificially but is obtained through training. The parameter is the key to improving the classification performance. PReLU is defined as Equation (12): When β = 0, PReLU becomes ReLU. β controls the slope of the negative axis.

Randomized Leaky ReLU
Randomized leaky rectified linear unit (RReLU) is another variant of LeakyReLU. RReLU is defined as Equation (13). Parameter β is randomly valued during training and becomes a fixed value during testing. During the training process, the distribution of β satisfies the normal distribution with a standard deviation of 1 and a mean of 0. When testing, Srivastava [70] et al. took an average of all β using the method of dropout. The NDSB competitor suggested that it would be better if β is sampled from C (3,8).

Exponential Linear Unit
The exponential linear unit (ELU) is also an activation function. It has similarities and differences with ReLU. ELU is an activation function with negative values, and it is defined as Equation (14): where β controls the saturation value for negative net input [71]. ELU solves the problem of gradient vanishing. The mean values of the ELU output are close to zero, which makes the convergence faster. Because ELU is nonzero for negative values, there is no "dying ReLU" phenomenon. ELU is an unsaturated function, and there is no problem with vanishing gradients or exploding gradients.

Pooling
Pooling usually follows the convolution operation. The essence of pooling is sampling. The dimension of the input feature map can be reduced using pooling by choosing an appropriate method. Pooling can reduce the operation's complexity and improve the operation's speed. The addition of pooling layers in the architecture of CNN can reduce the scale of weights and parameters [72], reducing the input's dimensions and memory consumption [73][74][75]. The use of pooling layers can also solve overfitting [76].

Max Pooling
Max pooling is an extensively applied pooling method in CNNs [58]. This method is popular because it does not require tuning parameters. The max pooling technique is to find the max element in the pooling region [77,78]. Max pooling can exhibit the max value of the feature map in the k × k neighborhood. Not only the scale of the feature map space is optimized, but also the translation invariance of the network is preserved by max pooling [79].
In Max pooling, the whole image is divided into several blocks of the same size that do not overlap. We discard all other nodes and take only the max value in each block. For example, there is a pooling region of size 4 × 4. The pooling filter size is 2 × 2, and the stride is 2. These setups enable pooling to be applied to regions of the image that do not overlap. The pooling filter selects the max value of each block to obtain the final output. Figure 5 shows the process of max pooling. Max pooling focuses on the max element and discards the others. This may lead to wrong results of the disappearance of salient features [80].
discards the others. This may lead to wrong results of the disappearance of salient features [80].

Average Pooling
The input sample is divided into multiple blocks of the same size and not overlapping each other. Average pooling is to calculate the average value of all elements in each block and present output. Use the input sample in Section 3.4.1. Similarly, the average pooling filter size is 2 × 2, and the stride is 2. The purpose is to ensure that the areas swept by the pooling filter do not overlap. Calculate the average of all values in each 2 × 2 pooling block as the output. Figure 6 shows the process of average pooling. In average pooling, if the averages are low, the contrast will be diminished [79]. The convolutional features will be decreased [81] if most of the averages are zero. Wang et al. [82] proposed using a rank-based average pooling module in the network for COVID-19 COVID-19 recognition to avoid overfitting. Fractional max pooling (FMP) is a special max pooling. It has an important parameter called the pooled fraction. Because represents the ratio of the input-spatial size to the output-spatial size of the pooling region, the image size can be reduced by setting . For regular max pooling, is set to an integer. The size of the feature map is rapidly reduced. For fractional max pooling, is set to a non-integer (a fraction). The decay rate of the feature map will slow down. If the focus is on accuracy, set ∈ (1,2), as shown in Figure  7. Suppose the focus is on speed, set ∈ (2,3) [83].

Average Pooling
The input sample is divided into multiple blocks of the same size and not overlapping each other. Average pooling is to calculate the average value of all elements in each block and present output. Use the input sample in Section 3.4.1. Similarly, the average pooling filter size is 2 × 2, and the stride is 2. The purpose is to ensure that the areas swept by the pooling filter do not overlap. Calculate the average of all values in each 2 × 2 pooling block as the output. Figure 6 shows the process of average pooling. In average pooling, if the averages are low, the contrast will be diminished [79]. The convolutional features will be decreased [81] if most of the averages are zero. Wang et al. [82] proposed using a rank-based average pooling module in the network for COVID-19 COVID-19 recognition to avoid overfitting. discards the others. This may lead to wrong results of the disappearance of salient features [80].

Average Pooling
The input sample is divided into multiple blocks of the same size and not overlapping each other. Average pooling is to calculate the average value of all elements in each block and present output. Use the input sample in Section 3.4.1. Similarly, the average pooling filter size is 2 × 2, and the stride is 2. The purpose is to ensure that the areas swept by the pooling filter do not overlap. Calculate the average of all values in each 2 × 2 pooling block as the output. Figure 6 shows the process of average pooling. In average pooling, if the averages are low, the contrast will be diminished [79]. The convolutional features will be decreased [81] if most of the averages are zero. Wang et al. [82] proposed using a rank-based average pooling module in the network for COVID-19 COVID-19 recognition to avoid overfitting. Fractional max pooling (FMP) is a special max pooling. It has an important parameter called the pooled fraction. Because represents the ratio of the input-spatial size to the output-spatial size of the pooling region, the image size can be reduced by setting . For regular max pooling, is set to an integer. The size of the feature map is rapidly reduced. For fractional max pooling, is set to a non-integer (a fraction). The decay rate of the feature map will slow down. If the focus is on accuracy, set ∈ (1,2), as shown in Figure  7. Suppose the focus is on speed, set ∈ (2,3) [83].

Fractional Max Pooling
Fractional max pooling (FMP) is a special max pooling. It has an important parameter µ called the pooled fraction. Because µ represents the ratio of the input-spatial size to the output-spatial size of the pooling region, the image size can be reduced by setting µ. For regular max pooling, µ is set to an integer. The size of the feature map is rapidly reduced. For fractional max pooling, µ is set to a non-integer (a fraction). The decay rate of the feature map will slow down. If the focus is on accuracy, set µ ∈ (1, 2), as shown in Figure 7. Suppose the focus is on speed, set µ ∈ (2, 3) [83]. There are two types of pooling regions: disjoint rectangular areas and overlapping square areas. The pooling regions can be chosen in pseudorandom and random ways. FMP works better in a random way [83]. Wang et al. [84] used fractional max pooling (FMP) instead of max pooling and average pooling in a network for COVID-19 recognition. There are two types of pooling regions: disjoint rectangular areas and overlapping square areas. The pooling regions can be chosen in pseudorandom and random ways. FMP works better in a random way [83]. Wang et al. [84] used fractional max pooling (FMP) instead of max pooling and average pooling in a network for COVID-19 recognition.

Other Popular Pooling Methods
Both average pooling and max pooling have disadvantages. Yu et al. [85] proposed a way to mix the two. This method is called mixed pooling. The weights of average pooling and max pooling are merged [86], which can overcome their shortcomings. In many experiments, the performance of mixed pooling is better than that of max pooling and average pooling in image classification [87][88][89].
Tree pooling [90] uses the data from pooling filters to learn and combines these learned filters. The performance of tree pooling is better than that of average pooling, max pooling, and mixed pooling. However, more parameters must be learned than mixed average and max pooling [91]. Tree pooling is suitable for lower network layers focusing on functional responses [92].
Fergus and Zeiler et al. [59] proposed stochastic pooling. Stochastic pooling randomly selects values. It is not suitable for negative activations. With little training data, stochastic pooling tends to generate strong activation, leading to overfitting. Rank-based stochastic pooling [93] may solve this problem. It estimates the ranking of the activations within the pooling region. Stochastic pooling needs to address issues related to scaling [94]. Wang et al. [95] proposed the stochastic pooling module and a stochastic pooling neural network for COVID-19 diagnosis. Zhang et al. [96] proposed a deep-learning model for COVID-19 diagnosis. It used stochastic pooling instead of max pooling and average pooling.

Data Augmentation
Data augmentation (DA) is an effective way to expand datasets. The larger the size of the data, the better the generalization ability of the trained model. When the number of samples in the dataset is relatively small, data augmentation increases the amount of training data by modifying the existing training data [97]. Data augmentation is mainly used to prevent overfitting, especially when the training dataset is small. There are many ways for data augmentation [98][99][100][101].

Geometric Transforms
Geometric transforms are standard methods of data augmentation. Training data can be effectively increased with geometric transforms [102]. Flipping is one of the easiest ways to implement data augmentation [103]. Flipping is a mirror image reflected along a line. Horizontal axis flipping is commonly used. Figure 8a shows the original image. Figure 8b shows the image after horizontal flipping. Figure 9a shows the original image. Figure      Rotation augmentation rotates the image to the left or right according to the selected angle. The image is rotated around a central point [104]. The angle of rotation should be chosen appropriately. A wide rotation may make recognition unsafe [105]. Figure 10a shows the original image. Figure 10b shows the result of a clockwise rotation of 10 degrees.
(a) Original image (b) Clockwise rotation of 10 degrees Shear is the non-vertical projection effect of a plane object on the projected plane. Horizontal shear and vertical shear are commonly used in data enhancement. If the original coordinate of a point is ( , ), the coordinate after horizontal shear is ( ℎ ′ , ℎ ′ ), as shown in Equation (15), where ℎ is horizontal shear parameter. The coordinate after the vertical shear is ( ′ , ′ ), as shown in Equation (16), where is vertical shear parameter. Rotation augmentation rotates the image to the left or right according to the selected angle. The image is rotated around a central point [104]. The angle of rotation should be chosen appropriately. A wide rotation may make recognition unsafe [105]. Figure 10a shows the original image. Figure 10b shows the result of a clockwise rotation of 10 degrees.  Rotation augmentation rotates the image to the left or right according to the selected angle. The image is rotated around a central point [104]. The angle of rotation should be chosen appropriately. A wide rotation may make recognition unsafe [105]. Figure 10a shows the original image. Figure 10b shows the result of a clockwise rotation of 10 degrees.
(a) Original image (b) Clockwise rotation of 10 degrees where ℎ is horizontal shear parameter. The coordinate after the vertical shear is ( ′ , ′ ), as shown in Equation (16), where is vertical shear parameter. Shear is the non-vertical projection effect of a plane object on the projected plane. Horizontal shear and vertical shear are commonly used in data enhancement. If the original coordinate of a point is (x, y), the coordinate after horizontal shear is x h , y h , as shown in Equation (15), where β h is horizontal shear parameter. The coordinate after the vertical shear is (x v , y v ), as shown in Equation (16), where β v is vertical shear parameter.
Cropping cuts a portion of the input image. It is an effective tool for extracting patches [106,107]. Images can be classified according to patches rather than the whole image. An image can be cropped into several pieces. It does not lose important information about the image when cropping.

Noise Injection
There are many ways for data augmentation using noise injection, such as Gaussian noise, salt-and-pepper noise, speckle noise, etc. Noise injection has been successfully used in plant leaf disease recognition [108], robot speech commands [109], fruit classification [110], and so on. Gaussian noise injection is more popular. Gaussian noise follows the normal distribution. The probability density function (PDF) f (x) of Gaussian noise can be expressed as Equation (17), where α 2 is the variance and β is the mean value. Gaussian injection means injecting random values from a Gaussian distribution into the pixels of the image.
In addition to adding noise to the input layer, you can also add noise to other layers. DeVries et al. add noise to a learned feature space instead of the input space [111]. Xie et al. [112] add noise to the loss layer. These prevent the network from overfitting.

Color Space
Implementing data augmentation in a color space is also a practical approach. The transformations are based on gray or RGB color values [113]. Color augmentation can be completed by isolating a color channel and converting an image into a representation in one color channel. Color augmentation can be completed by manipulating the RGB values to change the image's brightness. Color augmentation can be completed by changing the color histograms of the image [98].

Random Erasing
Random erasing means randomly selecting an area of the image, and its pixels are erased by random or mean pixel values [114]. It can augment the recognition of occluded images. Random erasing requires no parameters to learn and is easy to implement.
The size of the source image is M × N. The size of the randomly selected erasing rectangle region is M × N . Then, select randomly a point W = (x, y) in the source image region. If x + M < M and y + N < N, S = (x, y, x + M , y + N ) is the erasing region. Each pixel in S is assigned a random value between 0 and 255. Finally, a new image with a part of the regions erased is obtained.

Kernel Filters
Kernel filters can sharpen or blur images. The data augmentation can be obtained by convolving the kernel filter with the image. A Gaussian blur filter [115] is used to slide an n × n matrix across the image. A blurry image is yielded. An unsharp masking [116] slides an n × n matrix over the image. A sharpened image is yielded. The blurred image can resist motion blur. The sharpened image allows for more detail. Figure 11a shows the original image. Figure 11b shows the image after sharpening. Figure 11c shows the image after blurring.

Mixing Images
Mixing images is a method for data augmentation of two or more images. Calculating the average of the pixels can achieve a mixed image. This strategy of data augmentation may seem irrational, but it is effective. Ionue [117] came up with Samplepaining technology. Two images were randomly selected from the dataset, and one was overlaid on another. A new sample is obtained by superimposing two images. This simple data augmentation technology significantly improved classification accuracy. Summers and Dinneen

Mixing Images
Mixing images is a method for data augmentation of two or more images. Calculating the average of the pixels can achieve a mixed image. This strategy of data augmentation may seem irrational, but it is effective. Ionue [117] came up with Samplepaining technology. Two images were randomly selected from the dataset, and one was overlaid on another. A new sample is obtained by superimposing two images. This simple data augmentation technology significantly improved classification accuracy. Summers and Dinneen [118] mixed the images in a nonlinear manner. Takahashi et al. [119] generate a new training image by mixing four images obtained through random image cropping and patching (RICAP).

Data Augmentation Methods Based on Deep Learning
There are some DA ways based on deep learning. The feature space in CNN is a low-dimensional representation in high-level layers [98]. Feature space augmentation opens up opportunities for many vector operations for data augmentation [111]. Adversarial training helps to search the space for possible augmentations. It can improve the weakness in the boundaries of learnable decisions. Generative Adversarial Networks (GANs) are a way to obtain additional information from a dataset [120]. Figure 12 shows the architecture of GAN.

Mixing Images
Mixing images is a method for data augmentation of two or more images. Calculating the average of the pixels can achieve a mixed image. This strategy of data augmentation may seem irrational, but it is effective. Ionue [117] came up with Samplepaining technology. Two images were randomly selected from the dataset, and one was overlaid on another. A new sample is obtained by superimposing two images. This simple data augmentation technology significantly improved classification accuracy. Summers and Dinneen [118] mixed the images in a nonlinear manner. Takahashi et al. [119] generate a new training image by mixing four images obtained through random image cropping and patching (RICAP).

Data Augmentation Methods Based on Deep Learning
There are some DA ways based on deep learning. The feature space in CNN is a lowdimensional representation in high-level layers [98]. Feature space augmentation opens up opportunities for many vector operations for data augmentation [111]. Adversarial training helps to search the space for possible augmentations. It can improve the weakness in the boundaries of learnable decisions. Generative Adversarial Networks (GANs) are a way to obtain additional information from a dataset [120]. Figure 12 shows the architecture of GAN. Neural style transfer transfers the style of an image created in CNN to another image and preserves the original content [121]. Overfitting of the model can be prevented. Metalearning refers to optimizing neural networks with neural networks [122]. There is neural augmentation [123], smart augmentation [124], and auto-augmentation [125].

Pretrained Models
With the development of CNN, many mature networks perform well in image recognition, object recognition, natural language processing, and other fields. These networks Neural style transfer transfers the style of an image created in CNN to another image and preserves the original content [121]. Overfitting of the model can be prevented. Metalearning refers to optimizing neural networks with neural networks [122]. There is neural augmentation [123], smart augmentation [124], and auto-augmentation [125].

Pretrained Models
With the development of CNN, many mature networks perform well in image recognition, object recognition, natural language processing, and other fields. These networks include AlexNet, ResNet, DenseNet, VGG, GoogleNet, etc. During the COVID-19 pandemic, these networks could be fine-tuned and used to classify chest CT or chest X-ray images into different categories (such as infection and health).

AlexNet
AlexNet contains five convolutional layers and three fully connected layers [66]. Compared with LeNet, it uses more new technology [126]. Figure 13 is the structure of AlexNet. include AlexNet, ResNet, DenseNet, VGG, GoogleNet, etc. During the COVID-19 pandemic, these networks could be fine-tuned and used to classify chest CT or chest X-ray images into different categories (such as infection and health).

AlexNet
AlexNet contains five convolutional layers and three fully connected layers [66]. Compared with LeNet, it uses more new technology [126]. Figure 13 is the structure of AlexNet.
(1) AlexNet uses ReLU as the activation function. ReLU solves the problem of gradient descent. (2) For fully connected layer training, AlexNet randomly ignores some neurons using dropout to avoid model overfitting. (3) AlexNet uses overlapping max pooling in CNN. The step size in AlexNet is smaller than the size of the pooling kernel, so overlapping pooling is obtained. (4) AlexNet uses the local normalization scheme, which is more helpful for generalization. (5) AlexNet uses the computing power of parallel GPU, and the training of CNN is accelerated. (6) Two different forms of data augmentation are used in AlexNet. The first form of data augmentation is the translation and horizontal reflection. The second form of data augmentation is the change in the intensity of the image color channel. Figure 13. Structure of AlexNet.

ResNet
ResNet is developed and optimized based on AlexNet. One of the advantages of residual neural networks is identity mapping. ResNet has 152 layers. Simply increasing the  (1) AlexNet uses ReLU as the activation function. ReLU solves the problem of gradient descent. (2) For fully connected layer training, AlexNet randomly ignores some neurons using dropout to avoid model overfitting. (3) AlexNet uses overlapping max pooling in CNN. The step size in AlexNet is smaller than the size of the pooling kernel, so overlapping pooling is obtained. (4) AlexNet uses the local normalization scheme, which is more helpful for generalization. (5) AlexNet uses the computing power of parallel GPU, and the training of CNN is accelerated. (6) Two different forms of data augmentation are used in AlexNet. The first form of data augmentation is the translation and horizontal reflection. The second form of data augmentation is the change in the intensity of the image color channel.

ResNet
ResNet is developed and optimized based on AlexNet. One of the advantages of residual neural networks is identity mapping. ResNet has 152 layers. Simply increasing the depth of the network can lead to the problem of vanishing gradients or exploding gradients [127,128]. This problem has been addressed by normalized initialization [63,69,128,129] and intermediate normalization layers. However, with the network layers stacking, the accuracy of the training dataset saturates or even descends. That is the degradation. The degradation is not due to overfitting. Stacking layers in the deep model will make the training error larger [130,131]. ResNet uses a deep residual learning framework to mitigate the degradation [132].
The shallow network gradually stacks layers to make it a deep network. In a deep network, if the stacked layers are identity mapping, and the other layers copy the shallow network, the performance can be almost the same as the shallow network. A few stacked layers are called a block. x is the input to the first layer of the block. The expected mapping is G(x). The fitting function of the block is Equation (18): So G(x) is recast into D(x) + x. It is easier to approximate D(x) to G(x) − x than to approximate D(x) to G(x). G(x) − x is the residual mapping. To the extreme, the identity mapping G(x) = x can be obtained by pushing the residual D(x) to zero. Figure 14 shows that feedforward neural networks with shortcut connections can achieve D(x) + x.

DenseNet
All the front layers in DenseNet are densely connected to the back layers. Each layer connects to every other layer. One of its characteristics is that the connection of features on channels enables feature reuse. DenseNet has fewer parameters, faster computation, and better performance.
DenseNet is a densely connected convolutional network. Feature maps from all preceding layers are concatenated as additional inputs to this layer. Its feature maps are passed on as inputs to all subsequent layers, which preserves the feedforward nature of the network. If DenseNet has N layers, there are ( + 1)/2 connections. 0 is the input image of a DenseNet.
is the output of the network at layer n. Layer n receives feature maps from all preceding layers as inputs (19):

DenseNet
All the front layers in DenseNet are densely connected to the back layers. Each layer connects to every other layer. One of its characteristics is that the connection of features on channels enables feature reuse. DenseNet has fewer parameters, faster computation, and better performance.
DenseNet is a densely connected convolutional network. Feature maps from all preceding layers are concatenated as additional inputs to this layer. Its feature maps are passed on as inputs to all subsequent layers, which preserves the feedforward nature of the network. If DenseNet has N layers, there are N(N + 1)/2 connections. x 0 is the input image of a DenseNet. x n is the output of the network at layer n. Layer n receives feature maps from all preceding layers as inputs (19): where F n () is a nonlinear transformation function of layer n. It is a composite function of operations [133]. It includes batch normalization (BN), rectified linear units (ReLU), pooling [126], or convolution. Pooling and convolution in CNN will change the size of feature maps. In DenseNet, only feature maps of the same size can be densely connected. To resolve this contradiction, DenseNet divides the network into several dense blocks. The feature maps of each layer in one dense block have the same size. They are densely connected. Between two adjacent dense blocks are transition layers. Pooling and convolution in the transition layer make the feature map smaller. Figure 15 is the structure of DenseNet.
All the front layers in DenseNet are densely connected to the back layers. Each layer connects to every other layer. One of its characteristics is that the connection of features on channels enables feature reuse. DenseNet has fewer parameters, faster computation, and better performance.
DenseNet is a densely connected convolutional network. Feature maps from all preceding layers are concatenated as additional inputs to this layer. Its feature maps are passed on as inputs to all subsequent layers, which preserves the feedforward nature of the network. If DenseNet has N layers, there are ( + 1)/2 connections. 0 is the input image of a DenseNet.
is the output of the network at layer n. Layer n receives feature maps from all preceding layers as inputs (19): where () is a nonlinear transformation function of layer . It is a composite function of operations [133]. It includes batch normalization (BN), rectified linear units (ReLU), pooling [126], or convolution. Pooling and convolution in CNN will change the size of feature maps. In DenseNet, only feature maps of the same size can be densely connected. To resolve this contradiction, DenseNet divides the network into several dense blocks. The feature maps of each layer in one dense block have the same size. They are densely connected. Between two adjacent dense blocks are transition layers. Pooling and convolution in the transition layer make the feature map smaller. Figure 15 is the structure of DenseNet.

VGG
Visual Geometry Group (VGG) won first place in localization and second place in the classification on ImageNet Challenge in 2014. VGG is an improvement on AlexNet by deepening the depth of the network. VGG uses 3 × 3 convolution filters in all layers. It enables the network to add more convolutional layers. Increasing the depth of the network leads to better performance. Figure 16 is the structure of VGG16.
VGG uses a stack of three 3 × 3 convolution kernels. Three 3 × 3 convolution kernels have the same receptive field of 7 × 7. The three nonlinear rectification layers make

VGG
Visual Geometry Group (VGG) won first place in localization and second place in the classification on ImageNet Challenge in 2014. VGG is an improvement on AlexNet by deepening the depth of the network. VGG uses 3 × 3 convolution filters in all layers. It enables the network to add more convolutional layers. Increasing the depth of the network leads to better performance. Figure 16 is the structure of VGG16.
VGG uses the max pooling layer. The size of the pooling kernel is 2 × 2, wi 2. Smaller pooling kernels can capture more details of the information. Max po easier to capture the changes in the image and obtain more differences in loc mation.  Figure 16. Structure of VGG16.

GoogleNet
GoogleNet believes that increasing the number of layers (depth) or the nu VGG uses a stack of three 3 × 3 convolution kernels. Three 3 × 3 convolution kernels have the same receptive field of 7 × 7. The three nonlinear rectification layers make the decision function more discriminative. Then, the number of parameters is decreased. If the input and the output of three 3 × 3 convolution layers both have X channels, the number of parameters is 3 × 3 2 × X 2 = 27X 2 . One 7 × 7 convolution layer has the parameters is 7 2 × X 2 = 49X 2 , 81% more parameters [61].
VGG uses the max pooling layer. The size of the pooling kernel is 2×2, with stride 2. Smaller pooling kernels can capture more details of the information. Max pooling is easier to capture the changes in the image and obtain more differences in local information.

GoogleNet
GoogleNet believes that increasing the number of layers (depth) or the number of neurons at each layer (width) of the networks can improve the performance of the networks. Increasing the size of a network has two drawbacks: (1) Larger sizes of networks generate more parameters. When the data in the training set is small, too many parameters will cause the network to overfit. (2) Larger size of networks increases computation dramatically. Adding a convolution layer will result in a quadratic increase in computation.
To overcome the two drawbacks, GoogleNet uses a sparse connection instead of a full connection. Inception architecture is constructed in GoogleNet to realize sparse connections [134]. The idea of Inception is to find the optimal local sparse structure in the network and repeat it. The filter sizes of Inception are 1 × 1, 3 × 3, and 5×5. The output filter banks concatenate into a single vector as the input of the next step. At the same time, it is also necessary to add an alternative parallel pooling path in each stage, as shown in Figure 17.

Transfer Learning
In some cases, it is not easy to obtain training data that matches the feature space of the test data [135]. If a test dataset is related to a training dataset that has already been trained, the parameters of the trained model (pre-trained model) can be transferred to the target model, as shown in Figure 19. This avoids training the target model from scratch. But in Inception architecture, the outputs of the convolutional layer and the pooling layer will increase. It can become expensive and a computational blow-up. Dimension reductions and projections can solve this problem. Using 1 × 1 convolutions can reduce the computation before 3 × 3 and 5×5 convolutions and rectify linear activation, as shown in Figure 18.

Transfer Learning
In some cases, it is not easy to obtain training data that matches the feature space of the test data [135]. If a test dataset is related to a training dataset that has already been trained, the parameters of the trained model (pre-trained model) can be transferred to the target model, as shown in Figure 19. This avoids training the target model from scratch. It speeds up and optimizes the learning of the target model. Transfer learning can provide related but not identical existing datasets for the target model.
Transfer learning has been applied to include text sentiment classification [136], image classification [137][138][139], human activity classification [140], software defect classification [141], and multi-language text classification [142][143][144]. The networks that can be used as pre-trained are also the focus of research. The feature space of the source domain is , and the feature space of the target domain is . If = , this is homogeneous transfer learning. If ≠ , this is heterogeneous transfer learning. The strategies adopted by homogeneous transfer learning [135] include correcting the difference in the marginal distribution in the source, correcting the

Transfer Learning
In some cases, it is not easy to obtain training data that matches the feature space of the test data [135]. If a test dataset is related to a training dataset that has already been trained, the parameters of the trained model (pre-trained model) can be transferred to the target model, as shown in Figure 19. This avoids training the target model from scratch. It speeds up and optimizes the learning of the target model. Transfer learning can provide related but not identical existing datasets for the target model.

Transfer Learning
In some cases, it is not easy to obtain training data that matches the feature space of the test data [135]. If a test dataset is related to a training dataset that has already been trained, the parameters of the trained model (pre-trained model) can be transferred to the target model, as shown in Figure 19. This avoids training the target model from scratch. It speeds up and optimizes the learning of the target model. Transfer learning can provide related but not identical existing datasets for the target model.
Transfer learning has been applied to include text sentiment classification [136], image classification [137][138][139], human activity classification [140], software defect classification [141], and multi-language text classification [142][143][144]. The networks that can be used as pre-trained are also the focus of research. The feature space of the source domain is , and the feature space of the target domain is . If = , this is homogeneous transfer learning. If ≠ , this is heterogeneous transfer learning. The strategies adopted by homogeneous transfer learning [135] include correcting the difference in the marginal distribution in the source, correcting the Transfer learning has been applied to include text sentiment classification [136], image classification [137][138][139], human activity classification [140], software defect classification [141], and multi-language text classification [142][143][144]. The networks that can be used as pre-trained are also the focus of research.
The feature space of the source domain is F s , and the feature space of the target domain is F t . If F t = F s , this is homogeneous transfer learning. If F t = F s , this is heterogeneous transfer learning. The strategies adopted by homogeneous transfer learning [135] include correcting the difference in the marginal distribution in the source, correcting the difference in the conditional distribution in the source, or correcting the difference between the marginal and conditional distribution in the source. Heterogeneous transfer learning [135] uses strategies that align the input spaces of the source and target domains. If the domain distributions are still unequal, further modulations are needed.
There are three strategies for transfer learning: (i) inductive transfer learning, (ii) transductive transfer learning, and (iii) unsupervised transfer learning [145,146]. There are two common types of inductive learning. One is to use the source domain to obtain a trained learning model and fine-tune different target layers [147]. Another is multitask-learning multiple tasks simultaneously from the same input [148]. The target task is similar to the source task in transductive transfer learning, but the domains differ. The factors affecting transductive transfer learning include domain adaptation [149], sample selection bias [150], and covariate shift [151]. Unsupervised transfer learning mainly solves unsupervised learning tasks in the target domain.
There are four kinds of approaches to transfer learning [145]. The first is instance-based transfer learning. A common approach is to reweight instances of the source domain [152][153][154].
These worked best when the conditional distributions were the same in the source domain and target domain. The second is feature-based transfer learning. The standard methods include asymmetric feature transformation [155] and symmetric feature transformation [156]. The third is parameter-based transfer learning. Source tasks and target tasks share the parameters of the models [157,158]. Target learners are formed by optimally combining the reweighted source learners [158][159][160]. The last one is relational-based transfer learning [161][162][163]. If the source and target data have similar relationships, the relationship among the data can be transferred. Figure 20 shows the classification of transfer learning. lar to the source task in transductive transfer learning, but the domains differ. The factors affecting transductive transfer learning include domain adaptation [149], sample selection bias [150], and covariate shift [151]. Unsupervised transfer learning mainly solves unsupervised learning tasks in the target domain.
There are four kinds of approaches to transfer learning [145]. The first is instancebased transfer learning. A common approach is to reweight instances of the source domain [152][153][154]. These worked best when the conditional distributions were the same in the source domain and target domain. The second is feature-based transfer learning. The standard methods include asymmetric feature transformation [155] and symmetric feature transformation [156]. The third is parameter-based transfer learning. Source tasks and target tasks share the parameters of the models [157,158]. Target learners are formed by optimally combining the reweighted source learners [158][159][160]. The last one is relationalbased transfer learning [161][162][163]. If the source and target data have similar relationships, the relationship among the data can be transferred. Figure 20 shows the classification of transfer learning.

Application, Analysis, and Discussion
Currently, the diagnosis of COVID-19 mainly relies on epidemiological history, clinical symptoms, laboratory results, chest imaging findings, nucleic acid testing, or homologous comparison of gene sequencing [164]. COVID-19 is highly contagious and has an incubation period. It needs repeated testing over some time to determine infection. It is difficult for doctors to diagnose COVID-19 quickly and accurately. Chest computed tomography (CT) and chest X-ray images can reflect whether the patient's lung is infected and the degree of damage. However, doctors rely on experience to judge CT images, and X-ray images can also lead to misdiagnosis. Popular artificial intelligence (AI) systems can quickly analyze large amounts of image data. After learning, the system can accurately classify chest CT images and X-ray images. This will help doctors diagnose COVID-19 quickly and accurately.

Application, Analysis, and Discussion
Currently, the diagnosis of COVID-19 mainly relies on epidemiological history, clinical symptoms, laboratory results, chest imaging findings, nucleic acid testing, or homologous comparison of gene sequencing [164]. COVID-19 is highly contagious and has an incubation period. It needs repeated testing over some time to determine infection. It is difficult for doctors to diagnose COVID-19 quickly and accurately. Chest computed tomography (CT) and chest X-ray images can reflect whether the patient's lung is infected and the degree of damage. However, doctors rely on experience to judge CT images, and X-ray images can also lead to misdiagnosis. Popular artificial intelligence (AI) systems can quickly analyze large amounts of image data. After learning, the system can accurately classify chest CT images and X-ray images. This will help doctors diagnose COVID-19 quickly and accurately.

CNN Applied to CT Images
The papers mentioned in Table 2 used convolutional neural networks (CNN) to analyze chest CT images and proposed various algorithms for COVID-19 diagnosis. These algorithms modify the parameters and architecture of the convolutional neural network (CNN) to obtain better accuracy.
Aslan et al. [165] improved the architecture of AlexNet to obtain mAlexNet. In the mAlexNet architecture, the fully connected layer eight has 25 neurons instead of 1000 neurons. The features extracted by mAlexNet are given into the tree seed algorithm (TSA) ANN structure for classification. The experimental comparison shows that the performance of mAlexNet + TSA-ANN is more excellent. AlexNet and mAlexNet can also obtain good performance by hybridizing other structures or algorithms [166,167]. Different total number of layers [168,169] in ResNet leads to different performance results. Rahimzadeh et al. [170] modified ResNet50V2. This model can retain the data of small objects and improve the classification accuracy of images containing small important objects. By comparison, Loey et al. [171] found that among AlexNet, VGGNet16, VGGNet19, GoogleNet, and ResNet50, the ResNet50 was the most appropriate deep learning model for using the classical data augmentation and CGAN. Özdemir et al. [97] extended ResNet50 architecture with a feature-wise attention layer and used the mixup data augmentation technique. This architecture achieves higher accuracy. Mondal et al. [172] proposed the scheme of optimized InceptionResNetV2 for COVID-19 (CO-IRv2). It combines InceptionNet with ResNet with hyperparameter tuning. DenseNet-121 is a network with 121 layers [173]. DenseNet-121 D solves the problem of vanishing gradients, allows better feature reuse, and reduces the number of parameters. It is more beneficial to the training of deep learning models [174]. Xiao et al. [175] improved DenseNet with a parallel attention module (PAM-DenseNet) which has spatial and channel attention modules. These make the net have better classification performance and patient-wise prediction performance. The VGG network alone [176], or together with other neural networks [177], has been used to classify chest CT images with high accuracy. Wang et al. [178] proposed a VGG-style base network (VSBN). Convolutional block attention module (CBAM) serves as the attention module of VSBN. In order to solve the problem of artificial intelligence (AI) model overfitting, VSBN uses an improved multiple-way data augmentation method. GoogleNet is retrained over COVID-CT-Dataset [179]. GoogleNet learned the variations present in diverse types of CT images. The system model GoogleNet-COD developed by Yu et al. [180] takes GoogleNet as the backbone network. It removes the last top-two layers and replaces them with four new layers, which include the dropout layer, two fully connected layers, and the output layer. Zhang et al. improved the deep convolutional neural network (DCNN) [181]: the pooling layer adopts stochastic pooling; construct a convolution block (CB), which is obtained by combining the convolution layer with the batch normalization layer; construct a fully connected block (FCB), which is obtained by combining the dropout layer with the fully connected layer. Pham [182] uses multiple CNNS to classify CT images collected from COVID-19 patients and non-COVID-19 subjects. Among them, the deepest net, DenseNet201, has the best performance. Transfer learning with the direct input of whole image slice and without using data augmentation provided better classification rates. JavadiMoghaddam et al. [183] proposed a convolutional neural network structure containing a wavelet and four convolutional layers. It optimizes convergence time using batch normalization (BN) and Mish Functions. The Haar wavelet transform occurs at the pooling layer. There are squeeze and Excitation blocks (SE blocks) after each dropout layer. As can be seen from Table 2, there have been many suggestions, proposals, and implementations for applying CNN to COVID-19 diagnosis by analyzing chest CT images. For example, some researchers focus on adjusting the architecture of the network. Some focus on adjusting the number of layers of the network, some on improving existing algorithms, and some on mixing several structures.
Many schemes listed above have achieved accuracy higher than 90% when implemented. However, there are a lot of CNN networks that have a huge space for improvement: (1) It can be concluded from the results that the modified AlexNet can obtain better accuracy. When combined with other structures, the network is improved so that the accuracy is higher for COVID-19 diagnosis. (2) Different accuracy will be obtained from different depths of ResNet. Common ones include ResNet18, ResNet50, ResNet101, etc. However, according to the results of the above experiments, ResNet18 performs the best. ResNet can be modified to improve accuracy. The modified ResNet or ResNet combined with other networks could achieve more than 90% accuracy. ResNet50 is often used with other architectures or algorithms for COVID-19 diagnosis. Moreover, we found that the larger the dataset ResNet applied, the higher the accuracy. (3) With DenseNet, the accuracy was close to 95%. DenseNet has a better performance than other networks on the same dataset. The dataset size does not affect DenseNet accuracy as much as the depth of the network. DenseNet201 shows excellent performance for COVID-19 diagnosis. (4) Even on small datasets, VGG combined with other structures or algorithms can achieve more than 95% accuracy. VGG19 and VGG16 performed similarly. (5) The accuracy of GoogleNet for COVID-19 diagnosis is not good enough, generally less than 90%. Try tweaking the architecture of GoogleNet or combining GoogleNet with other networks to improve classification accuracy. (6) For almost all networks, accuracy increases as the dataset become larger. Private datasets are generally smaller than public datasets. When there are more than 1000 images in the dataset, almost all networks or models can achieve more than 95% accuracy.

CNN Applied to X-ray Images
The papers mentioned in Table 3 use the architecture or technology of convolutional neural networks (CNNs) to analyze chest X-ray images and establish various training models for COVID-19 diagnosis. Through optimizations, these automatic diagnosis systems have achieved better performance. Chest X-ray images were more readily available. So COVID-19 diagnosis systems could be more widely used by analyzing chest X-ray images in areas with inadequate medical systems.
Cortés et al. [184] applied to learn transfer to AlexNet and fine-tuned it. The first layer of AlexNet is replaced for images in a single intensity. Kaur et al. [185] used the improved AlexNet architecture. Strength Pareto evolutionary algorithm-II (SPEA-II) is used to optimize parameters. Narin et al. [186] compared different layers of ResNet to classify chest X-ray images. ResNet50 provides the highest classification performance. In the study of Chowdhury et al. [187], DenseNet201 performed well in classifying chest X-ray images with image augmentation. In the study, Hernandez et al. applied transfer learning through ResNet, DenseNet, and VGG and fine-tuned them, which achieved higher accuracy [188]. Sitaula et al. [189] added the attention module to the appropriate convolution layer of VGG16. Classification experiments were performed on three COVID-19 chest X-ray image datasets. Haritha et al. [190] used GoogleNet to classify X-ray images and predict COVID-19. Kaya et al. [191] first applied the angle transformation (AT) on X-ray images. Then these images are trained using GoogleNet combined with LSTM. Finally, they obtained a better accuracy rate. As can be seen from Table 3, there have also been many suggestions, proposals, and implementations for applying CNN to COVID-19 diagnosis by analyzing chest X-ray images. The researchers focused on tweaking the number of layers in the network and mixing CNNS with other network structures. They divided datasets into several classes for training and validation. Many automatic diagnosis systems can achieve accuracy higher than 90%. However, there is a huge space for improvement.

Conclusions
In this paper, we reviewed and summarized convolutional neural networks in COVID-19 diagnosis. We introduced various technologies related to CNN and some mature CNN networks with excellent performance. Then, we analyzed and compared various suggestions of other researchers on the application of CNN for COVID-19 diagnosis. Here are a few conclusions and suggestions: (1) At present, rapid and accurate COVID-19 diagnosis is vital. The classification method of chest CT or chest X-ray images based on CNN plays an important role.
(2) The current experiment has limited datasets. It is necessary to collect more data or explore better methods for small datasets. (3) Most experiments do not consider the execution time problem. It is necessary to shorten the execution time with appropriate data preprocessing [192][193][194][195] strategies or GPU acceleration. (4) The experiments discussed in this paper use chest CT or chest X-ray images as the input datasets of CNN and have achieved good performance. Although X-ray image is not as good as CT in performance, it has low cost, low radiation dose, and easyto-operate in general hospitals [196]. Future research could consider more types of medical image forms. The application of the CNN method on medical images has potential value for COVID-19 diagnosis.