Performance Comparison of CNN Models Using Gradient Flow Analysis

: Convolutional neural networks (CNNs) are widely used among the various deep learning techniques available because of their superior performance in the ﬁelds of computer vision and natural language processing. CNNs can effectively extract the locality and correlation of input data using structures in which convolutional layers are successively applied to the input data. In general, the performance of neural networks has improved as the depth of CNNs has increased. However, an increase in the depth of a CNN is not always accompanied by an increase in the accuracy of the neural network. This is because the gradient vanishing problem may arise, causing the weights of the weighted layers to fail to converge. Accordingly, the gradient ﬂows of the VGGNet, ResNet, SENet, and DenseNet models were analyzed and compared in this study, and the reasons for the differences in the error rate performances of the models were


Introduction
Convolutional neural networks (CNNs) are one of the most widely used deep learning techniques because of their superior performance in the fields of computer vision and natural language processing [1]. Moreover, CNNs have been successfully applied to different machine-learning-related tasks, such as object detection, recognition, classification, regression, and segmentation [2][3][4]. Recently, CNNs have been actively applied in the medical field, while CNN models that can be used as automated diagnostic tools to aid experts in the detection of hypertension, coronary artery disease, myocardial infarction, and congestive heart failure have been proposed [5][6][7][8]. However, in order to use deep CNNs in mobile and embedded systems, it is necessary to overcome challenges relating to the necessity of several computations and high memory usage [9,10]. Several studies have been conducted on the gating mechanism to overcome these limitations [11].
In a fully connected neural network (FCNN), spatial information is lost, and the features of adjacent images cannot be recognized in the process of learning and classifying by flattening 3D image data into a one-dimensional array. In contrast, a CNN has a translation invariance feature that effectively recognizes the features of adjacent images while maintaining the spatial information of the image. As a CNN uses a filter as a shared parameter for data from each image, this is an effective deep learning algorithm for learning and classifying images because extremely few learning parameters are required compared to an FCNN. Computer vision performance has significantly improved in recent years with the re-emergence of CNNs and deep learning techniques [12][13][14][15]. The performance of neural networks such as CNNs has improved as the depth of CNNs has increased by effectively extracting the locality and the correlation of input data using structures in which convolutional layers are successively applied to input data. From the beginning of the ImageNet Large Scale Visual Recognition Competition (ILSVRC), the depth of CNNs has increased to improve the accuracy of object recognition. The AlexNet model [16], comprising 8 weighted layers, lowered the top-5 error rate to 16% and won the ILSVRC in 2012; the VGGNet model [17], comprising 16 weighted layers, lowered the top-5 error rate Informatics 2021, 8,53 2 of 13 to 7.3% in 2014; and the ResNet model [18], comprising 152 weighted layers and shortcut identity connections, lowered the top-5 error rate to 3.6% and won the ILSVRC in 2015. The DenseNet model is a DCNN that was proposed by Huang et al. in 2017; the top-5 error rate of DenseNet-161, comprising 161 weighted layers, was found to be 5.30% [19].
However, the accuracy of neural networks does not necessarily increase with an increase in the depth of CNNs. This is because the weights of the weighted layers may not converge owing to the gradient vanishing problem [20,21]. The weights of the weighted layers are updated in the direction in which the loss function decreases with the backpropagation algorithm. The gradient at each node is calculated based on the chain rule in a backward manner as follows: (local gradient) × (gradient f lowing f rom ahead) Therefore, the gradient calculated at the data input node may vanish to an extremely small value close to 0 or diverge infinitely if the depth of the neural network is sufficiently deep. The ResNet and DenseNet models, which introduced skip connections, were developed to overcome this problem. In addition, SENets [22] were proposed to improve the representational power of CNN models. In the ResNet model, gradients do not vanish but are effectively transmitted when the weights are updated because there is a shortcut identity connection for each of the two weighted layers as in Figure 2, even in the case of deep neural networks. DenseNet can be considered a model that maximizes the idea of skip connections. In the DenseNet model, each layer is connected to all the other layers in a feed-forward manner, as in Figure 4. Therefore, each layer receives full output data from the previous layers. The DenseNet model effectively overcomes the gradient vanishing problem even when the depth of a neural network is deeper because of this feature. SENets model the interdependencies between the channels of convolutional features by introducing a squeeze-and-excitation (SE) block, which improves the representation power of CNNs by being plugged into various CNN models. In a performance comparison study of CNN models [18,22], the superiority was demonstrated in comparison to conventional models by calculating the accuracy of the trained models on several datasets (CIFAR-10 and ImageNet) and by comparing the number of parameters and the amount of computation (FLOPs) required for forwarding a single input image. He et al. (2016) measured the top-1 error rate performance of ResNets and plain networks composed of the same rules as the VGGNet model on the ImageNet validation set. He et al. (2016) have verified that ResNet reduces the top-1 error rate compared to a plain network on an extremely deep system, as shown in Table 1  The aim of this study was to present an analysis tool that can be used for the performance analysis of CNN models in the future by deriving the theoretical basis for the performance difference of the top-1 error rates of the four models using the analysis and comparison of gradient flows based on a single bottleneck layer for the VGGNet, ResNet, SE-ResNet, and DenseNet models. The proposed gradient flows analysis method based on a single bottleneck layer can also be applied to design a CNN model to enhance the learning ability of the model.

Materials and Research Method
In this study, we thought that the error rate performance of CNN models with various architectures could be predicted by analyzing how efficiently the gradient vanishing problem can be overcome. Therefore, the following research questions were established. The analysis results for research question 1 are described in Section 3, while the analysis results for research question 2 are described in Section 4. By analyzing and comparing the gradient flow of the VGGNet, ResNet, SE-ResNet, and DenseNet models based on a single bottleneck layer, this study aimed to draw the theoretical basis for the differences in the error rate performance of CNN models.
Let F(x) denote the output data that passed through the weighted layer in a bottleneck block of a CNN model. Let H(x) denote the output data of a bottleneck block of a CNN model with skip connection. For the analysis, the gradient of loss function L(x) with respect to the input data x of the bottleneck layer was expressed according to the chain rule as the product of the rate of change of the loss function with respect to F(x)(or H(x)) and the rate of change of F(x)(or H(x)) with respect to x. When ∂F ∂x (or ∂H ∂x ) converged to 0, the lower limit of ∂L ∂x in the VGGNet, ResNet, SE-ResNet, and DenseNet models was investigated to study the gradient vanishing problem, which occurs when ∂F ∂x (or ∂H ∂x ) converges to 0 at an increased number of weighted layers. Moreover, the findings from this investigation were compared with the error rate performance analysis results [18,19,22] to check for consistency. Through this method, this study aimed to provide a theoretical basis for the analysis results of the error rate performances [18,19,22].
If the number of filters in the convolutional layer is n, the width of the filter is k W , and the height is k H , the architecture of the VGGNet, ResNet, SE-ResNet, and DenseNet models when the filters of the convolutional layer are simply expressed as n@k W × k H is summarized in Table A1 in Appendix A. The symbol f C used in the architecture of the SE-ResNet model in Table A1 indicates the output dimensions of the two fully connected layers of the SE block.

Research Results
In the VGGNet model, the size of all the filters of the convolution layer is 3 × 3, and the activation function ReLU is applied to the data that have passed through the convolution layer. A certain level of complexity is maintained for each layer by doubling the number of filters in the next convolutional layer when the size of the feature map is halved by a 2 × 2 max pooling layer. Three fully connected layers (FC layers) are located in the last stage of the network and serve as classifiers. The ResNet model is a CNN that overcomes the problem of degradation by introducing residual learning. If H(x) denotes the underlying mapping of the stacked layer and x denotes the input data, the activation function ReLU is applied after training, F(x) = H(x) − x, and adding x to the output data of the stacked layer. He et al. (2016) used a bottleneck building block composed of 1 × 1 − 3 × 3 − 1 × 1 conv to present the structure of an iterative ResNet, as shown in Table A1. SE-ResNet is a model that improves feature discriminability by scaling the output data of the residual block by plugging an SE block into the bottleneck block of ResNet and extracting channel-wise multiplication factor s from the output data of the residual block. The DenseNet model has a structure in which the dense block of Table A1. In the DenseNet model, , where x l denotes the output data of the l-th dense block, H l [·] denotes the nonlinear transformation of the l-th dense block, and [x 0 , x 1 , · · · , x l−1 ] denotes the concatenation of feature maps created in previous layers.
A single bottleneck block in the VGGNet model is illustrated in Figure 1. The gradient of L(x) is expressed as shown in Equation (1), where x denotes the input data of the bottleneck block of the VGGNet model, F(x) denotes the output data that passed through the weighted layer, and L(x) denotes the loss function.

∂L ∂x
Therefore, a gradient vanishing problem occurred when ∂F/∂x vanished to 0. A single bottleneck block in the ResNet model is illustrated in Figure 2. The output data of the bottleneck block are represented as F(x) + x, where x denotes the input data of the bottleneck block of the ResNet model and F(x) denotes the output data that passed through the weighted layer. If H(x) = F(x) + x and the loss function is L(x), Therefore, the gradient vanishing problem could be overcome more effectively than by using the VGGNet model because ∂L/∂H components remained even if ∂F/∂x vanished to 0.  Figure 3. Let x denote the input data of the bottleneck block of the SE-ResNet model and F(x) denote the output data that have passed through the residual block. Then, F(x) = (u 1 , u 2 , · · · , u C ), assuming that F(x) includes C feature maps. If the result of the squeeze step by applying global average pooling to F(x) is z = F sq (F(x)) and the result of the excitation step by applying FC − ReLU − FC − Sigmoid to z is s = (s 1 , s 2 , · · · , s C ), s transforms F(x) into F(x) = (s 1 u 1 , s 2 u 2 , · · · , s C u C ) as a channel-wise multiplication factor that acts as a scale factor to improve the feature discriminability. The output data of the SE-ResNet model bottleneck block become H(x) = F(x) + x as x is added to F(x) because of the shortcut identity connection. Therefore, if the loss function is L(x), the following equation is established: ∂L ∂x  As F(x) can be expressed as F(x) = s·F, which is the Hadamard product of s and F(x), the following equation holds: Therefore, The possibility of overcoming the gradient vanishing problem more effectively than can be achieved using the ResNet model increased because ∂L ∂H × ∂s ∂x ·F + 1 components remained even if ∂F/∂x vanished to 0 in Equation (3).
A single dense block in the DenseNet model is demonstrated in Figure 4. If x 0 denotes the initial input data of DenseNet, x i−1 denotes the input data of the i-th dense block, and x i denotes the output data, the following equations are established. If H([x 0 , x 1 , · · · , x l−1 ]) = F l (x l−1 )||x l−1 || · · ·||x 0 and x = (x 0 , x 1 , · · · , x l−1 ) in Equation (4): Therefore, Therefore, the gradient vanishing problem could be overcome more effectively than it can using the ResNet model because ∂L/∂H × √ l remained even if all ∂F l (x l−1 )/∂x 0 , ∂F l (x l−1 )/∂x 1 , · · · , ∂F l (x l−1 )/∂x l−1 vanished to 0.

Discussion
The results of measuring the top-1 error rate performances for the CIFAR-10 dataset of ResNets and plain networks using the same rules as the VGGNet model while varying the number of weighted layers are presented in Figure 5. The results of measuring the top-1 error rate performances for the ImageNet dataset of plain networks and ResNets by varying the number of weighted layers are presented in Table 1.

Discussion
The results of measuring the top-1 error rate performances for the CIFAR-10 dataset of ResNets and plain networks using the same rules as the VGGNet model while varying the number of weighted layers are presented in Figure 5. The results of measuring the top-1 error rate performances for the ImageNet dataset of plain networks and ResNets by varying the number of weighted layers are presented in Table 1. The performance of plain networks decreased as the number of weighted layers increased, whereas there was a performance gain in which the accuracy of ResNets increased as the number of weighted layers increased by overcoming the gradient vanishing problem, as presented in Figure 5 and Table 1.
The performance in terms of the top-1 error rate, top-5 error rate, and amount of computation (GFLOPs) needed in the ResNet and SE-ResNet models for the ImageNet dataset when the number of weighted layers is 50, 101, and 152 is summarized in Table 2 [22]. A graph showing the change in the top-1 error rate based on the epochs of ResNet-50 and SE-ResNet-50 is depicted in Figure 6, which shows that the validation of the top-1 error rate of SE-ResNet-50 is lower than that of ResNet-50 [22]. Although the amount of computation needed for the SE-ResNet model with the SE block plugged into the ResNet model slightly increased, the top-1 and top-5 error rates were lower than those of the Res-Net model, as presented in Table 2 and Figure 6. This was consistent with the result of the gradient flow analysis, in which it can be seen that the SE-ResNet model was more likely to effectively overcome the gradient vanishing problem than the ResNet model.
The differences in error rate performance in Tables 1 and 2 can be said to be statistically significant if we consider the results showing that the standard deviation of the accuracies of CNN for 15 datasets is less than 1% [23] and the results showing that the The performance of plain networks decreased as the number of weighted layers increased, whereas there was a performance gain in which the accuracy of ResNets increased as the number of weighted layers increased by overcoming the gradient vanishing problem, as presented in Figure 5 and Table 1.
The performance in terms of the top-1 error rate, top-5 error rate, and amount of computation (GFLOPs) needed in the ResNet and SE-ResNet models for the ImageNet dataset when the number of weighted layers is 50, 101, and 152 is summarized in Table 2 [22]. A graph showing the change in the top-1 error rate based on the epochs of ResNet-50 and SE-ResNet-50 is depicted in Figure 6, which shows that the validation of the top-1 error rate of SE-ResNet-50 is lower than that of ResNet-50 [22]. Although the amount of computation needed for the SE-ResNet model with the SE block plugged into the ResNet model slightly increased, the top-1 and top-5 error rates were lower than those of the ResNet model, as presented in Table 2 and Figure 6. This was consistent with the result of the gradient flow analysis, in which it can be seen that the SE-ResNet model was more likely to effectively overcome the gradient vanishing problem than the ResNet model.
The differences in error rate performance in Tables 1 and 2 can be said to be statistically significant if we consider the results showing that the standard deviation of the accuracies of CNN for 15 datasets is less than 1% [23] and the results showing that the standard deviation of the layer response, which is the output of the 3 × 3 layer of ResNet, is less than 1 [18]. The experimental results shown in Table 1 were obtained using the following parameters: the learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 × 10 4 iterations. The experimental results shown in Table 2 were obtained using the following parameters: the learning rate is set to 0.6 and decreased by a factor of 10 every 30 epochs, and the models are trained for 100 epochs from scratch. standard deviation of the layer response, which is the output of the 3 × 3 layer of ResNet, is less than 1 [18]. The experimental results shown in Table 1 were obtained using the following parameters: the learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 × 10 iterations. The experimental results shown in Table 2 were obtained using the following parameters: the learning rate is set to 0.6 and decreased by a factor of 10 every 30 epochs, and the models are trained for 100 epochs from scratch.        Although the error rate of VGGNet might increase because of the gradient vanishing problem if the number of weighted layers is increased to improve the performance by effectively extracting the input data from CNNs, ResNet can overcome the disadvantages of VGGNet using a shortcut identity connection. SE-ResNet increases the possibility of overcoming the gradient vanishing problem more effectively than ResNet through its improved feature discriminability from plugging the SE block into ResNet. DenseNet overcomes the gradient vanishing problem more effectively than ResNet by maximizing the idea of skip connections in ResNet to connect each layer to all other layers in a feedforward manner. However, the computational amount (FLOPs) also increases compared to ResNet when forwarding a single input image owing to the complexity of the model [19].

Conclusions
This study is meaningful because the basis for the difference in the performance of the four models was derived by analyzing and comparing the gradient flow based on a single bottleneck block for the VGGNet, ResNet, SE-ResNet, and DenseNet models, which are representative models of CNNs.
A gradient vanishing problem occurred when the gradient of L(x) was calculated, as shown in Equation (1), where x denotes the input data of the bottleneck block of VGGNet, F(x) denotes the output data that passed through the weighted layer, and L(x) denotes the loss function. In the case of the ResNet model, the output data of the bottleneck block were F(x) + x. If H(x) = F(x) + x, the gradient vanishing problem could be overcome more effectively than it could in the VGGNet model because ∂L/∂H components remained even if ∂F/∂x vanished to 0 when the gradient of L(x) was calculated, as shown in Equation (2). If the input data of the bottleneck block of SE-ResNet were expressed as x, the output data of the residual block were expressed as F(x), the channel-wise multiplication factor obtained by F(x) through the SE block was expressed as s, and the loss function was expressed as L(x), then the gradient of L(x) was calculated as shown in Equation (3). Therefore, the possibility of overcoming the gradient vanishing problem more effectively than it could be by the ResNet model increased because ∂L ∂H × ∂s ∂x ·F + 1 components remained even if ∂F/∂x vanished to 0. In the case of the DenseNet model, if the initial input data were expressed as x 0 , the input data of the l-th dense block were expressed as x l−1 , the output data were expressed as x l , and H([x 0 , x 1 , · · · , x l−1 ]) = F l (x l−1 )||x l−1 || · · ·||x 0 , then the gradient of L(x) for x = (x 0 , x 1 , · · · , x l−1 ) was calculated as shown in Equation (5) with a lower limit as shown in Equation (6). Therefore, the gradient vanishing problem could be overcome more effectively than it could in the ResNet model because ∂L/∂H × √ l remained even if all ∂F l (x l−1 )/∂x 0 , ∂F l (x l−1 )/∂x 1 , · · · , ∂F l (x l−1 )/∂x l−1 vanished to 0.
The performance of a plain network with the same rules as the VGGNet model decreased as the number of weighted layers increased, whereas there was a performance gain where the accuracy of the ResNet model increased as the number of weighted layers increased due to overcoming the gradient vanishing problem, as shown in the performance analysis results in Figure 5. The results in Table 1 show that ResNet-34 reduces the ImageNet validation top-1 error by 3.5% compared to a plain network with the same number of layers and parameters. Although the amount of computation slightly increased in the SE-ResNet model compared to in the ResNet model, the validation top-1 error rate for ImageNet was lower because it effectively overcame the gradient vanishing problem, as presented in Table 2 and Figure 6. The results shown in Table 2 demonstrate that SE-ResNet-50 reduces the ImageNet validation top-1 error by 1.51% compared to ResNet-50. The DenseNet model had a lower validation top-1 error rate for ImageNet than ResNet because it effectively overcame the gradient vanishing problem even when the number of weighted layers increased, as shown in the performance analysis results in Figure 7.
In the future, more related studies will be conducted quantitatively and qualitatively if the causes for these differences in performances are investigated by analyzing the gradient flow for other CNN models and the causes of the differences in the performances of the CNN models, other than the gradient flow, are identified.