Bridge Crack Detection Based on SSENets

: Bridge crack detection is essential to prevent transportation accidents. However, the surrounding environment has great interference with the detection of cracks, which makes it di ﬃ cult to ensure the accuracy of the detection. In order to accurately detect bridge cracks, we proposed an end-to-end model named Skip-Squeeze-and-Excitation Networks (SSENets). It is mainly composed of the Skip-Squeeze-Excitation (SSE) module and the Atrous Spatial Pyramid Pooling (ASPP) module. The SSE module uses skip-connection strategy to enhance the gradient correlation between the shallow network and deeper network, alleviating the vanishing gradient caused by the deepening of the network. The ASPP module can extract multi-scale contextual information of images, while the depthwise separable convolution reduces computational complexity. In order to avoid destroying the topology of crack, we used atrous convolution instead of the pooling layer. The proposed SSENets achieved a detection accuracy of 97.77%, which performed better than the models we compared it with. The designed SSE module which used skip-connection strategy can be embedded in other convolutional neural networks (CNNs) to improve their performance.


Introduction
In modern society, it is important to ensure the safety of bridges. Crack is one of the most common diseases of bridge structures, so detecting and repairing cracks in time are important tasks for the maintenance of bridges [1]. It can effectively prevent bridge quality problems from endangering transportation safety. In view of the strict requirements for bridge safety, we have to detect tiny cracks successfully and overcome the interference of noise, scratches and uneven illumination to the detection results. Workers used to rely on subjective judgment to detect bridge cracks, which would cause the problems of low efficiency, accuracy and be time consuming, thus it is not appropriate for actual application. With an advancement in computer vision and deep learning techniques, computer vision has been applied in the field of crack detection [2,3], solving the problem of crack detection methods in recent decades.
In recent years, crack detection algorithms based on computer vision are being continuously developed. Threshold segmentation [4], morphological [5], wavelet transform [6], and the filter-based algorithm [7] have been applied to detect cracks. Although these algorithms may achieve high detection accuracy after adjusting parameters, they are only effective for images captured in specific environments. In other words, when the illumination and shooting distance change, the parameters need to be adjusted to ensure the high detection accuracy.
To satisfy the requirement of working in different environments, we considered the use of convolutional neural networks (CNNs) to detect bridge cracks. CNNs was first proposed by • We designed an embedded module with skip-connection strategy, which was called Skip-Squeeze-and-Excitation (SSE) module. By inserting the SSE module into the existing network, the detection accuracy can be improved without increasing the computational complexity.

•
Considering the large span of crack size in the crack detection task, we introduced the Atrous Spatial Pyramid Pooling (ASPP) module into our model. It can effectively improve the detection accuracy by capturing the context of images in multiple scales. • Based on the above-mentioned modules, we proposed SSENets, which was applied to the bridge crack detection task. The detection accuracy of SSENets can reach 97.77%, which is higher than the traditional classification models and the model proposed by Xu et al. [25] under the same model complexity.

Datasets
In order to meet the experimental requirements, we used the bridge crack dataset created by Xu et al. [25] as input for training and testing. A total of 2068 initial images of the dataset were collected by Phantom 4 Pro's Complementary Metal Oxide Semiconductor (CMOS) surface array camera with a resolution of 1024 × 1024. In order to construct positive samples (images with cracks) and negative samples (images without cracks), the initial images were divided into four parts. Sub images were filtered, cropped and flipped, then 6069 images with resolution of 224 × 224 were obtained. The combination of images and labels was used as the dataset. We chose 4856 images as the training set and 1213 images as the testing set. As shown in Figure 1, the flow chart of the crack detection task was divided into two parts: training and testing. By inputting the training set into SSENets, we can get a trained crack classifier. It can be used to detect whether there are cracks in the testing set, and finally get the output of the task. In the test, we used the sliding window technique to traverse the whole image. The structure of SSENets will be described in detail below.

Datasets
In order to meet the experimental requirements, we used the bridge crack dataset created by Xu et al. [25] as input for training and testing. A total of 2068 initial images of the dataset were collected by Phantom 4 Pro's Complementary Metal Oxide Semiconductor (CMOS) surface array camera with a resolution of 1024 × 1024. In order to construct positive samples (images with cracks) and negative samples (images without cracks), the initial images were divided into four parts. Sub images were filtered, cropped and flipped, then 6069 images with resolution of 224 × 224 were obtained. The combination of images and labels was used as the dataset. We chose 4856 images as the training set and 1213 images as the testing set. As shown in Figure 1, the flow chart of the crack detection task was divided into two parts: training and testing. By inputting the training set into SSENets, we can get a trained crack classifier. It can be used to detect whether there are cracks in the testing set, and finally get the output of the task. In the test, we used the sliding window technique to traverse the whole image. The structure of SSENets will be described in detail below.

Proposed Network
In order to improve the capability of the model, reduce the model complexity and alleviate the vanishing gradient in the training process, we proposed a model named SSENets, based on the SSE module using skip-connection strategy and the ASPP module using atrous convolutions with multisample rates. The structure of SSENets is shown in Figure 2, which contains the core SSE module, ASPP module and conventional convolutional layers and pooling layers. The role of the first three convolutional layers is to extract the images features. The module takes the feature maps from the second and third convolutional layers as the input of the SSE module, and uses the generated channel weights to recalibrate the feature map. The SSE module uses feature maps from different layers as input, which can improve the problem of the vanishing gradient in the training process. The structure of SSE module will be detailed in Section 2.3. So as to improve the learning capability to cracks features, the model takes the output feature map of SSE module as the input of ASPP module and extracts the multi-scale features. We structure the ASPP module with depthwise separable convolution in order to greatly reduce the parameters and model complexity. The structure of ASPP module will be detailed in Section 2.4. In addition, in order to avoid destroying the topology of the cracks after using several pooling layers to sample the feature map, we introduce atrous convolution with an atrous rate of 2 in the last three convolutional layers. Finally, we use the Softmax function to predict whether the input images contain cracks or not.

Proposed Network
In order to improve the capability of the model, reduce the model complexity and alleviate the vanishing gradient in the training process, we proposed a model named SSENets, based on the SSE module using skip-connection strategy and the ASPP module using atrous convolutions with multi-sample rates. The structure of SSENets is shown in Figure 2, which contains the core SSE module, ASPP module and conventional convolutional layers and pooling layers. The role of the first three convolutional layers is to extract the images features. The module takes the feature maps from the second and third convolutional layers as the input of the SSE module, and uses the generated channel weights to recalibrate the feature map. The SSE module uses feature maps from different layers as input, which can improve the problem of the vanishing gradient in the training process. The structure of SSE module will be detailed in Section 2.3. So as to improve the learning capability to cracks features, the model takes the output feature map of SSE module as the input of ASPP module and extracts the multi-scale features. We structure the ASPP module with depthwise separable convolution in order to greatly reduce the parameters and model complexity. The structure of ASPP module will be detailed in Section 2.4. In addition, in order to avoid destroying the topology of the cracks after using several pooling layers to sample the feature map, we introduce atrous convolution with an atrous rate of 2 in the last three convolutional layers. Finally, we use the Softmax function to predict whether the input images contain cracks or not.

Skip-Squeeze-and-Excitation Module
To alleviate the vanishing gradient problem with the increase of the depth of the model, we design the embedded SSE module based on the skip-connection strategy, the structure of which is shown in Figure 3. F refers to any matrix transformation in the network. The feature map FM ∈ ℝ × × can be obtained by F , where ≤ , is the total number of convolutional layers in the network. F represents the squeeze operator in the SSE module. The input of the Squeeze operation is the feature map FM , and its spatial dimensions of each channel will be aggregated to get the channel-wise descriptor d ∈ ℝ . F represents the excitation operator in the SSE module. The excitation operator maps the input channel-wise descriptor d to a set of channel weights d ′ ∈ ℝ , the channel number of which is the same as the output feature map FM ∈ ℝ × × . Then select the feature map FM ∈ ℝ × × obtained by the j-th convolutional layer, multiply with channel weights d ′ . During the training process, the channel weight d ′ is adjusted continuously, and each channel of FM is recalibrated, so as to enhance the learning capability of the module.

Skip-connection
The appearance of VGGNets [33] proves that the performance of network increases with the increase of network depth. However, with the increase of network depth, vanishing gradient would appear. The essence that CNNs can iterate continuously is the back propagation of parameters. The chain rule of back propagation will make the gradient less than 1 close to 0 after iteration, so that the parameters far from the output layer cannot be undated. Therefore, it is impossible to increase the number of network layers without limitation in order to improve the network performance.
To alleviate the vanishing gradient caused by the depth increase of the network, this paper designs the SSE module using the skip-connection strategy. SSE module selects the feature map of different depths as input, and uses the channel weight d ′ generated by the shallow layer to recalibrate the feature map FM generated by the deeper layers. This strategy can increase the gradient correlation of the model, and alleviate the vanishing gradient of CNNs with the increase of the depth of the model. Therefore, it makes the model easier to optimize, and improves the detection

Skip-Squeeze-and-Excitation Module
To alleviate the vanishing gradient problem with the increase of the depth of the model, we design the embedded SSE module based on the skip-connection strategy, the structure of which is shown in Figure 3. F tr refers to any matrix transformation in the network. The feature map FM i ∈ R H i ×W i ×C i can be obtained by F tr , where i ≤ n, n is the total number of convolutional layers in the network. F sq represents the squeeze operator in the SSE module. The input of the Squeeze operation is the feature map FM i , and its spatial dimensions of each channel will be aggregated to get the channel-wise descriptor d ∈ R C i . F ex represents the excitation operator in the SSE module. The excitation operator maps the input channel-wise descriptor d to a set of channel weights d ∈ R C j , the channel number of which is the same as the output feature map FM j ∈ R H j ×W j ×C j . Then select the feature map FM j ∈ R H j ×W j ×C j obtained by the j-th convolutional layer, multiply with channel weights d . During the training process, the channel weight d is adjusted continuously, and each channel of FM j is recalibrated, so as to enhance the learning capability of the module.

Skip-Squeeze-and-Excitation Module
To alleviate the vanishing gradient problem with the increase of the depth of the model, we design the embedded SSE module based on the skip-connection strategy, the structure of which is shown in Figure 3. F refers to any matrix transformation in the network. The feature map FM ∈ ℝ × × can be obtained by F , where ≤ , is the total number of convolutional layers in the network. F represents the squeeze operator in the SSE module. The input of the Squeeze operation is the feature map FM , and its spatial dimensions of each channel will be aggregated to get the channel-wise descriptor d ∈ ℝ . F represents the excitation operator in the SSE module. The excitation operator maps the input channel-wise descriptor d to a set of channel weights d ′ ∈ ℝ , the channel number of which is the same as the output feature map FM ∈ ℝ × × . Then select the feature map FM ∈ ℝ × × obtained by the j-th convolutional layer, multiply with channel weights d ′ . During the training process, the channel weight d ′ is adjusted continuously, and each channel of FM is recalibrated, so as to enhance the learning capability of the module.

Skip-connection
The appearance of VGGNets [33] proves that the performance of network increases with the increase of network depth. However, with the increase of network depth, vanishing gradient would appear. The essence that CNNs can iterate continuously is the back propagation of parameters. The chain rule of back propagation will make the gradient less than 1 close to 0 after iteration, so that the parameters far from the output layer cannot be undated. Therefore, it is impossible to increase the number of network layers without limitation in order to improve the network performance.
To alleviate the vanishing gradient caused by the depth increase of the network, this paper designs the SSE module using the skip-connection strategy. SSE module selects the feature map of different depths as input, and uses the channel weight d ′ generated by the shallow layer to recalibrate the feature map FM generated by the deeper layers. This strategy can increase the gradient correlation of the model, and alleviate the vanishing gradient of CNNs with the increase of the depth of the model. Therefore, it makes the model easier to optimize, and improves the detection

Skip-Connection
The appearance of VGGNets [33] proves that the performance of network increases with the increase of network depth. However, with the increase of network depth, vanishing gradient would appear. The essence that CNNs can iterate continuously is the back propagation of parameters. The chain rule of back propagation will make the gradient less than 1 close to 0 after iteration, so that the parameters far from the output layer cannot be undated. Therefore, it is impossible to increase the number of network layers without limitation in order to improve the network performance.
To alleviate the vanishing gradient caused by the depth increase of the network, this paper designs the SSE module using the skip-connection strategy. SSE module selects the feature map of different depths as input, and uses the channel weight d generated by the shallow layer to recalibrate the feature map FM j generated by the deeper layers. This strategy can increase the gradient correlation of the model, and alleviate the vanishing gradient of CNNs with the increase of the depth of the model. Therefore, it makes the model easier to optimize, and improves the detection accuracy. The simplified model of SSE module is shown in Figure 4. We assume that the input of the model is x n , the output is x n+2 after two hidden layers. The formula of x n+2 is shown in Equation (1): where W n represents the parameters of the hidden layer. Operator represents Hadamard product of the matrix. From the chain rule, the partial derivative of loss function Loss to parameter W n is shown in Equation (2): It can be seen from the formula that the square brackets contain two items, even if the partial derivative accuracy. The simplified model of SSE module is shown in Figure 4. We assume that the input of the model is , the output is +2 after two hidden layers. The formula of +2 is shown in Equation (1): Hidden layer: W n x n ○ · Hidden layer: W n+1 x n+1 x n+2 Loss x n+1 where represents the parameters of the hidden layer. Operator ⨀ represents Hadamard product of the matrix. From the chain rule, the partial derivative of loss function to parameter is shown in Equation (2): It can be seen from the formula that the square brackets contain two items, even if the partial derivative +1 ℱ( +1 , +1 ) approaches 0 with the increase of iteration times and the depth of the model, won't be 0. Therefore, our model can alleviate vanishing gradient of the network.

Squeeze
Each pixel obtained from conventional convolution is only related to the context in the local receptive field and cannot take advantage of the context outside the receptive field. To solve this problem, we use the Squeeze operator to aggregate global information into a channel descriptor d.
We apply global average pooling to generate a channel-wise vector from the input feature map FM . It can shrink the context with the size of × to the size of 1 × 1 in spatial dimension. The formula of channel descriptor is shown in Equation (3):

Squeeze
Each pixel obtained from conventional convolution is only related to the context in the local receptive field and cannot take advantage of the context outside the receptive field. To solve this problem, we use the Squeeze operator to aggregate global information into a channel descriptor d. We apply global average pooling to generate a channel-wise vector from the input feature map FM i . It can shrink the context with the size of H i × W i to the size of 1 × 1 in spatial dimension. The formula of channel descriptor is shown in Equation (3): where d c is the value of c-th channel in the channel descriptor d and FM ic refers to the c-th channel of the feature map FM i . We choose the simplest aggregation strategy [32], which can improve the capability of the module while minimizing the complexity of the module.

Excitation
In order to alleviate the vanishing gradient of the network with the increase of the depth of the model, we choose the feature map FM j from deeper layer interaction with the channel descriptor d from the shallow feature map. However, the number of channels in FM j and d is generally different. To make them possible to be multiplied, we have to post-process the channel descriptor d. Squeeze operator establishes the global information of each channel in spatial dimension, but it does not consider the connection between channels. Therefore, excitation operator adopts the gating strategy [32] to focus on establishing the connection between channels, the formula is shown in Equation (4): where d = d 1 , d 2 , . . . , d c , . . . , d j , δ refers to the Rectified Linear Unit (ReLU) activation function [34], r ×C i and W 2 ∈ R C j × C i r , r is reduction ratio. To build up the correlation between channels, we take the channel descriptor d as the input of two fully-connected layers. According to Equation (4), the first fully-connected layer changes the number of channels from C i to C i r , and the second changes the number of channels from C i r to C j , which is same as the channels number of feature map FM j . Besides, both of the fully-connected layer uses the ReLU activation function.
The output of SSE module is obtained by the following formula: where SSE = SSE 1 , SSE 2 , . . . , SSE c , . . . , SSE j , F sc FM jc , d c refers to channel-wise multiplication between the channel weights d c and the feature map FM jc . The SSE module essentially introduces the skip-connection strategy and depthwise separable convolution: we select the feature maps of different depths as input, and use the channel weights generated by the shallow feature map to multiply the deeper feature map to enhance the gradient transmission ability of the network; the squeeze operator aggregates feature maps in the spatial dimension to obtain the global information of each channel; the excitation operator uses the gating strategy to establish the correlation between the channels, and converts the channel descriptor into the channel weights, which can be used to recalibrate the input feature map with the global information considering the channel relationship.

Atrous Spatial Pyramid Pooling Module
In crack detection task, cracks only occupy a small proportion of the image, and the width of cracks is quite different. Conventional convolution cannot be used for multi-scale analysis of cracks with different widths, which is not conducive to fully capturing the features of cracks. The Atrous Spatial Pyramid Pooling (ASPP) module [35] uses atrous convolutions with different rates to extract multi-scale features of cracks. As shown in Figure 5, the structure of ASPP module contains 5 parallel sub-networks. The first part obtains global information through the global average pooling while the remaining four parts use atrous convolutions with multi-sample rates of 1, 3, 7, and 11. The parallel atrous convolutions are processed by depthwise separable convolution in order to reduce the model complexity. Since the ASPP module captures the contextual information of cracks on multiple scale, it could improve the detection accuracy.
multi-scale features of cracks. As shown in Figure 5, the structure of ASPP module contains 5 parallel sub-networks. The first part obtains global information through the global average pooling while the remaining four parts use atrous convolutions with multi-sample rates of 1, 3, 7, and 11. The parallel atrous convolutions are processed by depthwise separable convolution in order to reduce the model complexity. Since the ASPP module captures the contextual information of cracks on multiple scale, it could improve the detection accuracy.

Experimental Results and Ablation Study
In this paper, all experiments are performed on an Inter(R) Core (TM) i5-9400F CPU @ 2.90 GHz CPU, a 32 GB RAM and a NVIDIA GeForce GTX 1660 GPU. The model was constructed by Pytorch. Models and code are available on [36].

Hyperparameters
SSENets uses Stochastic Gradient Descent (SGD) algorithm to train the training set containing 4856 images and labels mentioned in Section 2. We use the learning rate reduction strategy proposed by Wilson and Martinez et al. [37] for training, in which initial learning rate is 0.001, momentum is 0.9, weight decay is 0.3 and each batch contains 32 samples.

Experimental Results
In order to fairly test the performance of SSENets, we choose to compare with the model proposed by Xu et al. [25] and several traditional classification models for comparison. We guarantee that all models in the test apply the hyperparameters mentioned in Section 3.1. The experimental results are shown in Table 1. Compared with other models, SSENets achieves a higher detection accuracy [38] of 97.77, which proves the SSENets could perform better on bridge cracks dataset.

Ablation Study
In this section, ablation experiments are conducted to gain a better understanding of the effect of using different configurations on components of the SSENets. All ablation experiments are performed on the datasets mentioned in Section 2.1. and the hyperparameters mentioned in Section 3.1.

SSE Module
In Section 2.3 we introduce the structure of SSE module and its effectiveness, as well as the improvement compared with SE module. In order to verify the above content, we designed the experiment as shown in Table 2. As shown in Table 2, the experimental results show that the detection accuracy of SSE module is 1.44% higher than that of SE module. It is proven that SSE module with skip-connection strategy can effectively enhance network performance and improve the crack detection accuracy.

Reduction Ratio
Reduction ratio r is a hyperparameter introduced in Equation (4). By changing r, we can change the vector size between the two fully-connected layers of excitation operator in SSE module. In order to discuss the influence of r on the experimental result, we ensure that the input feature maps of the SSE module are the same (select the feature map Con_2 obtained from the second convolutional layer and Con_3 obtained from the third convolutional layer). It can be concluded from Table 3 that the detection accuracy decreases with the increase of r, and the highest detection accuracy is obtained when r = 0.5. [32] proves that the larger r is, the less the parameters of the model are. When r = 0.5, the model has the most parameters and the strongest ability, thereby achieving the highest detection accuracy.

Location of SSE Module
In order to discuss the effect of the location of SSE module on the detection accuracy, we ensure that there are no ASPP modules in each model, and the reduction ratio r and other hyperparameters are the same. Since there are 6 convolutional layers in the model, we select 5 groups of adjacent convolutional layers as the input of SSE module in turn. As shown in Table 4, the detection accuracy of SSE module with Con_2 and Con_3 as input is the highest, reaching 96.87%. The number of feature map channels obtained in the shallow layer is small, and the global information obtained by squeeze operator is limited, which blocks the capacity and ability of the model. Meanwhile, the number of channels of the feature map obtained in the deeper layer is larger, which increases the risk of over fitting when the datasets are small. Therefore, different locations of SSE module should be chosen for different datasets to achieve the best detection accuracy. In order to find the relationship between the detection performance and the skipping span of the input feature maps of the SSE module, we keep the second input feature map unchanged and change the skipping span of the two input feature maps. The experimental results are shown in Table 5. It can be seen that the larger the input skipping span of SSE module, the higher the detection accuracy. The reason is that SSE module uses skip-connection strategy, which applies the channel weights obtained from the shallow feature map to the deeper feature map, and establishes the gradient connection between the shallow network and deeper network. Once the skipping span of the input increases, the gradient correlation between the shallow network and the deeper network increases, thereby increasing the transmission capacity of the network and further improving the performance of the network.

ASPP Module
In order to verify the contribution of ASPP module to the model and the influence of different sampling rates on the experimental results, we choose the model without the ASPP module as the control group, and the rest three models set the multi-sample rates as [1,3,6,9], [1,3,7,11] and [1,4,8,12], respectively. The experimental results are shown in Table 6. It can be found that the detection accuracy of the model with ASPP module is higher than that of the control group while the highest detection accuracy is obtained when the multi-sample rate is set to [1,3,7,11]. Compared with the multi-sample rate set to [1,3,6,9], the module set to [1,3,7,11] can obtain a larger receptive field, so as to capture more contextual information, therefore improve the detection accuracy. However, the cracks are tiny, and the size of crack will be further reduced after down sampling. The excessive multi-sample rate will lead to the transformation of a 3 × 3 atrous convolution into a simple 1 × 1 convolution [39], so that the detection accuracy of setting the multi-sample rate to [1,4,8,12] is lower than setting to [1,3,7,11]. In practical applications, we have to consider the characteristics of the detection object, and choose the appropriate sampling rates, to achieve the best detection performance.

Performance of Models
To quantitatively analyze the testing result, several evaluation factors commonly used in the binary classification task, which have been discussed in detail in [25], are chosen to compare the performance of models. According to the evaluate results in Table 7, SSENets is superior to other models in accuracy, precision, specificity and F 1 score.

The 5-Fold Cross-Validation
Furthermore, we use 5-fold cross-validation to demonstrate the generalization ability of the models. After dividing the datasets into five parts on average, we choose each part as the testing set and the rest as training set. The detection accuracy of training is shown in Table 8 while that of testing is shown in Table 9. As shown in Tables 8 and 9, SSENets achieves the highest average detection accuracy in both training and testing. In order to make the data more intuitive, we use a histogram to draw the results of the 5-fold cross-validation. The histograms are shown in Figure 6

Computational Efficiency and Complexity of Models
We use floating-point operations (FLOPs) and running time to measure the efficiency and complexity of the models. As shown in Table 10, compared with Xu's model, the FLOPs of SSENets is increased by 0.4%, the running time is increased by 1%. Compared with Resnet50, which performs best in ResNets, the FLOPs of SSENets is decreased by 38.35% while the running time is decreased by 30.99%. In this part, we will discuss the performances between SSENets and other models: 1.
In Section 3.4.1, Table 7 shows SSENets achieves a better performance in terms of accuracy, precision, specificity and F 1 score, compared with other models. It proves that the designed embedded SSE module, which selects feature maps of different depths as inputs, and can improve the effectiveness of the model by recalibrating the feature maps by squeeze operator and excitation operator.

2.
As shown in Tables 8 and 9, the testing accuracy has been improved more in comparison to the training accuracy, which shows that SSENets has a better generalization ability. Besides, all the models get low detection accuracy at the third fold cross-validation. The reason is that its testing set contains about two-thirds of the background images, which makes the number of cracks images in training set is far more less than background images. Though this situation will affect the training results of models, SSENets still achieve a higher detection accuracy than other models. Considering the great improvement in the specificity factor, which is shown in Table 7, we conclude that SSENets can reduce the proportion of background images that are classified as crack images. 3.
Taking advantage of depthwise separable convolution, SSENets has smaller FLOPs and a shorter running time, compared to Resnets. Therefore, SSENets can greatly reduce the complexity of the model and improve the calculation efficiency, thus improving the detection performance of the model.

4.
Though SSENets could achieve a high detection accuracy in most situations, it still has limitations. As the number of negative samples in the training set decreases, the detection accuracy of SSENets will decrease, so we will devote future work to improving this problem.

Conclusions
In this paper, an image classification model SSENets for crack detection is proposed, which is mainly composed of the SSE module using the skip-connection strategy and the ASPP module using the atrous convolution with multi-sample rates. By applying the channel weights generated by shallow feature map to the deeper feature map, SSE module establishes the gradient connection between the shallow network and deeper network. It will alleviate the vanishing gradient during the network training, increase the gradient correlation, and enhance the transmission ability of the model. In view of the crack detection task, we introduce the ASPP module to capture multi-scale features from crack images, thereby improving the accuracy of crack detection. The proposed model can achieve a detection accuracy of 97.77%, which performs better than the comparison models.
Furthermore, the SSE module can be embedded in any convolutional neural network to improve performance. In future work, we will apply SSE module to pixel-level crack detection. Given the computational complexity of this task, we hope that the SSE module will reduce the model parameters while improving the detection accuracy.