Concrete Crack Detection Based on Well-Known Feature Extractor Model and the YOLO_v2 Network

: This paper compares the crack detection performance (in terms of precision and computational cost) of the YOLO_v2 using 11 feature extractors, which provides a base for realizing fast and accurate crack detection on concrete structures. Cracks on concrete structures are an important indicator for assessing their durability and safety, and real-time crack detection is an essential task in structural maintenance. The object detection algorithm, especially the YOLO series network, has signiﬁcant potential in crack detection, while the feature extractor is the most important component of the YOLO_v2. Hence, this paper employs 11 well-known CNN models as the feature extractor of the YOLO_v2 for crack detection. The results conﬁrm that a different feature extractor model of the YOLO_v2 network leads to a different detection result, among which the AP value is 0.89, 0, and 0 for ‘resnet18’, ‘alexnet’, and ‘vgg16’, respectively meanwhile, the ‘googlenet’ (AP = 0.84) and ‘mobilenetv2’ (AP = 0.87) also demonstrate comparable AP values. In terms of computing speed, the ‘alexnet’ takes the least computational time, the ‘squeezenet’ and ‘resnet18’ are ranked second and third respectively; therefore, the ‘resnet18’ is the best feature extractor model in terms of precision and computational cost. Additionally, through the parametric study (inﬂuence on detection results of the training epoch, feature extraction layer, and testing image size), the associated parameters indeed have an impact on the detection results. It is demonstrated that: excellent crack detection results can be achieved by the YOLO_v2 detector, in which an appropriate feature extractor model, training epoch, feature extraction layer, and testing image size play an important role.


Introduction
Timely detection of cracks in concrete structures is an important step of structural health monitoring (SHM) [1]. In recent years, due to aging and environmental impacts, surface cracks of infrastructures, especially concrete structures, are a hidden danger that needs to be focused on [2]. Therefore, it is necessary to detect the cracks of concrete structures to prevent any further losses in their durability. In general, inspectors collect images or videos through on-site optical instruments, process the collected data, and finally draw the inspection conclusions; this is an effective method for simple tasks, but it is unsuitable for large-scale inspection due to its low efficiency and high cost [3]. The detection of bridge surface defects based on images is highly repetitive work [4]. In the case of a large amount of data, manual detection is tedious, inefficient, and expensive [5]. Therefore, it is essential to investigate some automated methods, e.g., artificial intelligence (AI) technology, to undertake this labor-intensive work.
AI technology provides a more advanced method for SHM, which has the ability to perform various tasks (such as classification or regression) with outstanding performance. A number of AI-based image processing methods can be used for bridge defect detection [5,6]. Fujita et al. detected cracks in an asphalt pavement surface using a support vector machine (SVM); Shi et al. detected road cracks using random structured forests (RSF) [7]. Furthermore, as an emerging AI algorithm, the artificial neural network (ANN) [8] has been used to classify rail surface cracks [9], and detect potholes on an asphalt pavement surface [10], which has become a popular technique in the SHM field. However, the practical application of this method is limited due to its slow convergence, over-fitting, and high computational cost, etc. [11]. Therefore, a fast and automatic feature extraction algorithm [12,13] is needed to process the huge monitoring data [12].
The application of deep convolutional neural network (DCNN) algorithms further improve the detection method, and can automatically extract features from the raw data and gradually obtain advanced features through multiple processing layers [14]. In the field of SHM, the CNN can extract the signal distribution features from the images of fused time and frequency domain information [15], extract the frequency features from the acceleration signals [12], and extract the structural damage features from the modal shapes [13]. Meanwhile, a DCNN uses partial connections and pooling of neurons, thus, requires less computation, has better robustness, which makes the DCNN an effective and fast SHM method. A lot of research has been conducted for SHM based on the DCNN [16,17] by using vibration signals [12,18,19] and defect images [16,20,21]. In the field of defect imagesbased SHM [22], image classification is a popular method for automatic defect detection. The DCNN has been used in image classification for pavement cracks [23,24], sewer defects [21], and road damages [25]. In further research to determine the location of defects, a basic method was proposed to scan the image with a fixed size sliding window and then apply the trained DCNN to each small window. With this method, Cha [26] to classify two types of defects: cracking and spalling [27]. These studies confirmed that the location of cracks in an image can be obtained by using a sliding window. However, the challenge of this sliding window method is to find the appropriate window size when dealing with defects of different scales. Moreover, the computational cost of this method is very high, because the DCNN classifier must be applied to every window in every image many times. To improve the efficiency of detecting and locating an object (such as a crack), more advanced object detection technology needs to be further explored.
Region-based classification or object detection provides a state-of-the-art method for SHM. This method creates a bounding box around the region of interest (ROI), such as cracks, spalls, components, etc. Common methods include two-stage and one-stage algorithms. Two-stage algorithms include region-based CNN (R-CNN), Fast R-CNN, and Faster R-CNN. The R-CNN has been applied to post-event building reconnaissance with an accuracy of nearly 60% [28], and the related research shows that its computation speed is low [29]. As an advanced version of the R-CNN, the Fast R-CNN is proposed to improve computational efficiency and accuracy, and it is used to detect different defects and locations of concrete structures [30]. Then a faster R-CNN (Faster R-CNN) is proposed by introducing a region proposal network (RPN) [29], and it is employed to automatically detect structural components of the RC bridge system [31] and cracks for asphalt pavements [32]. Although the above studies demonstrated the ideal accuracy of two-stage detectors, their detection is not very fast. High-speed detection is essential for the development of real-time automatic inspection systems, which seem to be the future trend of the industry [33]. Due to the above limitations, one-stage detectors have been proposed. The popular one-stage models include You Only Look Once (YOLO) and single-shot multi-box detector (SSD). These are faster than two-stage deep learning object detectors, such as region-based CNN (i.e., Faster R-CNN) [34]. Recent studies employed the YOLO network to detect multiple concrete bridge defects [6] and pavement cracks [35]; the SSD was also applied to detect road defects in real-time [25]. However, some researchers found that locating defects using one-stage detectors could compromise the accuracy of the detection [36]. The above research results show that the crack detection method based on an object detection algorithm has become Appl. Sci. 2021, 11, 813 3 of 13 a hot topic, and both two-stage and one-stage algorithms need a feature extractor (i.e., a CNN model), but the effect of different feature extractors on image feature extraction is not clear. With the continuous updating of neural networks, some well-known CNN models are widely used to complete different situations, which will provide more suitable feature extractors for defect detection. Therefore, the influence of different feature extractors on crack detection is a problem worthy of in-depth investigation.
The application of transfer learning technology provides a state-of-the-art method for feature extraction. As the well-known CNN models have strong feature extraction ability, it will save a lot of time (no-training process) to use these CNN models as feature extractors. Therefore, this paper compares various CNN models (e.g., 'alexnet', 'resnet18', etc.) as the feature extractor of the YOLO_v2 to identify a model with high precision and fast computational speed. Meanwhile, parametric studies are implemented to confirm the effect of the relevant parameters on the detection results. Finally, the influence of different feature layers on the detection results is revealed by feature visualization.

Methods
Popular programming languages such as Java, C++, Python, and MATLAB (Math-Works Inc., Natick, MA, USA) are usually used to build the YOLO_v2, among which MATLAB is more suitable for non-professional programmers. In this paper, the YOLO_v2 was established by using MATLAB, in which 11 CNN models were used respectively as the feature extractor of the YOLO_v2 network. The precision and computational cost were compared by using different feature extractors.

YOLO_v2
The YOLO_v2 object detector is a one-stage detection network, which runs a CNN model ( Figure 1) on an input image to produce feature images from a feature extraction layer. Then, these feature images are input into the YOLO_v2 detection layer and the classification and anchor box of the detection object are obtained. Therefore, the selection of the feature extractor model would be an interesting focus for object detection. It should be noted that the detailed feature extractor contains the convolution, pooling, and ReLU layers [18]. accuracy of the detection [36]. The above research results show that the crack detection method based on an object detection algorithm has become a hot topic, and both two-stage and one-stage algorithms need a feature extractor (i.e., a CNN model), but the effect of different feature extractors on image feature extraction is not clear. With the continuous updating of neural networks, some well-known CNN models are widely used to complete different situations, which will provide more suitable feature extractors for defect detection. Therefore, the influence of different feature extractors on crack detection is a problem worthy of in-depth investigation. The application of transfer learning technology provides a state-of-the-art method for feature extraction. As the well-known CNN models have strong feature extraction ability, it will save a lot of time (no-training process) to use these CNN models as feature extractors. Therefore, this paper compares various CNN models (e.g., 'alexnet', 'res-net18', etc.) as the feature extractor of the YOLO_v2 to identify a model with high precision and fast computational speed. Meanwhile, parametric studies are implemented to confirm the effect of the relevant parameters on the detection results. Finally, the influence of different feature layers on the detection results is revealed by feature visualization.

Methods
Popular programming languages such as Java, C++, Python, and MATLAB (Math-Works Inc., Natick, MA, USA) are usually used to build the YOLO_v2, among which MATLAB is more suitable for non-professional programmers. In this paper, the YO-LO_v2 was established by using MATLAB, in which 11 CNN models were used respectively as the feature extractor of the YOLO_v2 network. The precision and computational cost were compared by using different feature extractors.

YOLO_v2
The YOLO_v2 object detector is a one-stage detection network, which runs a CNN model ( Figure 1) on an input image to produce feature images from a feature extraction layer. Then, these feature images are input into the YOLO_v2 detection layer and the classification and anchor box of the detection object are obtained. Therefore, the selection of the feature extractor model would be an interesting focus for object detection. It should be noted that the detailed feature extractor contains the convolution, pooling, and ReLU layers [18].  Furthermore, anchor boxes ( Figure 2) were adopted to detect classes of objects in the image. Anchor boxes are a set of predefined bounding boxes of certain heights and widths. These boxes (defined based on the object sizes of the samples) capture the scale and aspect ratio of the specific object classes to be detected. The use of anchor boxes significantly improves the efficiency of object detection [37], making real-time detection possible.
Appl. Sci. 2021, 11, 813 4 of 13 was gradually reduced by minimizing the loss function. As shown in Figure 2, the upper left coordinates of the predefined anchor box were (x1, y1), and the refined anchor box was (x2, y2). During training, the precise location of the anchor box was obtained by minimizing the loss function (squared error loss, SEL): Finally, the most suitable anchor box (crack location) and the highest classification score (crack or background) were obtained. As a result, the YOLO_v2 predicted the class probability (i.e., how precisely the model found the object) and the crack location (i.e., using anchor box) for each image.

Transfer Learning-Based Feature Extractors
Transfer learning technology is the reuse of a pre-trained model on a new problem. With transfer learning technology, we transferred the weights that CNN had learned at 'Question A' to a new 'Question B'. Some well-known CNN models (e.g., alexnet, googlenet, etc.) were trained on the ImageNet database [22] which was used in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [38]. These CNN models had strong feature extraction ability and could classify images into 1000 object categories. Therefore, we employed these excellent CNN models as feature extractors of the YOLO_v2. Step 1 Step 2 Step 3 Step 4 For the detection layer, the YOLO_v2 predicted the location of the anchor box by the following four steps ( Figure 2): Step 1: Obtained feature images from the feature extractor; Step 2: The predefined anchor boxes were tiled across the image; Step 3: To generate the final object detections, tiled anchor boxes that belonged to the background class were removed; Step 4: The location error (the distance between refined and predefined anchor box) was gradually reduced by minimizing the loss function. As shown in Figure 2, the upper left coordinates of the predefined anchor box were (x 1 , y 1 ), and the refined anchor box was (x 2 , y 2 ). During training, the precise location of the anchor box was obtained by minimizing the loss function (squared error loss, SEL): Finally, the most suitable anchor box (crack location) and the highest classification score (crack or background) were obtained. As a result, the YOLO_v2 predicted the class probability (i.e., how precisely the model found the object) and the crack location (i.e., using anchor box) for each image.

Transfer Learning-Based Feature Extractors
Transfer learning technology is the reuse of a pre-trained model on a new problem. With transfer learning technology, we transferred the weights that CNN had learned at 'Question A' to a new 'Question B'. Some well-known CNN models (e.g., alexnet, googlenet, etc.) were trained on the ImageNet database [22] which was used in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [38]. These CNN models had strong feature extraction ability and could classify images into 1000 object categories. Therefore, we employed these excellent CNN models as feature extractors of the YOLO_v2.

Experimental Setup and Performance Evaluation
An image dataset included 990 RGB crack images (namely CR, 227 × 227 pixels) of a concrete bridge. Among them, 90% of the images were used for training (891 images) and 10% for testing (99 images) the performance of the YOLO_v2. The 'Image labeler' toolbox in MATLAB was employed to label cracks (including the classification and location of cracks with anchor boxes).   The performance of the YOLO_v2 was investigated by comparing four factors: (1) Feature extractor model. The 11 well-known CNN models (illustrated in Section 2.2) were used as the feature extractor of the YOLO_v2; (2) Maximum epoch. An epoch was a full circle in the entire training dataset. When the precision of the YOLO_v2 reached a plateau and was clearly no longer improving, the starting epoch of the plateau was defined as the maximum epoch. In this paper, 1~30 epochs were designed to observe the change of detection results with epochs.
(3) Feature images in different CNN layers. The optimal model obtained in Step (2) was the feature extractor, and feature images of multiple layers were obtained and used as the input of the YOLO_v2 detection layer, the influence of feature images of different layers on the detection results was studied.
Computational cost and detection precision are important indexes to evaluate the performance of the YOLO detector. The computation time (the unit is seconds) was the processing time of the entire detection process, which included feature extraction and detector training. Detection precision included precision, recall, and average precision (AP). The following is a detailed explanation of these three terms: Precision: represented the classification effect of the classifier, which was the ratio of true positive instances to the total positive instances (Equation (2)). Recall: was a ratio of true positive instances to the sum of true positives and false negatives in the detection (Equation (3)).
where TP (true positive instances): positive instances that were predicted to be positive instances. FN (false negative instances): positive instances that were predicted to be negative instances. FP (false positive instances): negative instances that were predicted to be positive instances.
AP was used to reflect the detection precision, which was a combined outcome of the precision and recall metrics.
where k = 1, 2, . . . , N; and N is the number of samples, P(k) is the precision of the k-th sample, and is the difference of ∆R(k) the recall between the k-th sample and k − 1-th sample.

Crack Detection Results of the YOLO_v2
The testing images described in Section 2.3 were input into the trained YOLO_v2, the detection results of using 11 CNN models as the feature extractors are illustrated in Table 3. The feature extractor based on six transfer learning models achieved superior detection results, the AP values were higher than 0.6. Among them, 'resnet18' achieved the best detection results with the AP value reaching 0.89, and 'googlenet' (AP = 0.84) and 'mobilenetv2' (AP = 0.87) also had comparable AP values. The performance of some feature extractors (i.e., 'alexnet', 'resnet50', 'vgg16') was unsatisfactory, and the AP value tended to 0.
The computational cost was an important index to evaluate whether the model was suitable for fast and real-time detection. In this paper, the computation time of the 11 models was recorded and illustrated in Table 3. The 'alexnet' took the least computation time, the 'squeezenet' and 'resnet18' were ranked second and third respectively in terms of the computation time; the 'inceptionresnetv2' used the most computation time which was more than 10 times that of 'alexnet'. For a certain series of networks, the computational cost would increase with the increase of network complexity, e.g., in the series of 'resnet18', resnet50', and 'resnet101'. On the whole, 'resnet18' was the best detector in the trade-off between detection precision and computational cost.

Parametric Study
Three parameters of the YOLO_v2 using 'resnet18' (as the feature extractor) were studied, (1) Influence of the training epochs on detection results; (2) Influence of the feature extraction layer on detection results; (3) Influence of the testing image size on detection results.
(1) A case study for the maximum epoch was therefore performed on a training set with 891 training images to determine the optimal epoch for the training. The AP value and computing time were recorded in Figure 4  By studying the influence of epoch on detection results, the optimal detection precision could be achieved with fewer epochs. Because once the detection precision was stable, with the increase of epoch number, the computational cost would increase, which was not necessary.
(2) Eight layers (L1~L8 in Figure 5) of the 'resnet18' were employed as the feature extraction layers of the YOLO_v2 respectively. These networks were trained with the training data, the computational cost was recorded, and then the testing data were input into the eight YOLO_v2 detectors. The results are illustrated in Table 4. The AP value increased first, and then decreased with the increase of the layer depth; 'L6' achieved the best detection results. However, the computational time increased gradually by selecting a deeper layer as the feature extraction layer. By studying the influence of epoch on detection results, the optimal detection precision could be achieved with fewer epochs. Because once the detection precision was stable, with the increase of epoch number, the computational cost would increase, which was not necessary.
(2) Eight layers (L1~L8 in Figure 5) of the 'resnet18' were employed as the feature extraction layers of the YOLO_v2 respectively. These networks were trained with the training data, the computational cost was recorded, and then the testing data were input into the eight YOLO_v2 detectors. The results are illustrated in Table 4. The AP value increased first, and then decreased with the increase of the layer depth; 'L6' achieved the best detection results. However, the computational time increased gradually by selecting a deeper layer as the feature extraction layer.
(2) Eight layers (L1~L8 in Figure 5) of the 'resnet18' were employed as the feature extraction layers of the YOLO_v2 respectively. These networks were trained with the training data, the computational cost was recorded, and then the testing data were input into the eight YOLO_v2 detectors. The results are illustrated in Table 4. The AP value increased first, and then decreased with the increase of the layer depth; 'L6' achieved the best detection results. However, the computational time increased gradually by selecting a deeper layer as the feature extraction layer.    (3) According to the optimal network obtained above (the YOLO_v2 with 'resnet18', and 'L6' as the feature extraction layer), seven testing image sets (Section 2.3) with different sizes were input into the YOLO_v2 respectively to evaluate the influence of image resolution (image size) on detection results ( Table 5). The results indicated that: with the increase of resolution, the AP value increased gradually and tended to become stable. However, The FPS (frames per second), which evaluates the testing speed, decreased gradually with the increase of resolution. Therefore, the image with an appropriate resolution could achieve the optimal combination of precision and computing speed.

Visualization of Features
As the feature extractor of the YOLO_v2, the deep CNN network was a key component. In this paper, the features extracted by 'resnet18' were visualized to explain the influence of the feature extraction process on the detection results. The feature images of L1~L8 ( Figure 5) were displayed respectively. Figure 6 is the feature images of L1 and L2, and 64 Appl. Sci. 2021, 11, 813 9 of 13 feature images were obtained for each layer. Some feature images showed the outline of the crack. But there was interference by other non-target objects (some grass and background).

Visualization of Features
As the feature extractor of the YOLO_v2, the deep CNN network was a key component. In this paper, the features extracted by 'resnet18' were visualized to explain the influence of the feature extraction process on the detection results. The feature images of L1~L8 ( Figure 5) were displayed respectively. Figure 6 is the feature images of L1 and L2, and 64 feature images were obtained for each layer. Some feature images showed the outline of the crack. But there was interference by other non-target objects (some grass and background).  Figure 7 is the feature images of L3 and L4, and 128 feature images were obtained for each layer. Compared with L1 and L2, the resolution of the feature images was reduced due to the convolution process. Some extra interference was filtered out. The ideal feature (crack shape, marked by a red circle) of the crack was extracted from some feature  Figure 7 is the feature images of L3 and L4, and 128 feature images were obtained for each layer. Compared with L1 and L2, the resolution of the feature images was reduced due to the convolution process. Some extra interference was filtered out. The ideal feature (crack shape, marked by a red circle) of the crack was extracted from some feature images. Therefore, using L3 and L4 as a feature extraction layer, the detection effect ( images. Therefore, using L3 and L4 as a feature extraction layer, the detection effect (Table 4) was better than L1 and L2.  Figure 8 is the feature images of L5 and L6, and 256 feature images were obtained for each layer. The resolution continued to decrease and the features were still visible, as shown in the marked location in Figure 8. Using L5 and L6 as feature extraction layers, the detection performance (Table 4) remained at a high level.  Figure 8 is the feature images of L5 and L6, and 256 feature images were obtained for each layer. The resolution continued to decrease and the features were still visible, as shown in the marked location in Figure 8. Using L5 and L6 as feature extraction layers, the detection performance (Table 4) Figure 9 is the feature images of L7 and L8, and 512 feature images were obtained for each layer. At this point, the resolution continued to decrease. At L7, the crack features were still visible, but at L8, the image became extremely fuzzy and the visual features disappeared. This was also confirmed in Table 4 (detection results). In general, a deeper network layer (deeper layer to get lower resolution feature image) could help to filter out some non-target objects (grass and background). However, it did not mean that the lower the resolution, the better the detection results. If the resolu-  Figure 9 is the feature images of L7 and L8, and 512 feature images were obtained for each layer. At this point, the resolution continued to decrease. At L7, the crack features were still visible, but at L8, the image became extremely fuzzy and the visual features disappeared. This was also confirmed in Table 4 (detection results).  Figure 9 is the feature images of L7 and L8, and 512 feature images were obtained for each layer. At this point, the resolution continued to decrease. At L7, the crack features were still visible, but at L8, the image became extremely fuzzy and the visual features disappeared. This was also confirmed in Table 4 (detection results). In general, a deeper network layer (deeper layer to get lower resolution feature image) could help to filter out some non-target objects (grass and background). However, it did not mean that the lower the resolution, the better the detection results. If the resolu- In general, a deeper network layer (deeper layer to get lower resolution feature image) could help to filter out some non-target objects (grass and background). However, it did not mean that the lower the resolution, the better the detection results. If the resolution was too low, it was easy to ignore the features of the object (crack), which can be confirmed in Table 4. Therefore, the selection of the feature extraction layer was worth further studying, which determined the feature information obtained by the YOLO_v2 detection layer. Figure 10 shows some results of crack feature extraction ('resnet18' with 'L6' feature extraction layer).
Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 14 tion was too low, it was easy to ignore the features of the object (crack), which can be confirmed in Table 4. Therefore, the selection of the feature extraction layer was worth further studying, which determined the feature information obtained by the YOLO_v2 detection layer. Figure 10 shows some results of crack feature extraction ('resnet18' with 'L6' feature extraction layer).

Conclusions
In this paper, we have compared the detection performance of 11 well-known CNN models as the YOLO_v2 feature extractors and identified an excellent feature extractor based on the precision and computational cost. Then, the parametric studies were carried out: (1) Influence of the training epochs on detection results. (2) The influence of different feature extraction layers on the detection results was compared. (3) The size of the testing images also affects the detection results. The associated parameters indeed have an impact on the detection results. Finally, the working mechanism of the feature extractor was explained by visualizing the features obtained by the feature extractor.
Based on the above research results, this paper draws the following conclusions: 1. The comparison of 11 well-known network models indicated that the 'resnet18' has a high precision (AP = 0.89) and fast computing speed. 2. Influence of relevant parameters on detection results: (a) Once the detection precision is stable, there is no need to increase the epoch number, which will increase the computing cost. (b) An appropriate selection of the feature extraction layer can help to improve the detection results. Too shallow or too deep layers can also lead to unsatisfactory detection results. (c) The detection precision increases with the resolution of the images used; but once it reaches the optimal value, it is meaningless to further increase the image resolution, as it means more detection time.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Some or all data, models, or code generated or used during the study are available from the corresponding author by request.

Conflicts of Interest:
The authors declare no conflict of interest.