1. Introduction
With the rapid development of China’s economy, the electric power industry has ushered in unprecedented opportunities. The large-scale construction of power facilities has expanded the coverage of transmission lines, while geological and environmental conditions in different regions directly affect their safety and reliability [
1]. As shown in
Figure 1, geologic hazards (e.g., avalanches, landslides, mudslides, and ground subsidence) pose serious threats to the stable operation of transmission lines, potentially leading to line interruptions, equipment damage, and even casualties [
2,
3]. Therefore, detailed geological surveys are essential before construction to assess disaster risks and enhance resilience.
According to the Code for Geologic Hazard Risk Assessment (GB/T 40112-2021) [
4], surface cracks should be treated seriously in the assessment of geologic hazards. The presence of surface cracks may not only trigger uneven settlement of foundations but also be a precursor of larger-scale geohazards (e.g., landslides or ground collapse) [
5]. Cracks on the rock surface can induce weathering and fragmentation due to moisture, chemical agents, plant root infiltration, or human activities. Under heavy loads, the rock may break along these cracks, potentially compromising the safety of power transmission towers. Therefore, accurate identification of surface cracks is of great practical significance for the siting and construction of transmission lines. However, traditional surface crack identification methods mainly rely on manual investigation and simple image processing techniques, such as impact echo methods [
6], radar methods [
7], infrared thermography methods [
8], etc. The aforementioned methods are inefficient and subjective, making it difficult to satisfy large-scale and high-precision engineering demands. Especially in complex surface environments, manual recognition is easily affected by lighting, texture, and background interference, resulting in low recognition accuracy and poor generalization ability [
9,
10,
11]. Therefore, it is of great practical significance to develop an efficient, accurate, and adaptable crack recognition method.
In recent years, surface crack identification has attracted the attention of many researchers. The most common crack recognition method is based on digital image processing to process the collected crack images and determine the location of the cracks. In 2003, Abdel Qader et al. [
12] tried to use the image segmentation method for crack recognition, with the results confirming its great performance in terms of crack recognition. Subsequently, AyenuPrah et al. [
13] and Wei et al. [
14] added image denoising and image enhancement operations to preprocess images to enhance crack recognition accuracy. To cope with the problem of non-crack target interference and low robustness, Han [
15] et al. further reduced the effect of noise based on spatial filtering of Gaussian function and top-hat transform. The above method can obtain relatively good crack recognition performance with a small data volume; however, it requires a complex feature engineering analysis process, which has limitations in practical applications.
The rapid development of artificial intelligence technology provides new solutions for surface crack recognition. Different from the above traditional image processing methods, deep learning methods automatically capture the regularities in pictures by using neural networks to solve the problems related to feature extraction [
16,
17,
18]. A deep learning-based feature extractor can extract the regularities, such as crack edges, width, position, and brightness, hidden in the picture. Liu et al. [
19] used a deep learning algorithm to automatically identify tunnel lining cracks. Fan et al. [
20] proposed a supervised pavement crack detection method based on convolutional neural networks and solved the data imbalance problem by adjusting the ratio of positive and negative samples. Liu et al. [
21] used the UNet model to identify cracks on concrete surfaces. Cao et al. [
22] incorporated an attention mechanism into the encoder–decoder neural network structure to recognize pavement cracks. Al-Huda et al. [
23] proposed a hybrid deep learning approach for crack localization and crack recognition on pavement crack images.
Despite the significant progress in existing research, the following shortcomings still exist: (1) traditional methods rely on manual feature extraction, making it difficult to adapt to complex and changing real-world scenarios; (2) existing deep learning models perform poorly when dealing with tiny cracks and complex backgrounds; and (3) most of the methods lack an end-to-end solution, which leads to low recognition efficiency. These problems limit the application and promotion of crack recognition technology in practical engineering.
In order to obtain an efficient, accurate, and highly generalized surface crack identification algorithm, a smart surface crack identification method that integrates multiple neural networks is proposed in this paper. We call it the MultiNet crack identification method. It involves the design of a crack classification strategy based on convolutional neural networks (CNN) to filter out images that do not contain cracks, thereby effectively enhancing the efficiency of subsequent crack identification. Subsequently, a crack identification algorithm based on the Unet-YoLo architecture is constructed, in which the Unet module is employed for the preliminary segmentation of crack images, and the YoLo module is utilized for the optimization of the identification results. The combination significantly reduces the interference of non-crack information, such as shadows and potholes, in the recognition outcomes. Finally, in order to validate the accuracy of the proposed algorithm, experiments are conducted on a surface rock crack dataset, and the results indicate that crack information can be accurately identified and extracted. Rock crack identification is a crucial basis for assessing the degree of surface fragmentation in rocks. Based on the fragmentation level, it is possible to preliminarily infer whether the geological conditions in the area are suitable for the construction of transmission towers. Generally, a high density of cracks and significant fragmentation in rocks can negatively impact the stability of transmission towers, making them susceptible to collapse due to unstable foundations. Additionally, to assess the generalization capabilities of the proposed algorithm, validations are performed on other crack identification scenarios, confirming that the algorithm demonstrates good crack identification performance across different scenarios.
Therefore, the research in this paper not only provides reliable technical support for transmission station siting but also promotes the development of intelligent crack identification and extraction technology, which has important theoretical value and engineering application significance.
3. CNN-Based Image Classification Module
In order to effectively filter out images without crack features, a lightweight CNN-based image classification module was designed (
Figure 3). The model contains four convolutional layers, four pooling layers, and two fully connected layers. It adopts a series of regularization methods to prevent overfitting and uses ReLU as the activation function.
3.1. Convolution Operation
Let be the three-dimensional pixel tensor of the input RGB surface image, where | k}. The core part of the CNN-based classification model designed is the convolutional layer, which gradually extracts the multi-level spatial features of the input image by means of the local sense field and weight sharing mechanism. Let be input tensor of the convolutional layer, where is the number of convolutional layers, denotes the number of input channels at layer , and denote the height and width of the input image of layer , respectively. represents the pixel value of the input tensor at position of the channel k of the layer . When = 1, , = .
Let be the convolution kernel, where is the number of output channels of the layer , is the convolution kernel size, and , where represents the weight value of the convolution kernel of the output channel of the layer to the input channel at position . The corresponding bias vector is .
The formula for the convolution operation is as shown in Equation (1):
In Equation (1), represents the position of the pixel values of the feature matrix, represents the pixel value of the channel of the input tensor of the layer at the position . The number of channels in the convolutional layer starts from the three RGB channels of the input image, increases layer by layer to 48, 96, and 192, and finally decreases to 96 channels in the convergent convolutional layer. This design enhances the model’s ability to capture the local details and overall structural features of the cracks. In this paper, the convergent convolution operation uses 3 × 3 small convolution kernels, which can effectively capture the local details of the crack.
3.2. Pooling Operation
Each full convolution operation of the model is followed by a max pooling layer, using a 2 × 2 pooling kernel with a step size of 2. Assume that the convolutional layer
outputs a three-dimensional tensor
consisting of
two-dimensional feature maps. Its elements are
, and the calculation formula is as follows:
where
represents the offset within the pooling kernel (with a value of 0 or 1),
represents the pooling step size (set to 3 in this paper), and
represents the value of the four-dimensional tensor output at the convolutional layer
where the batch is
, channel is
, and position is
.
represents a four-dimensional output tensor after pooling of layer
, where the batch is
, the channel is
, and the position is
.
3.3. BatchNorm2d+ Dropout Regularization
To improve the generalization ability of the model and reduce the risk of overfitting, BatchNorm2d+Dropout is used for regularization in convolutional layer operations. Specifically, BatchNorm is used to normalize the input data for each layer to speed up training and avoid gradient explosion or vanishing problems. The calculation formula is as shown in Equation (3):
where
represents the values of input tensor in the layer
, where
denotes the sample,
denotes the channel, and
denotes the position,
and
respectively represent the data mean and standard deviation of the channel
at the layer
,
and
respectively represent the learnable scaling factor and offset of the channel
at the layer
,
is a minimal constant in case the denominator is 0, and
represents the output tensor value after normalization and regularization transformation.
In addition, differential dropout regularization is used to break the co-adaptation between features and improve the generalization ability of the model. Specifically, the lower dropout rate (0.3, 0.3) is applied to the shallow and middle convolutional layers, while the higher dropout rate (0.5, 0.4) is applied to the deep and convergent convolutional layers. The dropout operation is given by Equation (4):
where
represents the probability that a mask tensor following a Bernoulli distribution is retained at position
, and
represents the output tensor after dropout processing.
Finally, in this study, we use two fully connected layers to integrate the features extracted from the convolutional layers and output the final classification results. To further prevent overfitting, 0.5 dropout is used to regularize the fully connected layers.
3.4. Activation Function
This model uses ReLU as the activation function after each convolutional layer and the fully connected layer. The ReLU function effectively avoids the vanishing gradient problem and speeds up the training process. The calculation formula is as shown in Equation (5):
where
represents the output tensor after passing through the ReLU activation function.
3.5. Loss Function for the CNN-Based Image Classification Module
The cross-entropy loss, which is widely used in classification tasks, is adopted in this paper. This loss measures the deviation between the predicted probability and the true label, which helps to achieve steady gradient descent during training. The cross-entropy loss formula is as shown in Equation (6):
where
represents the loss value,
represents the number of categories of the sample,
in the surface fissure classification scenario in this paper represents the predicted category of the model for the surface fissure image, and
represents the true category of the surface fissure.
4. UNet-YOLOv8 Crack Segmentation Module
A semantic segmentation model for crack identification was initially constructed based on the UNet architecture, yielding preliminary crack extraction results. The YOLOv8 model was subsequently employed to precisely locate crack regions. By integrating the detection boxes produced by YOLOv8 with the segmentation outcomes from UNet, the effects of confounding factors such as shadows and potholes on crack extraction were eliminated, thereby optimizing extraction performance. Specifically, the YoLov8 model provides accurate location information for cracks, while the UNet model offers detailed shape information. By utilizing the detection boxes from YoLov8 as prior information to constrain the segmentation results of UNet, occurrences of mis-segmentation can be effectively reduced, enhancing both the accuracy and robustness of crack extraction.
4.1. The Input of the Unet-YOLOv8 Model
The UNet architecture employed in this study consists of three components: an encoder (down-sampling path), a decoder (up-sampling path), and a skip connection mechanism. The encoder systematically extracts high-level semantic information from the input images through multiple down-sampling operations, simultaneously reducing the spatial resolution of the feature maps. This process is crucial for capturing the fundamental features of cracks at various scales. In contrast, the decoder progressively restores the spatial resolution of the feature maps via up-sampling while integrating features from the encoder stage. This integration enables the acquisition of rich detail information, which is essential for accurately delineating crack boundaries. Ultimately, the model outputs the probability values for each pixel belonging to the target category (i.e., crack or non-crack), achieved through the application of a Sigmoid activation function. The primary structure of the UNet network as applied to the crack identification scenario is illustrated in
Figure 4.
Before being input into the model, both the original images and their corresponding binarized masks are loaded in single-channel grayscale mode. After pre-processing, these images are converted into tensors and input into the model in the form of a tensor. Let be the input tensor, where represents the batch size, denotes the number of channels (for grayscale images, the value is set to 1), indicates the image height, and refers to the image width. During the pre-processing stage, all images are standardized to a size of 256 × 256 pixels.
4.2. Down-Sampling Spatial Encoding Layer
Subsequently, the input images are processed using a down-sampling spatial encoding layer, which reduces the spatial resolution of the feature maps while expanding the channel dimension to extract more abstract semantic information. Each down-sampling module consists of a max pooling layer and a double convolution module.
Let
be the input tensor of the sampling module under the layer
, and the calculation formula for the double convolution operation of the 3 × 3 convolution kernel is as follows:
In Equation (7), represents the BatchNorm operation, and respectively represent the weights of the two convolution kernels of the layer , and respectively represent the bias of the two convolution operations of the layer , represents the ReLU activation function, and represents the intermediate feature tensor obtained after the double convolution operation.
The model uses 3 × 3 convolution kernels to ensure adequate feature extraction, with a step size of 1 and a fill of 1. The formula for calculating the output feature map size is as follows:
In Equation (8), represents the height of the input feature map, is the number of fill layers, is the height of the convolution kernel, is the step size, and is the height of the output feature map. In Equation (9), represents the width of the input feature map, is the width of the convolution kernel, is the width of the output feature map, where and are consistent with those in Equation (8).
Therefore, it can be concluded from the equation that when the step size is set to 1, the fill is set to 1, and a double convolution operation is performed using a 3 × 3 convolution kernel, the size of the feature map remains the same during the convolution operation and the spatial dimension remains the same but the channel expression ability is enhanced.
After the double convolution module, spatial down-sampling is performed with a 2 × 2 max pooling kernel, and the calculation formula is as follows:
In Equation (10), represents the max pooling operation, represents the tensor output after one max pooling.
A double convolution operation and a max pooling operation constitute a complete down-sampling layer. The input feature tensor undergoes four consecutive down-sampling operations, doubling the number of channels at each step and reducing the height and width of the feature map to half of its original size. Ultimately, the output tensor enters the decoder section. Through these four down-sampling operations, the model successfully extracted advanced semantic information from the image, laying a solid foundation for subsequent up-sampling operations. At the encoder stage, each down-sampling layer effectively reduced the spatial resolution of the feature map while significantly increasing the number of channels, enabling the model to capture more abstract and complex features in the image. This design enables the model to better understand the content of the image, providing strong support for the precise segmentation of cracks.
4.3. Up-Sampling Spatial Encoding Layer
The up-sampling spatial decoder is employed to gradually restore the spatial resolution of the feature maps while integrating the features collected during the encoder phase to obtain richer detail information. Each up-sampling module consists of a transpose convolution operation followed by a dual-convolution module. Through skip connections, the decoder can merge features from the encoder phase, allowing it to retain more detailed information while recovering the spatial resolution.
Let is the input vector of the sampling module on the layer , where represents the batch size, and represents the number of channels, height, and width of the feature map of that layer, respectively.
The transposed convolution operation uses the transposed matrix to expand the spatial dimension of the feature map, as shown below:
In Equation (11), represents the feature tensor obtained from the transposed convolution of layer , with the kernel size set to 2 × 2 and the span set to 1. After transposed convolution, the number of channels is halved and the spatial dimension is doubled.
Then, the up-sampled feature tensor
is concatenated with the corresponding feature tensor
of the encoder stage in the channel dimension, as shown below:
In Equation (12), represents the output tensor after the jump connection fusion of layer .
Perform a double convolution operation on the merged feature tensor . Consistent with the down-sampling stage, use 3 × 3 convolution kernels to ensure sufficient feature extraction, with a step size of 1 to maintain the feature map size while halving the number of channels. This, combined with batch normalization and the ReLU activation function, completes a full up-sampling process. After four consecutive up-sampling operations, output the tensor as the output of the up-sampling module.
4.4. Output of the UNet Module
The output layer includes a convolution operation that uses a 1 × 1 convolution kernel to reduce the number of channels in the feature map to the number of target categories (1 in this article)
. The Sigmoid activation function is then applied to map the output values to a range of [0, 1] representing the probability that each pixel belongs to the target category. The formula is as follows
In Equation (13), is the probability value tensor of each pixel in the feature map.
Through the above steps, the model can effectively extract the crack feature and output the probability value that each pixel belongs to the target category, thereby achieving a probabilistic representation of the crack feature. This design enables the model to better understand the content of the image, providing strong support for precise crack segmentation.
4.5. Loss Function for the UNet-YOLOv8 Crack Segmentation Module
For the crack segmentation task, we adopt a composite loss function consisting of Binary Cross-Entropy (BCE) loss and Dice loss. BCE measures the discrepancy between predicted and actual pixel labels
, shown in Equation (14), while Dice loss emphasizes the overlap between predicted and ground truth regions, shown in Equation (15), which is especially important for handling the class imbalance between crack and background pixels. The combined loss is defined as Equation (16).
represents a very small value, and
(0, 1) is the balance weight.
4.6. YOLOv8-Based Boolean-Type Identification Optimization Matrix
Considering that interference information, such as shadows that resemble cracks, can arise during crack feature extraction and lead to mis-identification of similar features as cracks, thereby affecting engineering judgments, this study introduces a Boolean-type identification matrix optimization algorithm based on YOLOv8.
This method first employs the YOLOv8 object detection algorithm to obtain the relative position vector of cracks in the input feature map, where represents the coordinates of the bounding box center, and denotes the relative width and height of the bounding box.
Based on the bounding box information
, the absolute positions of the four vertices of the crack feature bounding box can be determined, with the calculation formulas being as follows:
In the above equations, is the width of the original picture (usually 256), is the height of the original picture (usually 256), are the left and right boundaries of the bounding box, and are the left and right boundaries of the bounding box, respectively.
As a result, a Boolean identification matrix (MaskMap) with a size of × and initialized to False can be established for modeling the presence or absence of crack features in the feature map region. Based on the location information , it can be determined that crack features exist in this range, thus marking the feature points in this range of the MaskMap as True. through the input of multiple bounding box information, MaskMap has the ability to represent the existence of crack regions.
The mask map
of the crack extraction results obtained through the UNet model is binarized to obtain the feature matrix
consisting of 0 and 1. At this time, the MaskMap and
are of the same size, and the corresponding positional data are multiplied together to form the optimized mask image (OptimizedMask):
In Equation (21),
i and
j are the data positions in the feature matrix, respectively. Taking the 5 × 5 pixel crack feature map as an example, the optimization mask image formation principle is shown in
Figure 5.
The pixels marked in red in are the initially extracted pixels (1 is the crack feature and 0 is the background feature), the red box in the MaskMap is the crack feature area within the range and the background feature area outside the range, and the two are superimposed to retain only the crack features within the crack feature area. The optimized mask image is obtained to ensure that the extraction range is constrained to within the labeled box.
Pixel-level crack detection is performed on the input image using the UNet model, and the probability distribution map of cracked pixels is obtained, i.e., the value of each pixel point indicates the probability that the point is a crack. Subsequently, combined with the mask generated by the YOLOv8 frame label (MaskMap), the predicted probability map of the UNet model is multiplied with MaskMap on a pixel-by-pixel basis to achieve the optimization of the probability values. In the YOLOv8 recognition range, the probability value remains unchanged, while outside the range, the probability value is set to 0. This process effectively removes the possible misdetection area of the UNet model and at the same time retains the complete structural information of the cracks, which provides more accurate and reliable basic data for the generation of the subsequent thermodynamic diagram. The optimized thermodynamic diagram effectively eliminates the interference of crack-like features (e.g., shadows, etc.), which is significantly useful for engineering crack extraction.