A Multi-Deep Learning Intelligent Surface Rock Crack Identification Method for Transmission Tower Siting

Tang, Xiaoxian; Liu, Xin; Liu, Yuhai; Zhao, Bowen; Xie, Peng; Zhao, Jianwen; Gao, Xingqiang; Zhang, Ran

doi:10.3390/electronics14112255

Open AccessArticle

A Multi-Deep Learning Intelligent Surface Rock Crack Identification Method for Transmission Tower Siting

by

Xiaoxian Tang

¹,

Xin Liu

¹,

Yuhai Liu

¹,

Bowen Zhao

²,

Peng Xie

¹,

Jianwen Zhao

¹,

Xingqiang Gao

¹ and

Ran Zhang

^2,*

¹

State Grid Shandong Electric Power Company Construction Company, Jinan 250000, China

²

Department of Engineering Software, School of Civil Engineering, Shandong University, Jinan 250000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2255; https://doi.org/10.3390/electronics14112255

Submission received: 19 May 2025 / Accepted: 25 May 2025 / Published: 31 May 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate identification of surface cracks is of great significance for the site selection of transmission towers, as it directly affects the safety and stability of power grid construction. Traditional manual inspection methods are labor-intensive and inefficient, making them inadequate for large-scale and high-precision applications. Consequently, intelligent crack recognition technologies are receiving increasing attention and adoption. This study proposes a novel surface crack identification method that integrates multiple neural networks, aiming to overcome the limitations of traditional crack identification approaches, such as low accuracy and poor generalization. The proposed framework incorporates a convolutional neural network (CNN)-based data filtering module and a crack segmentation module combining UNet and YOLOv8 architectures. Together, these components form a robust and accurate end-to-end crack identification system, which is further applied to the evaluation of rock fragmentation. To verify the effectiveness and accuracy of the proposed method, experiments were conducted on a rock crack dataset and compared against several existing approaches. The results demonstrate that the proposed method achieves superior performance in crack detection accuracy. Moreover, tests on various scenario datasets also yielded promising results, indicating the strong generalization ability and adaptability of the method.

Keywords:

crack identification; multi-deep learning; CNN; Unet-YoLo; power grid siting

1. Introduction

With the rapid development of China’s economy, the electric power industry has ushered in unprecedented opportunities. The large-scale construction of power facilities has expanded the coverage of transmission lines, while geological and environmental conditions in different regions directly affect their safety and reliability [1]. As shown in Figure 1, geologic hazards (e.g., avalanches, landslides, mudslides, and ground subsidence) pose serious threats to the stable operation of transmission lines, potentially leading to line interruptions, equipment damage, and even casualties [2,3]. Therefore, detailed geological surveys are essential before construction to assess disaster risks and enhance resilience.

According to the Code for Geologic Hazard Risk Assessment (GB/T 40112-2021) [4], surface cracks should be treated seriously in the assessment of geologic hazards. The presence of surface cracks may not only trigger uneven settlement of foundations but also be a precursor of larger-scale geohazards (e.g., landslides or ground collapse) [5]. Cracks on the rock surface can induce weathering and fragmentation due to moisture, chemical agents, plant root infiltration, or human activities. Under heavy loads, the rock may break along these cracks, potentially compromising the safety of power transmission towers. Therefore, accurate identification of surface cracks is of great practical significance for the siting and construction of transmission lines. However, traditional surface crack identification methods mainly rely on manual investigation and simple image processing techniques, such as impact echo methods [6], radar methods [7], infrared thermography methods [8], etc. The aforementioned methods are inefficient and subjective, making it difficult to satisfy large-scale and high-precision engineering demands. Especially in complex surface environments, manual recognition is easily affected by lighting, texture, and background interference, resulting in low recognition accuracy and poor generalization ability [9,10,11]. Therefore, it is of great practical significance to develop an efficient, accurate, and adaptable crack recognition method.

In recent years, surface crack identification has attracted the attention of many researchers. The most common crack recognition method is based on digital image processing to process the collected crack images and determine the location of the cracks. In 2003, Abdel Qader et al. [12] tried to use the image segmentation method for crack recognition, with the results confirming its great performance in terms of crack recognition. Subsequently, AyenuPrah et al. [13] and Wei et al. [14] added image denoising and image enhancement operations to preprocess images to enhance crack recognition accuracy. To cope with the problem of non-crack target interference and low robustness, Han [15] et al. further reduced the effect of noise based on spatial filtering of Gaussian function and top-hat transform. The above method can obtain relatively good crack recognition performance with a small data volume; however, it requires a complex feature engineering analysis process, which has limitations in practical applications.

The rapid development of artificial intelligence technology provides new solutions for surface crack recognition. Different from the above traditional image processing methods, deep learning methods automatically capture the regularities in pictures by using neural networks to solve the problems related to feature extraction [16,17,18]. A deep learning-based feature extractor can extract the regularities, such as crack edges, width, position, and brightness, hidden in the picture. Liu et al. [19] used a deep learning algorithm to automatically identify tunnel lining cracks. Fan et al. [20] proposed a supervised pavement crack detection method based on convolutional neural networks and solved the data imbalance problem by adjusting the ratio of positive and negative samples. Liu et al. [21] used the UNet model to identify cracks on concrete surfaces. Cao et al. [22] incorporated an attention mechanism into the encoder–decoder neural network structure to recognize pavement cracks. Al-Huda et al. [23] proposed a hybrid deep learning approach for crack localization and crack recognition on pavement crack images.

Despite the significant progress in existing research, the following shortcomings still exist: (1) traditional methods rely on manual feature extraction, making it difficult to adapt to complex and changing real-world scenarios; (2) existing deep learning models perform poorly when dealing with tiny cracks and complex backgrounds; and (3) most of the methods lack an end-to-end solution, which leads to low recognition efficiency. These problems limit the application and promotion of crack recognition technology in practical engineering.

In order to obtain an efficient, accurate, and highly generalized surface crack identification algorithm, a smart surface crack identification method that integrates multiple neural networks is proposed in this paper. We call it the MultiNet crack identification method. It involves the design of a crack classification strategy based on convolutional neural networks (CNN) to filter out images that do not contain cracks, thereby effectively enhancing the efficiency of subsequent crack identification. Subsequently, a crack identification algorithm based on the Unet-YoLo architecture is constructed, in which the Unet module is employed for the preliminary segmentation of crack images, and the YoLo module is utilized for the optimization of the identification results. The combination significantly reduces the interference of non-crack information, such as shadows and potholes, in the recognition outcomes. Finally, in order to validate the accuracy of the proposed algorithm, experiments are conducted on a surface rock crack dataset, and the results indicate that crack information can be accurately identified and extracted. Rock crack identification is a crucial basis for assessing the degree of surface fragmentation in rocks. Based on the fragmentation level, it is possible to preliminarily infer whether the geological conditions in the area are suitable for the construction of transmission towers. Generally, a high density of cracks and significant fragmentation in rocks can negatively impact the stability of transmission towers, making them susceptible to collapse due to unstable foundations. Additionally, to assess the generalization capabilities of the proposed algorithm, validations are performed on other crack identification scenarios, confirming that the algorithm demonstrates good crack identification performance across different scenarios.

Therefore, the research in this paper not only provides reliable technical support for transmission station siting but also promotes the development of intelligent crack identification and extraction technology, which has important theoretical value and engineering application significance.

2. Framework of the MultiNet Crack Identification Method

This section primarily presents the framework of the crack identification algorithm based on multiple deep learning techniques, as illustrated in Figure 2. The multi-deep learning framework for crack identification consists mainly of a crack image classification module based on the CNN, a crack segmentation module based on UNet, and an optimization module based on YOLOv8. The CNN-based image classification module is employed to perform preliminary filtering of images using a crack classification algorithm, facilitating the selection of images that exhibit crack features and enhancing the efficiency of subsequent crack identification. The UNet-based crack segmentation module utilizes an image segmentation model to achieve a probabilistic representation of crack features, thereby effectively extracting these features. To address the interference from factors such as shadows and potholes during the crack identification process, an optimization matrix for crack recognition based on YOLO was designed, which significantly reduces the influence of non-crack factors.

3. CNN-Based Image Classification Module

In order to effectively filter out images without crack features, a lightweight CNN-based image classification module was designed (Figure 3). The model contains four convolutional layers, four pooling layers, and two fully connected layers. It adopts a series of regularization methods to prevent overfitting and uses ReLU as the activation function.

3.1. Convolution Operation

Let

χ \in R^{3 \times m \times n}

be the three-dimensional pixel tensor of the input RGB surface image, where

x_{k, i, j} \in {χ

| k

\in [1,3], i \in [1, m], j \in [1, n]

}. The core part of the CNN-based classification model designed is the convolutional layer, which gradually extracts the multi-level spatial features of the input image by means of the local sense field and weight sharing mechanism. Let

χ^{(l)} = {x_{k, i, j}^{(l)} | k \in [1, C_{i n}^{(l)}], i \in [1, H^{(l)}], j \in [1, W^{(l)}]}

be input tensor of the convolutional layer, where

l \in [1, 4]

is the number of convolutional layers,

C_{i n}^{(l)}

denotes the number of input channels at layer

l

,

H^{(l)}

and

W^{(l)}

denote the height and width of the input image of layer

l

, respectively.

x_{k, i, j}^{(l)} \in R

represents the pixel value of the input tensor at position

(i, j)

of the channel k of the layer

l

. When

l

= 1,

χ^{(l)} = χ

,

x_{k, i, j}^{(l)}

=

x_{k, i, j}

.

Let

W^{(l)} \in R^{C_{o u t}^{(l)} \times C_{i n}^{(l)} \times K \times K}

be the convolution kernel, where

C_{o u t}^{(l)}

is the number of output channels of the layer

l

,

K

is the convolution kernel size, and

W^{(l)} = {w_{c, k, u, v}^{(l)} | c \in [1, C_{o u t}^{(l)}], k \in [1, C_{i n}^{(l)}], u \in [1, K], v \in [1, K]}

, where

w_{c, k, u, v}^{(l)} \in R

represents the weight value of the convolution kernel of the output channel

c

of the layer

l

to the input channel

k

at position

(u, v)

. The corresponding bias vector is

b^{(l)} = {b_{c}^{(l)} | c \in [1, C_{o u t}^{(l)}]}

.

The formula for the convolution operation is as shown in Equation (1):

y_{c, i, j}^{(l)} = \sum_{k = 1}^{C_{i n}} \sum_{u = 1}^{K} \sum_{v = 1}^{K} w_{c, k, u, v}^{(l)} \cdot x_{k, i + u - 1, j + v - 1}^{(l - 1)} + b_{c}^{(l)}

(1)

In Equation (1),

(i, j)

represents the position of the pixel values of the feature matrix,

x_{k, i + u - 1, j + v - 1}^{(l - 1)}

represents the pixel value of the channel

k

of the input tensor of the layer

(l - 1)

at the position

(i + u - 1, j + v - 1)

. The number of channels in the convolutional layer starts from the three RGB channels of the input image, increases layer by layer to 48, 96, and 192, and finally decreases to 96 channels in the convergent convolutional layer. This design enhances the model’s ability to capture the local details and overall structural features of the cracks. In this paper, the convergent convolution operation uses 3 × 3 small convolution kernels, which can effectively capture the local details of the crack.

3.2. Pooling Operation

Each full convolution operation of the model is followed by a max pooling layer, using a 2 × 2 pooling kernel with a step size of 2. Assume that the convolutional layer

l

outputs a three-dimensional tensor

Y^{(l)} \in R^{C_{o u t}^{(l)} \times H_{o u t}^{(l)} \times W_{o u t}^{(l)}}

consisting of

C_{o u t}^{(l)}

two-dimensional feature maps. Its elements are

y_{c, i, j}^{(l)} \in {Y^{(l)} | c \in [1, C_{o u t}^{(l)}], i \in [1, H_{o u t}^{(l)}], j \in [1, W_{o u t}^{(l)}]}

, and the calculation formula is as follows:

{p o o l}_{b, c, i, j}^{(l)} = {m a x}_{0 \leq m, n \leq 1} y_{b, c, s i + m, s j + n}^{(l)}

(2)

where

m, n

represents the offset within the pooling kernel (with a value of 0 or 1),

s

represents the pooling step size (set to 3 in this paper), and

y_{b, c, s i + m, s j + n}^{(l)}

represents the value of the four-dimensional tensor output at the convolutional layer

l

where the batch is

b

, channel is

c

, and position is

(s i + m, s j + n)

.

{p o o l}_{b, c, i, j}^{(l)}

represents a four-dimensional output tensor after pooling of layer

l

, where the batch is

b

, the channel is

c

, and the position is

(i, j)

.

3.3. BatchNorm2d+ Dropout Regularization

To improve the generalization ability of the model and reduce the risk of overfitting, BatchNorm2d+Dropout is used for regularization in convolutional layer operations. Specifically, BatchNorm is used to normalize the input data for each layer to speed up training and avoid gradient explosion or vanishing problems. The calculation formula is as shown in Equation (3):

{\hat{y}}_{b, c, i, j}^{(l)} = γ_{c}^{(l)} \cdot \frac{y_{b, c, i, j}^{(l)} - μ_{c}^{(l)}}{\sqrt{{(σ_{c}^{(l)})}^{2} + ϵ}} + β_{c}^{(l)}

(3)

where

y_{b, c, i, j}^{(l)}

represents the values of input tensor in the layer

l

, where

b

denotes the sample,

c

denotes the channel, and

(i, j)

denotes the position,

μ_{c}^{(l)}

and

σ_{c}^{(l)}

respectively represent the data mean and standard deviation of the channel

c

at the layer

l

,

γ_{c}^{(l)}

and

β_{c}^{(l)}

respectively represent the learnable scaling factor and offset of the channel

c

at the layer

l

,

ϵ

is a minimal constant in case the denominator is 0, and

{\hat{y}}_{b, c, i, j}^{(l)}

represents the output tensor value after normalization and regularization transformation.

In addition, differential dropout regularization is used to break the co-adaptation between features and improve the generalization ability of the model. Specifically, the lower dropout rate (0.3, 0.3) is applied to the shallow and middle convolutional layers, while the higher dropout rate (0.5, 0.4) is applied to the deep and convergent convolutional layers. The dropout operation is given by Equation (4):

{\tilde{y}}_{b, c, i, j}^{(l)} = p_{b, c, i, j}^{(l)} \cdot {\hat{y}}_{b, c, i, j}^{(l)}

(4)

where

p_{b, c, i, j}^{(l)}

represents the probability that a mask tensor following a Bernoulli distribution is retained at position

(i, j)

, and

{\tilde{y}}_{b, c, i, j}^{(l)}

represents the output tensor after dropout processing.

Finally, in this study, we use two fully connected layers to integrate the features extracted from the convolutional layers and output the final classification results. To further prevent overfitting, 0.5 dropout is used to regularize the fully connected layers.

3.4. Activation Function

This model uses ReLU as the activation function after each convolutional layer and the fully connected layer. The ReLU function effectively avoids the vanishing gradient problem and speeds up the training process. The calculation formula is as shown in Equation (5):

z_{b, c, i, j}^{(l)} = m a x (0, {\tilde{y}}_{b, c, i, j}^{(l)})

(5)

where

z_{b, c, i, j}^{(l)}

represents the output tensor after passing through the ReLU activation function.

3.5. Loss Function for the CNN-Based Image Classification Module

The cross-entropy loss, which is widely used in classification tasks, is adopted in this paper. This loss measures the deviation between the predicted probability and the true label, which helps to achieve steady gradient descent during training. The cross-entropy loss formula is as shown in Equation (6):

L = - \sum_{i = 1}^{N} [y_{i} \ln \hat{y_{i}} + (1 - y_{i}) \ln (1 - \hat{y_{i}})]

(6)

where

L

represents the loss value,

N

represents the number of categories of the sample,

N = 2

in the surface fissure classification scenario in this paper represents the predicted category of the model for the surface fissure image, and

\hat{y_{i}} \in {0,1} y_{i} \in {0,1}

represents the true category of the surface fissure.

4. UNet-YOLOv8 Crack Segmentation Module

A semantic segmentation model for crack identification was initially constructed based on the UNet architecture, yielding preliminary crack extraction results. The YOLOv8 model was subsequently employed to precisely locate crack regions. By integrating the detection boxes produced by YOLOv8 with the segmentation outcomes from UNet, the effects of confounding factors such as shadows and potholes on crack extraction were eliminated, thereby optimizing extraction performance. Specifically, the YoLov8 model provides accurate location information for cracks, while the UNet model offers detailed shape information. By utilizing the detection boxes from YoLov8 as prior information to constrain the segmentation results of UNet, occurrences of mis-segmentation can be effectively reduced, enhancing both the accuracy and robustness of crack extraction.

4.1. The Input of the Unet-YOLOv8 Model

The UNet architecture employed in this study consists of three components: an encoder (down-sampling path), a decoder (up-sampling path), and a skip connection mechanism. The encoder systematically extracts high-level semantic information from the input images through multiple down-sampling operations, simultaneously reducing the spatial resolution of the feature maps. This process is crucial for capturing the fundamental features of cracks at various scales. In contrast, the decoder progressively restores the spatial resolution of the feature maps via up-sampling while integrating features from the encoder stage. This integration enables the acquisition of rich detail information, which is essential for accurately delineating crack boundaries. Ultimately, the model outputs the probability values for each pixel belonging to the target category (i.e., crack or non-crack), achieved through the application of a Sigmoid activation function. The primary structure of the UNet network as applied to the crack identification scenario is illustrated in Figure 4.

Before being input into the model, both the original images and their corresponding binarized masks are loaded in single-channel grayscale mode. After pre-processing, these images are converted into tensors and input into the model in the form of a tensor. Let

I \in R^{N \times C \times H \times W}

be the input tensor, where

N

represents the batch size,

C

denotes the number of channels (for grayscale images, the value is set to 1),

H

indicates the image height, and

W

refers to the image width. During the pre-processing stage, all images are standardized to a size of 256 × 256 pixels.

4.2. Down-Sampling Spatial Encoding Layer

Subsequently, the input images are processed using a down-sampling spatial encoding layer, which reduces the spatial resolution of the feature maps while expanding the channel dimension to extract more abstract semantic information. Each down-sampling module consists of a max pooling layer and a double convolution module.

Let

U^{(l)} \in R^{N \times C^{(l)} \times H^{(l)} \times W^{(l)}}

be the input tensor of the sampling module under the layer

l

, and the calculation formula for the double convolution operation of the 3 × 3 convolution kernel is as follows:

Q^{(l)} = σ (B n o r m (w_{2}^{(l)} \cdot [σ (B N (w_{1}^{(l)} \cdot U^{(l)} + b_{1}^{(l)}))] + b_{2}^{(l)}))

(7)

In Equation (7),

B n o r m

represents the BatchNorm operation,

w_{1}^{(l)}

and

w_{2}^{(l)}

respectively represent the weights of the two convolution kernels of the layer

l

,

b_{1}^{(l)}

and

b_{2}^{(l)}

respectively represent the bias of the two convolution operations of the layer

l

,

σ

represents the ReLU activation function, and

Q^{(l)}

represents the intermediate feature tensor obtained after the double convolution operation.

The model uses 3 × 3 convolution kernels to ensure adequate feature extraction, with a step size of 1 and a fill of 1. The formula for calculating the output feature map size is as follows:

H_{o u t} = \frac{H_{i n} + 2 P - F H}{S} + 1,

(8)

W_{o u t} = \frac{W_{i n} + 2 P - F H}{S} + 1 .

(9)

In Equation (8),

H_{i n}

represents the height of the input feature map,

P

is the number of fill layers,

F H

is the height of the convolution kernel,

S

is the step size, and

H_{o u t}

is the height of the output feature map. In Equation (9),

W_{i n}

represents the width of the input feature map,

F H

is the width of the convolution kernel,

W_{o u t}

is the width of the output feature map, where

P

and

S

are consistent with those in Equation (8).

Therefore, it can be concluded from the equation that when the step size is set to 1, the fill is set to 1, and a double convolution operation is performed using a 3 × 3 convolution kernel, the size of the feature map remains the same during the convolution operation and the spatial dimension remains the same but the channel expression ability is enhanced.

After the double convolution module, spatial down-sampling is performed with a 2 × 2 max pooling kernel, and the calculation formula is as follows:

P^{(l + 1)} = M a x P o o l (Q^{(l)})

(10)

In Equation (10),

M a x P o o l

represents the max pooling operation,

P^{(l + 1)}

represents the tensor output after one max pooling.

A double convolution operation and a max pooling operation constitute a complete down-sampling layer. The input feature tensor

I \in R^{N \times C \times H \times W}

undergoes four consecutive down-sampling operations, doubling the number of channels at each step and reducing the height and width of the feature map to half of its original size. Ultimately, the output tensor

\hat{I} \in R^{N \times 1024 \times 16 \times 16}

enters the decoder section. Through these four down-sampling operations, the model successfully extracted advanced semantic information from the image, laying a solid foundation for subsequent up-sampling operations. At the encoder stage, each down-sampling layer effectively reduced the spatial resolution of the feature map while significantly increasing the number of channels, enabling the model to capture more abstract and complex features in the image. This design enables the model to better understand the content of the image, providing strong support for the precise segmentation of cracks.

4.3. Up-Sampling Spatial Encoding Layer

The up-sampling spatial decoder is employed to gradually restore the spatial resolution of the feature maps while integrating the features collected during the encoder phase to obtain richer detail information. Each up-sampling module consists of a transpose convolution operation followed by a dual-convolution module. Through skip connections, the decoder can merge features from the encoder phase, allowing it to retain more detailed information while recovering the spatial resolution.

Let

T^{(l)} \in R^{N \times C^{(l)} \times H^{(l)} \times W^{(l)}}

is the input vector of the sampling module on the layer

l

, where

N

represents the batch size,

C^{(l)}, H^{(l)}

and

W^{(l)}

represents the number of channels, height, and width of the feature map of that layer, respectively.

The transposed convolution operation uses the transposed matrix to expand the spatial dimension of the feature map, as shown below:

Z^{(l)} = C o n v T r a n s p o s e (t^{(l)})

(11)

In Equation (11),

Z^{(l)} \in R^{N \times \frac{C^{(l)}}{2} \times 2 H^{(l)} \times 2 W^{(l)}}

represents the feature tensor obtained from the transposed convolution of layer

l

, with the kernel size set to 2 × 2 and the span set to 1. After transposed convolution, the number of channels is halved and the spatial dimension is doubled.

Then, the up-sampled feature tensor

Z^{(l)}

is concatenated with the corresponding feature tensor

Q^{(l)} \in R^{N \times \frac{C^{(l)}}{2} \times 2 H^{(l)} \times 2 W^{(l)}}

of the encoder stage in the channel dimension, as shown below:

\hat{Z^{(l)}} = C o n c a t (Z^{(l)}, Q^{(l)})

(12)

In Equation (12),

\hat{Z^{(l)}} \in R^{N \times C^{(l)} \times {2 H}^{(l)} \times 2 W^{(l)}}

represents the output tensor after the jump connection fusion of layer

l

.

Perform a double convolution operation on the merged feature tensor

\hat{Z^{(l)}}

. Consistent with the down-sampling stage, use 3 × 3 convolution kernels to ensure sufficient feature extraction, with a step size of 1 to maintain the feature map size while halving the number of channels. This, combined with batch normalization and the ReLU activation function, completes a full up-sampling process. After four consecutive up-sampling operations, output the tensor

O_{f i n a l} \in R^{N \times 64 \times 256 \times 256}

as the output of the up-sampling module.

4.4. Output of the UNet Module

The output layer includes a convolution operation that uses a 1 × 1 convolution kernel to reduce the number of channels in the feature map to the number of target categories (1 in this article)

\hat{O_{f i n a l}} \in R^{N \times 1 \times 256 \times 256}

. The Sigmoid activation function is then applied to map the output values to a range of [0, 1] representing the probability that each pixel belongs to the target category. The formula is as follows

a n s = \frac{1}{1 + e^{- \hat{O_{f i n a l}}}}

(13)

In Equation (13),

a n s \in R^{N \times 1 \times 256 \times 256}

is the probability value tensor of each pixel in the feature map.

Through the above steps, the model can effectively extract the crack feature and output the probability value that each pixel belongs to the target category, thereby achieving a probabilistic representation of the crack feature. This design enables the model to better understand the content of the image, providing strong support for precise crack segmentation.

4.5. Loss Function for the UNet-YOLOv8 Crack Segmentation Module

For the crack segmentation task, we adopt a composite loss function consisting of Binary Cross-Entropy (BCE) loss and Dice loss. BCE measures the discrepancy between predicted and actual pixel labels

O_{g r o u n d}

, shown in Equation (14), while Dice loss emphasizes the overlap between predicted and ground truth regions, shown in Equation (15), which is especially important for handling the class imbalance between crack and background pixels. The combined loss is defined as Equation (16).

ε

represents a very small value, and

α \in

(0, 1) is the balance weight.

L_{B C E} = - [O_{g r o u n d} \cdot l o g ({\hat{O_{f i n a l}}}_{f i n a l O u t}) + (1 - O_{g r o u n d}) \cdot l o g (1 - {\hat{O_{f i n a l}}}_{f i n a l O u t})],

(14)

L_{Dice} = 1 - \frac{2 O_{g r o u n d} \hat{O_{f i n a l}} + ε}{O_{g r o u n d} + \hat{O_{f i n a l}} + ε}

(15)

L_{seg} = α \cdot L_{BCE} + (1 - α) \cdot L_{Dice}

(16)

4.6. YOLOv8-Based Boolean-Type Identification Optimization Matrix

Considering that interference information, such as shadows that resemble cracks, can arise during crack feature extraction and lead to mis-identification of similar features as cracks, thereby affecting engineering judgments, this study introduces a Boolean-type identification matrix optimization algorithm based on YOLOv8.

This method first employs the YOLOv8 object detection algorithm to obtain the relative position vector

[x_{c e n t e r}, y_{c e n t e r}, w i d t h, h e i g h t]

of cracks in the input feature map, where

{(x}_{c e n t e r}, y_{c e n t e r})

represents the coordinates of the bounding box center, and

w i d t h, h e i g h t

denotes the relative width and height of the bounding box.

Based on the bounding box information

[x_{c e n t e r}, y_{c e n t e r}, w i d t h, h e i g h t]

, the absolute positions of the four vertices of the crack feature bounding box can be determined, with the calculation formulas being as follows:

x_{m i n} = (x_{c e n t e r} - \frac{w i d t h}{2}) \times w_{i m g},

(17)

y_{m i n} = (y_{c e n t e r} - \frac{h e i g h t}{2}) \times h_{i m g},

(18)

x_{m a x} = (x_{c e n t e r} + \frac{w i d t h}{2}) \times w_{i m g},

(19)

y_{m a x} = (y_{c e n t e r} + \frac{h e i g h t}{2}) \times h_{i m g} .

(20)

In the above equations,

w_{i m g}

is the width of the original picture (usually 256),

h_{i m g}

is the height of the original picture (usually 256),

x_{m i n} a n d x_{m a x}

are the left and right boundaries of the bounding box, and

y_{m a x} a n d y_{m i n}

are the left and right boundaries of the bounding box, respectively.

As a result, a Boolean identification matrix (MaskMap) with a size of

w_{i m g}

×

h_{i m g}

and initialized to False can be established for modeling the presence or absence of crack features in the feature map region. Based on the location information

[x_{m i n}, x_{m a x}, y_{m i n,} y_{m a x}]

, it can be determined that crack features exist in this range, thus marking the feature points in this range of the MaskMap as True. through the input of multiple bounding box information, MaskMap has the ability to represent the existence of crack regions.

The mask map

O_{f i n a l O u t}

of the crack extraction results obtained through the UNet model is binarized to obtain the feature matrix

O_{m a s k}

consisting of 0 and 1. At this time, the MaskMap and

O_{m a s k}

are of the same size, and the corresponding positional data are multiplied together to form the optimized mask image (OptimizedMask):

O p t i m i z e d M a s k (i, j) = M a s k M a p (i, j) \times O_{m a s k} (i, j) .

(21)

In Equation (21), i and j are the data positions in the feature matrix, respectively. Taking the 5 × 5 pixel crack feature map as an example, the optimization mask image formation principle is shown in Figure 5.

The pixels marked in red in

O_{m a s k}

are the initially extracted pixels (1 is the crack feature and 0 is the background feature), the red box in the MaskMap is the crack feature area within the range and the background feature area outside the range, and the two are superimposed to retain only the crack features within the crack feature area. The optimized mask image is obtained to ensure that the extraction range is constrained to within the labeled box.

Pixel-level crack detection is performed on the input image using the UNet model, and the probability distribution map of cracked pixels is obtained, i.e., the value of each pixel point indicates the probability that the point is a crack. Subsequently, combined with the mask generated by the YOLOv8 frame label (MaskMap), the predicted probability map of the UNet model is multiplied with MaskMap on a pixel-by-pixel basis to achieve the optimization of the probability values. In the YOLOv8 recognition range, the probability value remains unchanged, while outside the range, the probability value is set to 0. This process effectively removes the possible misdetection area of the UNet model and at the same time retains the complete structural information of the cracks, which provides more accurate and reliable basic data for the generation of the subsequent thermodynamic diagram. The optimized thermodynamic diagram effectively eliminates the interference of crack-like features (e.g., shadows, etc.), which is significantly useful for engineering crack extraction.

5. Experiments and Analysis

Considering that the MultiNet crack identification method comprises two distinct steps, a CNN-based crack classification module and a UNet-YOLO-based crack segmentation module, each implemented using an independent neural network architecture, this paper conducts separate experiments and analyses for each component.

5.1. Experimental Preparation

All experiments were conducted on a personal computer running the Windows operating system, equipped with an NVIDIA GeForce RTX 4060 GPU. The deep learning framework used is PyTorch 2.4.1 with CUDA version 11.8.

In order to validate these two modules with different functions separately, two datasets were selected for experiments:

Dataset for the CNN-based crack classification module: An open-source, publicly available surface crack dataset was used as the dataset for the CNN-based crack classification module. The original crack dataset was manually screened and divided into images with crack features and those with non-crack features. The training dataset used in this paper comprises 4900 images. The images were divided into a training set (80%) and a validation set (10%) for real-time monitoring of the training process. To ensure the practicability of the model, a separate test dataset was prepared. Specifically, the test dataset comprised 2600 images. These images were carefully labeled to ensure accurate feature identification. Additionally, median filtering was applied to remove noise from the images.

Dataset for the UNet-YOLOv8-based crack segmentation module: An open-source, publicly available dataset of 2083 high-resolution (256 × 256) crack images with corresponding manually annotated binary masks was used as the dataset. The dataset encompasses various complex scenarios, including open cracks, closed cracks, high-density distribution cracks, and low-density distribution cracks, covering diverse lighting conditions and background interference scenarios. The dataset was divided into a training set (1666 samples) and a validation set (417 samples) with an 8:2 ratio. Additionally, 100 crack feature maps of four types were prepared as the test set. The data distribution is shown in Table 1:

5.2. Experimental Settings and Result Analyses

5.2.1. For the CNN-Based Crack Classification Module

(1): Experimental setting and process

Before the experiment was carried out, the image data were first denoised. The dataset was denoised through median filtering. Then, normalization was carried out, and the image size was uniformly adjusted to 128 × 128 pixels. Since there are significantly more crack feature images than non-crack feature images, unbalanced data augmentation was adopted for the two different feature images. For images with crack features, on-the-fly rotation and horizontal flip were performed. For images without crack features, on-the-fly rotation, horizontal flip, vertical flip, color dithering, on-the-fly auto-contrast, and random histogram averaging were carried out to reduce the training data imbalance problem due to unbalanced data augmentation.

Table 2 illustrates that the experiment was conducted over 200 training epochs using a batch size of 32. Optimization was performed with the AdamW optimizer, initialized with a learning rate of 0.004, and incorporates the StepLR scheduler, which reduces the learning rate by a decay factor of 0.3 every 40 training epochs (denoted as patience). Additionally, a weight decay of 0.0001 was applied.

During the training process, the model adjusts the network parameters using a back-propagation algorithm to minimize the loss function (cross-entropy loss). During the training process, a validation set is used to monitor the overfitting situation and adjust the training strategy based on validation accuracy. After each iteration, the accuracy and loss values on the training and validation sets are recorded so that the training effect of the model can be analyzed. After training, the model is evaluated using a test set to check the generalization ability of the model on unseen data. The performance of the model is evaluated by calculating the accuracy of the test set. Based on the evaluation results, the hyperparameters of the model (e.g., learning rate, dropout ratio, etc.) are further tuned to improve the model’s performance. Table 3 demonstrates the detailed process of model evaluation and tuning, and by comparing the information in Table 3 and Figure 6, C2 was finally selected as the optimal hyperparameter combination.

To balance the complex scenario and the inferring accuracy, a network architecture consisting of four convolutional layers, four pooling layers, and two fully connected layers was adopted. Based on the determination of the number of weight layers, the number of channels takes the variation of 3-48-96-192-96. Compared to the 3-32-64-128-48 channel variation in Table 2, the former allows the model to capture more complex and abstract features at a deeper level. Moreover, the extraction of fine feature information, such as crack morphology and texture, is more adequate. The latter channel variation is smoother, which helps prevent the model from becoming overly complex. However, the smaller number of channels restricts the amount of feature information that the model can learn when dealing with complex images. This limitation leads to the model’s insufficient coping ability for complex road crack recognition.

To prevent model overfitting, Dropout2d was used. Relatively low on-the-fly discard rates were set in the shallow convolutional layers (layers 1 and 2), while higher on-the-fly discard rates were adopted in the deeper convolutional layers (layers 3 and 4), so as to reduce the model’s dependence on specific features and improve robustness. During the experiments, it was found that the model performs better when the Dropout2d allocation is relatively smooth than when the allocation variance is large. The smooth allocation helps the model to maintain the stability of continuous learning during the training process and promotes the neural network to gradually construct a hierarchical structure of features from simple to complex. If the follow-through discard rate varies too much from layer to layer, this can easily cause the model to over-rely on certain features in the early stage of training and ignore other potentially important information. In addition, a too high on-the-fly discard rate may also cause information loss and affect the learning effect. The optimal allocation scheme of 0.3, 0.3, 0.5, and 0.4 for the on-the-fly discard rate was finally determined.

The initial learning rate of 0.004 not only avoids instability during the training process but also ensures that the model learns and converges at an appropriate rate, ensuring that the model learns the basic features quickly in the early stage of training and makes fine adjustments in the later stage. In the choice of learning rate scheduler, a comprehensive comparison of StepLR and CosineAnnealingLR was carried out. It was found that the former learning rate scheduling is more convenient and accurate, and the appropriate step size adjustment makes the model training from shallow to deep and accelerates model convergence; in comparison, the latter scheduling method is not good for the convergence of the model, and the fluctuation of the loss value is large. Therefore, StepLR was used as the learning rate scheduler.

(2): Experimental results

(i) Accuracy and Convergence for Loss Value: In the model training phase, after 200 iterations of training, the accuracy of the model on the training set was stable at 98%, and the accuracy of the test set was stable at 96.41%. During the experimental process, the loss value gradually decreased and finally converged to a stable value, indicating that the model has fully learned the features in the data and that the training process does not show an overfitting phenomenon, as shown in Figure 6.

In the model testing phase, the model achieved an accuracy of 93.23% on the test set, which verifies the effectiveness and usefulness of the model in practical applications and confirms the excellent performance of the model in the crack recognition task. The small error in the test set proves that the model is robust in the classification of crack images.

(ii) Comparisons with Other Methods:. In order to verify the performance of the proposed model, we compared it with other advanced machine learning models, and the experimental results are shown in Table 4. Four methods, Support Vector Machine (SVM), Random Forest, Naive Bayes, and Decision Tree, were selected for the comparison test. The SVM method is a supervised learning algorithm that maximizes the boundary between different classes by finding the optimal segmentation hyperplane, which is especially suitable for high-dimensional data processing. The Random Forest method is a multi-decision tree model based on integrated learning. It improves generalization capabilities through a voting mechanism, but it has limited ability to extract complex texture features. The Naive Bayes method is based on the Bayesian model, which is suitable for simple scenarios. The Decision Tree adopts an intuitive tree structure for feature partitioning.

Through the experimental comparative analysis, the test accuracies of each comparison model were 87.14%, 92.73%, 65.33%, and 87.63%, respectively. The results show that the crack recognition method based on the CNN model proposed in this paper exhibits significant advantages considering Accuracy, Recall rate, F1-value, and Mean average precision.

5.2.2. For the UNet-YOLOv8 Crack Segmentation Module

(1): Experimental setting and process

YOLOv8 and UNet were trained in a parallel framework sharing the same preprocessed dataset. YOLOv8 training focuses on the detection task, optimizing bounding box regression and crack presence classification using Binary Cross-Entropy loss. U-Net training targets pixel-level segmentation, optimizing learning rate and weight decay parameters. After each training epoch, YOLOv8 outputs detection boxes while UNet generates pixel-level crack probability maps. All mentioned preprocessing operations, including spatial transformations and dataset partitioning, are applied before experimentation.

Table 5 shows the main hyperparameter settings for the UNet-YOLOv8 crack segmentation module. Table 5 demonstrates that the experiment consists of 100 training epochs using a batch size of 16. The AdamW optimizer is utilized with specific parameters (weight decay = 0.0001 and initial learning rate = 0.001), in conjunction with the ReduceLROnPlateau scheduler. The scheduler is configured to monitor validation loss, with a decay factor of 0.5 and a patience of 8 epochs.

Preprocessed images are fed simultaneously into UNet and YOLOv8 for iterative training. The UNet updates parameters through backpropagation using BCELoss:

L = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \cdot \ln x_{i} + (1 - y_{i}) \cdot \ln {(1 - x}_{i})]

(22)

In Equation (22),

N

is the total number of pixel points,

y_{i}

is the true label, and

x_{i}

is the probability value predicted by the model.

The model parameters are updated using the AdamW optimizer, and the learning rate is dynamically adjusted via the ReduceLROnPlateau scheduler. To enhance generalization and prevent overfitting, we implemented early stopping (15-epoch patience on validation IoU plateau) and metric monitoring.

In order to better evaluate the performance of the model, this experiment was analyzed using a Confusion Matrix. The Confusion Matrix is denoted as follows:

C o n f u s i o n M a t r i x = [\begin{matrix} T P & F P \\ F N & T N \end{matrix}] .

In this case,

T P

(true positive example) is the sample that predicts the same positive class (crack) as the label,

T N

(true negative example) is the sample that predicts the same negative class (background) as the label,

F P

(false positive example) is the sample that predicts the positive class as the negative class, and

F N

(false negative example) is the sample that predicts the negative class as the positive class. A model that performs well should have the following characteristics:

T P

and

T N

are large enough while

F P

and

F N

are small enough.

The model passes in a standard mask map and generates a copy of the mask map in the validation phase, and the two were compared at the pixel level to obtain the four base sample numbers of TP, TN, FP, and FN. This experiment uses four kinds of evaluation metrics, Precision, Recall, Accuracy, and Specificity, which are shown in Equations (23)–(26).

P r e c i s i o n = \frac{T P}{T P + F P}

(23)

R e c a l l = \frac{T P}{T P + F N}

(24)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(25)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(26)

In Equations (23)–(26),

P r e c i s i o n

denotes the percentage of real cracked pixels among the pixels predicted to be cracked

, R e c a l l

denotes the percentage of correctly predicted pixels among the real cracked pixels

, A c c u r a c y

denotes the percentage of correctly predicted pixels among the total pixels, and

S p e c i f i c i t y

denotes the percentage of correctly predicted pixels among the real background pixels.

Based on the four model metrics, U-Net model performance can be clearly and comprehensively evaluated. By training and validating the model under different parameter settings, the experiment records the trends and stable values of Precision, Recall, Accuracy, and Specificity during the training process. These metrics provide the experiments with a comprehensive view of model performance.

(2): Experimental results

In order to verify the effectiveness and accuracy of the method, this section compares the crack identification results using this paper’s method with the original map and conducts ablation experiments to verify the enhancement of the identification effect of the method by the optimization module based on YoLoV8 as well as heat map processing, as shown in Figure 7. Figure 7a is the original rock crack image. Figure 7b is the identification result under an unoptimized mask map, wherein the optimization module and color-mapped operation are not used. Figure 7c is the identification result under an unoptimized heat map, wherein the optimization module is not used. Figure 7d is the identification result, wherein the optimization module and the color-mapped operation are all used. Figure 7 shows that the optimization module based on YoLoV8 can significantly improve the crack feature identification accuracy of the identification method and substantially reduce the influence of non-crack factors (e.g., shadows) on the identification results. Therefore, it can be inferred that the accurate localization of crack features based on YOLOv8 and the pixel-level crack feature extraction of the U-Net model in this paper can effectively avoid the interference of class crack features. Moreover, color-mapped heat maps can better show the connectivity of cracks.

Additionally, the various evaluation metrics are presented in Figure 8. Figure 8 shows that the Precision rate, the Recall rate, the Accuracy rate, and the Specificity converge to 79%, 70%, 97.5%, and 98.9%, respectively, during the training process. The numbers mean that the model has good reliability, and most of the regions predicted to be cracks are indeed crack features. The model is also able to capture the vast majority of crack features, with nearly all of the pixel points correctly classified. Additionally, the model has a very high accuracy rate in recognizing non-cracked pixels. The training and validation loss values decay smoothly, and the final fit is realized without overfitting. Therefore, the U-Net model and the training mechanism perform as expected.

After YOLOv8 optimization, the interference of large shadows, pits, and other crack-like features was excluded, and the pixels predicted to be crack features have a larger proportion of real crack pixels. Consequently, Precision was significantly improved in the test set images, with the average Precision increasing to 90.03%, and the other metrics were relatively stable. Moreover, Figure 9 shows that the loss value curves tend to converge after 80 iterations. This verifies the reasonableness of the training part of the setup in this paper.

Meanwhile, the experiments also added an asphalt pavement crack figure, a concrete crack figure, and a wall crack figure as test sets to verify the generalization ability of the model, shown in Figure 10. The results show that the method can also accurately extract crack information from other scenarios.

6. Conclusions

With the rapid development of electric power facilities, surface crack identification is crucial for transmission station siting and safety assessment. Traditional methods rely on manual surveys and simple image processing, which have problems of low efficiency and poor accuracy. This paper proposes a novel surface crack identification framework, which significantly enhances the accuracy and generalization capabilities of traditional crack detection methods by innovatively integrating various neural network approaches. Our experimental results indicate that the proposed end-to-end identification method, which includes a CNN-based data filtering module and a crack extraction module combining UNet and YOLO architectures, demonstrates good identification accuracy on rock crack datasets. This suggests that it can effectively support the site selection of transmission towers in power engineering, thereby enhancing the safety and stability of power grid construction. Through intelligent crack recognition technology, it can better address the demands of large-scale and high-precision crack detection, providing a reliable solution for future infrastructure monitoring and maintenance. Additionally, tests on different scenario datasets also reveal the strong generalization ability and adaptability of the method, making it applicable in other crack recognition scenarios.

Author Contributions

Conceptualization, X.T. and P.X.; data curation, J.Z.; formal analysis, X.L.; funding acquisition, X.T.; investigation, X.L. and Y.L.; methodology, R.Z.; project administration, Y.L.; resources, J.Z.; software, X.G.; supervision, X.G.; validation, P.X. and B.Z.; visualization, B.Z.; writing—original draft, B.Z. and R.Z.; writing—review and editing, X.T. and R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the State Grid Shandong Electric Power Company Science and Technology Project Funding “Research on Intelligent Survey Technology of Transmission and Substation Engineering Based on Multi-source Data Fusion”, grant number 520632240002 (Project number 2024A-097).

Data Availability Statement

We would like to share our datasets and source code. If you would like to get the data related to this article, you can contact us via correspondence email.

Conflicts of Interest

Authors Xiaoxian Tang, Xin Liu, Yuhai Liu, Peng Xie, Jianwen Zhao and Xingqiang Gao are employed by State Grid Shandong Electric Power Company Construction Company. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Yuan, M. Foundation settlement deformation and tilt correction of extra-high voltage transmission tower. Shandong Power Technol. 2015, 42, 78–80. [Google Scholar]
Zhao, L.; Yuan, P.; Zhang, X.; Zheng, T. Online monitoring of tower settlement of power transmission line in coal mine hollow area based on laser ranging method. J. Xi’an Eng. Univ. 2021, 35, 60–66. [Google Scholar]
Zhao, H.; Zhou, L.; Tan, M.; Tang, M.; Tong, Q.; Qin, J.; Peng, Y. Early identification of landslide potentials in Sichuan and Chongqing transmission grids based on optical remote sensing and SBAS-InSAR. Remote Sens. Nat. Resour. 2023, 35, 264–272. [Google Scholar]
GB/T 40112-2021; Specification for Geological Hazard Risk Assessment. National Standardization Administration of China: Beijing, China, 2021.
Li, W.; Li, Y. Analysis of Geological Hazard Types Classification and Disaster Reduction Countermeasures. Urban Constr. Theory Res. (Electron. Ed.) 2014, 36, 6468–6469. [Google Scholar]
Gu, Y.; Lin, W.; Su, H. Application of impact echo method in nondestructive testing of concrete. Nondestruct. Test. 2004, 26, 468–470. [Google Scholar]
Zhao, Y.; Wu, J.; Wan, M.; Tan, C. Nondestructive testing technology and application of perspective radar for reinforced concrete. Nondestruct. Test. 2002, 24, 234–236. [Google Scholar]
Liu, Y.; Fan, J.; Nie, J.; Kong, S.; Qi, Y. A review and prospect of research on digital image method for recognizing cracks on structural surface. J. Civ. Eng. 2021, 54, 20. [Google Scholar]
Deng, H.; Hou, S.; Zheng, S. Implementation of digital image processing based technology in crack detection. Comput. Sci. Appl. 2023, 13, 2030–2036. [Google Scholar]
Zhou, G. Research and Realization of Tunnel Crack Detection Technology Based on Structured Light. Master’s Thesis, Hebei University of Science and Technology, Shijiazhuang, China, 2021. [Google Scholar]
Li, Y. Research on Road Crack Identification Method Based on Laser Three-Dimensional. Master’s Thesis, Chongqing Jiaotong University, Chongqing, China, 2024. [Google Scholar]
Abdel-Qader, I.; Abudayyeh, O.; Kelly, M.E. An edge detection based crack identification method for bridges. J. Civ. Eng. Comput. 2003, 17, 255–263. [Google Scholar] [CrossRef]
Ayenu-Prah, A.; Attoh-Okine, N. Pavement crack analysis based on empirical modal decomposition. J. Signal Process. 2008, 34, 1–7. [Google Scholar]
Wei, N.; Zhao, X.; Wang, T.; Song, H. Detection of asphalt pavement cracks based on mathematical morphology. J. Transp. Eng. 2009, 31, 383–387. [Google Scholar]
Han, H.; Deng, H.; Dong, Q.; Gu, X.; Zhang, T.; Wang, Y. An advanced Otsu method integrated with edge detection and decision tree for crack detection in highway transportation infrastructure. Adv. Mater. Sci. Eng. 2021, 2021, 1–12. [Google Scholar] [CrossRef]
Dung, C.V.; Anh, D.L. Autonomous concrete crack detection using deep fully convolutional neural network. Autom. Constr. 2019, 99, 52–58. [Google Scholar] [CrossRef]
Wang, H.-Y.; Pan, Z.-J.; Cao, J.-K.; Zhang, J.; Guo, B.-D. Pavement crack recognition method based on double convolutional neural network with composite graph. Highw. Transp. Sci. Technol. 2024, 41, 1–9. [Google Scholar]
Lua, W.K.H.; Yau, P.C.; Seow, C.K.; Wong, D. Lightweight CNN-Based Deep Neural Networks Application in Safety Measurement. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2022; pp. 455–459. [Google Scholar]
Liu, X.; Chen, Y.; Zhu, A.; Liu, H. Deep learning based tunnel crack recognition method. J. Guangxi Univ. Nat. Sci. Ed. 2018, 43, 2243–2251. [Google Scholar]
Fan, Z.; Wu, Y.; Lu, J.; Li, W. Automatic pavement crack detection based on structured prediction with the convolutional neural network. arXiv 2018, arXiv:1802.02208. [Google Scholar]
Liu, Z.; Cao, Y.; Wang, Y.; Wang, W. Computer vision-based concrete crack detection using U-net fully convolutional networks. Build. Autom. 2019, 104, 129–139. [Google Scholar] [CrossRef]
Cao, J.; Yang, G.; Yang, X. Deep learning pavement crack detection based on attention mechanism. J. Comput.-Aided Des. Graph. 2020, 32, 10. [Google Scholar]
Al-Huda, Z.; Peng, B.; Algburi, R.N.A.; Al-antari, M.A.; Al-Jarazi, R.; Zhai, D. Hybrid deep learning for semantic segmentation of pavement cracks. Artif. Intell. Eng. Appl. 2023, 122, 106142. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of transmission tower collapses due to cracks in the ground rock surface and crack examples. (a) The rock mass that provides bearing capacity for the foundation has multiple tiny cracks. (b) When there are cracks on the surface of the rock mass, moisture, chemical substances, plant roots or human factors may cause the rock to weather and break. Especially when subjected to heavy loads, the rock will rupture along the development direction of the cracks, and (c) the cracks gradually expand over time. (d) When the cracks expand to a certain extent, It leads to the toppling and collapse of the rock mass, thereby affecting the safety of the tower base of the transmission tower.

Figure 2. Framework for the crack identification algorithm.

Figure 3. A lightweight CNN-based image classification module.

Figure 4. Primary structure of the UNet network applied to the crack identification scenario.

Figure 5. Optimized mask image formation principle.

Figure 6. Loss values with different parameter combinations.

Figure 7. Rock crack identification results under different methods: (a) original image; (b) unoptimized mask map where the white part indicates the crack; (c) unoptimized heat map where the red and yellow part indicates the crack; (d) optimization of heat maps where the red and yellow part indicates the crack.

Figure 8. Assessment of indicator change curves.

Figure 9. Variation curve of training and validation loss values.

Figure 10. The crack identification effect in different scenarios. White part in the picture of the second row indicates the crack corresponding to the picture of the first row.

Table 1. Distribution of datasets.

Dataset	Training Set	Validation Set	Test Set
Quantities	1666	417	100

Table 2. The main hyperparameter settings for the CNN-based crack classification module.

Main Hyperparameters	Value
Batch size	32
Epochs	200
Learning Rate Tuning Scheduler (LRTS)	StepLR
Decay factor for the LRTS	0.3
Patience for the LRTS	40
Initial learning rate	0.004
Optimizer	AdamW
Weight decay	$0.0001$

Table 3. Analysis and comparison under different combinations of training process hyperparameters.

Parameter Combination	Parameter Settings					Training Process			Testing Accuracy
Parameter Combination	Conv	Pooling	FC	Channel	Regularized Distribution (Dropout)	Lr Initialization	Optimizer	Scheduler	Testing Accuracy
C1	4	4	2	3-48-96-192-96	0.2 0.3 0.6 0.4	0.004	AdamW	StepLR	92.88%
C2	4	4	2	3-48-96-192-96	0.3 0.3 0.5 0.4	0.004	AdamW	StepLR	93.23%
C3	4	4	2	3-32-64-128-48	0 0.2 0.5 0.3	0.004	AdamW	StepLR	89.61%
C4	4	4	2	3-48-96-192-96	0.2 0.3 0.6 0.4	0.004	AdamW	CosineAnnealingLR	92.69%
C5	4	4	2	3-48-96-192-96	0.3 0.3 0.5 0.4	0.004	AdamW	CosineAnnealingLR	93.04%
C6	4	4	2	3-32-64-128-48	0 0.2 0.5 0.3	0.004	AdamW	CosineAnnealingLR	91.70%

Table 4. Comparisons with other state-of-the-art methods.

Methodologies	CNN-Based Classification Method	Support Vector Machine	Random Forest	Naive Bayes (Math.)	Decision Tree
Accuracy	96.41%	87.14%	92.16%	65.33%	87.63%
Recall rate	96.04%	85.87%	95.63%	64.96%	95.13%
F1-value	95.94%	87.14%	95.88%	77.56%	94.66%
Mean average precision (mAP)	96.43%	90.32%	94.88%	69.64%	94.66%

Table 5. The main hyperparameter settings for the UNet-YOLOv8 crack segmentation module.

Main Hyperparameters	Value
Batch size	16
Epochs	100
Learning Rate Tuning Scheduler (LRTS)	ReduceLROnPlateau
Decay factor for LRTS	0.5
Patience for LRTS	8
Initial learning rate	$0.001$
Optimizer	AdamW
Weight decay	$0.0001$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, X.; Liu, X.; Liu, Y.; Zhao, B.; Xie, P.; Zhao, J.; Gao, X.; Zhang, R. A Multi-Deep Learning Intelligent Surface Rock Crack Identification Method for Transmission Tower Siting. Electronics 2025, 14, 2255. https://doi.org/10.3390/electronics14112255

AMA Style

Tang X, Liu X, Liu Y, Zhao B, Xie P, Zhao J, Gao X, Zhang R. A Multi-Deep Learning Intelligent Surface Rock Crack Identification Method for Transmission Tower Siting. Electronics. 2025; 14(11):2255. https://doi.org/10.3390/electronics14112255

Chicago/Turabian Style

Tang, Xiaoxian, Xin Liu, Yuhai Liu, Bowen Zhao, Peng Xie, Jianwen Zhao, Xingqiang Gao, and Ran Zhang. 2025. "A Multi-Deep Learning Intelligent Surface Rock Crack Identification Method for Transmission Tower Siting" Electronics 14, no. 11: 2255. https://doi.org/10.3390/electronics14112255

APA Style

Tang, X., Liu, X., Liu, Y., Zhao, B., Xie, P., Zhao, J., Gao, X., & Zhang, R. (2025). A Multi-Deep Learning Intelligent Surface Rock Crack Identification Method for Transmission Tower Siting. Electronics, 14(11), 2255. https://doi.org/10.3390/electronics14112255

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Deep Learning Intelligent Surface Rock Crack Identification Method for Transmission Tower Siting

Abstract

1. Introduction

2. Framework of the MultiNet Crack Identification Method

3. CNN-Based Image Classification Module

3.1. Convolution Operation

3.2. Pooling Operation

3.3. BatchNorm2d+ Dropout Regularization

3.4. Activation Function

3.5. Loss Function for the CNN-Based Image Classification Module

4. UNet-YOLOv8 Crack Segmentation Module

4.1. The Input of the Unet-YOLOv8 Model

4.2. Down-Sampling Spatial Encoding Layer

4.3. Up-Sampling Spatial Encoding Layer

4.4. Output of the UNet Module

4.5. Loss Function for the UNet-YOLOv8 Crack Segmentation Module

4.6. YOLOv8-Based Boolean-Type Identification Optimization Matrix

5. Experiments and Analysis

5.1. Experimental Preparation

5.2. Experimental Settings and Result Analyses

5.2.1. For the CNN-Based Crack Classification Module

5.2.2. For the UNet-YOLOv8 Crack Segmentation Module

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI