CGSW-YOLO Enhanced YOLO Architecture for Automated Crack Detection in Concrete Structures

Li, Gaoyu; Yang, Yu; Wen, Yang; Li, Jinkui

doi:10.3390/sym17060890

Open AccessArticle

CGSW-YOLO Enhanced YOLO Architecture for Automated Crack Detection in Concrete Structures

College of Civil Engineering and Architecture, Dalian University, Dalian 116622, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 890; https://doi.org/10.3390/sym17060890

Submission received: 22 April 2025 / Revised: 27 May 2025 / Accepted: 4 June 2025 / Published: 6 June 2025

(This article belongs to the Section Engineering and Materials)

Download

Browse Figures

Versions Notes

Abstract

Cracks in concrete structures are key indicators for structural health diagnosis, and the demand for automated detection is gradually increasing. Although various non-destructive testing (NDT) methods and concrete defect detection software have been widely applied, their detection performance varies significantly when dealing with cracks of different shapes and scales. In particular, under complex environmental conditions, detecting fine, irregular, or occluded cracks remains a major challenge. Traditional image-processing-based methods face clear limitations in feature extraction and detection efficiency in practical applications. To address these issues, we propose the CGSW-YOLOv5 algorithm, which enhances detection performance through the following innovations: First, a Concrete Crack Feature Enhancement Block (CNeB) is introduced to improve fine-detail capture. Second, an Adaptive Multi-Scale Feature Aggregation attention mechanism (AMFA) is designed to optimize convolutional kernel selection. Third, the Dynamic Gradient Focusing Weighted IoU loss (DGFW-IoU) is adopted to improve localization accuracy for small targets. Finally, a Lightweight Dual-Stream Convolutional Feature Enhancement module (LDSConv) is constructed to achieve efficient feature utilization. Experimental results show that the CGSW-YOLOv5 algorithm achieves detection accuracies of 71.74% and 72.85% on a self-built dataset and a concrete crack dataset under various environmental conditions (windy, rainy, and foggy), respectively. These results represent improvements of 4.49% and 4.6% over the baseline algorithm, demonstrating superior detection performance and strong environmental adaptability. The proposed method provides an effective solution for intelligent crack detection in concrete structures.

Keywords:

crack detection; feature extraction; CSGW-YOLOv5s; attention mechanism

1. Introduction

Cracks are among the most prominent and critical indicators in structural health diagnosis. Their severity directly affects the performance and stability of concrete structures. The continued propagation of cracks not only weakens the overall load-bearing capacity of the structure but can also lead to more serious structural damage. For example, cracks often accompany reinforcement corrosion, especially in rebars produced using the Tempcore process. In such cases, continuous material loss during corrosion significantly degrades the mechanical properties of the reinforcement, posing a serious threat to the long-term durability and safety of the structure [1,2,3]. Therefore, early crack detection is of great significance for identifying potential safety risks, ensuring structural operational safety, and extending service life.

Traditional crack detection methods largely rely on manual inspection. Although widely used, these methods suffer from strong subjectivity, low efficiency, and high dependency on skilled personnel. With the advancement of computer vision and intelligent detection technologies, image-based crack detection methods have gained increasing attention due to their low cost, high efficiency, and automation capability. Statistics show that modern image recognition techniques can achieve high detection accuracy while significantly reducing human involvement [4]. However, current detection technologies still face limitations in ease of use, real-time performance, and adaptability to complex environments, especially when detecting fine, slender, or irregular cracks.

Research in this field has mainly evolved through two stages: traditional image processing and deep-learning-based intelligent crack detection. Traditional methods rely on manually designed image features, such as threshold segmentation [5] and the Canny operator [6], but are limited by poor feature expression and low robustness to interference. In recent years, one-stage object detection algorithms, such as YOLO and SSD, have become mainstream due to their balance between speed and accuracy.

For example, Ji et al. [5] proposed a Feature Boosting and Differential Pyramid Network (FBDPN), which enhances contextual information across multiple scales and preserves effective features. However, the model suffers from high computational complexity and a large number of parameters. Dai et al. [6] proposed a Gated Cross-Domain Collaborative Network (GCC-Net), which enhances images in real-time and introduces cross-domain feature interaction and fusion modules to improve visibility in low-contrast regions. However, over-enhancement leads to the loss of critical feature details and misdetections in occluded areas. Ma et al. [7] developed an illumination noise model and incorporated the Minimum Weighted Mean Entropy Error (MWMEE) criterion into the YOLO loss function to optimize network parameters. A multi-error handling strategy was also introduced to improve convergence speed by correcting vector errors during backpropagation. Experiments demonstrated promising detection performance. Zhou et al. [8] proposed an object detection network (UODN), which includes a Cross-Stage Multi-branch Module (CSMB) and a Large-Kernel Spatial Pyramid Module (LSKP). These modules enhance the network’s ability to extract features from objects at various scales. Gao et al. [9] designed a large-kernel convolutional object detection network (USF-Net) based on self-attention and long-range dependency modeling. The network introduces a Hybrid Dilated Large-Kernel Attention mechanism (HDLKA), a residual reconstruction module (RConNeXt), and an Adaptive Spatial Feature Fusion Head (ASFF-Head). This architecture effectively reduces missed and false detections and significantly improves small object detection.

Despite recent progress, several challenges remain in crack detection:

(1): Insufficient consideration of model real-time performance, leading to large model sizes, high computational cost, and slow response on mobile devices.
(2): Limited detection accuracy in real-world scenarios involving blurred cracks or complex backgrounds.
(3): Inadequate modeling and feature extraction for crack shapes that are elongated, non-uniform, or irregular.

Based on the above analysis, we aim to address the following key problems:

(1): Improve model adaptability and accuracy for cracks of different scales, irregular geometries, and complex backgrounds.
(2): Enhance detection efficiency by ensuring model lightweight and real-time performance, making it suitable for deployment on embedded or mobile platforms.

To improve the accuracy and real-time performance of concrete crack detection, this study proposes an enhanced CGSW-YOLOv5 algorithm. The algorithm incorporates four innovative modules, Concrete Crack Feature Enhancement Block (CNeB), Adaptive Multi-Scale Feature Aggregation (AMFA) mechanism, Dynamic Gradient Focusing Weighted IoU loss (DGFW-IoU), and Lightweight Dual-Stream Convolutional Feature Enhancement module (LDSConv), which significantly enhance detection performance. Specifically, the CNeB module effectively extracts features of fine cracks, the AMFA mechanism improves the model’s adaptability to cracks at different scales, the DGFW-IoU loss function enhances small crack localization accuracy, and the LDSConv module improves computational efficiency while enhancing feature representation capability. This ensures the model runs efficiently in low-power environments, meeting the real-time and power consumption requirements of embedded devices. The proposed approach can provide a theoretical foundation and technical support for intelligent inspection, crack evolution assessment, and safety maintenance of concrete structures.

It is noteworthy that the shape features of concrete cracks exhibit different physical symmetries: horizontal cracks show mirror symmetry, vertical cracks exhibit translational symmetry, and mesh cracks present self-similar fractal geometry. These symmetrical properties align with the translation invariance and multi-scale feature extraction mechanisms of convolutional neural networks (CNNs), aiding in capturing the repetitive structure and multi-scale details of cracks. The AMFA mechanism proposed in this study adaptively selects convolution kernel sizes, essentially exploring the local symmetric features of crack shapes at different receptive fields. The dual-stream structure of the LDSConv module captures gradient information in orthogonal directions, strengthening the recognition of the main directional symmetry axes of cracks.

2. Object Detection and Crack Detection

Object detection models are generally categorized into two-stage and one-stage detection approaches. Although two-stage algorithms such as Faster R-CNN achieve high detection accuracy, their slower inference speed limits their suitability for real-time applications. In contrast, one-stage detection algorithms, such as the YOLO series, perform end-to-end detection through regression and offer faster detection speeds. This makes them particularly suitable for crack detection in concrete structures, which often involve numerous targets with varying scales.

In recent years, researchers have proposed various improvements for crack detection tasks. Zhang et al. [10] combined the YOLOv3 model with an Adaptive Spatial Feature Fusion algorithm to enhance the detection of cracks at different scales while maintaining real-time performance. Ma et al. [11] proposed YOLO-MF, a crack detection model that integrates YOLOv3 with a median flow tracking algorithm. This model enables real-time detection and avoids repeated counting of the same cracks. Zhao et al. [12] introduced an improved YOLOv3 algorithm enhanced by K-means++, achieving fast and accurate crack recognition. Zhang et al. [13] used YOLOv4 as the baseline to explore the impact of different backbone networks and data augmentation strategies on crack detection performance. An et al. [14] adopted depthwise separable convolutions in YOLOv4 and applied the Focal Loss function to address the problems of low accuracy and large model size. Jeong [15] utilized Test Time Augmentation to improve the prediction capability of the YOLOv5x crack detection model. Wu et al. [16] proposed YOLO-LWNet, a lightweight crack detection network based on YOLOv5, significantly reducing computational complexity and improving detection efficiency for road surface damage. Lei et al. [17] combined YOLOv3 with MobileNet to build a deep-learning model capable of effectively identifying and labeling cracks on concrete surfaces of bridges. Wang et al. [18] developed an improved YOLO-PCB crack detection algorithm, enhancing feature extraction and detection accuracy by integrating a pyramid network into the feature fusion layer. Zhou et al. [19] added attention mechanisms to the YOLOv5s model, which improved its detection accuracy for cracks. Xuan et al. [20] replaced standard convolutions in YOLOv7’s feature extraction network with deformable convolutions, enabling effective detection of road cracks and potholes.

Considering the large parameter sizes of models such as Faster R-CNN and SSD, their crack detection variants are less suitable for deployment on resource-constrained devices. In comparison, YOLOv5 offers a mature and stable detection framework that balances speed, accuracy, and computational efficiency. Specifically, YOLOv5s—the most lightweight version in the YOLOv5 family—achieves faster inference speed and lower resource consumption. Therefore, given the crack detection task’s comprehensive requirements for accuracy, computational efficiency, and model stability, this study selects YOLOv5s as the baseline model.

3. Concrete Crack Detection Network Model: CGSW-YOLOv5s

3.1. Method Overview

To address the issues of limited generalization and missed detections in existing concrete crack detection models, we propose an improved crack detection network, CGSW-YOLOv5s, based on the YOLOv5s framework.

Compared with the original YOLOv5s, the proposed model (Figure 1) introduces the following key improvements:

(1): A Concrete Crack Feature Enhancement Block (CNeB) is integrated into the second convolutional module to improve the extraction of fine-grained crack features.
(2): An Adaptive Multi-Scale Feature Aggregation attention mechanism (AMFA) is introduced, enabling the network to adaptively select optimal convolution kernel sizes based on input features. This dynamic adjustment of receptive field sizes effectively reduces both missed and false detections.
(3): A Dynamic Gradient-Focused Weighted IoU loss (DGFW-IoU) is employed to replace the standard IoU loss. This module adjusts the loss weights for small targets based on gradient information, significantly enhancing detection robustness and stability in complex backgrounds.
(4): A Lightweight Dual-Stream Convolutional Feature Enhancement module (LDSConv) is inserted between the C3 and convolutional modules. This lightweight structure improves feature utilization efficiency and enhances overall feature representation.

3.2. Concrete Crack Feature Enhancement Module

Traditional methods often suffer from information loss when dealing with fine cracks in concrete structures. To address this issue, we propose the Concrete Crack Feature Enhancement Block (CNeB), which effectively captures the features of fine cracks through deep feature fusion and cross-layer information transfer. This module improves the detection accuracy of fine cracks by utilizing deep convolution and multi-scale feature extraction. Figure 2 illustrates the architecture of the CNeB module for concrete crack feature detection. The module consists of three main components.

For an input feature map X from a downsampling layer, CNeB contains two branches. In the main branch, a 7 × 7 depthwise convolution is applied, followed by layer normalization. Two consecutive 1 × 1 convolutions are then used—one activated by a GELU function, the other followed by layer scaling. A Drop Path operation is applied for regularization, and the output is aggregated with the original features.

The multi-branch structure offers varied receptive fields and parameter configurations, enhancing the module’s ability to extract diverse and complex features. Essentially, CNeB adopts a Transformer-inspired architecture to realize a self-attention-like mechanism for more comprehensive and fine-grained feature extraction. As a feature extraction module, CNeB is integrated into the backbone network after the second CBS module. Since its input and output dimensions are consistent, it is plug-and-play and can be inserted into the original model without further modification.

At the microscopic level, the CNeB module introduces a unified activation function and normalization layer, further enhancing model performance. After depthwise convolution, a single normalization layer is applied. Additionally, Layer Normalization (LN) replaces Batch Normalization (BN) in residual blocks. The normalization result (y_a) is expressed as follows:

y_{a} = γ \hat{x_{i}} + β = L N_{γ, β} (x_{i})

(1)

where γ and β are learnable parameters, x_i is the input value, and y_a represents the normalized value.

The GELU activation function is defined as follows:

x P \leq (X \leq x) = x Φ (x)

(2)

where Φ(x) is the cumulative distribution function (CDF) of a Gaussian normal distribution, given by

x P \leq (X \leq x) = x \int_{- \infty}^{x} \frac{e - \frac{{(X - μ)}^{2}}{2 σ^{2}}}{\sqrt{2 \prod} σ}

(3)

where σ and μ denote the standard deviation and mean of the normal distribution, respectively, and x is the neuron input.

By incorporating CNeB, the network depth is gradually increased, which strengthens its capacity to represent crack features in concrete structures. The module further extracts, integrates, and processes feature information, enabling the network to capture more detailed and complex surface crack features. During training, it supports learning more discriminative representations, improving the model’s adaptability to various samples and scenes. This enhances the generalization capability of the model. Additionally, CNeB increases the model’s depth and structural complexity, which helps reduce overfitting. By learning expressive crack representations, the module reduces over-reliance on training samples and improves overall generalization performance.

3.3. Adaptive Multi-Scale Feature Aggregation Attention Mechanism

Due to the varying shapes and sizes of cracks, as well as the complexity of real-world detection scenarios, there is a significant disparity in the feature information of cracks at different scales. This leads to insufficient feature extraction of concrete cracks. To address this, we propose an Adaptive Multi-Scale Feature Aggregation Attention Mechanism (AMFA), which dynamically adjusts the convolution kernel size, allowing the model to adaptively extract crack features at different scales. This mechanism is particularly effective for cracks with complex geometric shapes and large variations in scale, effectively reducing both missed and false detections. Figure 3 illustrates the workflow of the AMFA mechanism.

The AMFA mechanism enhances the model’s adaptability to cracks at different scales through multi-scale feature fusion. The adaptive adjustment of the feature fusion strategy generates U₁ and U₂, with a dynamic allocation mechanism for channel attention weights. This enables the network to adjust based on the complexity of the crack’s symmetry, as shown in Equation (4).

The decomposed features are then fused at the channel level in the Fuse stage. Specifically, features U₁ and U₂. are merged to obtain U, as shown in Equations (5) and (6). This process enhances the network’s ability to perceive features at different crack scales. Next, a global average pooling operation is applied to U to obtain a feature vector S_c, which captures the statistical information carried by each channel. The resulting vector is then passed through a fully connected layer to produce the final output vector Z.

U = U_{1} + U_{2}

(4)

S_{c} = F_{g p} (U_{c}) = \frac{1}{H * W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U_{C} (i, j)

(5)

Z = F_{f c} (s) = δ (β (W_{s}))

(6)

where W and H represent the width and height of the feature map, F_gp is the GAP function, δ is the ReLU activation function, B represents the batch normalization function, and F_fc is the fully connected function. The dimension d is defined as d = max (c/r, L), where r is the reduction ratio and L is the minimum value of d.

The vector Z is passed through a softmax function to predict the channel-wise attention weights a_c and b_c for different-scale feature maps U₁ and U₂. These weights are used to perform channel-wise weighted aggregation, producing the final feature representation V, as described in Equations (7)–(9). This process enhances essential features while suppressing irrelevant information.

a_{c} = \frac{e^{A_{C} Ζ}}{e^{A_{C} Z} + e^{B_{C} Z}}

(7)

b_{c} = \frac{e^{B_{C} Ζ}}{e^{A_{C} Z} + e^{B_{C} Z}}

(8)

V_{C} = a_{C} {\tilde{U}}_{C} + b_{C} {\overset{⌢}{U}}_{C}, a_{C} + b_{C} = 1

(9)

where A and B ∈ R^{{d × c}}, A_c represents the c-th row of A, a_c is the c-th element of a, and B_c and b_c follow the same notation.

By integrating AMFA into YOLOv5s, the model effectively focuses on informative channels while suppressing irrelevant noise, such as stains or pseudo-cracks. AMFA dynamically selects the optimal convolutional operator, further improving crack detection performance. The weighted multi-scale features are then fed into the Neck for feature fusion, ultimately enhancing crack detection at different scales.

In this work, to enhance the feature representation of concrete cracks and suppress interference from stains, false joints, and other irrelevant surface information, the AMFA module is integrated into the backbone of YOLOv5. The AMFA module enables the network to focus more on channels that are useful for crack recognition during feature extraction. It also automatically selects the optimal convolutional operators, thereby further improving crack detection performance and providing richer information for multi-scale detection. After weighting the multi-scale features with AMFA, the features are fed into the Neck for fusion. Finally, the fused features are used to detect cracks in the images following the aforementioned prediction method.

3.4. Dynamic Gradient Focusing Weighted Intersection-over-Union Module

Bounding box regression is a critical step in object detection and plays an important role in locating cracks in concrete structures. In the original YOLOv5 model, the CIoU loss function is used for bounding box regression. CIoU calculates the loss by considering not only the intersection-over-union (IoU) of the bounding boxes but also incorporating the distance between their center points and the aspect ratio of the bounding boxes. The calculation process is shown in Equation (10).

L_{C I o U} = 1 - Z_{I o U} + \frac{ρ^{2} (B_{g t}, B_{p r d})}{C^{2}} + V * α

(10)

Z_{I o U} = \frac{B_{g t} \cap B_{p r d}}{B_{g t} \cap B_{p r d}}

(11)

V = \frac{4}{Π^{2}} {(\arctan \frac{D G F W^{g t}}{h^{g t}} - \arctan \frac{D G F W^{p r d}}{h^{p r d}})}^{2}

(12)

α = \frac{V}{1 - Z_{I o U} + V}

(13)

where

DGFW_t, h_gt—width and height of the overlapping region between the target box and predicted box;

DGFW, h—width and height of the predicted box;

DGFW_gt, h_gt—width and height of the target box;

(x, y)—coordinates of the center of the predicted box;

(x_gt, y_gt)—coordinates of the center of the target box;

DGFW_gt, H_t—width and height of the minimum enclosing box that contains both the predicted and target boxes.

As shown in Equation (14) and Figure 4, when the aspect ratio is the same, the penalty term V in the CIoU loss function becomes 0, and in this case, the penalty term does not have any effect. The traditional IoU loss function has limitations when handling small objects. To address this, we propose the Dynamic Gradient Focusing Weighted IoU (DGFW-IoU) loss function. By dynamically adjusting the loss weights for small objects, it improves the detection accuracy of small cracks in complex environments. Figure 4 illustrates the geometric parameter definition system of DGFW-IoU. Figure 4 illustrates the geometric parameter definition system of DGFW-IoU. The figure clearly marks the dynamic gradient-focusing width (DGFW) calculation region between the predicted box and the ground truth box. The shaded area represents the IoU calculation range as defined in Equation (15), while the arrows indicate the direction for the center point distance calculation path in Equation (16).

The DGFW-IoU loss function mainly consists of three components: the Intersection-over-Union loss (L_IoU), distance loss (RDGFWIoU), and the non-monotonic focusing coefficient (r). The final DGFW-IoU loss function is shown in Equation (18).

The equations are as follows:

L_{D I o U v} = r R D G F W_{I o U} L_{I o U}

(14)

L_{I o U} = 1 - Z_{I o U} = 1 - \frac{D G F W_{t} H_{t}}{d h + d_{g t} h_{g t} - D_{t} H_{t}}

(15)

R D G F W I o U = e [\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{({D G F W g}^{2} + {D G F W g}^{2})}^{*}}]

(16)

r = \frac{β}{δ * α^{β - δ}}

(17)

β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}}

(18)

where

*—non-participation in backpropagation;

β—outlier degree;

α, δ are hyperparameters; when the outlier degree β of the predicted box reaches the set threshold, the predicted box can achieve the maximum gradient gain;

L^*_IoU—the

\bar{L_{I o U}}

will be separated from the computation graph and serves as a normalization factor representing the sliding average of the increments.

The DGFW-IoU loss function can flexibly adjust the gradient gain allocation strategy during the training process, adapting to different training stages while reducing competition between high-quality bounding boxes and mitigating the negative impact of low-quality samples. By balancing the weight of high- and low-quality samples, DGFW-IoU effectively addresses difficult-to-detect issues, such as fine crack blurring and overlapping cracks, making it especially suitable for fine crack detection in concrete structure images.

3.5. Lightweight Dual-Stream Convolutional Feature Enhancement Module

To address the issue of low detection efficiency in real-world scenarios, we design a novel convolutional technique—Lightweight Dual-Stream Convolutional Feature Enhancement module (LDSConv). This module uses a dual-stream parallel convolution structure to extract local details while preserving global information. This approach enhances feature representation capability and optimizes computational efficiency, making the model suitable for real-time applications on embedded and mobile platforms.

Specifically, LDSConv is embedded before and after the second Conv module in the YOLOv5s backbone network. Figure 5 shows the topology of the LDSConv module, which consists mainly of three submodules: Conv, Concat, and Shuffle. The workflow is as follows: the input crack features first undergo initial feature extraction through a 1 × 1 convolution. Then, the features are sent into two parallel branches. One branch uses a 5 × 5 convolution to further extract local crack features, while the other branch maintains an identity mapping to preserve the original crack feature information. The features from both branches are concatenated via the Concat operation, followed by a Shuffle operation that rearranges the channels of the crack features. This design not only enhances the diversity of crack features but also improves the model’s generalization by increasing the randomness of crack information. It effectively prevents overfitting or underfitting caused by extreme gradient updates during training.

4. Experimental Results and Analysis of the Improved YOLOv5 Algorithm

4.1. Experimental Environment and Parameter Settings

The experiments were conducted on a 64-bit Windows operating system. The hardware configuration included an NVIDIA 3080Ti GPU with 12 GB of VRAM, an Intel (R) Core (TM) i7-11700K CPU, and 31.9 GB of RAM. The deep-learning framework used was PyTorch 2.3.0, with Python 3.8 as the programming language. The computational platform was CUDA 12.1. During training, the proposed model was developed based on the YOLOv5s architecture. The initial learning rate was set to 0.01, with a momentum of 0.937. The optimizer used was stochastic gradient descent (SGD) with momentum. Weight decay was set to 0.0005. The number of training epochs was 100, and the batch size was 16. All input images were resized to a resolution of 640 × 640 pixels.

To validate the effectiveness of the improved loss function, we compared it with the original YOLOv5s model (Figure 6). After approximately 40 iterations, the loss function tended to stabilize, and the model reached its optimization. As shown in Figure 6, the loss value of the proposed model decays faster, and the final minimum loss is lower compared to the model before optimization. This indicates that the DGFW-IoU loss function significantly enhances the model’s convergence speed.

4.2. Experimental Dataset

Although the proposed CGSW-YOLOv5 algorithm improves the ability to capture fine details, the diversity and complexity of the existing datasets remain limited when handling cracks at different scales and under varying environmental conditions. To further enhance the model’s performance in real-world applications, especially for fine, non-uniform, or occluded cracks, it was necessary to introduce a new dataset that includes more complex backgrounds and multi-scale crack features. Training the model on such a challenging and diverse dataset will significantly improve its adaptability and robustness, enabling it to better handle the complex environmental variations encountered in practice. Therefore, the construction of this dataset was divided into three parts: First, Crack500 and SDNET2018 were selected as benchmark test sets. Second, appropriate images were selected from publicly available datasets, such as “Concrete Crack Images for Classification” and “crack-detection”. Finally, additional images of concrete structural cracks were collected from various structures in Dalian City, as shown in Figure 7. To ensure annotation accuracy, the collected images were manually filtered during the labeling phase. Non-structural microcracks and surface shrinkage cracks caused by thermal stress were removed to avoid label noise that could negatively impact model training. After merging the public and collected data, a total of 5400 crack images were obtained. The image data covered a variety of concrete structures, including bridge decks, retaining walls, and tunnel linings. All crack images were labeled at the pixel level using Labelme 3.16 software and classified into three categories: transverse cracks, longitudinal cracks, and network (crazing) cracks. The dataset included cracks caused by shrinkage, freeze-thaw cycles, mechanical loads, and material aging, offering high diversity and contributing to improved model generalization and adaptability. The classification of transverse, longitudinal, and network cracks was based on crack orientation, formation mechanism, and morphology. Transverse cracks extend perpendicular to the principal stress direction or material texture, typically caused by tensile or shrinkage stress, and appear straight and evenly distributed. Longitudinal cracks run parallel to the main stress direction, often resulting from shear, compression, or freeze-thaw effects, and extend along or obliquely to the structural axis. Network cracks are formed due to multi-directional stress or material degradation, exhibiting irregular intersecting patterns that are dense, shallow, and lack a dominant direction. The dataset was randomly split into a training set and a validation set with a ratio of 8:2.

4.3. Model Performance Evaluation

To comprehensively evaluate the performance of the improved model, the following metrics were used:

(1) Recall (R): This measures the proportion of correctly predicted samples among all target samples. (2) Precision (P): This measures the proportion of correctly predicted samples among all predicted positive samples. And (3) mAP@0.5: This refers to the mean average precision at an Intersection over Union (IoU) threshold of 0.5. Higher values of Precision, Recall, and mAP@0.5 indicate better model performance. The formulas for each metric are as follows:

P = \frac{T p}{T p + F p} \times 100 %

(19)

R = \frac{T p}{T p + F n} \times 100 %

(20)

A P = \int_{0}^{1} P (R) d R

(21)

m A P = \frac{1}{m} \sum_{i = 1}^{m} A P_{i}

(22)

In the formulas, T_P represents the number of correctly identified cracks, F_P represents the number of incorrectly identified cracks, F_N represents the number of missed cracks, and A_P represents the area under the Precision-Recall (P-R) curve. mAP refers to the mean average precision across multiple IoU thresholds for m class labels. Additionally, the number of parameters is an important indicator for evaluating model performance. Generally, the fewer the parameters, the lighter the model. The computational load is also measured by the number of Floating-Point Operations Per Second (FLOPs).

4.4. Experimental Results

4.4.1. Comparison of Attention Mechanisms

To address the significant variation in feature information across crack targets of different scales, we propose the Adaptive Multi-Scale Feature Aggregation (AMFA) attention mechanism to enhance multi-scale feature interaction. To evaluate the impact of various attention mechanisms on model performance, six attention modules—GAMAttention, S2A-Net, SEAttention, Shuffle-Net, AMFA, and SocaAttention—were integrated into the improved model. The experimental results are summarized in Table 1.

The results indicate that, although AMFA slightly increases model complexity, it achieves superior detection performance compared to the other methods. Specifically, on the custom concrete crack dataset, AMFA obtained the highest accuracy of 76.54% and mAP50 of 70.24%, outperforming the second-best Soca by 0.77% in accuracy and 1.47% in mAP. While the parameter count of AMFA is higher than that of SEAttention, the improvement in feature extraction capabilities significantly offsets the additional complexity. This is mainly because lightweight attention mechanisms, such as SE, lead to considerable accuracy loss and are insufficient for capturing fine crack details.

Meanwhile, the parameter sizes of GAM, S2A, and Shuffle attention models increased to 8.77 × 10⁶, 9.13 × 10⁶, and 7.03 × 10⁶, respectively. However, their mAP gains were only 1.51%, 1.28%, and 0.37%, indicating that the performance improvement was not proportional to the increase in model size. In terms of computational efficiency, Soca consumes 2.8 fewer FLOPs than AMFA, but fails to meet real-time monitoring requirements. In contrast, AMFA achieves notable accuracy improvements while maintaining a relatively high FLOPs level, demonstrating that effective allocation of computational resources is crucial for performance optimization.

AMFA enables the network to focus more on informative channels during feature extraction and automatically selects optimal convolution operators, thereby improving crack detection performance and supporting subsequent multi-scale detection. These results confirm that AMFA-weighted features, when passed to the Neck for fusion, achieve a better balance between detection accuracy and efficiency. This provides a more effective solution for the detection of fine-scale cracks.

4.4.2. Comparison of Loss Functions

To evaluate the impact of different loss functions on model performance, comparative experiments were conducted based on the improved model. The tested loss functions include CIoU, EIoU, AlphaIoU, SIoU, and the proposed DGFW-IoU. The experimental results are shown in Table 2.

As observed in Table 2, although the proposed DGFW-IoU introduces a slight increase in parameters, it achieves the best performance in terms of precision (P), recall (R), and mAP. Specifically, it obtains 73.82% in precision, 57.22% in recall, and 69.55% in mAP. These results demonstrate that DGFW-IoU provides more accurate localization for concrete crack detection tasks.

4.4.3. Ablation Experiment

To evaluate the contribution of each improvement module to the overall model performance, we used YOLOv5s as the baseline and progressively integrated the DGFW-IoU loss function along with other proposed modules. The effectiveness of each component was assessed through a series of ablation experiments. The results are summarized in Table 3.

Comparing Experiments 2 and 3 with the baseline (Experiment 1), it is evident that the introduction of the CNeB module and the convolutional attention mechanism AMFA significantly improves model performance, with mAP50 increasing by 1.28% and 0.69%, respectively. From the inference time perspective, the CNeB module optimizes the computation path using depthwise separable convolution, reducing the per-frame inference time from 59.5 ms to 56.7 ms. In contrast, AMFA slightly increases the inference time to 58.6 ms due to its dual-branch attention computation. The performance gains can be attributed to the CNeB module’s ability to better capture fine-grained crack features, while AMFA enhances the model’s perception of multi-scale crack characteristics by adaptively adjusting the receptive field. To further enhance performance, Experiment 4 introduces the lightweight LDSConv module, leading to an mAP50 improvement of 0.89%. Notably, LDSConv reduces the inference time to 54.2 ms—a 9.1% decrease compared to the baseline—while maintaining a strong feature representation. This is achieved through a dual-stream feature reorganization strategy that not only enriches feature diversity but also improves model generalization. Experiments 5 and 6 demonstrate that combining the CNeB module with LDSConv and AMFA results in further performance gains without sacrificing accuracy, despite a reduction in parameter count. When CNeB and LDSConv operate jointly, the inference time is further reduced to 52.9 ms, saving 2.3 ms compared to using LDSConv alone. This indicates a cumulative effect in computation path optimization between the modules.

Finally, as shown in Experiment 8, when all the proposed improvement modules are introduced, the model achieves optimal performance, with the mAP50 increasing to 71.74%, a significant improvement over the baseline in Experiment 1. At this point, the model demonstrates the best time efficiency, with an inference time of only 42.7 ms, a 28.2% speedup compared to the original YOLOv5s. Based on these experimental results, we gradually added the CNeB module, AMFA mechanism, DGFW-IoU loss function, and LDSConv module to assess their contributions to the model’s performance. The results show that the introduction of the CNeB module significantly improved the model’s performance in detecting fine cracks. The AMFA mechanism further enhanced the model’s adaptability to cracks at different scales, while the DGFW-IoU loss function improved the localization ability for small cracks in complex backgrounds. The LDSConv module optimized the feature extraction path, significantly reduced inference time, and maintained high detection accuracy, which substantially improved the model’s generalization ability and practical applicability in engineering.

4.4.4. Comparison Experiment

To further validate the improvements and performance advantages of CGSW-YOLOv5, we conducted comparison experiments on our custom dataset with several mainstream object detection algorithms, including Faster R-CNN [21], SSD [22], YOLOv3-tiny [23], YOLOv4 [24], YOLOv5s [25], YOLOv7-tiny [26], YOLOv8 [27], and the general-purpose model Segment Anything (SAM) [28], to verify the necessity of domain-specific designs. All experiments were conducted using the same training parameter settings, and the results are shown in Table 4.

The experimental results show that, compared to the two-stage detection algorithm Faster R-CNN and the single-stage detection algorithm SSD, our algorithm improves mAP50 by 12.70% and 13.01%, respectively, while reducing single-frame inference time by 29.5 ms and 27.1 ms. Notably, SAM, as a general-purpose segmentation model, achieves an mAP50 of only 41.27% in the zero-shot detection task, with an inference time of 3200 ms. Its FLOPs (4500 G) are 138 times higher than those of CGSW-YOLOv5, which fully demonstrates the limitations of general models in specialized scenarios. From a computational efficiency perspective, CGSW-YOLOv5 has a FLOPs value of 32.5 G, which is higher than YOLOv5s (15.8 G) but significantly lower than the computational load of Faster R-CNN and SSD. Additionally, key metrics, such as parameter count and model weight file size, are significantly reduced—its model weight file is 55.85 MB, only 50.3% of Faster R-CNN’s size, and 97.7% smaller than SAM’s 2400 MB, making it well-suited for embedded devices with storage limitations. YOLOv3-tiny, due to its inherent limitations in multi-object and small-object detection, achieves an mAP50 of only 22.71%, much lower than the other algorithms. Although SAM has a high parameter count of 636 M, its detection accuracy is still 30.47% lower than CGSW-YOLOv5, proving that simply increasing model capacity does not solve the specialized detection problem. Compared to YOLOv4, CGSW-YOLOv5s not only has an advantage in detection accuracy but also reduces inference time from 66.2 ms to 42.7 ms, with a larger speedup compared to SAM. It also outperforms YOLOv4 in terms of FLOPs, parameter count, and model weight file size. Although YOLOv7-tiny achieves a detection accuracy of 59.02%, it is still 12.72% lower than our algorithm. Notably, CGSW-YOLOv5s achieves a 30.4% speedup in inference time while maintaining higher accuracy than YOLOv7-tiny, thanks to the dual-stream feature reorganization mechanism of the LDSConv module, which optimizes feature extraction efficiency through parallel computation paths. In contrast, SAM’s ViT-H architecture, due to its global attention mechanism, faces difficulties in optimizing computational paths, resulting in a frame rate of less than 0.3 FPS on ARM devices. In the comparison with YOLOv8, CGSW-YOLOv5 shows significant advantages: mAP50 increases by 3.1%, inference time decreases by 17.1 ms, and FLOPs reduce by 46.7% compared to YOLOv8. Although YOLOv8 performs well on conventional datasets, it exhibits significant errors when handling fine, complex cracks, especially in areas with crack intersections and occlusions. CGSW-YOLOv5 captures these details more accurately, with a single-frame processing speed of 42.7 ms, meeting real-time detection requirements and maintaining high detection accuracy, even in low computational resource environments.

In summary, CGSW-YOLOv5s achieves a good balance between accuracy and lightweight design in concrete crack detection tasks. By optimizing feature extraction efficiency and resource allocation, it demonstrates advanced performance and practical applicability in the field of object detection.

Based on the Table 5 results and statistical tests (including the Friedman test, Nemenyi post-hoc test, and critical difference (CD) analysis), the proposed CGSW-YOLOv5 demonstrates clear advantages over other models. Although the Friedman test yielded a p-value of 0.066, which does not meet the standard threshold for statistical significance, the subsequent Nemenyi test revealed significant differences between CGSW-YOLOv5 and models, such as SAM and YOLOv3-tiny in terms of detection accuracy and inference time. The differences in average ranks exceeded the critical difference (CD = 8.216), confirming the effectiveness of the proposed improvements. Specifically, CGSW-YOLOv5 outperforms both traditional methods and general-purpose models in mAP50 and inference speed. For example, SAM performs poorly under complex environmental conditions, with an inference time of up to 3200 ms, while CGSW-YOLOv5 achieves a much lower inference time of 42.7 ms and a detection accuracy of 72.85%, demonstrating strong adaptability and efficiency. In contrast, no significant difference was observed between YOLOv4 and YOLOv7-tiny, suggesting performance convergence among some lightweight models.

In summary, CGSW-YOLOv5 significantly improves detection accuracy and practical applicability while maintaining a compact model size and low computational cost, making it suitable for intelligent crack detection in concrete structures under resource-constrained conditions.

4.4.5. Generalization Experiment

In practical applications, environmental factors, such as weather changes, lighting conditions, and background noise, can significantly affect crack detection. To investigate the impact of these external variables, this study conducted extensive experiments under various environmental conditions, including wind, rain, and fog. The data were preprocessed and split into training and testing sets in an 8:2 ratio. Under the same experimental setup, the CGSW-YOLOv5s concrete crack detection model was compared with Faster R-CNN [21], SSD [22], YOLOv3-tiny [23], YOLOv4 [24], YOLOv5s [25], YOLOv7-tiny [26], YOLOv8 [27], and SAM [28]. The results are shown in Table 6 and Figure 8.

The Experiment 6 results show that, compared to Faster R-CNN and SSD, the proposed CGSW-YOLOv5s model achieves the highest detection accuracy while maintaining the lowest number of parameters. Inference speed is improved by 29.1 ms and 26.7 ms, respectively. Notably, SAM performs poorly in complex environments, with an mAP50 of only 41.78% and an inference time of 3200 ms. With 1332 M parameters and a model size of 4570 MB, SAM is not feasible for mobile deployment. Although YOLOv3-tiny and YOLOv7-tiny have an advantage in terms of parameter count, their detection accuracy is 47.24% and 13.34% lower than our algorithm, respectively. Additionally, their actual inference speeds are 8.8 ms and 18.1 ms slower than our model, which fails to meet the detection requirements for various types of concrete cracks.

Compared to the original YOLOv5s, CGSW-YOLOv5s shows a significant improvement of 4.6% in detection accuracy and reduces inference time from 65.1 ms to 43.1 ms. This is due to the depthwise separable convolution design in the CNeB module and the dual-stream feature reorganization strategy of LDSConv. Compared to the latest YOLOv8, our algorithm still achieves a faster inference speed by 16.7 ms while maintaining the accuracy advantage and reducing FLOPs by 45.5%. SAM shows lower accuracy in wind, rain, and fog scenarios, while CGSW-YOLOv5 maintains a detection accuracy of 72.85% under complex environmental conditions, with only a 2.3 ms increase in inference time.

The experimental results demonstrate that environmental factors do not lead to a decrease in detection accuracy. By introducing the AMFA module for enhanced multi-scale feature aggregation and optimizing localization accuracy with the DGFW-IoU loss function, the model’s adaptability in complex environments is significantly improved. Specifically, the AMFA mechanism effectively captures multi-scale crack features, reducing missed and false detections caused by environmental interference, while the DGFW-IoU loss function dynamically adjusts the loss weight for small targets, improving the detection accuracy of small cracks in complex backgrounds.

4.4.6. Visual Comparison Experiment

Figure 9 presents the detection results for three different types of concrete structure cracks: transverse cracks, vertical cracks, and mesh cracks. For each crack type, the figure shows the detection results of seven different models: Faster R-CNN [21], SSD [22], YOLOv3-tiny [23], YOLOv4 [24], YOLOv5s [25], YOLOv7-tiny [26], and YOLOv8 [27].

The model comparison in Figure 9 displays three different types of concrete cracks: horizontal cracks, vertical cracks, and mesh cracks. It visually demonstrates the detection advantages of CGSW-YOLOv5, especially in capturing the complete contours of intersecting mesh cracks (highlighted in the red box) in complex backgrounds. For each crack category, the image shows the detection results of seven different models (Faster R-CNN [21], SSD [22], YOLOv3-tiny [23], YOLOv4 [24], YOLOv5s [25], YOLOv7-tiny [26], YOLOv8 [27], and CGSW-YOLO). From the figure, it can be observed that for the horizontal crack category, the confidence values of CGSW-YOLO are 0.99 and 0.92, showing a significant improvement compared to Faster R-CNN, SSD, YOLOv3-tiny, YOLOv4, YOLOv5s, YOLOv7-tiny, YOLOv8, and SAM. For transverse cracks, CGSW-YOLO achieves high confidence scores of 0.99 and 0.92, outperforming all comparison models in detection accuracy. In the longitudinal crack category, YOLOv3-tiny exhibits missed detections, while CGSW-YOLO not only meets the basic detection requirements but also demonstrates higher confidence scores, making it more suitable for practical applications. In the case of mesh cracks, all baseline models show varying degrees of false negatives and false positives. Notably, CGSW-YOLO attains a confidence score of 1.00, significantly higher than the other models. Overall, CGSW-YOLO demonstrates robust and accurate performance across all crack types, with particularly outstanding results in detecting complex mesh cracks, highlighting its superior reliability and detection precision.

The Table 7 and Table 8 results indicate that false positives in transverse cracks often arise from linear texture similarities on the surface, such as formwork joints. The high missed detection rate for reticular cracks is closely related to the non-uniform and asymmetric topology of their structure. When the branching angles deviate from the typical distribution of training samples, current models exhibit insufficient feature coupling at intersection nodes. Regarding the high false positive rate under rainy conditions, although the dilated convolution in the AMFA mechanism helps expand the receptive field, it lacks effective suppression of pseudo-symmetric features caused by water stain reflections.

Figure 10 compares the training curves of the original YOLOv5s model and the improved GSCW-YOLO model in terms of the loss function, mAP50, and mAP50-95. As shown in subfigures (a) and (b), the GSCW-YOLO model exhibits faster convergence and significantly improved training stability compared to YOLOv5s. In subfigures (c) and (d), the mAP50 and mAP50-95 curves of the improved GSCW-YOLO are clearly higher than those of YOLOv5s. The GSCW-YOLO algorithm not only provides high-precision detection results for concrete structural cracks but also demonstrates strong accuracy and robustness in detecting fine cracks. It effectively meets the practical demands for efficiency and accuracy in crack detection under various environmental conditions. This offers strong technical support for quality monitoring and maintenance of concrete structures and holds significant value for advancing the health assessment of concrete structures in the construction industry.

5. Conclusions

We address key challenges in intelligent detection of concrete structural cracks, including insufficient accuracy, limited real-time performance, and difficulty in extracting fine-grained features. A lightweight and improved model, CGSW-YOLOv5s, is proposed and validated. This work focuses on resolving the following issues: (1) balancing model lightweight design with detection accuracy; (2) ensuring detection stability under complex backgrounds and elongated crack conditions; and (3) optimizing regression for precise crack boundary localization. The CGSW-YOLOv5s achieves significant improvements through four key innovations: the CNeB module enhances fine-grained feature extraction; the AMFA mechanism strengthens multi-scale adaptability; the DGFW-IoU loss function improves localization accuracy for small targets; and the LDSConv module effectively balances computational efficiency and precision. Ablation experiments demonstrate a 4.49% increase in mAP on the self-constructed concrete crack dataset. Generalization tests further indicate strong cross-scene adaptability, validating the model’s capability to meet detection requirements under varied environmental conditions. Compared with mainstream object detection algorithms, the proposed model exhibits clear performance advantages.

The proposed algorithm demonstrates good accuracy and robustness across multiple datasets. In the future, further optimization will focus on issues such as crack depth and width variations. Experimental data show that, in samples with formwork seam interference, the false detection rate remains as high as 7.2%, indicating ongoing challenges in differentiating morphologically similar features. Moreover, since the current model is trained on 2D images, it may be unable to fully distinguish surface cracks caused by thermal expansion or contraction from structural cracks in real-world applications. Future work will focus on the integration of 3D feature fusion techniques. In particular, incorporating depth scanning data from real bridge or tunnel structures (e.g., GPR or LiDAR) will be used for cross-validation to further enhance model adaptability. Additionally, the application of transfer learning frameworks combined with follow-up strain monitoring and crack propagation rate data is planned to support auxiliary judgment. These future directions aim to provide more effective solutions for crack detection in concrete structures.

Author Contributions

G.L.: made the most significant contributions to the data curation, software, writing—original draft and writing—review and editing. Y.Y.: made the most significant contributions to the formal analysis, investigation, and methodology. Y.W.: played an important role in the Software, Validation, Writing—review and editing. J.L.: Visualization, Supervision, Writing—review and editing (final review). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding and the APC was funded by G.L.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Andrade, C.; Alonso, C. On-site measurements of corrosion rate of reinforcements. Constr. Build. Mater. 2001, 15, 141–145. [Google Scholar] [CrossRef]
Poursaee, A.; Hansson, C.M. The influence of longitudinal cracks on chloride-induced corrosion of steel in high performance concrete. Cem. Concr. Res. 2008, 38, 1098–1105. [Google Scholar] [CrossRef]
Mateo, J.; Andrade, C. Corrosion of Tempcore steel bars in concrete structures. Corros. Sci. 1990, 30, 905–913. [Google Scholar]
Dorafshan, S.; Thomas, R.J.; Maguire, M. Comparison of deep convolutional neural networks and edge detectors for image-based crack detection in concrete. Constr. Build. Mater. 2018, 186, 1031–1045. [Google Scholar] [CrossRef]
Ji, X.; Chen, S.; Hao, L.Y.; Zhou, J.; Chen, L. FBDPN: CNN-Transformer hybrid feature boosting and differential pyramid network for underwater object detection. Expert Syst. Appl. 2024, 256, 124978. [Google Scholar] [CrossRef]
Dai, L.; Liu, H.; Song, P.; Liu, M. A gated cross-domain collaborative network for underwater object detection. Pattern Recognit. 2024, 149, 110222. [Google Scholar] [CrossRef]
Ma, H.; Zhang, Y.; Sun, S.; Zhang, W.; Fei, M.; Zhou, H. Weighted multi-error information entropy based you only look once network for underwater object detection. Eng. Appl. Artif. Intell. 2024, 130, 107766. [Google Scholar] [CrossRef]
Zhao, D.; Wang, Y.; Chen, L. A crack video detection system of pipeline snake shaped robot. Sci. Technol. Andeng. 2023, 23, 2492–2498. [Google Scholar]
Zhou, H.; Kong, M.; Yuan, H.; Pan, Y.; Wang, X.; Chen, R.; Lu, W.; Wang, R.; Yang, Q. Real-time underwater object detection technology for complex underwater environments based on deep learning. Ecol. Inform. 2024, 82, 102680. [Google Scholar] [CrossRef]
Gao, Z.; Shi, Y.; Li, S. Self-attention and long-range relationship capture network for underwater object detection. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 101971. [Google Scholar] [CrossRef]
Zhang, R.; Shi, Y.; Yu, X. Pavement crack detection based on deep learning. In Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7367–7372. [Google Scholar]
Ma, D.; Fang, H.; Wang, N.; Zhang, C.; Dong, J.; Hu, H. Automatic detection and counting system for pavement cracks based on PCGAN and YOLO-MF. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22166–22178. [Google Scholar] [CrossRef]
Zhang, X.; Xia, X.; Li, N.; Lin, M.; Song, J.; Ding, N. Exploring the tricks for road damage detection with a one-stage detector. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; IEEE: New York, NY, USA, 2020; pp. 5616–5621. [Google Scholar]
An, X.; Dang, J.; Wang, Y.; Yue, B. UAV image pavement distress detection method based on improved YOLOv4. Radio Eng. 2023, 53, 1285–1294. [Google Scholar]
Jeong, D. Road damage detection using YOLO with smartphone images. In Proceedings of the 2020 IEEE International Conference on Big Data (BigData), Atlanta, GA, USA, 10–13 December 2020; IEEE: New York, NY, USA, 2020; pp. 5559–5562. [Google Scholar]
Wu, C.G.; Ye, M.; Zhang, J.L.; Ma, Y. YOLO-LWNet: A lightweight road damage object detection network for mobile terminal devices. Sensors 2023, 23, 3268. [Google Scholar] [CrossRef]
Resda; Lin, J.; Huang, S.; Xiao, Q.; Huang, Y. Research on target recognition of concrete surface crack based on improved YOLOv3 network. Highway 2019, 69, 270–275. [Google Scholar]
Wang, L.; Huang, J.; Zeng, X. Bare board defect detection of PCB based on YOLO-PCB. Sci. Technol. Eng. 2024, 24, 6338–6345. [Google Scholar]
Zhou, M.; Wang, H.; Gao, L.; Wang, N.; Lai, W. YOLOv5s-FCS based steel surface defect detection study. Sci. Technol. Eng. 2024, 24, 5901–5910. [Google Scholar]
Xuan, Y.; Yu, C. Improved YOLOv7 road crack and pot-hole detection algorithm. Sci. Technol. Eng. 2024, 24, 7205–7213. [Google Scholar]
Luo, R.; Tang, X.; Yu, H.; Li, H. Weld seam defect detection method based on improved Faster RCNN for ray images. Electron. Meas. Technol. 2023, 46, 160–168. [Google Scholar]
Peng, L.; Wang, K.; Zhou, H. Target detection algorithm based on improved SSD. Laser J. 2024, 45, 71–76. [Google Scholar]
Xu, Y.; Li, Y.; Guo, X.; Han, L.; Liu, Q. Flame target detection algorithm based on YOLOv3-tiny. J. Shandong Univ. Sci. Technol. (Nat. Sci. Ed.) 2022, 41, 95–103. [Google Scholar]
Guo, M.; Wang, W.; Shen, H. Lightweight target detection algorithm based on improved YOLOv4. Comput. Eng. Appl. 2023, 59, 145–153. [Google Scholar]
Wang, W.; Yu, X.; Miao, J.; Liu, X. Solar cell surface defect detection algorithm based on improved YOLOv5s. Electron. Meas. Technol. 2023, 48, 128–136. [Google Scholar]
Li, K.; Chen, F.; Li, Y.; Fan, X.; Chen, J. Surface defect detection algorithm for silicon steel sheets based on improved YOLOv7-tiny. J. Comb. Mach. Tools Autom. Process. Technol. 2025, 171–176. [Google Scholar] [CrossRef]
Chang, L.; Liu, G.; Wang, L.; Sun, J.; Chen, S. A Lightweight YOLOv8 Algorithm for Steel Surface Defect Detection. J. Tianjin Univ. Technol. 2025, 1–10. [Google Scholar]
Wu, W.; Tan, Y.; Lin, J. Application of stress-absorbing layer technology in crack treatment of asphalt pavement on Kaiyang Expressway. Guangdong Highw. Transp. J. 2009, 3, 12–14. [Google Scholar]

Figure 1. Concrete crack detection algorithm based on the improved YOLOv5s model.

Figure 2. Structure of the CNeB Module.

Figure 3. Network structure of the Adaptive Multi-Scale Feature Aggregation module.

Figure 4. Illustration of parameters for the Dynamic Gradient Focusing Weighted Intersection-over-Union module.

Figure 5. Structure of the Lightweight Dual-Stream Convolutional Feature Enhancement module.

Figure 6. Loss Function Curve Comparison.

Figure 7. Fracture dataset of CGSW YOLOv5s algorithm.

Figure 8. Schematic diagram of the improved model’s generalization.

Figure 9. Comparison of model detection results.

Figure 10. Comparison of the improved model and YOLOv5s model performance.

Table 1. Comparison of Attention Mechanism Experiments.

Model	P	R	mAP50	Parameter/1 × 10⁶	FLOPs/G
YOLOv5-GAM	70.35%	62.71	68.73%	8.77	17.20
YOLOv5-S2A	71.72%	57.76	68.96%	9.13	17.50
YOLOv5-SE	68.71%	59.94	67.64%	7.16	15.90
YOLOv5-Shuffle	74.33%	60.16	69.87%	7.03	15.80
YOLOv5-AMFA	76.54%	59.37	70.24%	9.14	21.50
YOLOv5-Soca	75.77%	58.82	68.77%	8.47	18.70

Table 2. Comparison of Loss Function Experiments.

Model	P	R	mAP50	Parameter/1 × 10⁶	FLOPs/G
YOLOv5-CIoU	72.49%	54.69%	67.25%	15.84	7.03
YOLOv5-DGFWIoU	73.82%	57.22%	69.55%	15.97	7.03
YOLOv5-EIoU	68.51%	52.63%	65.59%	15.73	7.03
YOLOv5-AlphaIoU	68.75%	49.46%	61.72%	15.86	7.03
YOLOv5-SIoU	70.86%	53.34%	66.19%	15.97	7.03
YOLOv5-DIoU	70.99%	53.89%	66.48%	15.36	7.03

Table 3. Ablation Experiment Results.

Experiment	YOLOv5s	CNeB	AMFA	LDSConv	DGFW-IoU	P	R	mAP50	Inference Time/ms
1	✔	-	-	-	✔	73.82%	57.22%	69.55%	59.50
2	✔	✔	-	-	✔	75.32%	59.43%	70.83%	56.70
3	✔	-	✔	-	✔	76.54%	59.37%	70.24%	58.60
4	✔	-	-	✔	✔	75.57%	59.12%	70.95%	54.20
5	✔	✔	✔	-	✔	76.65%	60.28%	70.86%	52.90
6	✔	✔	-	✔	✔	75.67%	60.84%	71.03%	46.80
7	✔	-	✔	✔	✔	76.49%	60.08%	71.15%	48.50
8	✔	✔	✔	✔	✔	77.94%	61.63%	71.74%	42.70

Table 4. Comparison of Mainstream Algorithms.

Algorithm	mAP50	FLOPs/G	Parameter Count/1 × 10⁶	Size/MB	Inference Time/ms
Faster R-CNN	59.04%	369.77	136.73	110.85	72.20
SSD	58.73%	60.96	23.88	93.83	69.80
YOLOv3-tiny	22.71%	12.90	8.69	17.04	51.90
YOLOv4	50.77%	59.96	63.95	250.32	66.20
YOLOv5s	67.25%	15.80	7.03	18.29	65.10
YOLOv7-tiny	59.02%	60.96	6.01	12.01	61.20
YOLOv8	68.64%	60.96	11.28	41.44	59.80
SAM	41.27%	4500.00	636.00	2400.00	3200
CGSW-YOLOv5s	70.74%	32.50	28.29	55.85	42.70

Table 5. Statistical Test Analysis of Mainstream Algorithms.

Group1	Group2	Meandiff	p-Value	t-Value	Lower	Upper
Faster R-CNN	CGSW-YOLOv5s	−21.365	0.0019	62.21	−34.1856	−8.5444
SSD		−13.84	0.0032	71.25	−26.6606	−1.0194
YOLOv3-tiny		26.14	0.0004	191.02	13.3194	38.9606
YOLOv4		21.995	0.0015	89.34	9.1744	34.8156
YOLOv5s		−26.225	0.0004	15.34	−39.0456	−13.4044
YOLOv7-tiny		−17.74	0.0069	69.15	−30.5606	−4.9194
YOLOv8		−27.52	0.0003	13.62	−40.3406	−14.6994
SAM		30.77	0.0001	65.52	17.9494	43.5906

Table 6. Comparison of Generalization Performance of Mainstream Algorithms.

Algorithm	mAP:0.5	FLOPs/G	Parameter Count/1 × 10⁶	Size/MB	Inference Time/ms
Faster R-CNN	66.74%	369.77	138.40	118.45	73.10
SSD	69.55%	60.77	23.88	98.55	70.80
YOLOv3-tiny	25.61%	12.88	8.67	18.04	53.50
YOLOv4	49.83%	60.96	63.95	255.82	68.10
YOLOv5s	68.25%	14.78	7.03	22.18	66.70
YOLOv7-tiny	59.51%	60.96	6.01	12.01	63.20
YOLOv8	69.01%	60.96	11.28	45.44	62.10
SAM	41.78%	4500.00	1332.00	4570.00	3200.00
CGSW-YOLOv5s	71.85%	33.52	29.32	58.33	43.10

Table 7. False Positive/False Negative Statistics for Different Crack Types (%).

Crack Type	Number of Samples	False Positive Rate (%)	False Negative Rate (%)	Main Error Pattern
Transverse Crack	1420	3.2	1.8	Blurred edges leading to inaccurate width estimation
Longitudinal Crack	1560	4.1	2.5	Shadow interference causing false crack detection
Network Crack	980	7.6	5.3	Feature discontinuity at crack intersections causing partial misses

Table 8. Detection Error Comparison Under Different Environmental Conditions (%).

Environmental Condition	mAP50 (%)	Increase in False Positives (%)	Increase in False Negatives (%)	Typical Error Case
Sunny	73.1	—	—	Baseline reference
Rainy	68.9	+12%	+9%	Water stains misclassified as cracks
Foggy	66.7	+18%	+15%	Low contrast causing missed detections at crack breaks
Windy	70.2	+6%	+5%	Motion blur leading to localization errors

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, G.; Yang, Y.; Wen, Y.; Li, J. CGSW-YOLO Enhanced YOLO Architecture for Automated Crack Detection in Concrete Structures. Symmetry 2025, 17, 890. https://doi.org/10.3390/sym17060890

AMA Style

Li G, Yang Y, Wen Y, Li J. CGSW-YOLO Enhanced YOLO Architecture for Automated Crack Detection in Concrete Structures. Symmetry. 2025; 17(6):890. https://doi.org/10.3390/sym17060890

Chicago/Turabian Style

Li, Gaoyu, Yu Yang, Yang Wen, and Jinkui Li. 2025. "CGSW-YOLO Enhanced YOLO Architecture for Automated Crack Detection in Concrete Structures" Symmetry 17, no. 6: 890. https://doi.org/10.3390/sym17060890

APA Style

Li, G., Yang, Y., Wen, Y., & Li, J. (2025). CGSW-YOLO Enhanced YOLO Architecture for Automated Crack Detection in Concrete Structures. Symmetry, 17(6), 890. https://doi.org/10.3390/sym17060890

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CGSW-YOLO Enhanced YOLO Architecture for Automated Crack Detection in Concrete Structures

Abstract

1. Introduction

2. Object Detection and Crack Detection

3. Concrete Crack Detection Network Model: CGSW-YOLOv5s

3.1. Method Overview

3.2. Concrete Crack Feature Enhancement Module

3.3. Adaptive Multi-Scale Feature Aggregation Attention Mechanism

3.4. Dynamic Gradient Focusing Weighted Intersection-over-Union Module

3.5. Lightweight Dual-Stream Convolutional Feature Enhancement Module

4. Experimental Results and Analysis of the Improved YOLOv5 Algorithm

4.1. Experimental Environment and Parameter Settings

4.2. Experimental Dataset

4.3. Model Performance Evaluation

4.4. Experimental Results

4.4.1. Comparison of Attention Mechanisms

4.4.2. Comparison of Loss Functions

4.4.3. Ablation Experiment

4.4.4. Comparison Experiment

4.4.5. Generalization Experiment

4.4.6. Visual Comparison Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI