1. Introduction
With the global expansion of transportation networks and the gradual extension of bridge service lives, the demand for reliable detection and assessment of structural defects in bridge infrastructure has become increasingly urgent. As one of the most common and critical early-stage damages in bridges, cracks directly reflect fundamental issues such as abnormal structural stress, material degradation, and reduced long-term durability. The accuracy of crack identification not only ensures the scientific validity of bridge safety evaluations but also guides the formulation of data-driven maintenance strategies, thereby exerting a pivotal impact on the overall service life and operational safety of bridge structures [
1,
2].
Traditional crack detection methods primarily rely on manual inspection and contact-based measurement techniques [
3]. While manual inspection is operationally simple, it suffers from inherent limitations: strong subjectivity (e.g., inconsistent judgments on crack dimensions among different inspectors), low efficiency, and high labor intensity [
4]. These drawbacks are particularly prominent for large-span or high-altitude bridge components, where inspections are often time-consuming and even inaccessible [
5]. Contact measurement methods such as laser profiling and ultrasonic testing can achieve high precision in crack width measurement (down to 0.01 mm), but they typically require temporary traffic closures, incur substantial economic costs, and lack scalability for large-scale bridge network inspections. These inherent deficiencies have thus driven an urgent need for more efficient, non-contact, and automated detection solutions [
6,
7].
The rapid advancement of Unmanned Aerial Vehicle (UAV)-based image acquisition technology, combined with state-of-the-art computer vision algorithms, has emerged as a transformative approach to overcome the limitations of traditional methods [
8,
9]. UAVs equipped with high-definition stabilized imaging systems can efficiently capture high-resolution images of bridge surfaces—this not only eliminates the safety risks associated with manual climbing operations but also improves inspection efficiency by 5–10 times compared with manual work. Meanwhile, deep learning-based image segmentation models built on Convolutional Neural Networks (CNNs) have revolutionized crack detection through end-to-end intelligent recognition, and have been successfully applied in road crack extraction, tunnel defect identification, and concrete surface flaw detection [
9,
10]. Among these architectures, U-Net [
11]—a classic model initially developed for biomedical image segmentation—has been widely adapted for bridge crack segmentation. Its symmetric encoder-decoder structure and effective skip connection mechanism facilitate the fusion of shallow-layer detail information (e.g., crack edges) and deep-layer semantic features (e.g., crack contours), making it highly suitable for fine-grained defect segmentation tasks [
12,
13]. Inspired by recent work such as MARBLE-DA [
14], which emphasizes explainable and domain-adaptive image classification for crack detection, our study highlights the importance of model interpretability alongside accurate segmentation in UAV-based bridge inspection.
Nevertheless, practical bridge crack segmentation using UAV-captured images still faces three core challenges that severely hinder the performance of existing deep learning methods. First, UAV-acquired bridge surface images often contain complex non-structural interferences (e.g., concrete surface textures, steel bar occlusions, and shadow artifacts). These interferences share similar grayscale characteristics with cracks, leading to unstable model responses and increased false detection rates [
15,
16,
17]—an issue that is particularly pronounced in high-resolution (HR) images, where background details are amplified alongside crack features [
18]. Second, the simple concatenation of shallow-layer detail and deep-layer semantic features in traditional U-Net fails to achieve adaptive fusion, resulting in blurred crack edges and missed detections of narrow (width < 0.5 mm) or discontinuous crack segments [
19]. Although existing studies have attempted to enhance boundary refinement or address crack discontinuity, they still rely on fixed fusion strategies that lack dynamic adjustments tailored to crack morphology [
20,
21]. Third, the extreme imbalance between crack pixels (typically accounting for only 1–5% of the total image area) and background pixels introduces biases into model training, preventing the model from fully learning the discriminative features of cracks [
22]. Despite the proposal of specialized loss functions to mitigate this issue, the feature learning bias caused by pixel imbalance remains unresolved in complex practical scenarios [
23].
Existing studies have made targeted efforts to address partial challenges in UAV-based crack segmentation but have failed to develop a comprehensive solution. For example, Liu et al. [
24] proposed MFDA-Net, which integrates multi-scale feature fusion and hybrid attention to enhance noise resistance in road crack segmentation, but it lacks dynamic adaptability between shallow edge details and deep semantic features. Zhou et al. [
25] developed DBE-YOLO with 3D crack visualization to improve detection accuracy and quantification precision, yet it cannot effectively suppress background interferences such as concrete textures. Pervaiz et al. [
26] designed MSMC-U-Net, which combines Transformer and multi-scale context analysis to achieve a high mIoU, but it still uses fixed feature fusion strategies, leading to missed detections of narrow cracks. Wang et al. [
27] proposed CSHD with a hybrid-window Transformer to improve mIoU across datasets, but it did not alleviate the bias caused by crack-background pixel imbalance. Zhou et al. [
28] optimized YOLO-Edge with an edge branch (achieving 83.8% AP0.5_seg for masonry cracks) but overlooked the mutual influence between background interference and feature fusion. Notably, all these studies only address a single challenge (whether multi-scale feature extraction, edge refinement, or detection efficiency) and fail to synergistically tackle the three core issues (complex background interference, non-adaptive feature fusion, and pixel imbalance) in UAV-based bridge crack segmentation. This fragmented approach results in performance degradation in practical scenarios, where the three challenges interact reciprocally: for instance, background noise exacerbates the difficulty of feature fusion, and pixel imbalance weakens the model’s ability to learn features of small cracks. These research gaps highlight the urgent need for an integrated network capable of comprehensively resolving these intertwined problems—which is the core focus of this study.
To fill this critical research gap, this study proposes the Crack Segmentation Network integrating Channel-Spatial Attention and Multi-Scale Structural Enhancement (CFM-Net), a novel crack segmentation network based on an optimized U-Net backbone. It integrates three key technologies: channel-spatial attention, gated feature fusion, and morphology-guided multi-scale perception, and is specifically designed to address the aforementioned three challenges through coordinated interactions among its modules. First, the Convolutional Block Attention Module (CBAM) is incorporated to dynamically weight channel and spatial features—this enhances the model’s ability to distinguish crack regions and effectively suppresses complex background interference. Second, the Gated Fusion Module for Skip Connections (GFF) replaces the traditional concatenation operation in U-Net, enabling adaptive fusion of shallow-layer detail and deep-layer semantic features to significantly improve the recognition accuracy of narrow cracks. Third, the Morphology-Guided Multi-Scale Interaction Block (MGMSIB) is constructed to capture crack features across various scales and orientations, thereby enhancing the modeling of crack structural continuity. Additionally, a combined loss function integrating Binary Cross-Entropy (BCE) and Dice loss is adopted to alleviate the adverse effects of data imbalance between crack and background pixels.
The rest of this paper is structured as follows:
Section 2 presents the preliminary UAV-based bridge crack detection framework, laying the foundation for subsequent model improvements.
Section 3 elaborates on the detailed architecture of CFM-Net, including the design principles of its three core modules.
Section 4 presents the experimental setup and result analysis, covering dataset construction, hardware and software parameters, evaluation metrics, comparative experiments, and ablation studies.
Section 5 summarizes the key conclusions of this study and proposes future research directions. This study is expected to provide a reliable technical solution for UAV-based bridge crack detection and offer effective technical support for promoting the intellectualization and efficiency of bridge structural defect inspection.
2. Preliminary UAV-Based Bridge Crack Detection Framework
As a core component of transportation infrastructure, the structural health of bridges is directly related to traffic safety and the stability of transportation networks. Unmanned Aerial Vehicles (UAVs), leveraging their advantages of flexibility, maneuverability, and efficient coverage, have become the mainstream technical carrier for bridge crack detection, while the development of deep learning provides key technical support for pixel-level crack segmentation. To bridge the gap between theoretical algorithms and engineering practice, the research team has previously developed a preliminary UAV-based system framework for bridge crack detection [
29]. This framework fully covers the entire workflow from high-resolution image acquisition, background filtering, and crack segmentation to quantitative output, effectively addressing the limitations of traditional manual inspection, such as low efficiency and poor accessibility to hazardous areas. It thus lays a solid foundation for the subsequent implementation of end-to-end detection technologies.
In the first stage, overlapping tiling technology is employed to process high-resolution images of bridge surfaces captured by UAVs. This technique divides large images into smaller, more manageable blocks, facilitating detailed analysis of specific regions of the bridge. Beyond improving processing efficiency, overlapping tiling enhances the accuracy of crack detection. A bridge classification model is applied to filter out irrelevant background elements such as the sky and water, ensuring that detection is focused solely on the bridge structure. A CNN-based classification model is selected for this task due to its simple structure, mature training process, and efficient optimization techniques, which help reduce annotation costs and minimize the impact of irrelevant background information.
In the second stage, a mixed dataset, combining publicly available crack datasets with the bridge-specific image data, is used to train a crack segmentation model. This mixed dataset improves the model’s robustness in various environmental conditions, enabling the system to detect cracks in diverse settings. The crack segmentation is performed using deep learning methods, specifically the U-Net network, which excels at pixel-level segmentation tasks and ensures precise identification of cracks and other types of damage.
The third stage involves the classification and quantification of detected cracks. A custom seed-point growth method is adopted to accurately measure crack length and width, enabling efficient quantification of crack dimensions. Additionally, different types of cracks (e.g., micro-cracks and severe cracks) are classified to provide detailed reports for maintenance personnel. The crack detection results are integrated into a comprehensive reporting system that outputs detailed crack information to support bridge maintenance and management decisions. The architecture of the proposed system is illustrated in
Figure 1.
Although this preliminary framework has met the basic requirements for engineering applications and demonstrated certain practicality in conventional inspection scenarios, its segmentation performance still has room for optimization when confronting practical challenges such as complex texture interference on bridge surfaces and weak feature expression of narrow cracks. A core limitation lies in the fact that the baseline network architecture lacks targeted design for the unique characteristics of bridge cracks, making it difficult to accurately balance background suppression and crack detail preservation. This has to some extent affected the reliability and engineering applicability of the segmentation results.
To further improve the system’s detection accuracy in complex scenarios and meet the stringent requirements for crack recognition in practical engineering, this study proposes an improved crack segmentation network, CFM-Net, based on the aforementioned preliminary framework. By integrating channel-spatial attention, gated feature fusion, and morphology-guided multi-scale perception modules, CFM-Net specifically addresses the aforementioned core limitations. Subsequent sections will elaborate on its architectural design and optimization strategies in detail.
3. A Concrete Bridge Crack Segmentation Network Based on Improved U-Net
To address the three core challenges in bridge crack segmentation for UAV inspection scenarios—false detections caused by complex background interference, missed detections of narrow cracks due to imbalanced fusion of shallow and deep features, and insufficient modeling of multi-scale crack structural continuity—this study proposes the Crack Segmentation Network integrating Channel-Spatial Attention and Multi-Scale Structural Enhancement (CFM-Net), based on the classic U-Net architecture. Overall, CFM-Net adopts a progressive structural optimization path of CBAM, GFF, and MGMSIB, which effectively resolves bottleneck issues in crack image segmentation, such as weak target perception, strong background interference, and poor scale adaptability, and achieves significant accuracy improvements on multiple real-world datasets.
As shown in
Figure 2, the proposed CFM-Net retains the symmetric structure of the U-Net framework, while enhancing performance through three key enhancement modules:
(1) First, the network introduces the CBAM (Convolutional Block Attention Module). Through the sequential action of channel attention (average and max pooling sharing MLP prediction weights) and spatial attention (channel pooling and 7 × 7 convolution generating weights), it strengthens the response to crack regions and suppresses background interference. Compared with modules like SE and ECA, CBAM has greater advantages in slender structure perception and computational efficiency.
(2) To address the semantic-detail imbalance in shallow–deep feature fusion, the network embeds the GFF (Gated Fusion Module for Skip Connections). Using dynamic weight adjustment via a reset gate and a select gate, it enhances high-level semantics in the main crack regions and preserves shallow details in edge regions, reducing noise propagation and avoiding information dilution.
(3) To tackle the problem of multi-scale crack parsing, the designed MGMSIB (Morphology-Guided Multi-Scale Interaction Block) adopts a multi-scale parallel structure with 3 × 3, 1 × 3/3 × 1, and 5 × 5/7 × 7 convolutions. Combined with morphology-guided dynamic weight prediction, it accurately captures crack features of different scales and orientations, and has better task adaptability than fixed-structure modules like ASPP.
In addition, the network employs a combined loss function of Binary Cross-Entropy (BCE) and Dice loss to balance the class imbalance problem and improve segmentation accuracy. Overall, through progressive optimization of feature purification, adaptive fusion, and multi-scale refinement, CFM-Net effectively solves the issues of weak target perception, strong background interference, and poor scale adaptability in crack segmentation, significantly enhancing segmentation performance and enabling the network to maintain robust crack detection with high precision in complex environments.
3.1. U-Net Network Architecture
U-Net is a convolutional neural network proposed by the Ronneberger team in 2015, specifically designed for high-precision biomedical image segmentation. Its architecture centers on symmetric encoders and decoders, with performance enhanced by skip connections:
The encoder uses convolution and pooling operations to extract and integrate complex semantic features while reducing the spatial resolution of the input image.
The decoder gradually restores high-resolution details through upsampling.
Skip connections transmit low-level detail information from the encoder to the decoder, which is crucial for image detail restoration and accurate segmentation of object boundaries. The final output layer completes pixel-level category prediction (see
Figure 3 for the structure).
U-Net demonstrates excellent performance in the field of medical image segmentation, especially in handling images with complex structures and rich details. It can effectively address common challenges such as blurred boundaries, small-scale feature recognition, and limited datasets. This advantage also applies to binary crack image segmentation: since the shape and texture of cracks are often similar to the background, high precision in edge segmentation is required. With its outstanding detail extraction and boundary restoration capabilities, U-Net can efficiently complete crack segmentation tasks even when data is limited.
3.2. Convolutional Attention Module
Proposed by Woo et al. [
30], the Convolutional Block Attention Module (CBAM) serially introduces channel attention and spatial attention mechanisms into convolutional feature maps, achieving dual response enhancement for key channels and target regions.
As shown in
Figure 4, the CBAM architecture consists of two sub-modules: the Channel Attention Module and the Spatial Attention Module.
(1) In the Channel Attention Module: First, average pooling and max pooling are respectively applied to the input feature map to generate two 1 × 1 × C spatial context descriptors (denoted as Avg and Max). These two descriptors are then fed into a shared Multi-Layer Perceptron (MLP) with a hidden layer. The number of neurons in the first layer of the MLP is C/r (where r is the reduction ratio), and the ReLU activation function is adopted; the number of neurons in the second layer is C. After processing with shared weights, the output features of the two MLPs are added element-wise and activated by the sigmoid function to generate the channel attention map (Mc).
(2) In the Spatial Attention Module: Taking the output feature map of the channel attention module as input, global max pooling and global average pooling (based on the channel dimension) are first performed to obtain two H × W × 1 feature maps. These two feature maps are then concatenated along the channel dimension, reduced to a 1-channel H × W × 1 feature map via 7 × 7 convolution, and further activated by the sigmoid function to generate the spatial attention map (Ms).
(3) Finally, the spatial attention map (Ms) is multiplied element-wise with the input feature map to obtain the final output feature.
The entire process retains information more effectively through parallel pooling and sequentially focuses on important channels and key regions, thereby enhancing the model’s feature representation capability.
Its structure can be mathematically expressed as:
where
denotes the input feature map,
and
represent the channel attention map and spatial attention map, respectively, and
denotes element-wise multiplication. Channel attention extracts channel statistical features through global average pooling and max pooling, and learns channel weights using two fully connected layers; spatial attention obtains the spatial attention map using 7 × 7 convolution based on the results of channel pooling.
Compared with existing attention modules:
Unlike the SE module [
31], which only focuses on the channel dimension, CBAM adds spatial modeling capabilities.
Compared with the local cross-channel interaction of the ECA module [
32], the global statistics of CBAM are more suitable for slender structures like cracks.
Compared with the Coordinate Attention (CA) module [
33], CBAM has a simpler structure and higher computational efficiency, making it suitable for real-time processing of UAV images.
The application of CBAM in this task focuses on enhancing the response activation of crack regions and reducing interference from backgrounds such as steel bars, concrete textures, and shadows, thereby providing a more discriminative feature foundation for subsequent decoding and reconstruction.
3.3. Gated Fusion Module
Although CBAM can effectively improve the quality of feature representation, directly concatenating or adding shallow and deep features may still lead to the problem of semantic-detail imbalance: deep features have stronger semantic discriminative ability but blurred structural details, while shallow features retain edge contours but have weak semantic distinguishability. To solve this fusion problem, this study introduces the Gated Fusion Module for Skip Connections (GFF).
As shown in
Figure 5, let
denote the encoder feature at the
-th scale after CBAM refinement, and let
denote the corresponding decoder feature after upsampling and channel alignment:
Before fusion, the decoder feature is adjusted to have the same spatial size and channel number as the encoder feature. The two features are then concatenated along the channel dimension:
Based on the concatenated feature
, two learnable transformations
and
are used to generate the reset gate and the select gate, respectively:
where
denotes the sigmoid activation function. The reset gate
and select gate
are constrained to the range [0, 1] and are used as adaptive attention weights during feature fusion. If the gates are generated as spatial attention maps, they are broadcast along the channel dimension when performing element-wise multiplication.
The reset gate
is first applied to the encoder feature to suppress irrelevant shallow background information:
where
denotes element-wise multiplication. The reset encoder feature is then concatenated with the decoder feature and transformed by a learnable fusion operation
:
Finally, the select gate
adaptively controls the contribution of the newly fused feature
and the original decoder feature
:
In this formulation, mainly suppresses noise contained in shallow encoder features, while determines how much complementary shallow information should be introduced into the decoder stream. Compared with direct concatenation, the proposed gated fusion strategy adaptively balances fine boundary details and high-level semantic information, thereby reducing background noise propagation and improving the segmentation of narrow and discontinuous cracks.
The advantage of GFF lies in the adjustability of the fusion process: it enhances high-level semantic representation in the main crack regions and improves shallow details in the edge contour regions. Compared with the direct concatenation in traditional U-Net, GFF can significantly reduce the risk of background noise propagation; compared with additive fusion, it avoids the problem of information dilution caused by averaging; compared with some attention fusion modules (e.g., AFM [
34]), it has a lighter structure and fewer parameters, making it suitable for real-time scenarios of high-resolution UAV images.
It is worth noting that CBAM and GFF are not stacked independently but form an information flow path of feature purification and adaptive fusion: the preprocessed output of CBAM serves as the input of GFF, which helps to improve the discriminability of fusion. When generating gating weights, GFF can also leverage the spatial response map output by CBAM to achieve region enhancement guidance—strengthening details in crack regions and suppressing noise in background regions, thus realizing spatial-semantic collaborative optimization.
3.4. Morphology-Guided Multi-Scale Interaction Module
To further enhance the model’s ability to parse multi-scale crack structures, this study proposes the Morphology-Guided Multi-Scale Interaction Block (MGMSIB). In UAV images, bridge cracks exhibit characteristics such as large-scale variation, diverse orientations, and complex morphologies, making it difficult for traditional decoders to balance the restoration of narrow and wide crack regions during structure reconstruction. The MGMSIB module adopts a multi-scale parallel structure. As shown in
Figure 6:
Let Formula (9) denote the feature map input to the MGMSIB at the
-th decoding stage. This module consists of five parallel convolution branches with distinct receptive fields:
For each branch
, the corresponding multi-scale feature is extracted as:
where
Appropriate padding is adopted in each branch to keep the spatial resolution unchanged. The 3 × 3 branch captures local crack details, the 1 × 3 and 3 × 1 branches enhance horizontal and vertical linear structures, and the 5 × 5 and 7 × 7 branches capture wider cracks and larger structural trends.
To adaptively determine the importance of different receptive fields, a morphology-aware weight generation branch is introduced. This branch takes the input feature
as input and predicts a set of branch weights:
where
denotes the feature descriptor extraction operation used to summarize the morphology-related response of
, MLP denotes a lightweight multi-layer perceptron, and
denotes the weight activation or normalization function. The predicted weight vector is defined as:
Each weight
measures the relative importance of the corresponding convolutional branch for the current input feature. If softmax normalization is adopted, the weights satisfy:
The final output of MGMSIB is obtained by weighted multi-scale fusion:
Through this dynamic weighting mechanism, MGMSIB can assign higher weights to 1 × 3 and 3 × 1 branches when the input feature contains elongated thin cracks, while increasing the contribution of 5 × 5 and 7 × 7 branches for wider or discontinuous crack regions. Compared with multi-scale modules with fixed structures (e.g., ASPP [
35], Inception [
36]), MGMSIB has stronger task orientation and structural adaptability.
In addition, the design of MGMSIB forms a sequential collaborative relationship with the GFF module: the fused features output by GFF are further refined in terms of response according to crack morphology after entering MGMSIB, which effectively enhances the information complementarity between features of different scales and prevents structural discontinuity and scale deviation.
3.5. Loss Function
3.5.1. Binary Cross-Entropy (BCE) Loss
To guide pixel-wise classification in crack segmentation, we adopt the standard Binary Cross-Entropy (BCE) loss. BCE evaluates the discrepancy between predicted probabilities and ground truth labels for each pixel, providing stable gradient signals for model optimization. In crack images, since crack pixels occupy a small portion of the image, BCE alone may underrepresent slender cracks, but in combination with Dice loss (see next section), it still provides strong pixel-level supervision.
The formula of BCE loss is defined as follows:
where
denotes the predicted probability for the
-th pixel,
is the corresponding ground truth label, and
is the total number of pixels.
3.5.2. Dice Loss
However, in crack images, the background class often dominates, resulting in an obvious class imbalance problem. In this case, using only BCE loss may cause the model to misclassify key crack regions as background, thereby affecting the effect of feature extraction. The Dice loss function can improve this imbalance by quantifying the overlap between the predicted region and the actual region, thus effectively solving this problem. Its mathematical form is:
where pi denotes the predicted value of the i-th pixel, gi is the corresponding ground truth label, N is the total number of pixels, and ϵ is a small constant used to prevent division by zero.
3.5.3. Combined Loss Function
Combining the BCE and Dice loss functions can effectively improve the model’s accuracy. BCE provides stable pixel-wise supervision; Dice loss enhances the model’s ability to perceive the overall shape of cracks by quantifying the overlap between the predicted and actual regions. The synergistic effect of the two can effectively improve segmentation accuracy. In addition, adjusting the weight ratio of the two in the combined loss can further optimize the balance between pixel-level classification accuracy and shape perception ability, thereby achieving optimal segmentation performance. The total expression of the loss function is as follows:
in which, α = 0.7 and β = 0.3 are weight coefficients. On the one hand, this highlights the optimization of the overlap degree for small-sample foreground such as slender cracks; on the other hand, it retains sufficient cross-entropy gradient to stabilize training, refine boundary probability, and control the proportion of the two branches in the total loss.
4. Experimental and Result Analysis
4.1. Dataset
4.1.1. Project Overview
The Yiling Yangtze River Bridge is a key urban cross-river artery in Yichang City, Hubei Province, connecting the northern (Xiling District) and southern banks of the Yangtze River. It plays a vital role in alleviating regional cross-river traffic pressure and promoting integrated development of the city’s north-south areas. Completed and opened to traffic on 28 December 2001, the bridge has been in continuous service for 22 years, bearing long-term dynamic loads from mixed urban traffic (motor vehicles, non-motorized vehicles, and pedestrians). With a total length of 3246 m, it consists of three functional components: the northern approach bridge and overpass (1150 m, linking to Yiling Avenue and Sports Road), the southern approach bridge and overpass (1160 m, connecting to Wuyi Road and Wulong New District), and the main bridge (936 m, spanning the Yangtze River’s main channel), which serves as the core load-bearing structure.
The main bridge adopts a single-cable-plane concrete stiffened-girder three-pylon cable-stayed design, optimized for wide river channels and navigation requirements. With a deck width of 23 m, it accommodates four bidirectional motor vehicle lanes, two non-motorized lanes, and two sidewalks to meet urban mixed traffic demands. Its span arrangement is symmetric about the middle pylon: (38.0 + 38.5 + 43.5) m (northern approach spans) + 2 × 348.0 m (main spans) + (43.5 + 38.5 + 38.0) m (southern approach spans), with the 348 m main span ensuring navigation clearance for large vessels. Notably, the bridge operates in a subtropical monsoon climate (high rainfall, humidity, and occasional extreme temperatures) with exposure to river vapor and industrial corrosive factors, accelerating concrete degradation and crack initiation—making it a representative scenario for validating crack detection models. The bridge’s location and front elevation are shown in
Figure 7 and
Figure 8:
4.1.2. Data Collection
After clarifying the shooting objects and defect distribution characteristics, UAV aerial photography technology was used for systematic data collection. With the widespread application of UAV technology in infrastructure inspection—especially in refined tasks such as bridge crack detection—higher requirements have been put forward for the imaging quality, flight stability, and positioning accuracy of aerial platforms. Since public datasets generally lack high-quality bridge crack images captured under real UAV inspection environments, constructing a real-scenario dataset with engineering applicability has become a key objective of this study. To this end, it is necessary to first address the selection of UAV platforms to ensure the effectiveness and quality control of data collection. Based on the specific needs of bridge inspection operations, a performance comparison was conducted between two quadcopter UAV devices owned by the research team: Matrice 350 RTK and Mavic 3 Enterprise (Da-Jiang Innovations, Shenzhen, China) as shown in
Figure 9.
As shown in
Table 1, although the Mavic 3 Enterprise has advantages such as easy operation and moderate price, making it suitable for image collection in general scenarios, it has obvious limitations in image resolution, lens expandability, and adaptability to complex environments. In contrast, the Matrice 350 RTK can be equipped with Zenmuse series high-resolution lenses and high-precision radar systems, enabling it to capture high-pixel, detailed images. It also integrates dual positioning modules (GNSS and RTK), which can achieve high-precision positioning even under harsh conditions such as occlusion and interference, providing strong support for obtaining high-quality images.
Relying on this platform (Matrice 350 RTK), multiple rounds of on-site aerial photography missions were carried out on the Yiling Yangtze River Bridge in Yichang City, Hubei Province. Key structural parts such as the bridge deck, beam bottom, and piers were focused on, and multi-angle, multi-altitude, and multi-time-period shooting methods were adopted to systematically collect crack defect images, fully presenting different types and morphologies of structural defects. After image acquisition, screening was conducted based on criteria such as image quality, composition integrity, and crack recognizability. Finally, 11,298 high-quality bridge crack images were selected to form the self-built dataset of this study.
Meanwhile, to improve the generalization ability of the model for bridge crack segmentation, 11 public datasets were collected in this study: RDD2022, Structural-Defects Network (SDNET) 2018, CrackForest-datasetmaster, DeepCrack, LCW (Concrete Crack Detection), Crackseg9k, GAPs384, AEL (Active Learning Crack Evaluation Dataset), CFD (Crack Forest Dataset), CrackTree200, and Crack 500. From these datasets, 11,812 high-quality bridge crack images were selected to form the mixed public dataset.
4.1.3. Data Processing
Data augmentation is a basic technique commonly used in machine learning and deep learning. It enhances the model’s generalization ability by enriching the diversity of training data [
37]. This method not only improves model performance but also effectively reduces the risk of model overfitting when training data is limited or imbalanced in distribution. Specifically, data augmentation applies a series of transformations to original images, such as rotation, translation, horizontal or vertical flipping, and scaling (upsampling or down-sampling), to generate new training samples. By learning these transformed samples, the model can develop the ability to perceive targets from different angles, positions, and sizes, thereby better adapting to various complex real-world scenarios. In this study, image augmentation techniques (e.g., adjusting hue, image saturation, and image exposure) were used to ensure the model has strong generalization ability under different conditions. The schematic diagram of image augmentation effects is shown in
Figure 10.
After augmentation, a subset of 4000 high-quality bridge crack images was manually selected from both the self-built and public datasets. Selection criteria included image clarity, completeness of bridge structural coverage, and unambiguous crack annotations, ensuring that the training set contains representative examples of various crack types while maintaining high-quality labels, thereby minimizing potential selection bias. All selected images were manually annotated using the Labelme tool, with black backgrounds and white cracks saved in JSON format. To further reduce potential information leakage between training, validation, and test sets, the 7:2:1 split was performed with manual checking of images from the same bridge section or spatially adjacent UAV frames, ensuring that highly similar crack instances are assigned to the same subset where possible. After verification, the training, validation, and test sets are mutually exclusive, forming the Mix Bridge Crack dataset for model training. Similarly, 8000 images were randomly sampled from the public dataset to form the mixed public dataset, also split at a 7:2:1 ratio.
To ensure reproducibility and maintain robustness across different UAV-acquired bridge images, all raw images were preprocessed before training CFM-Net. High-resolution images were first divided into overlapping tiles of 512 × 512 pixels with a 50% overlap, resulting in a stride of 256 pixels, which not only facilitates efficient computation but also preserves narrow cracks that may appear near tile boundaries. Tiles with inconsistent resolution were resized using bilinear interpolation to match the network input size, and pixel intensities were normalized to the [0, 1] range on a per-channel basis to standardize the input data and improve training stability. To further reduce irrelevant background information, such as sky and water regions, a lightweight CNN classifier was applied to filter out tiles that did not contain bridge structures, ensuring that subsequent segmentation focused exclusively on relevant areas. Data augmentation was then performed to enhance model generalization, including random rotations within ±30°, horizontal and vertical flipping, scaling in the range of 0.8–1.2×, and adjustments to brightness, contrast, and saturation, simulating variations in UAV acquisition conditions. A subset of tiles was manually annotated using Labelme, with black backgrounds and white crack masks, carefully verified to preserve edge accuracy and structural continuity. During inference, overlapping tiles were merged back into full images using pixel-wise maximum aggregation, which maintains crack continuity across tile boundaries and minimizes fragmented predictions.
To provide a clear and reproducible summary of the preprocessing and augmentation steps, the complete image-processing pipeline is summarized in
Table 2.
4.2. Experimental Environment
All experimental models were trained on a cloud server equipped with an 18 vCPU AMD EPYC 9754 128-Core Processor and an Nvidia GeForce RTX 4090D GPU (24 GB video memory). The software environment was built using Python 3.12 (Ubuntu 22.04), PyTorch 2.5.1, and CUDA 12.4 to form the deep learning framework. In the experiment, the Adam optimizer was used to update the model’s weight parameters. The number of training epochs was set to 100, the Batch_Size was 8, and the training would automatically stop and save the optimal model when overfitting occurred. The initial learning rate was 0.0001, and a cosine annealing learning rate decay strategy was adopted, with the minimum learning rate set to 1 × 10−6.
4.3. Evaluation Metrics
Evaluation metrics refer to the use of various data indicators to determine the detection performance of the model. The metrics adopted in this study include Mean Intersection over Union (mIoU), F1-score, Recall, and Precision. In crack detection, the mIoU metric helps evaluate the model’s performance in image segmentation, especially in capturing the overlap between the detected crack regions and the actual labels. Its calculation formula is as follows:
where N is the total number of classes, TPi (True Positive) is the number of pixels correctly predicted as class i, FPi (False Positive) is the number of pixels incorrectly predicted as class i, and FNi (False Negative) is the number of pixels of class i that were not predicted by the model. In essence, mIoU measures the overlap between the model’s predicted regions and the ground truth regions, and is the most commonly used core metric in image segmentation.
Precision measures the proportion of pixels predicted as a certain class that are actually of that class. It focuses on the ratio of pixels that are truly cracks among all detected pixels. A higher value indicates fewer misjudgments (treating non-crack regions as cracks) during detection. Its calculation formula is as follows:
Recall measures the proportion of actual pixels of a certain class that are correctly identified by the model. It focuses on how many of all truly existing crack pixels are successfully detected. A higher value indicates fewer missed detections (failing to detect actually existing cracks). Its calculation formula is as follows:
The F1-score comprehensively balances Precision and Recall, and is suitable for task scenarios with class imbalance. Its calculation formula is as follows:
4.4. Comparative Experiments
To verify the effectiveness of the proposed method, eight state-of-the-art algorithms in the field of image segmentation were selected for comparative experiments, including Unet++, PspNet, U2Net, DeepLabv3+, SegFormer, Hrnet, DeepCrack, and CrackFormer. The training parameters of all algorithms were standardized to ensure consistency. To validate the applicability and generalization of the CFM-Net algorithm, comparative experiments were conducted on both the Mix Bridge Crack dataset and the mixed public dataset.
4.4.1. Experimental Result Analysis on the Mix Bridge Crack Dataset
After 100 training epochs, the losses of the nine networks (including CFM-Net) on the Mix Bridge Crack training set had converged.
Figure 11 and
Figure 12 show the loss curves of the nine networks on the training set and validation set, respectively. It is evident that after 100 epochs, the training set losses of CrackFormer and CFM-Net were significantly lower than those of the other seven networks, with CFM-Net showing a marginally lower loss than CrackFormer.
The evaluation results are presented in
Table 3. Although the Precision of the proposed method was 86.77%, slightly lower than the highest value, its mIoU, Recall, and F1-score reached 80.05%, 87.31%, and 87.06%, respectively—ranking first among all comparative methods. Compared with the second-ranked method (CrackFormer), the proposed method achieved improvements of 2.3 percentage points in mIoU, 1.1 percentage points in Recall, and 1.03 percentage points in F1-score. These results indicate that the proposed method achieves competitive and consistent performance improvements under the evaluated experimental settings. More importantly, the observed gains are consistent with the intended architectural design, in which background interference suppression, adaptive semantic-detail fusion, and morphology-aware multi-scale refinement jointly contribute to improved crack segmentation.
To more intuitively demonstrate the performance advantages of the improved CFM-Net in crack segmentation tasks, three bridge crack images were selected from the self-built dataset to compare the segmentation visualization results of different models. The outcomes are shown in
Figure 13.
While several comparative algorithms could effectively segment wide cracks with simple textures, they exhibited poor segmentation performance for narrow cracks in complex environments, leading to missed detections and false detections. In the first group of images, the cracks had complex textures and included narrow segments, but the proposed improved algorithm successfully captured these details with almost no omissions. It fully presented all cracks with complex textures and did not misclassify other textures as cracks, highlighting its superiority. In the second group of sample images, the cracks had distinct features but blurred edges, making them difficult to distinguish. All algorithms showed varying degrees of missed detections when processing this group—some only marked a small part of the cracks, while others failed to mark them at all. In contrast, the improved algorithm marked a more complete area; despite the blurred crack edges, it accurately extracted the edge and structural information of the cracks, allowing for a clearer understanding of the crack morphology. The third group of sample images contained numerous non-crack interference elements that were visually similar to cracks, posing significant challenges for algorithm recognition. All algorithms exhibited varying degrees of false detections in this group (misclassifying non-crack elements as cracks), but the improved algorithm had significantly fewer false detections. It accurately segmented the actual cracks and effectively avoided interference elements, demonstrating more stable performance in complex environments.
To evaluate the stability and reproducibility of CFM-Net, we conducted experiments using five different random seeds. Each seed corresponds to a complete training and evaluation cycle with the same hyperparameters, data splits, and augmentation settings. The metrics reported in
Table 4 are the mean and standard deviation across seeds, demonstrating that CFM-Net’s performance is stable under different initializations. This analysis ensures that the reported segmentation results are reliable and reproducible.
4.4.2. Experimental Result Analysis on the Mixed Public Dataset
After 100 training epochs, the losses of the nine networks on the mixed public training set had converged.
Figure 14 and
Figure 15 show the loss curves of the nine networks on the training set and validation set, respectively. Similarly, after 100 epochs, the training set loss of CFM-Net was significantly lower than that of the other networks. Moreover, in terms of validation set loss, CFM-Net was markedly superior to the other eight algorithms.
The evaluation results are presented in
Table 5. The comparison shows that the proposed algorithm still maintained excellent performance on the public dataset. Experimental data indicate that for different types of cracks, the proposed algorithm outperformed all other comparative algorithms, achieving 80.62% in mIoU, 88.23% in Precision, 88.21% in Recall, and 87.65% in F1-score. This result fully confirms the generalization of the algorithm in bridge crack image segmentation tasks and highlights its outstanding advantages over other algorithms.
To further intuitively demonstrate its segmentation accuracy, three sample images were randomly selected from the mixed public dataset, processed using the aforementioned nine algorithms, and segmentation visualization results were generated. The specific outcomes are shown in
Figure 16. It can be clearly observed from the figure that the detection performance of CFM-Net on the mixed public dataset was significantly superior to that of other comparative methods. The segmentation results of CFM-Net were clearer with less background noise, and it could not only accurately recognize narrow crack segments but also capture more key detail information (e.g., discontinuous crack connections and subtle edge contours), which fully verified its strong adaptability to diverse crack morphologies in public datasets.
Meanwhile,
Figure 17 compares CFM-Net with other state-of-the-art models across various crack datasets, presenting the validated IoU, Dice score, and Recall. These results confirm that CFM-Net effectively addresses the limitations of existing models: it can reliably detect tiny cracks (width < 0.5 mm) and complex cracks (e.g., mesh cracks with irregular contours) across different datasets, without performance degradation caused by dataset differences. This further reinforces its practical application potential in large-scale bridge crack detection and infrastructure health monitoring.
4.4.3. Model Complexity and Inference Analysis
In addition to segmentation accuracy, it is important to evaluate the computational efficiency and practical deployment feasibility of CFM-Net. This analysis provides insight into the model’s real-world applicability, particularly for UAV-based bridge inspection tasks where inference speed and resource requirements are critical.
Table 6 summarizes the parameter count, FLOPs, and inference time per 512 × 512 image for CFM-Net, the baseline U-Net, and two representative competitors, DeepCrack and CrackFormer.
All inference times in
Table 6 were measured under the same hardware and software environment, using 512 × 512 input images and the same evaluation settings. Although CFM-Net contains more parameters than the baseline U-Net, its FLOPs are lower in our implementation. Therefore, the measured inference time is not solely determined by the parameter count but is also influenced by FLOPs, memory access, operator implementation, and GPU parallelization efficiency. The lightweight attention and fusion modules in CFM-Net can be efficiently executed on the GPU, resulting in comparable inference efficiency to the baseline U-Net, with slightly lower measured latency under the tested setting. Importantly, the inference time per image (~17 ms) remains competitive, faster than the Transformer-based CrackFormer while only slightly higher than U-Net.
This combination demonstrates a favorable trade-off between segmentation accuracy and computational efficiency. The model complexity is kept within a range that allows deployment on UAV onboard computing systems and typical edge devices, without compromising real-time or near-real-time processing requirements.
Furthermore, compared with lightweight models such as DeepCrack, CFM-Net achieves significantly higher segmentation accuracy while maintaining manageable computational cost, demonstrating that the added modules provide practical benefits rather than unnecessary overhead. This analysis confirms that CFM-Net not only advances segmentation performance but is also feasible for real-world deployment in UAV-based bridge inspection tasks.
4.5. Ablation Experiments
To verify the performance improvement effects of the Channel and Spatial Attention Module (CBAM), Gated Fusion Module for Skip Connections (GFF), and Morphology-Guided Multi-Scale Interaction Block (MGMSIB) on the original U-Net model, ablation experiments were conducted on the Mix Bridge Crack dataset. The experimental design focused on isolating each module to quantify its independent contribution and analyze the synergistic effect between modules.
As shown in
Table 7, the Baseline model (original U-Net without any enhanced modules) achieved the lowest performance across all evaluation metrics, with an mIoU of 77.19%, a Precision of 85.06%, a Recall of 85.75%, and an F1-score of 85.55%. In contrast, CFM-Net (integrating all three modules) exhibited the optimal performance in mIoU, Recall, and F1-score, reaching 80.05%, 87.31%, and 87.06%, respectively.
Model A (Baseline + CBAM): After integrating only the CBAM module, the model’s Precision reached 87.11%—the highest among all ablation models—and other metrics also improved significantly (mIoU increased by 1.44 percentage points, Recall by 0.59 percentage points, F1-score by 0.61 percentage points compared with the Baseline). This improvement is attributed to the CBAM module’s ability to dynamically weight channel and spatial features: it enhances the model’s response to crack regions (e.g., highlighting pixel clusters with crack-like linear features) and effectively suppresses interference from non-target backgrounds (e.g., concrete textures and steel bar shadows), thereby reducing false detections.
Model B (Baseline + GFF): When integrating only the GFF module into the baseline U-Net, the model achieved notable improvements in mIoU and F1-score (mIoU increased by 1.33 percentage points, F1-score by 0.71 percentage points compared with the Baseline). The GFF module, through its reset and select gates, dynamically balances semantic and detail information: it strengthens the high-level semantic features of primary crack regions while preserving fine edge details of narrow cracks. This allows accurate crack positioning and minimizes missed detections, even without the assistance of CBAM for background suppression.
Model C (Baseline + MGMSIB): Incorporating only the MGMSIB module into the baseline U-Net led to a significant enhancement in crack continuity and width estimation, reflected in the increased Recall and F1-score (Recall increased by 1.27, F1-score by 0.8 compared with the Baseline). MGMSIB captures multi-scale contextual features and integrates morphology-guided perception, which effectively addresses the challenge of discontinuous crack predictions in narrow and branching regions. By focusing on multi-scale feature interactions, MGMSIB complements the baseline segmentation, enhancing structural consistency without additional attention weighting.
Model D (Baseline + CBAM + GFF): On the basis of Model A, adding the GFF module further improved the mIoU and F1-score by 0.62 and 0.49 percentage points, respectively. By introducing a reset gate and select gate for dynamic weight adjustment, the GFF module solves the semantic-detail imbalance problem in traditional U-Net’s fixed feature fusion: it strengthens high-level semantic information in main crack regions (to ensure accurate crack positioning) and preserves shallow edge details in narrow crack segments (to avoid missed detections). This dynamic fusion method complements the CBAM module’s interference suppression capability, as it can prioritize fusing crack-related features that have been enhanced by CBAM, rather than blindly concatenating all shallow and deep features.
Model E (Baseline + GFF + MGMSIB): Combining both GFF and MGMSIB modules (without CBAM) further improved overall segmentation performance, achieving higher mIoU and F1-score than either Model B or Model C individually. This synergistic improvement arises from the complementary capabilities of the two modules: GFF ensures accurate crack localization and preserves fine details, whereas MGMSIB enhances multi-scale feature perception and continuity. Together, they improve both the precision and structural integrity of predicted crack masks, demonstrating the benefit of coordinated module integration.
Model F (Full CFM-Net): After adding the MGMSIB module, all evaluation metrics reached the optimal level. Compared with the Baseline model, the mIoU increased by 2.86 percentage points and the F1-score increased by 1.51 percentage points; compared with Model B, the mIoU and Recall further improved by 0.8 and 0.42 percentage points, respectively. The performance gain mainly stems from MGMSIB’s multi-scale parsing capability: 3 × 3 convolutions capture local details of micro-cracks, 1 × 3/3 × 1 rectangular convolutions strengthen the modeling of horizontal/vertical linear crack structures, and 5 × 5/7 × 7 convolutions extract global trend information of wide cracks. This multi-scale feature integration forms a synergistic effect with CBAM (interference suppression) and GFF (adaptive fusion): CBAM filters out background noise to ensure the purity of input features for MGMSIB, while GFF provides balanced semantic-detail features to lay a foundation for multi-scale refinement, ultimately enabling the model to accurately identify cracks of various sizes and morphologies.
From the perspective of comprehensive metrics (F1-score and mIoU), the three modules do not exhibit simple additive effects but form a synergistic gain mechanism: CBAM improves feature quality by suppressing interference, GFF optimizes feature utilization through adaptive fusion, and MGMSIB enhances feature coverage via multi-scale perception. This collaborative relationship demonstrates the significant contribution of each core module to CFM-Net’s overall architectural design and performance, highlighting how the modules work synergistically under the current experimental setting.
5. Conclusions
This study addresses three key challenges in bridge crack segmentation for UAV inspection scenarios: false detections caused by complex background interference, missed detections of narrow cracks due to imbalanced shallow–deep feature fusion, and insufficient modeling of multi-scale crack structural continuity. To solve these intertwined issues, we proposed CFM-Net—a crack segmentation network with an optimized U-Net backbone that integrates three core modules (CBAM, GFF, and MGMSIB). Comprehensive experiments were conducted on two datasets: the self-constructed Mix Bridge Crack dataset and a mixed public dataset (composed of 8000 images from 11 public sources). The proposed method was compared with 8 mainstream segmentation models (Unet++, PspNet, DeepCrack, etc.), and the effectiveness of each core module was verified via ablation experiments. The key findings of this study are summarized as follows:
(1) CFM-Net achieves superior comprehensive segmentation performance on the Mix Bridge Crack dataset compared with mainstream models. On this dataset, CFM-Net reaches an mIoU of 80.05% and an F1-score of 87.06%—2.3 percentage points higher in mIoU and 1.21 percentage points higher in F1-score than DeepCrack (77.75% mIoU, 85.85% F1-score), and 2.42 percentage points higher in mIoU and 1.03 percentage points higher in F1-score than CrackFormer (77.63% mIoU, 86.03% F1-score). Its recall rate reaches 87.31%, which effectively reduces missed detections of narrow cracks (width < 0.5 mm) and false detections caused by concrete textures or shadow artifacts—key pain points in practical UAV inspection.
(2) The three core modules of CFM-Net exhibit significant synergistic effects in enhancing segmentation capability. Ablation experiments on the Mix Bridge Crack dataset show that the Baseline (original U-Net) only achieves an mIoU of 77.19% and an F1-score of 85.55%. Adding CBAM alone raises the mIoU to 78.63% (with a Precision of 87.11%, the highest for single-module enhancement) by suppressing background interference; further integrating GFF improves the mIoU to 79.25% through adaptive feature fusion, which solves the problem of blurred narrow crack edges. When all three modules are integrated, CFM-Net’s mIoU and F1-score reach 80.05% and 87.06%—2.86 and 1.51 percentage points higher than the Baseline—fully proving the complementarity and collaborative value of the modules.
(3) CFM-Net demonstrates excellent robust performance generalizing on a mixed public dataset composed of 11 sources. Composed of 8000 images from 11 public crack datasets (covering diverse scenarios such as road cracks, tunnel cracks, and masonry cracks), this dataset sees CFM-Net maintain leading performance: 80.62% mIoU, 88.23% Precision, 88.21% Recall, and 87.65% F1-score. These metrics are 1.73 percentage points higher in mIoU and 0.54 percentage points higher in F1-score than CrackFormer (78.76% mIoU, 87.11% F1-score), confirming that CFM-Net avoids overfitting to a single dataset and can adapt to diverse image sources and crack morphologies.
The preliminary UAV-based bridge crack detection framework outlined in Chapter 2 has laid a solid foundation for end-to-end inspection, covering the entire workflow, including high-resolution image tiling, background filtering, crack segmentation, and post-processing quantification. However, its practical performance is constrained by the limitations of the Baseline U-Net: specifically, complex background interference (e.g., concrete textures and steel bar shadows) is prone to causing false detections; rigid feature fusion leads to frequent missed detections of narrow cracks; and the modeling of multi-scale crack continuity—from 0.5 mm micro-cracks to 5 mm wide cracks—is insufficient. These limitations not only reduce the framework’s segmentation accuracy but also undermine the reliability of downstream quantitative analysis (e.g., crack length/width measurement) and maintenance guidance.
CFM-Net breaks through this core bottleneck by optimizing the framework’s segmentation engine, and its impact extends beyond the segmentation stage to drive subtle yet critical improvements across the entire workflow. For instance, the background interference suppression capability of the CBAM module means the VGG16-based background filter in the preliminary framework no longer needs to be overly stringent—even if 10% more residual background blocks are retained (to reduce the false rejection of edge bridge regions), CFM-Net can still effectively filter out interference. This alleviates the data annotation burden of the background classifier, as fewer manual corrections are required for misclassified bridge regions. Meanwhile, the adaptive fusion of shallow-edge and deep-semantic features by the GFF module addresses the narrow crack missed detection issue of the Baseline U-Net, ensuring that critical small cracks (which are often early warning signs of structural degradation) are not overlooked in practical inspections.
Future research will expand CFM-Net’s practical value in three key directions: First, explore lightweight optimization strategies (e.g., depthwise separable convolutions, parameter pruning, and knowledge distillation) to reduce the model’s computational complexity and parameter count by over 50%, enabling real-time inference on UAV-mounted edge devices (e.g., embedded GPUs with limited computing resources). Second, enhance cross-environment generalization by collecting additional data from different bridge types (e.g., arch bridges, beam bridges) and extreme environmental conditions (e.g., rainy weather, strong light, and fog), and applying domain adaptation techniques to mitigate performance degradation caused by environmental variations. Third, integrate CFM-Net with structural health monitoring systems: combine CFM-Net’s crack detection results with structural mechanics models to calculate crack expansion rates and residual service life, achieving seamless connection from automatic crack detection to intelligent health assessment—providing stronger technical support for the digital transformation of bridge infrastructure.