SAM-Guided Concrete Bridge Damage Segmentation with Mamba–ResNet Hierarchical Fusion Network

Li, Hao; Yang, Jianxi; Jiang, Shixin; Yang, Xiaoxia

doi:10.3390/electronics14081497

Open AccessArticle

SAM-Guided Concrete Bridge Damage Segmentation with Mamba–ResNet Hierarchical Fusion Network

¹

School of Traffic and Transportation, Chongqing Jiaotong University, Chongging 400060, China

²

School of Information Science and Engineering, Chongqing Jiaotong University, Chongging 400060, China

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(8), 1497; https://doi.org/10.3390/electronics14081497

Submission received: 12 March 2025 / Revised: 5 April 2025 / Accepted: 6 April 2025 / Published: 8 April 2025

Download

Browse Figures

Versions Notes

Abstract

Automated damage segmentation for concrete bridges is a fundamental task in infrastructure maintenance, yet existing systems often depend heavily on large annotated datasets, which are costly and time-consuming to produce. This paper presents an innovative framework for concrete bridge damage segmentation, leveraging the Segment Anything Model (SAM) to reduce the reliance on extensive annotated data while enhancing segmentation accuracy and efficiency. Firstly, a SAM-guided mask generation network is introduced, which utilizes the SAM’s segmentation capabilities to generate supplementary supervision labels for damage segmentation. Then, a novel point-prompting strategy, incorporating saliency information, is proposed to refine SAM’s prompts, ensuring accurate mask generation for complex damage patterns. Next, a trainable semantic segmentation network is designed, integrating MambaVision and ResNet as dual backbones to capture multi-level features from concrete bridge damages. To fuse these features effectively, a Hierarchical Attention Fusion (HAF) mechanism is introduced. Finally, a Polarized Self-Attention (PSA) decoder is employed to improve segmentation precision. Experiments on a dataset of 10,000 concrete bridge images with box-level annotations achieved state-of-the-art performance, with an MIoU of 60.13%, PA of 74.02%, and MDice of 75.40%, outperforming existing segmentation models. In summary, this study improves the accuracy of concrete bridge damage segmentation through a series of innovative methods and strategies, and the concrete bridge damage segmentation algorithm opens up new horizons and directions.

Keywords:

concrete bridge damage segmentation; segment anything model; Mamba; hierarchical attention fusion

1. Introduction

With the growing problem of infrastructure aging, particularly in concrete bridges, the detection and assessment of bridge damages have become crucial for ensuring traffic safety [1]. Concrete bridges are subject to various forms of deterioration such as cracks and spall, which directly affect their structural integrity and service life [2]. Therefore, timely and accurate damage detection is essential to prevent catastrophic failures and to extend the lifespan of these vital infrastructures [3,4]. Traditional methods for damage detection predominantly rely on manual inspection or hand-labeling of damaged areas [5,6]. While these approaches are still in use, they suffer from several significant limitations, including low efficiency, high labor costs, and dependence on the experience and skill of the inspectors. Moreover, manual detection is prone to human error, and the results often exhibit inconsistencies, making the process unreliable for large-scale assessments [7]. As a result, there is a pressing need for automated damage detection systems that can provide both efficiency and accuracy [8].

Deep learning techniques have shown great promise in overcoming these challenges. However, training these models requires large quantities of high-quality labeled data, which remains a significant bottleneck. The process of labeling damages is time-consuming and requires expert knowledge, especially due to the diversity of damage types and the intricacies of different concrete surfaces. This results in a scarcity of labeled data and a slow labeling process, ultimately hindering the widespread application of automated damage detection systems [9]. Consequently, improving data annotation methods and efficiency remains one of the primary challenges for advancing automated concrete damage detection technologies.

While significant progress has been made in the application of deep learning techniques, such as convolutional neural networks (CNNs) [10], for concrete damage segmentation, these methods still face several challenges, particularly concerning the dependence on large amounts of labeled data. These techniques have shown remarkable success in accurately detecting and segmenting damages, such as cracks and delamination, on concrete surfaces, making them highly valuable for automated inspection systems [11]. However, the performance of these models is still highly dependent on the availability of substantial amounts of high-quality annotated data. The major issue lies in the data annotation process itself. To train deep learning models effectively, large annotated datasets are essential. For concrete bridge damage detection, this means the need for meticulous and time-consuming labeling, often requiring expert knowledge to accurately identify and mark damages. Consequently, obtaining a sufficient volume of high-quality labeled data remains a significant obstacle to the widespread application of these techniques [12]. Even when a considerable amount of annotated data is available, the process of labeling is still slow and inefficient, especially for large-scale bridge inspection tasks. The complexity and time required for accurate labeling significantly hinder the scaling of data collection efforts, which in turn limits the expansion of datasets and the continuous optimization of deep learning models. This persistent bottleneck in annotation efficiency continues to impede the progress of automated concrete damage detection systems, highlighting the urgent need for alternative methods that can improve both data annotation efficiency and quality [13].

In order to address the challenges of limited labeled data and low annotation efficiency, the Segment Anything Model (SAM) [14] has emerged as a powerful image segmentation technique. The SAM leverages self-supervised learning and pre-trained models to perform efficient and accurate damage segmentation with minimal labeled data, providing a promising solution to the data bottleneck in automated concrete damage detection. The SAM algorithm operates by utilizing a guided segmentation approach that significantly reduces the dependency on large amounts of manually annotated data [15]. Unlike traditional deep learning methods that require substantial datasets for training, the SAM can generate high-quality segmentation results with only a small amount of annotated data. This approach dramatically cuts down the cost and time required for data annotation, making it a more scalable option for real-world applications [16].

In this paper, we propose a novel training framework for concrete bridge damage segmentation that leverages the Segment Anything Model (SAM) to reduce the reliance on extensive annotated data while improving segmentation efficiency and accuracy. Specifically, we introduce a SAM-guided mask generation network, which harnesses the segmentation mask generation capability of SAM. These masks are subsequently utilized as supplementary supervision labels to train a damage segmentation network. To ensure that SAM generates reliable and precise masks, we propose an Optimizing Prompting Strategy with saliency information. This strategy enhances the quality of the prompts provided to the SAM, thereby improving the accuracy of the generated masks. Following this, we present a trainable semantic segmentation network that employs a multi-level feature extraction architecture. This architecture utilizes distinct backbone networks to extract semantic features of varying levels from concrete bridge damages. To effectively fuse the features obtained at different levels, we propose a Hierarchical Attention Fusion (HAF) mechanism. HAF enables the integration of multi-level features to enhance the overall segmentation performance. Additionally, we utilize a Polarized Self-Attention (PSA) decoder as the feature decoder, which predicts precise segmentation results for concrete bridge damages. The contributions of this study can be summarized as follows:

1.: We propose a SAM-guided mask generation network that leverages the segmentation mask generation capabilities of SAM to provide supplementary supervision labels for training concrete bridge damage segmentation networks. This innovation significantly reduces reliance on large annotated datasets, addressing a critical limitation in existing automated damage detection systems.
2.: A novel point-prompting strategy is introduced, which employs saliency information to refine the prompts provided to SAM. This ensures the generation of reliable and accurate segmentation masks, even for complex damage patterns, thereby enhancing the overall segmentation quality.
3.: We design a trainable semantic segmentation network featuring a multi-level feature extraction framework. By integrating MambaVision and ResNet as dual backbone networks, the system captures hierarchical semantic features from concrete bridge damages, simultaneously extracting high-level global semantic features and robust local feature representations.
4.: To effectively integrate multi-level features and enhance segmentation performance, we introduce a Hierarchical Attention Fusion (HAF) mechanism. This includes a Boosting Textures Module (BTM) designed to strengthen the representation of discriminative texture features. Combined with the Polarized Self-Attention (PSA) decoder, our approach achieves high-precision damage segmentation results, demonstrating its robustness and applicability in practical engineering scenarios.

The rest of the paper is structured as follows: Section 2 presents related work focusing on semantic segmentation algorithms with applications to concrete bridge damage segmentation and the application of the Segment Anything Model. Section 3 describes the SAM-guided concrete damage segmentation mask generation and the Mamba–ResNet hierarchical fusion network. Section 4 details the experimental procedure and the experimental results of the proposed method. Section 5 provides a discussion of our study. Finally, Section 6 summarizes the conclusions of this paper.

2. Related Works

2.1. Semantic Segmentation Algorithm with Application to Concrete Bridge Damage Segmentation

Concrete bridge surface damage recognition typically relies on classification and object detection methods. These approaches classify regions in images into predefined categories or detect specific damages within candidate boxes. Such methods have been widely applied in the rapid detection and localization of concrete structures. In contrast, semantic segmentation plays a crucial role in concrete bridge surface damage recognition because it allows for pixel-level classification of the damages. This fine-grained approach is vital for obtaining quantitative parameters (such as the exact area of damage on the bridge surface), which classification or object detection methods cannot directly provide [17,18].

While concrete bridge surface damage classification methods classify image patches, and object detection methods focus on classifying and regressing candidate boxes of various sizes within the image, pixel-level semantic segmentation requires classifying every pixel in the image. Rubio et al. [19] proposed a pixel-level recognition method for concrete bridge stratification and reinforcement based on Fully Convolutional Networks [20], using a pre-trained VGG network as the feature extractor. This method achieved average accuracies of 89.7% for stratification and 78.4% for exposed reinforcement. Shi et al. [21] introduced a VGG-Unet-based semantic segmentation method for bridge reinforcement corrosion and rubber bearing cracks, where the Background Data Drop Rate (BDDR) was used to reduce the number of background pixels and control the proportion of damaged pixels in the dataset. Deng et al. [22] incorporated the atrous spatial pyramid pooling (ASPP) [23] module into LinkNet and used pre-trained ResNet34 as the feature extractor. The model also employed a weighted–balanced joint intersection over union (IoU) loss function to achieve precise segmentation on highly imbalanced small datasets. Their experimental results showed that the method achieved a mean IoU (mIoU) of 61.95%, outperforming other models. Narazaki et al. [24] developed a visual-based automatic structural state assessment system that can identify and locate concrete bridge damages. To address the issue of limited image data, they used a synthetic environment to randomly generate target structures and damage scenarios. The model achieved an IoU of 87.9% on synthetic images and 73.9% on real images. Zou et al. [25] adopted the idea of Seg-Net [26] and chose a labeled pooling layer during the down-sampling stage to avoid information loss during up-sampling. Yang et al. [27] combined the feature pyramid network (FPN) [28] architecture with semantic segmentation networks to detect multi-scale cracks, achieving good results on public datasets. Choi and Cha [29] proposed a novel deep learning structure called the Semantic Damaged Dam Age Detection Network (SDDNet) for real-time segmentation of structural surface cracks, and created a new crack dataset called Crack200. [30] separated the crack regions from point cloud images captured by a depth camera and directly quantified the crack volume. Zhang et al. [31] modified the Mask R-CNN [32] network to address issues with the mask branch’s inability to accurately predict crack details. Additionally, Li et al. [33] proposed a pixel-based adaptive weighted cross-entropy (WCE) loss function combined with Jaccard distance to analyze crack images, which facilitated high-quality pixel-level road crack detection. However, convolutional neural networks (CNNs) lack a global understanding of images and cannot establish dependencies between features [34]. Vaswani et al. [35] introduced the Vision Transformer (ViT), applying self-attention to image segmentation tasks for the first time. Liu et al. [36] proposed the Swin Transformer, based on sliding windows and hierarchical structures, achieving better results in semantic segmentation tasks than most CNNs. As Transformer-based networks have gained widespread application, more and more researchers are utilizing Transformers for damage segmentation tasks, particularly for concrete crack segmentation. Xu et al. [37] combined Transformers with skip connection strategies for high-resolution road crack recognition. Shamsabadi et al. [38] proposed a method based on ViT, while Chen et al. [39] pointed out that compared to CNNs, Transformer-based image recognition networks suffer from data dependency and the loss of local features due to the lack of inductive biases, such as positional encoding, translation invariance, and hierarchical structure.

2.2. Application of Segment Anything Model

The Segment Anything Model (SAM), introduced by Meta AI, is a versatile segmentation framework designed for universal application. Leveraging prompt-based learning and pre-trained on over a billion masks, SAM achieves high generalization across tasks without task-specific fine-tuning [14]. Its Vision Transformer backbone ensures robust performance, making it a cornerstone for interactive and zero-shot segmentation.

Zhou et al. [40] leveraged the Segment Anything Model (SAM), known for its rich prior knowledge and strong generalization ability, by integrating a lightweight and learnable crack-adaptive layer with a sparse prompt generation method based on high-frequency components of road images. This framework enabled efficient road crack segmentation. However, in scenarios with subtle crack patterns, both this framework and the Mask2Former model struggled to achieve accurate segmentation. Additionally, the inference speed of the framework was slightly slower compared to the CNN-based PIDNet model. Wang et al. [41] proposed a SAM-based Image Enhancement (SAM-IE) approach, which utilizes the masks and stability scores generated by SAM to enhance the diagnostic performance of medical image classification models. This method does not require extensive modifications or excessive prompts for SAM, making it a practical enhancement strategy. Further, Zhou et al. [42] introduced a novel SAM-based framework named AoP-SAM, specifically designed to improve segmentation accuracy of the pubic symphysis (PS) and fetal head (FH) in intrapartum ultrasound images, demonstrating SAM’s adaptability to specialized medical imaging applications.

Building upon the findings in the related works section, the following key insights can be drawn:

Focus on Single-Class Segmentation: Research on the semantic segmentation of concrete bridge damages leveraging deep learning has predominantly concentrated on single-class segmentation. While effective in isolating specific damage types, this approach often overlooks the complexity and diversity of real-world scenarios where multiple damage types co-exist. As a result, the exploration of multi-class segmentation remains underdeveloped, leaving a critical gap in addressing practical applications.
Challenges in Multi-Class Segmentation: Multi-class segmentation faces considerable hurdles, with two primary challenges being the scarcity of annotated datasets and the inherent difficulty of the annotation process. The annotation of multi-class datasets demands significant expertise, time, and resources, particularly in scenarios involving subtle or overlapping damage features. These factors have constrained the advancement of robust multi-class segmentation techniques tailored for bridge damage analysis.
Limitations of SAM in Bridge Damage Segmentation: While the Segment Anything Model (SAM) represents a promising tool for mask generation in general segmentation tasks, its application to bridge damage imagery is not without limitations. The generic nature of SAM’s mask generation requires adaptation to account for the unique visual characteristics of bridge damages, such as irregular shapes, complex textures, and diverse environmental conditions. Therefore, further refinements are necessary to improve the effectiveness of SAM in this specialized domain.

3. Method

3.1. Overall Architecture

The overall framework of our proposed structure is shown in Figure 1. We propose a SAM-guided mask generation framework for damage segmentation in concrete bridges. Let the input image be

X \in R^{H \times W \times C}

, where H, W, and C represent the height, width, and channels of the image. The initial bounding box annotation is defined as

Y_{B} \in R^{4}

, where the four values represent the coordinates of the bounding box. We propose a Prompt Optimization Generation module

T_{P O G}

that refines the initial bounding box

Y_{B}

into an optimized prompt

Y_{B}^{'} \in R^{4}

. The optimized prompt

Y_{B}^{'}

is then used as input to the SAM model

F_{S A M}

, which generates a segmentation mask

Y_{M} \in R^{H \times W}

as the output:

\begin{matrix} Y_{M} = F_{S A M} (X, Y_{B}^{'}) \end{matrix}

(1)

where

Y_{B}^{'} = T_{P O G} (Y_{B})

. Here,

F_{S A M}

denotes the segmentation model, which is composed of a frozen encoder–decoder architecture. Next, we employ a trainable encoder–decoder network

F_{E D}

to process the segmentation mask

Y_{M}

. The network

F_{E D}

, consisting of encoder

E_{E D}

and decoder

D_{E D}

, produces the predicted segmentation mask

Y_{P} \in R^{H \times W}

as follows:

\begin{matrix} Y_{P} = D_{E D} (E_{E D} (X, Y_{M}; θ_{E D})) \end{matrix}

(2)

where

θ_{E D}

represents the parameters of the encoder–decoder network

F_{E D}

, which are learned during training. The objective of the training process is to minimize the segmentation loss

L_{s e g}

, which measures the difference between the predicted segmentation

Y_{P}

and the ground-truth segmentation mask

Y_{M}

. Specifically, we aim to minimize the following loss:

\begin{matrix} L_{s e g} = L (Y_{P}, Y_{M}) \end{matrix}

(3)

where

L

is a suitable loss function. During training, the parameters

θ_{E D}

of the encoder–decoder network are updated, while the SAM model’s encoder–decoder components are kept frozen. After training, the learned encoder–decoder network

F_{E D}

is used for inference on new, unseen images X, generating segmentation predictions

Y_{P}

.

3.2. SAM-Guided Concrete Damage Segmentation Mask Generation

3.2.1. Segment Anything Model

As shown in Figure 2, the Segment Anything Model (SAM) [14] segmentation framework represents a state-of-the-art approach to generalizable object segmentation. Its architecture is structured around three key components: a Transformer-based image encoder, a mask decoder, and a prompt embedding module.

1.: Transformer Image Encoder: The image encoder in SAM leverages a Vision Transformer (ViT) [43] architecture to encode the input image into a high-dimensional latent representation. Formally, let $I \in R^{H \times W \times C}$ denote the input image, where H,W,C represent height, width, and the number of channels, respectively. The image is partitioned into patches, and each patch is projected into a token embedding $X \in R^{N \times d}$ , where $N = H W / P^{2}$ is the number of patches and d is the embedding dimension. These tokens are then processed through a series of transformer layers, each comprising multi-head self-attention and feed-forward networks. The resulting feature map $F \in R^{N \times d}$ encodes the spatial and contextual information critical for segmentation tasks.
2.: Mask Decoder: The mask decoder refines the encoded image features to produce segmentation masks. Using the encoded feature map F as input, along with a set of learnable query embeddings $Q \in R^{M \times d}$ (where M is the number of masks), the decoder computes attention weights to localize the object of interest. The mask prediction is then given by $M = σ (Q F^{T})$ , where $σ$ is the sigmoid activation applied element-wise to generate pixel-wise probabilities. This formulation allows the SAM to generate multiple masks simultaneously while maintaining scalability.
3.: Prompt Embedding: The prompt embedding module allows the SAM to adapt to various inputs, including points, bounding boxes, and textual descriptions. For a point-based prompt, the embedding is represented as a learnable vector $p \in R^{d}$ , added to the corresponding spatial location in the encoded feature map. For bounding boxes, the embeddings are parameterized as the positional encodings of the box’s corners, integrated into the transformer layers. This versatility enables SAM to function effectively across diverse user-provided guidance signals.

3.2.2. Optimizing Prompting Strategy with Saliency Information

In the field of image segmentation, the SAM algorithm has significantly reduced the need for large-scale manually annotated data by leveraging prompting techniques. Unlike traditional deep learning models that require vast amounts of labeled data for training, SAM allows for high-quality segmentation with minimal input through various prompting strategies, such as bounding boxes or points. This ability to generate accurate segmentation results with minimal user intervention not only reduces the time and cost associated with data labeling but also presents a more scalable solution for real-world applications. When only bounding boxes are used as prompts, the SAM struggles to accurately segment narrow or elongated objects. Cracks, as shown in Figure 3a, are typically thin and irregular, and a bounding box prompt often leads to imprecise segmentation that includes parts of the background. This issue arises because a bounding box provides a coarse estimate of the object’s position, which is insufficient to guide the model in refining the segmentation of such intricate structures. While bounding box prompts serve as an initial guide, they are not always detailed enough to achieve the fine-grained segmentation required for challenging cases like cracks, where precision is crucial [44,45].

The SAM primarily relies on the Transformer architecture and large-scale pre-training data to achieve generalized segmentation capabilities for arbitrary objects. Its segmentation results are highly dependent on the provided prompts, with common prompt types including point prompts and box prompts. The combination of saliency maps generated from point and box prompts can leverage their complementary advantages. Box prompts help the SAM constrain the approximate target region, reducing background interference, while point prompts provide fine-grained local information, guiding the model to segment crack regions more accurately. This approach is particularly effective for elongated targets, as cracks exhibit weak local features, making box prompts alone prone to misclassification. In contrast, point prompts can provide additional information at critical locations along the crack, facilitating the learning of its complete structure by the SAM. Moreover, existing studies have demonstrated that integrating multiple prompts can significantly improve the SAM’s segmentation performance in complex scenarios. For instance, in medical image segmentation tasks, researchers have found that using both point and box prompts enhances model stability, thereby improving accuracy [46]. Similarly, in concrete bridge damage detection, this combined strategy strengthens the SAM’s ability to adapt to intricate damage patterns such as cracks, ultimately improving the quality of segmentation results.

As shown in Figure 3b, to enhance segmentation accuracy, we propose the incorporation of point prompts in addition to bounding box cues. In order to accurately obtain prompt points, we propose a strategy for optimizing prompts using saliency information. This process involves four key steps: (1) computing the saliency map, (2) thresholding, (3) connected component analysis, and (4) computing the centroid of the region. By refining the segmentation prompts with these saliency-driven steps, we aim to improve both the precision and accuracy of the segmentation masks generated by the SAM. The steps are shown below:

(1) Compute the Saliency Map: The first step involves computing the saliency map for the input image, which highlights regions of interest based on their visual importance. The spatial contrast-based saliency map S can be formulated mathematically as follows:

\begin{matrix} S (x, y) = \sum_{i, j \in N (x, y)} | I (x, y) - I (x + i, y + j) | \end{matrix}

(4)

where

I (x, y)

is the intensity of the pixel at position

(x, y)

in the image, and

N (x, y)

is a local neighborhood of size

k \times k

centered at the pixel

(x, y)

. The summation is taken over all neighboring pixels

(x + i, y + j

) in the local neighborhood. This measure computes the contrast between a pixel and its neighbors by evaluating the absolute differences in pixel intensities.

(2) Thresholding: After normalizing the saliency map

S

to produce

S_{n o r m}

with values in the range

[0, 1]

, we proceed to identify the most salient regions by applying a threshold. To achieve this, we first calculate the maximum value of the normalized saliency map,

S_{m a x}

which is given by

\begin{matrix} S_{m a x} = m a x (S_{n o r m}) \end{matrix}

(5)

This value represents the highest saliency level in the image. We define the threshold T as half of the maximum saliency value:

\begin{matrix} T = 0.5 \times S_{m a x} \end{matrix}

(6)

Using this threshold, we create a binary map B by comparing each pixel in the normalized saliency map

S_{n o r m} (x, y)

to T. Specifically, for each pixel

(x, y)

, if the saliency value

S_{n o r m} (x, y)

is greater than or equal to T, it is considered a salient pixel and assigned a value of 255 in the binary map:

\begin{matrix} B (x, y) = \{\begin{matrix} 255 & S_{norm} (x, y) \geq T \\ 0 & S_{norm} (x, y) < T \end{matrix} \end{matrix}

(7)

(3) Connected Component Analysis: After thresholding, connected component analysis is performed to identify distinct, contiguous regions in the image. This step is crucial as it allows the model to isolate individual objects or regions of interest that have been highlighted by the saliency map. To identify regions of interest in the binary map B, the first step is to find all connected components. A connected component is a set of pixels in B that are connected to each other and share the same binary value, in this case, 255. These connected components correspond to distinct regions of interest in the binary map. Once the connected components are identified, we proceed to compute the area of each component. For a given connected component

R_{i}

, the area

A_{i}

is defined as the total number of pixels within that component. Mathematically, this is expressed as

\begin{matrix} A_{i} = \sum_{(x, y) \in R_{i}} 1 \end{matrix}

(8)

where the summation iterates over all pixels

(x, y)

that belong to the region

R_{i}

, and the result gives the total count of pixels.

(4) Compute Centroid of the Region: The final step is to compute the centroid of each connected region, which provides a precise point location that represents the “center” of the object or region of interest. This centroid can be used as a point-based prompt to guide the segmentation process, improving the accuracy of the segmentation masks. By using the centroid as a reference, the model can more effectively refine the segmentation boundaries and focus its attention on the most salient areas [4]. For each valid region

R_{i}

, we compute its centroid

c_{i} = (x_{c}, y_{c})

, which represents the center of mass of the region. The centroid coordinates are calculated as the weighted average of the pixel positions within the region, given by

\begin{matrix} c_{i} = (x_{c}, y_{c}) = \frac{1}{A_{i}} \sum_{(x, y) \in R_{i}} [\begin{matrix} x \\ y \end{matrix}] \end{matrix}

(9)

where

A_{i}

is the area of the region

R_{i}

, and the summation iterates over all pixels

(x_{c}, y_{c})

in the region. This compact form allows simultaneous computation of both the x and y coordinates of the centroid.

3.3. Mamba–ResNet Hierarchical Fusion Network for Concrete Damage Segmentation

The overall architecture of our proposed method is shown in Figure 4. In the task of concrete bridge damage segmentation, traditional single-encoder architectures often struggle to effectively balance the capture of global semantic information and local structural details. To overcome this limitation, we propose the Mamba–ResNet hierarchical fusion network, which utilizes a dual-branch design with MambaVision [47] as the main backbone and ResNet [48] as the auxiliary backbone. This dual-backbone architecture ensures a complementary feature extraction process, enhancing the model’s ability to handle complex structural variations in damaged bridges. As shown in Figure 4, MambaVision, as the main backbone, leverages its self-attention mechanism to excel in capturing long-range dependencies and extracting high-level semantic features. This is particularly useful for understanding the broader context in bridge damage, where large spatial relationships are essential for distinguishing various types of damage from the surrounding background. ResNet, acting as the auxiliary backbone, provides robust local feature representations through its hierarchical convolutional structure. This allows it to focus on capturing fine-grained details, such as the precise boundaries of cracks and surface degradation, which are crucial for accurate damage segmentation. The fusion of these two backbones allows the model to simultaneously benefit from both global context understanding and local feature precision, ensuring comprehensive damage segmentation. Bridge damage typically appears in different scales, from small cracks to large-area spalling or corrosion. A single-backbone network often struggles to capture both fine-scale details and large-scale structural patterns simultaneously. By using MambaVision as the main backbone, the network can establish a global understanding of damage distribution, while ResNet as the auxiliary backbone refines the segmentation results by enhancing local details at various scales. This hierarchical fusion ensures that the model can adapt to both local irregularities and global structural changes, improving segmentation robustness in real-world applications.

To integrate the multi-level features extracted by MambaVision and ResNet, we introduce a Hierarchical Attention Fusion (HAF) module. To improve the segmentation of concrete bridge damages, we integrated the Polarized Self-Attention (PSA) mechanism into the decoder. PSA enhances the model’s ability to distinguish texture features by effectively capturing both global patterns and local details. This is particularly important for damage segmentation, where subtle texture variations often define the boundary between damages and background, and traditional methods struggle with such fine-grained distinctions [49,50].

3.3.1. MambaVision Backbone

Recent developments in deep learning have revealed fundamental differences between CNN, Transformers, and state-space models like Mamba in modeling long-range dependencies. CNNs, known for their hierarchical feature extraction, primarily focus on local spatial patterns. Their reliance on stacked convolutional layers limits their ability to capture global context efficiently, as deeper layers are required to expand the receptive field. While dilated convolutions and non-local blocks have been introduced to alleviate this limitation, they still struggle with fully modeling distant dependencies [51]. Transformers effectively capture global dependencies through self-attention mechanisms, allowing each token to attend to all others in the input. This enables robust contextual modeling across an entire image. However, the computational complexity of self-attention scales quadratically with the number of tokens, posing challenges for high-resolution dense prediction tasks such as damage segmentation [43]. Various adaptations, such as windowed attention and sparse attention, attempt to address this, but they often compromise on capturing full global dependencies [52]. Mamba, a state-space sequence model (SSM), presents an alternative approach by encoding long-range interactions through structured state-space modeling [53]. Unlike CNNs, which require multiple layers to aggregate global information, Mamba inherently maintains long-range dependencies within a single operation. Unlike Transformers, which rely on explicit pairwise attention, Mamba propagates information sequentially in a manner that allows for efficient modeling of distant relationships without quadratic complexity. This structured approach ensures that long-range dependencies are captured effectively, making it particularly suited for segmentation tasks that require preserving global structural integrity.

In order to address the issues of insufficient feature extraction and inadequate exploration of deep contextual information in concrete bridge damage segmentation, we chose MambaVision as the backbone network for feature extraction. MambaVision has demonstrated significant advantages in capturing fine-grained features and preserving long-range dependencies, which are essential for accurately identifying subtle damage patterns in concrete structures. Recent studies have shown that deep contextual information plays a crucial role in improving segmentation performance by providing a more comprehensive understanding of the spatial relationships between different regions of the bridge surface [54]. Therefore, utilizing MambaVision allows for enhanced feature extraction and better integration of spatial context, ultimately improving the accuracy and reliability of damage detection in concrete bridges. As shown in Figure 5, the MambaVision architecture is a hierarchical network that consists of four distinct stages. The first two stages are primarily dedicated to feature extraction using CNN-based blocks, while the third and fourth stages focus on more complex feature extraction using the Mamba-based model.

The input image, sized

H \times W \times 3

, is first divided into overlapping patches of size

\frac{H}{4} \times \frac{W}{4} \times C

, and these patches are then projected into a C-dimensional embedding space. The stem module, consisting of two sequential

3 \times 3

convolutional layers with stride 2, performs the initial feature extraction. This step is essential for reducing the image resolution while preserving essential information, enabling a more efficient analysis. The structure then progresses into stages 1 and stages 2, which rely on CNN blocks for rapid extraction of localized features. These stages use a down-sampling mechanism via batch-normalized convolutional layers with stride 2, which progressively reduces the image resolution. The CNN block formulation is represented by the following equation:

\begin{matrix} \hat{z} = G E L U (B N ({Conv}_{3 \times 3} (z))) \end{matrix}

(10)

\begin{matrix} z = B N ({Conv}_{3 \times 3} (\hat{z})) + z \end{matrix}

(11)

where GELU refers to the Gaussian error linear unit activation function, and BN is batch normalization [55].

As the architecture progresses into stages 3 and 4, it integrates the MambaVision Mixer and Transformer blocks, which enhance the model’s ability to capture both spatial and sequential dependencies. The design ensures that each of these components works in concert, balancing local feature extraction with broader context modeling.

A distinctive feature of the Mamba architecture is its use of the Structured State Space Model (SSM), which plays a central role in overcoming the limitations of traditional convolutional and Transformer-based networks. The SSM is designed to model long-range dependencies in visual data, allowing the network to process sequential input in an efficient and context-aware manner. This is particularly important for visual tasks where global context and distant spatial relationships need to be captured. Specifically, the input

x (t)

is processed through a series of hidden states

h (t) \in R^{M}

, with the following formulation:

\begin{matrix} h^{'} (t) = A h (t) + B x (t) \end{matrix}

(12)

\begin{matrix} y (t) = C h (t) \end{matrix}

(13)

where A, B, and C are learnable parameters. To improve computational efficiency, these continuous parameters are discretized as follows:

\begin{matrix} \bar{A} = exp (Δ A) \end{matrix}

(14)

\begin{matrix} \bar{B} = {(Δ A)}^{- 1} (e x p (Δ A) - I) (Δ B) \end{matrix}

(15)

\begin{matrix} \bar{C} = C \end{matrix}

(16)

This discretization enables the model to perform efficient computations in sequential tasks while preserving the ability to model long-range dependencies. The output can then be computed using a convolutional kernel K, which aggregates the processed input

x (t)

:

\begin{matrix} y = x * K \end{matrix}

(17)

\begin{matrix} K = [C B, C A B, \dots, C A^{T - 1} B] \end{matrix}

(18)

This structure allows the network to integrate both local and global features efficiently, significantly enhancing its representational capacity.

As shown in Figure 6, the MambaVision block introduces a dual-branch design, combining the strengths of selective scan operations (Scan) and regular convolutional paths.

The input

X_{i n}

undergoes two parallel processing streams. In the first branch,

X_{1}

is computed using a selective scan operation applied to features processed by a linear transformation and convolution, mathematically described as:

\begin{matrix} X_{1} = S c a n (σ (C o n v (L i n e a r (C, \frac{C}{2}) (X_{i n})))) \end{matrix}

(19)

where

σ

represents the activation function (e.g., Sigmoid Linear Unit, SiLU), and

L i n e a r (C, \frac{C}{2})

denotes a linear transformation reducing the input dimensions. The second branch computes

X_{2}

by applying a similar convolution and activation process without the Scan operation:

\begin{matrix} X_{2} = σ (C o n v (L i n e a r (C, \frac{C}{2}) (X_{i n}))) \end{matrix}

(20)

The outputs from both branches are concatenated and projected back into the embedding space C using a final linear transformation:

\begin{matrix} X_{o u t} = L i n e a r (\frac{C}{2}, C) (C o n c a t (X_{1}, X_{2})) \end{matrix}

(21)

This design ensures that both local features (captured by the convolutional path) and global dependencies (modeled by the Scan operation) are effectively integrated into the final feature representation. By incorporating a symmetric path without SSM in the second branch, the MambaVision block mitigates any loss of critical local information, striking a balance between sequential and spatial modeling.

In the final stages of MambaVision, Transformer blocks are integrated to recover lost global context and capture long-range spatial dependencies. These blocks employ a multi-head self-attention mechanism, which is essential for modeling the complex relationships between distant parts of the image. The attention mechanism is formulated as

\begin{matrix} A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{h}}}) V \end{matrix}

(22)

where Q, K, and V are the query, key, and value matrices, respectively, and

d_{h}

is the dimension of the attention heads. This mechanism enables the model to focus on relevant parts of the image, even if they are far apart spatially, allowing it to effectively capture dependencies across large regions of the image.

3.3.2. Hierarchical Attention Fusion

Integrating multi-level features from different network layers is a widely used approach to improve feature representation in vision tasks [56]. Two common methods for such integration are feature concatenation and element-wise addition. Feature concatenation combines features by stacking them along the channel dimension, allowing the model to preserve all information from both levels. On the other hand, element-wise addition fuses features by summing corresponding elements, effectively reducing redundancy and maintaining compactness. These methods, while effective in many applications, struggle to capture the intricate texture details required for segmenting damages in concrete bridges.

In the context of concrete bridge damage segmentation, these general methods often fail to effectively highlight the subtle texture variations that distinguish damages from background noise. This limitation arises due to the lack of an adaptive mechanism to emphasize critical texture features and suppress irrelevant information. To address this challenge, we propose the Hierarchical Attention Fusion (HAF) framework. As shown in Figure 7, the proposed HAF framework employs an attention mechanism to integrate features from different levels of MambaVision and ResNet. By utilizing attention, the HAF adaptively assigns importance weights to features, enhancing their discriminative power for damage segmentation.

For a given pair of input features X and Y with dimensions

C \times H \times W

, HAF operates through parallel pathways. Each pathway first processes its input using the BTM, refining the texture details and emphasizing the regions critical for damage identification. Mathematically, the attention-weighted feature maps

F_{X}

and

F_{Y}

can be represented as

\begin{matrix} F_{X} = {Conv}_{1 \times 1} (R e L U (B T M (X))) \end{matrix}

(23)

\begin{matrix} F_{Y} = {Conv}_{1 \times 1} (R e L U (B T M (Y))) \end{matrix}

(24)

where

B T M

represents the Boosting Textures Module,

{Conv}_{1 \times 1}

is a

1 \times 1

convolution layer, an

σ

denotes the sigmoid activation function. The expression for the output of HAF is as follows:

\begin{matrix} F_{H A F} = X \cdot σ (F_{X} + F_{Y}) + Y \cdot σ (F_{X} + F_{Y}) \end{matrix}

(25)

where · denotes element-by-element multiplication, and

σ

denotes the activation function.

The BTM, a core component of HAF, is specifically designed to enhance the representation of discriminative texture features. As detailed in Figure 7, the BTM captures multi-scale texture patterns through a combination of

3 \times 3

convolutions and dilated convolutions with dilation rates

d = 3, 5, 7

. These multi-scale outputs are concatenated along the channel dimension:

\begin{matrix} F_{B T M} = C o n c a t ({Conv}_{3 \times 3} (F), {Conv}_{3 \times 3}^{d = 3} (F), {Conv}_{3 \times 3}^{d = 5} (F), {Conv}_{3 \times 3}^{d = 7} (F)) \end{matrix}

(26)

The resulting feature map is then refined through a

1 \times 1

convolution to reduce dimensionality:

\begin{matrix} F = {Conv}_{1 \times 1} (F_{B T M}) \end{matrix}

(27)

Neuroscience studies have revealed that the human visual system employs a series of receptive fields with varying sizes to process visual information effectively. Specifically, it has been demonstrated that these receptive fields are organized to emphasize the area near the fovea of the retina, which is particularly sensitive to subtle spatial displacements [57]. In the context of damage segmentation in concrete bridges, this design principle allows the network to simulate the multi-scale processing mechanism of the human visual system. By employing convolutions with dilation rates

d = 3, 5, 7

, the network can simultaneously capture fine-grained details, such as micro-cracks, and broader contextual patterns, such as spalling or extensive cracking.

3.3.3. Polarized Self-Attention

In segmentation networks, the decoder plays a pivotal role in recovering high-resolution spatial details from low-resolution feature maps generated by the encoder. Traditional decoders often rely on up-sampling operations such as transposed convolutions, bilinear interpolation, or nearest-neighbor interpolation to reconstruct spatial details [58]. While effective in generating higher-resolution outputs, these methods primarily focus on local context and often fail to model global dependencies, which are critical for understanding complex spatial relationships. To address this, researchers have proposed feature fusion techniques that aggregate information across different levels of the network. For instance, skip connections, as seen in U-Net, combine high-resolution features from earlier encoder layers with the up-sampled outputs of the decoder, thereby enriching the spatial detail [59]. Pyramid-based approaches, such as the Pyramid Scene Parsing Network (PSPNet), use multi-scale pooling to capture context at various resolutions [60]. Similarly, atrous spatial pyramid pooling (ASPP) effectively integrates features at multiple scales, improving the segmentation of objects with varying sizes [61]. Despite their effectiveness, these methods are limited in their ability to dynamically adapt to diverse inputs, often treating all features equally without emphasizing the most salient ones.

Attention mechanisms offer a solution by enabling selective focus on critical regions while suppressing irrelevant information. This capability is particularly valuable for tasks like concrete bridge damage segmentation, where anomalies such as cracks, spalling, and other forms of damage vary significantly in appearance and scale. By incorporating Polarized Self-Attention into the decoder, the network gains the ability to model both local and global dependencies, enhancing precision and robustness in feature reconstruction.

As illustrated in Figure 8, PSA consists of two key components: channel-only self-attention and spatial-only self-attention. In the channel dimension, PSA retains half of the original features’ dimensions, while in the spatial dimension, it maintains the full dimensions of the original features. This design reduces the information loss caused by dimensionality reduction.

In the channel attention component, we first apply a 1 × 1 convolution to the input feature

X \in R^{\frac{C}{2} \times H \times W}

, transforming it into two feature maps,

Q \in R^{\frac{C}{2} \times H \times W}

and

V \in R^{C \times H \times W}

. Here, Q’s channel dimension is compressed to half, whereas V’s channel dimension is preserved at a higher level. To strengthen the information content of Q, the Softmax function is applied across the channel dimension:

\begin{matrix} Q_{c}^{'} = S o f t m a x (Q_{c}), \forall c \in [1, \frac{C}{2}] \end{matrix}

(28)

Next, a matrix multiplication is performed between Q and K to aggregate channel attention:

\begin{matrix} A = Q^{'} \cdot K^{T} \end{matrix}

(29)

where K is another transformation of the input features via

1 \times 1

convolution. Subsequently, we use a

1 \times 1

convolution and normalization layer to project the aggregated features back to C-dimensional space:

\begin{matrix} F_{c h a n n e l} = N o r m (C o n v 1 x 1 (A \cdot V)) \end{matrix}

(30)

Finally, a sigmoid function is applied to restrict all parameters within the range [0, 1]:

\begin{matrix} F_{c h a n n e l}^{o u t} = S i g m o i d (F_{c h a n n e l}) \end{matrix}

(31)

In the spatial attention component, we similarly apply a 1×1 convolution to transform the input feature X into two spatial maps,

Q \in R^{C \times H \times W}

and

V \in R^{C \times H \times W}

. To compress the spatial dimension of Q, we use global average pooling:

\begin{matrix} Q^{'} = G A P (Q) \end{matrix}

(32)

The spatial dimension of V remains intact as

H \times W

. Softmax is then applied to enhance the information content of Q:

\begin{matrix} Q_{h, w}^{'} = S o f t m a x (Q_{h, w}), \forall (h, w) \in [1, H] \times [1, W] \end{matrix}

(33)

The attention map is obtained by performing matrix multiplication between Q and K:

\begin{matrix} A_{s p a t i a l} = Q^{'} \cdot K^{T} \end{matrix}

(34)

Subsequently, feature transformation and a sigmoid function are applied:

\begin{matrix} F_{s p a t i a l}^{o u t} = S i g m o i d (T r a n s f o r m (A_{s p a t i a l} \cdot V)) \end{matrix}

(35)

The final output feature

F_{P S A}

is obtained by combining both channel attention and spatial attention outputs:

\begin{matrix} F_{P S A} = F_{c h a n n e l}^{o u t} + F_{s p a t i a l}^{o u t} \end{matrix}

(36)

3.4. Loss Function

In this study, the proposed method for segmenting concrete bridge damages leverages a combination of Cross-Entropy Loss (CE Loss) and Dice Loss to optimize the segmentation performance.

CE Loss, commonly used in multi-class and binary classification tasks, measures the dissimilarity between the predicted probabilities and the ground-truth labels. It effectively penalizes incorrect predictions, ensuring that the model learns to output probabilities close to the true label distributions. For damage segmentation, CE Loss helps in handling class imbalances by emphasizing the need for accurate pixel-wise classification. The loss function is defined as follows:

\begin{matrix} L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} l o g ({\hat{y}}_{i}) + (1 - y_{i}) l o g (1 - {\hat{y}}_{i})] \end{matrix}

(37)

where

y_{i}

and

{\hat{y}}_{i}

represent the true and predicted probabilities for pixel i, and N is the total number of pixels.

Dice Loss, on the other hand, directly optimizes for the overlap between the predicted and ground-truth masks by focusing on maximizing their Dice Similarity Coefficient (DSC). It is particularly effective in addressing issues of class imbalance, as it equally emphasizes both large and small regions within the mask. Dice Loss is given by

\begin{matrix} L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} y_{i} {\hat{y}}_{i}}{\sum_{i = 1}^{N} y_{i} + \sum_{i = 1}^{N} {\hat{y}}_{i}} \end{matrix}

(38)

The combined loss function, a weighted sum of CE Loss and Dice Loss, ensures robust optimization by leveraging the strengths of both approaches:

\begin{matrix} L = α L_{C E} + β L_{D i c e} \end{matrix}

(39)

where

α

and

β

are hyperparameters that balance the contributions of each loss term. By combining these loss functions, the model achieves enhanced segmentation accuracy, as CE Loss ensures precise pixel-wise predictions while Dice Loss enforces shape and region-level consistency.

4. Experiments and Results

4.1. Experimental Environment

4.1.1. Datasets

Since there is no publicly available dataset for concrete bridge damage detection, we constructed our own dataset by collecting 10,000 concrete bridge inspection images from various bridge inspection reports, as shown in Figure 9. These images contain six types of damage, including cracks, spalling, corrosion, rebar exposure, stains, and holes, as well as background images of the bridge surface. Among these, 7354 images depict different types of damage, while 2644 images represent background scenes.

In order to create the box-level annotations for these six types of damage, we annotated the damage regions with bounding boxes, as illustrated in Figure 10. The distribution of images and instance-level annotations for each type of damage are provided in Table 1.

The annotation process was conducted under the supervision of a domain expert with extensive experience in machine vision and structural damage identification. The expert first guided the selection of bridge damage images, which were compiled into a curated damage image sample library. Next, the expert distributed the damage sample images to three independent annotators, who manually labeled the damage regions using a bounding-box-based approach. After the initial annotation, the labeled images were reviewed by the expert. Any incorrect or inconsistent annotations were reassigned to another annotator for re-labeling. This iterative annotation and validation process was repeated until all annotations met the predefined accuracy and consistency standards. Each image in the dataset is uniquely identified by a numerical file name ranging from 1 to 10,000, ensuring clear and consistent labeling. The original image files are in JPG format, and the corresponding annotation files are stored in XML format. The XML files contain detailed information about the annotations, including the bounding box coordinates and damage categories for each image. As shown in Figure 11, the XML file structure includes metadata, the labeled damage regions, and additional attributes such as image dimensions and class labels. The standardized naming and file structure facilitate efficient management and use of the dataset.

In our dataset, the distribution of damage center points is predominantly concentrated near the center of the image, with a relatively small proportion of damage regions appearing at the edges. This is illustrated in Figure 12, which shows the normalized label center points and the width-to-height ratios of the bounding boxes for the annotated damage regions. Concrete bridge damage typically occupies a small portion of the image, suggesting that networks designed for small-object detection could achieve better performance in detecting such damage types.

Since the task at hand is semantic segmentation, we further created pixel-level annotations for 1000 images to evaluate the proposed method. A sample of these pixel-level annotations is shown in Figure 13.

4.1.2. Experimental Settings

The model training was carried out over a total of 100 epochs, with each epoch consisting of a complete pass through the training dataset. The batch size was set to 16, meaning that the model parameters were updated after processing 16 samples. This batch size strikes a balance between computational efficiency and stability of gradient estimates. Training was conducted using a Stochastic Gradient Descent (SGD) optimizer, widely known for its efficiency in large-scale machine learning tasks. The optimizer incorporated a momentum term of 0.9 to accelerate convergence and smooth the optimization process by considering past gradients, which helps in reducing oscillations. The initial learning rate

η_{0}

was set to

10^{- 3}

, providing a good balance between rapid convergence and stability during training. As training progressed, the learning rate decayed exponentially according to the following schedule:

\begin{matrix} η_{t} = η_{0} \cdot e^{- λ t} \end{matrix}

(40)

where t is the epoch index and

λ

is the decay factor, set to 0.01. This exponential decay allowed for finer adjustments to the learning rate as training advanced.

The model was optimized using a combined loss function, which consisted of both Cross-Entropy Loss and Dice Loss, addressing the specific challenges of bridge disease segmentation in concrete structures. Cross-Entropy Loss measures the discrepancy between predicted probabilities and true class labels, ensuring accurate classification performance, while Dice Loss quantifies the overlap between predicted and true segmentation masks, which is particularly valuable when dealing with imbalanced datasets. Both losses were weighted equally, with coefficients

α = 0.5

and

β = 0.5

, to balance their respective contributions during training.

All experiments were conducted using the PyTorch 2.1 framework. The training and evaluation processes were performed on four NVIDIA RTXA6000 GPUs (Nvidia Corporation, Santa Clara, CA, USA), leveraging their high computational efficiency. Mixed precision training was employed to optimize memory usage and speed up computation.

4.1.3. Evaluation Metrics

To evaluate the performance of the proposed segmentation method for concrete bridge damage detection, four widely used metrics were adopted: intersection over union (IoU), pixel accuracy (PA), and Dice Similarity Coefficient (Dice). These metrics comprehensively measure the accuracy of pixel-level predictions, ensuring both localization and classification performance are assessed.

IoU measures the overlap between the predicted segmentation and the ground truth, defined as the ratio of their intersection over their union:

\begin{matrix} I o U = \frac{T P}{T P + F P + F N} \end{matrix}

(41)

where TP, FP, and FN represent the number of true positive, false positive, and false negative pixels, respectively. IoU provides an overall measure of the region-level overlap.

PA indicates the number of pixels with correct prediction categories as a percentage of the total number of pixels:

\begin{matrix} P A = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(42)

Dice is a region-based metric that quantifies the harmonic mean of precision and recall, providing a balanced measure of segmentation accuracy:

\begin{matrix} D i c e = \frac{2 \cdot T P}{2 \cdot T P + F P + F N} \end{matrix}

(43)

Dice is particularly sensitive to imbalanced datasets and small damage regions, making it an essential metric for assessing damage segmentation tasks.

In addition to segmentation accuracy, computational efficiency is a crucial aspect in evaluating the practicality of deep learning models for real-world deployment in bridge inspection tasks. Therefore, we report the model’s performance in terms of frames per second (FPS) and GPU memory usage to provide a comprehensive assessment of its applicability.

4.2. Performance Comparison of Different Models

To evaluate the effectiveness of the proposed Mamba–ResNet hierarchical fusion network in addressing the segmentation of concrete bridge damage, we conducted comprehensive experiments comparing its performance against several state-of-the-art models. The selected baseline models represent a diverse set of architectures and methodologies, including both transformer-based and convolutional neural network (CNN)-based designs, to ensure a thorough performance assessment. Specifically, we included DeepLabV3+, a widely used semantic segmentation framework known for its atrous spatial pyramid pooling [62]; SegFormer, which combines a lightweight design with Transformer-based representation [63]; HRViT, leveraging high-resolution feature fusion with Vision Transformers [64]; ABCNet, which uses spatial path and contextual path architecture [65]; PIDNet, which uses a novel three-branch network architecture [66]; SeaFormer, emphasizing efficiency in lightweight segmentation tasks [67]; Mamba–UNet, a Mamba-based U-Net architecture [68]; and SCTNet, which extracts rich semantic information by learning from Transformer and CNN semantic information alignment [69]. The inclusion of these models provides a comprehensive benchmarking landscape, highlighting the strengths and limitations of our approach.

As shown in Table 2, our proposed method achieves superior performance, with MIoU, PA, and MDice scores of 60.13%, 74.02%, and 75.40%, respectively. Compared to the second-best-performing model, Semi-Mamba, our approach improves MIoU by 1.04%, PA by 0.81%, and MDice by 0.6%. These improvements highlight the efficacy of our hierarchical fusion strategy in feature extraction and integration. Notably, Semi-Mamba also incorporates the SSM method from Mamba for backbone feature extraction, which aligns with our approach in utilizing advanced feature extraction techniques. However, the performance gains in our proposed method underscore the advantages of leveraging ViT methods for backbone network feature extraction over traditional CNN approaches. From the individual class IoU results, it is evident that our model consistently achieves the best results across various damage categories. The segmentation precision for the hole class, in particular, shows a remarkable improvement, attributable to our multi-layer feature extraction and Hierarchical Attention Fusion (HAF) methods. These techniques enhance the network’s ability to capture intricate damage textures and efficiently merge features, thereby boosting the overall segmentation accuracy.

4.3. Ablation Studies

4.3.1. Ablation Study on Overall Framework

To assess the contribution of each component within the proposed Mamba–ResNet hierarchical fusion network for concrete bridge damage segmentation, an ablation study was conducted. The experimental configurations tested are as follows:

1.: This configuration corresponds to the baseline network, which excludes all the components proposed in this study, including MambaVision, Optimizing Prompting Strategy (OPS), Hierarchical Attention Fusion (HAF), Boosting Textures Module (BTM), and Polarized Self-Attention (PSA). This baseline serves as a reference to evaluate the network’s performance without the optimization and enhancement mechanisms introduced in this work. In the baseline network, the backbone is ResNet, and all feature fusion is performed using simple feature map concatenation.
2.: In this configuration, OPS is used for training the baseline network.
3.: This configuration replaces the baseline backbone with MambaVision as the main network backbone, without employing OPS for training.
4.: Building on the previous configuration, this setup uses MambaVision as the main backbone and incorporates OPS for network training.
5.: This configuration includes the OPS and HAF, but excludes the BTM within HAF.
6.: In this setup, the Boosting Textures Module (BTM) is introduced in addition to OPS and HAF, enabling the evaluation of the role of texture enhancement within the feature fusion framework.
7.: This configuration removes BTM from HAF and includes the Polarized Self-Attention (PSA).
8.: In this configuration, only the PSA module is introduced, allowing for the isolated evaluation of the PSA’s effect on segmentation performance.
9.: This represents the full model configuration proposed in this study.

As shown in Table 3, the ablation study presents the results of evaluating the overall framework of the proposed method. Exp. #1 and #2 demonstrate that adding OPS to the baseline model results in a performance improvement, with MIoU increasing by 1.39%. Exp. #3 and #4 further confirm that incorporating OPS into the MambaVision backbone also enhances the network’s performance, with MIoU increasing by 0.84%. Exp. #4 and #5 show that adding HAF without BTM also increases the network’s accuracy, with MIoU improving by 0.35%. The comparison between Exp. #2 and #3 shows a substantial improvement in performance when using MambaVision as the backbone, with MIoU increasing by 3.57%, proving the effectiveness of the hierarchical feature extraction network proposed in this work. Exp. #5 and #6 indicate that incorporating BTM further enhances accuracy, with MIoU improving by 1.17%, validating the effectiveness of BTM in capturing the texture features specific to concrete bridge damages. Exp. #7 and #8 show that HAF provides a slight improvement to the network, with MIoU increasing by 0.11%. Finally, the comparison of Exp. #4 and #8 reveals that using PSA as the decoding network, even without HAF, enhances segmentation performance, with MIoU improving by 0.62%. Overall, the proposed method demonstrates significant improvements over the baseline, with MIoU, PA, and MDice increasing by 6.85%, 5.79%, and 5.18%, respectively.

4.3.2. Ablation Study on Different SAM Prompting Strategies

To rigorously analyze the impact of different SAM prompting strategies on concrete bridge damage segmentation, we conducted an ablation study comparing three prompting approaches: point, box, and box + point. The three prompting strategies were evaluated using standard segmentation metrics including MIOU, MPA, and MDice. The experimental results are presented in Table 4.

The results indicate that the box and point prompting strategy outperforms the other approaches with the highest MIOU of 60.13%, MPA of 74.02%, and MDice of 75.40%. The box-only strategy shows better performance than the point-only method with improvements of approximately 1.39% in MIOU, 0.77% in MPA, and 0.66% in MDice. These findings confirm that integrating both types of prompts offers complementary benefits by constraining the target region and providing essential fine-grained details, which is particularly beneficial for complex and elongated damage patterns.

4.3.3. Ablation Study on Different Backbone Networks

The goal of this section is to investigate the impact of different backbone networks on the segmentation results. Specifically, we aim to assess how the choice of backbone network affects the performance of the proposed method for concrete bridge damage segmentation. The approach employs a dual-backbone architecture, where the primary backbone, based on MambaVision, captures global contextual features, while the secondary backbone, using ResNet, extracts local features.

To explore the role of the main backbone, we replace MambaVision with several alternative backbone networks, including MobileViT, a lightweight hybrid model combining CNNs and ViTs to balance efficiency and accuracy [70]; Swin Transformer, which uses hierarchical window-based attention to model local and global features at multiple scales [52]; ConvFormer, which integrates convolutional operations into Transformer structures to enhance locality modeling [71]; and CaiT, a Transformer model designed to improve attention mechanisms for capturing fine-grained details [72]. These ablation experiments are designed to demonstrate the effectiveness of MambaVision in comparison to other popular architectures.

As shown in Table 5, the experimental results using different backbone networks highlight the superior performance of the MambaVision backbone. It achieves the best results in terms of MIoU, PA, and MDice, demonstrating MambaVision’s significant advantage in capturing fine-grained features and maintaining long-range dependencies. Compared to the Swin Transformer, the MambaVision backbone improves the MIoU, PA, and MDice by 0.92%, 0.24%, and 1.19%, respectively.

In order to explore the individual contributions of different backbone networks to the performance of concrete bridge damage segmentation, we designed an ablation study to compare the performance of the proposed Mamba–ResNet hierarchical fusion network with two separate backbone configurations: MambaVision and ResNet.

As shown in Table 6, the experimental results indicate that the ResNet-only configuration achieves an mIoU of 54.33%, an mPA of 69.67%, and an mDice of 71.24%, while the MambaVision-only configuration yields an mIoU of 58.42%, an mPA of 73.36%, and an mDice of 74.11%. Moreover, the fusion of ResNet and MambaVision produces the highest performance with an mIoU of 60.13%, an mPA of 74.02%, and an mDice of 75.40%. By using MambaVision as the main backbone, the network can establish a global understanding of damage distribution, while ResNet as the auxiliary backbone refines the segmentation results by enhancing local details at various scales. This hierarchical fusion ensures that the model can adapt to both local irregularities and global structural changes, improving segmentation robustness in real-world applications.

4.3.4. Ablation Study on Different Attention Module Decoders

To evaluate the impact of different attention mechanisms on segmentation performance, we conducted an ablation study by integrating five distinct attention modules into the decoder architecture of our proposed Mamba–ResNet hierarchical fusion network. Specifically, we compared the performance of the following attention mechanisms: Squeeze-and-Excitation (SE) [73], Convolutional Block Attention Module (CBAM) [74], Selective Kernel Attention (SKA) [75], Efficient Multi-scale Attention (EMA) [76], and Global Attention Module (GAM) [77].

As shown in Table 7, the results of experiments using different attention modules for the decoder demonstrate that the SPA decoder achieves the best performance in terms of MIOU, PA, and MDice. This confirms that SPA effectively models both local and global dependencies, offering a more balanced performance compared to other attention mechanisms. Compared to EMA, SPA improves the MIoU, PA, and MDice by 0.11%, 0.10%, and 0.19%.

4.3.5. Ablation Study on Loss Function

In this section, we compare our proposed CE + Dice Loss with three commonly used loss functions, Mean Absolute Error (MAE), Cross-Entropy (CE), and Dice Loss, to demonstrate the advantages of our approach for concrete bridge damage segmentation.

As shown in Table 8, our proposed CE + Dice Loss combines the advantages of CE and Dice to deal with category imbalance, improve the detection of small damage regions, and maintain robustness to noise. Our ablation experimental results show that CE + Dice outperforms other loss functions in MIOU, PA, and MDice.

4.3.6. Hyperparametric Analysis

To analyze the impact of the hyperparameters on the model’s accuracy, we designed an ablation experiment. The learning rate (Lr) is one of the most critical hyperparameters in training machine learning models, influencing the final performance of the model. In this experiment, we systematically varied the learning rate while keeping all other hyperparameters constant. This approach allowed us to evaluate how different learning rates affected the model’s ability to detect and segment concrete bridge defects. Specifically, we tested a range of learning rates from 0.0001 to 0.01, with each experiment conducted under identical conditions to ensure the results were solely due to the change in learning rate.

The experimental results are presented in Table 9. The table summarizes the performance of the model under different learning rates, showing the impact on both the segmentation accuracy and the overall model efficiency. As seen in the table, the model with a learning rate of 0.001 achieved the highest accuracy, with a significant improvement in defect segmentation quality compared to the other rates.

4.4. Visualization Results

In this section, we present representative segmentation results to demonstrate the effectiveness of the proposed method. We selected several representative cases to perform comparisons with other state-of-the-art networks and analyze the results of ablation experiments. The visual comparisons serve as a critical evaluation tool for highlighting the superiority of the SAM-guided Mamba–ResNet hierarchical fusion network in concrete bridge damage segmentation.

Figure 14 presents a comparison of various methods evaluated in Section 4.2, including DeepLabV3+, SegFormer, HRViT, ABCNet, PIDNet, SeaFormer, Semi-Mamba, and SCTNet. As observed in the first and second rows of the figure, our proposed method demonstrates excellent segmentation results for single-class damage, particularly for the crack category, where the crack contours are closest to the ground truth. In the third row, it is evident that our method excels in segmenting small objects, with other networks struggling to accurately segment smaller features, such as rebar, leading to segmentation failures and fragmentations. The fourth row highlights our method’s superior ability to identify regions in the concrete background that resemble damage. While some methods incorrectly classify damaged background regions as cracks, our approach accurately differentiates these areas. Finally, in the fifth row, the dense damage segmentation results show that although our method, like others, occasionally misclassifies some background regions, it still produces the most accurate segmentation for actual damage areas. This demonstrates the robustness of our approach in handling complex and densely damaged concrete structures.

Figure 15 illustrates the ablation study results presented in Section 4.3.1, comparing several configurations: baseline, baseline + OPS, baseline + OPS + MambaVision, baseline + OPS + MambaVision + HAF, and baseline + OPS + MambaVision + HAF + PSA. As shown in the first row for crack segmentation, without OPS, the baseline method struggles to accurately segment the background surrounding the cracks. However, after adding OPS, the crack segmentation results improve significantly, with the network able to better delineate the crack contours. In the second row, for multi-class damage segmentation, the baseline model without OPS mistakenly recognizes complex background areas as damage categories. After integrating OPS, the contours of the damage become more aligned with the ground truth, although some discrepancies still remain. Adding MambaVision and HAF further refines the segmentation, with the contours becoming progressively clearer. The final model, with all components (OPS, MambaVision, HAF, PSA), shows a segmentation result that closely matches the ground truth, with minimal difference. The comparison in the second row clearly demonstrates that OPS provides a substantial improvement in segmentation performance. This is primarily due to the enhanced quality of labels generated by the SAM, which leads to more accurate and reliable segmentation results, bringing the network’s performance closer to the true segmentation.

4.5. Diversity Validation

To comprehensively evaluate the diversity and generalization capability of the proposed Mamba–ResNet hierarchical fusion network, we conducted extensive experiments on the publicly available CrackLS315 dataset. This dataset consists of a variety of real-world concrete crack images, making it a suitable benchmark for validating the robustness of our approach. We compared our model against several state-of-the-art segmentation networks, including DeepLabV3+ [62], SegFormer [63], HRVIT [64], ABCNet [65], PIDNet [66], SeaFormer [67], Semi-Mamba [68], and SCTNet [69]. These models represent a range of architectures, from transformer-based designs to CNN-based and hybrid approaches, ensuring a fair and comprehensive evaluation. The evaluation metrics used in this experiment were MIOU, MPA, and MDice.

As shown in Table 10, our proposed method achieved an MIoU of 69.01%, an MPA of 82.19%, and an MDice of 77.73%, outperforming all the compared networks. Notably, compared with traditional CNN-based models like DeepLabV3+ and Transformer-based models such as HRViT and SeaFormer, our model exhibits significant improvements. These improvements highlight the efficacy of our hierarchical fusion strategy in feature extraction and integration. The results demonstrate that our method effectively combines local and global contextual information, resulting in enhanced segmentation performance across diverse scenarios.

Figure 16 illustrates the experimental crack segmentation visualization results on the CrackLS315 dataset. This visual evidence further confirms the superiority of our approach in accurately delineating crack regions, even in cases of extremely fine cracks and challenging conditions.

5. Discussion

The proposed framework has demonstrated significant potential in concrete bridge damage segmentation. Beyond this specific application, the underlying principles of our architecture suggest its adaptability to a broader range of infrastructure damage detection tasks. Given that our approach combines hierarchical feature fusion with attention mechanisms for effective feature extraction of surface defects to remove other interfering backgrounds, it can be extended to the detection of structural defects in other infrastructure domains, such as pavement collapse [78], tunnel cracks [79], and dam surface erosion [80]. One key advantage of our framework is its ability to leverage foundational models like SAM, which enables robust feature extraction even in scenarios where labeled training data are scarce. This property suggests promising applicability in scenarios where high-quality annotations are difficult to obtain, such as corrosion detection in steel structures or defect identification in historical buildings, where manual labeling remains a significant challenge [81,82]. Furthermore, the effectiveness of our method in segmenting bridge damages indicates that it could be generalized to other multi-scale damage detection tasks. For example, in railway track inspection [83], where damages can appear at varying resolutions and orientations, our hierarchical fusion strategy could enhance segmentation accuracy. Similarly, in offshore wind turbine maintenance, where visual inspection is often constrained by harsh environmental conditions, SAM-guided segmentation could facilitate automated damage detection using drone imagery [84].

In the pursuit of enhancing concrete bridge damage segmentation, future research should explore the integration of few-shot learning and unsupervised learning methodologies. These approaches have the potential to significantly reduce the dependency on extensive labeled datasets while maintaining robust segmentation performance. Few-shot learning aims to enable models to generalize from limited annotated samples, which is particularly beneficial for damage segmentation where acquiring labeled data is labor-intensive and costly. Unsupervised learning approaches, such as self-supervised representation learning and contrastive learning, offer a promising direction for bridge damage segmentation. These methods enable models to extract meaningful features from large-scale unlabeled datasets, reducing reliance on manually annotated damage images. The fusion of few-shot learning and unsupervised learning could lead to a more data-efficient and adaptable segmentation framework. A hybrid pipeline could leverage self-supervised pre-training for feature extraction, followed by fine-tuning with few-shot learning strategies on specific damage categories. This integration may reduce annotation costs while maintaining high segmentation precision, particularly in real-world scenarios where annotated datasets are scarce. Future studies should explore optimal network architectures and training paradigms that effectively combine these learning paradigms to maximize segmentation performance.

The integration of Siamese neural networks (SNNs) into our proposed framework holds significant potential. SNNs have been widely used in image similarity tasks due to their ability to learn discriminative feature embeddings for paired input comparisons [85]. By incorporating SNNs, our framework could enhance damage detection accuracy by improving feature differentiation between damaged and undamaged regions. Moreover, SNNs could improve the adaptability of our framework to unseen damage patterns by learning fine-grained similarities across different bridge structures. This would be particularly beneficial for real-world applications where pre-existing datasets may not fully capture the diversity of damage manifestations. Additionally, the combination of SNNs and our proposed hierarchical fusion network could reduce false positives by emphasizing structural consistency in segmentation predictions.

Concrete bridges often suffer from large-area spalling, which not only compromises structural integrity but also complicates damage detection due to the similarity between damage and background, as well as the presence of complex textures. These factors pose significant challenges for current deep learning-based segmentation methods, limiting their ability to distinguish defects from surrounding environments and leading to reduced accuracy and robustness. To address these challenges, future research should focus on integrating prior knowledge fusion with graph neural networks (GNNs) [86] to enhance defect segmentation performance. Specifically, prior knowledge should incorporate spatial relationships and co-occurrence patterns of defects, as certain types of defects often appear in specific locations or in relation to other defects due to structural stress distribution and material degradation processes. By encoding these spatial and co-occurrence constraints into the learning model, segmentation accuracy can be significantly improved, as the model learns to predict defect locations with higher contextual awareness. Moreover, GNNs can effectively model these spatial dependencies and relational structures, allowing the network to capture the underlying geometric and topological patterns of defect distribution [87]. By constructing a graph representation of the defects, GNNs enable the model to refine segmentation results by leveraging the correlations between different defect types. Additionally, integrating multi-scale attention mechanisms with GNNs can further enhance segmentation by dynamically focusing on critical defect areas while suppressing background noise.

We plan to represent each segmented image as a graph

G = (V, E)

, where each node

v_{i} \in V

corresponds to a damage region (e.g., cracks, spalling), and each edge

e_{i j} \in E

encodes spatial adjacency or structural dependency between regions. The feature vector of each node

h_{i}^{(0)}

is initialized using region-level CNN embeddings extracted from our current segmentation backbone. We will apply a standard GCN layer, defined as

\begin{matrix} H^{(l + 1)} = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H^{(l)} W^{(l)}) \end{matrix}

(44)

where

\tilde{A} = A + I

is the adjacency matrix with self-connections,

\tilde{D}

is the degree matrix of

\tilde{A}

,

H (l)

is the node feature matrix at layer l, and

W^{(l)}

are learnable weights. The activation function

σ (\cdot)

can be ReLU or LeakyReLU. We will train this GNN integration model based on the image dataset generated from the framework proposed in this paper and evaluate it using metrics such as mIoU. The contribution of the GNN module in the overall performance will also be evaluated through ablation experiments.

6. Conclusions

In this paper, we have introduced a novel training framework for concrete bridge damage segmentation that significantly reduces the dependency on extensive annotated data while enhancing segmentation efficiency and accuracy. Our approach leverages the Segment Anything Model (SAM) to generate high-quality segmentation masks, which are then used as supplementary supervision labels to train a damage segmentation network. To ensure the reliability and precision of the masks generated by the SAM, we proposed an Optimizing Prompting Strategy with saliency information, which improves the quality of the prompts provided to SAM, thereby enhancing the accuracy of the generated masks. Furthermore, we presented a trainable semantic segmentation network that employs a multi-level feature extraction architecture. In this architecture, MambaVision serves as the primary backbone network, enabling the extraction of rich and diverse semantic features from concrete bridge damages at varying levels. To effectively integrate these multi-level features and improve segmentation performance, we introduced a Hierarchical Attention Fusion (HAF) mechanism. The HAF mechanism includes a Texture Enhancement Module (TEM), which is designed to strengthen the representation of discriminative texture features, thereby enhancing the network’s ability to capture fine-grained details in concrete bridge damages. Additionally, we utilized a Polarized Self-Attention (PSA) decoder as the feature decoder, which predicts precise segmentation results for concrete bridge damages. Our experimental results demonstrate that the proposed framework achieves state-of-the-art performance in concrete bridge damage segmentation, outperforming existing methods in terms of both accuracy and efficiency. The integration of SAM-guided mask generation, the Optimizing Prompting Strategy, and the Hierarchical Attention Fusion mechanism with its Texture Enhancement Module and the robust feature extraction capabilities of MambaVision collectively contribute to the robustness and effectiveness of our approach.

Author Contributions

Conceptualization, J.Y. and S.J.; methodology, H.L. and X.Y.; software, H.L.; validation, J.Y. and S.J.; investigation, X.Y.; resources, J.Y.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, J.Y. and S.J.; visualization, H.L. and X.Y.; supervision, J.Y.; project administration, J.Y. and S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62003063, the National Natural Science Foundation of China under Grant 62103068, the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant KJZD-M202000702, the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant KJQN202000726, the Science and Technology Research Program of Chongqing Municipal Education Commission under Grant KJQN202100748, the Science and Technology Research Program of Chongqing Municipal Education Commission of China under Grant KJQN202200720, and the Graduate Student Research Innovation Project of Chongqing under Grant CYB240260.

Data Availability Statement

The data generated and/or analyzed during the current study are not publicly available due to legal/ethical reasons but are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xiao, Q.; Li, R.; Yang, J.; Chen, Y.; Jiang, S.; Wang, D. TPKE-QA: A gapless few-shot extractive question answering approach via task-aware post-training and knowledge enhancement. Expert Syst. Appl. 2024, 254, 124475. [Google Scholar]
Khan, S.M.; Atamturktur, S.; Chowdhury, M.; Rahman, M. Integration of structural health monitoring and intelligent transportation systems for bridge condition assessment: Current status and future direction. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2107–2122. [Google Scholar]
Bhattacharya, G.; Mandal, B.; Puhan, N.B. Multi-deformation aware attention learning for concrete structural defect classification. IEEE Trans. Circuits Syst. Video Technol. 2020, 31, 3707–3713. [Google Scholar]
Bhattacharya, G.; Mandal, B.; Puhan, N.B. Interleaved deep artifacts-aware attention mechanism for concrete structural defect classification. IEEE Trans. Image Process. 2021, 30, 6957–6969. [Google Scholar]
Wan, H.; Gao, L.; Yuan, Z.; Qu, H.; Sun, Q.; Cheng, H.; Wang, R. A novel transformer model for surface damage detection and cognition of concrete bridges. Expert Syst. Appl. 2023, 213, 119019. [Google Scholar]
Jeong, E.; Seo, J.; Wacker, J.P. UAV-aided bridge inspection protocol through machine learning with improved visibility images. Expert Syst. Appl. 2022, 197, 116791. [Google Scholar]
Hu, X.; Assaad, R.H. The use of unmanned ground vehicles and unmanned aerial vehicles in the civil infrastructure sector: Applications, robotic platforms, sensors, and algorithms. Expert Syst. Appl. 2023, 232, 120897. [Google Scholar]
Chen, H.M.; Hou, C.C.; Wang, Y.H. A 3D visualized expert system for maintenance and management of existing building facilities using reliability-based method. Expert Syst. Appl. 2013, 40, 287–299. [Google Scholar]
Huang, L.; Fan, G.; Li, J.; Hao, H. Deep learning for automated multiclass surface damage detection in bridge inspections. Autom. Constr. 2024, 166, 105601. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar]
Chu, H.; Deng, L.; Yuan, H.; Long, L.; Guo, J. A transformer and self-cascade operation-based architecture for segmenting high-resolution bridge cracks. Autom. Constr. 2024, 158, 105194. [Google Scholar] [CrossRef]
Amirkhani, D.; Allili, M.S.; Hebbache, L.; Hammouche, N.; Lapointe, J.F. Visual Concrete Bridge Defect Classification and Detection Using Deep Learning: A Systematic Review. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10483–10505. [Google Scholar]
Gadetsky, A.; Brbic, M. The pursuit of human labeling: A new perspective on unsupervised learning. Adv. Neural Inf. Process. Syst. 2024, 36, 60527–60546. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Chen, C.; Miao, J.; Wu, D.; Zhong, A.; Yan, Z.; Kim, S.; Hu, J.; Liu, Z.; Sun, L.; Li, X.; et al. Ma-sam: Modality-agnostic sam adaptation for 3d medical image segmentation. Med. Image Anal. 2024, 98, 103310. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Jiang, S.; Tang, C.; Yang, J.; Li, H.; Zhang, T.; Li, R.; Wang, D.; Wu, J. TSCB-Net: Transformer-enhanced semantic segmentation of surface damage of concrete bridges. Struct. Infrastruct. Eng. 2024, 1–10. [Google Scholar]
Du, H.; Wang, H.; Zhang, X.; Peng, H.; Gao, R.; Zheng, X.; Tong, Y.; Shan, Y.; Pan, Z.; Huang, H. Automated intelligent measurement of cracks on bridge piers using a ring-climbing vision scanning operation robot. Measurement 2024, 237, 115197. [Google Scholar]
Rubio, J.J.; Kashiwa, T.; Laiteerapong, T.; Deng, W.; Nagai, K.; Escalera, S.; Nakayama, K.; Matsuo, Y.; Prendinger, H. Multi-class structural damage segmentation using fully convolutional networks. Comput. Ind. 2019, 112, 103121. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Shi, J.; Dang, J.; Cui, M.; Zuo, R.; Shimizu, K.; Tsunoda, A.; Suzuki, Y. Improvement of damage segmentation based on pixel-level data balance using vgg-unet. Appl. Sci. 2021, 11, 518. [Google Scholar] [CrossRef]
Deng, W.; Mou, Y.; Kashiwa, T.; Escalera, S.; Nagai, K.; Nakayama, K.; Matsuo, Y.; Prendinger, H. Vision based pixel-level bridge structural damage detection using a link ASPP network. Autom. Constr. 2020, 110, 102973. [Google Scholar]
Chen, L.C. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Narazaki, Y.; Hoskere, V.; Yoshida, K.; Spencer, B.F.; Fujino, Y. Synthetic environments for vision-based structural condition assessment of Japanese high-speed railway viaducts. Mech. Syst. Signal Process. 2021, 160, 107850. [Google Scholar] [CrossRef]
Zou, Q.; Zhang, Z.; Li, Q.; Qi, X.; Wang, Q.; Wang, S. Deepcrack: Learning hierarchical convolutional features for crack detection. IEEE Trans. Image Process. 2018, 28, 1498–1512. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1525–1535. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Choi, W.; Cha, Y.J. SDDNet: Real-time crack segmentation. IEEE Trans. Ind. Electron. 2019, 67, 8016–8025. [Google Scholar] [CrossRef]
Beckman, G.H.; Polyzois, D.; Cha, Y.J. Deep learning-based automatic volumetric damage quantification using depth camera. Autom. Constr. 2019, 99, 114–124. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, B.; Wang, J.; Li, J.; Sun, X. APLCNet: Automatic pixel-level crack detection network based on instance segmentation. IEEE Access 2020, 8, 199159–199170. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Li, K.; Wang, B.; Tian, Y.; Qi, Z. Fast and accurate road crack detection based on adaptive cost-sensitive loss function. IEEE Trans. Cybern. 2021, 53, 1051–1062. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 205–218. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.U.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Liu, C.; Zhang, D.; Zhang, S. Characteristics and treatment measures of lining damage: A case study on a mountain tunnel. Eng. Fail. Anal. 2021, 128, 105595. [Google Scholar] [CrossRef]
Xu, Z.; Guan, H.; Kang, J.; Lei, X.; Ma, L.; Yu, Y.; Chen, Y.; Li, J. Pavement crack detection from CCD images with a locally enhanced transformer network. Int. J. Appl. Earth Obs. Geoinf. 2022, 110, 102825. [Google Scholar]
Shamsabadi, E.A.; Xu, C.; Rao, A.S.; Nguyen, T.; Ngo, T.; Dias-da Costa, D. Vision transformer-based autonomous crack detection on asphalt and concrete surfaces. Autom. Constr. 2022, 140, 104316. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Zhou, W.; Huang, H.; Zhang, H.; Wang, C. Teaching Segment-Anything-Model Domain-Specific Knowledge for Road Crack Segmentation From On-Board Cameras. IEEE Trans. Intell. Transp. Syst. 2024, 25, 20588–20601. [Google Scholar]
Wang, C.; Chen, H.; Zhou, X.; Wang, M.; Zhang, Q. SAM-IE: SAM-based image enhancement for facilitating medical image diagnosis with segmentation foundation model. Expert Syst. Appl. 2024, 249, 123795. [Google Scholar]
Zhou, Z.; Lu, Y.; Bai, J.; Campello, V.M.; Feng, F.; Lekadir, K. Segment Anything Model for fetal head-pubic symphysis segmentation in intrapartum ultrasound image analysis. Expert Syst. Appl. 2024, 263, 125699. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Teng, S.; Liu, A.; Situ, Z.; Chen, B.; Wu, Z.; Zhang, Y.; Wang, J. Plug-and-play method for segmenting concrete bridge cracks using the segment anything model with a fractal dimension matrix prompt. Autom. Constr. 2025, 170, 105906. [Google Scholar]
Li, W.; Liu, W.; Zhu, J.; Cui, M.; Hua, R.Y.X.; Zhang, L. Box2mask: Box-supervised instance segmentation via level-set evolution. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5157–5173. [Google Scholar] [PubMed]
Ding, Y.; Liu, H. Barely-supervised Brain Tumor Segmentation via Employing Segment Anything Model. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 2975–2986. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. arXiv 2024, arXiv:2407.08083. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Mengiste, E.; Mannem, K.R.; Prieto, S.A.; Garcia de Soto, B. Transfer-learning and texture features for recognition of the conditions of construction materials with small data sets. J. Comput. Civ. Eng. 2024, 38, 04023036. [Google Scholar] [CrossRef]
Xie, J.; Li, G.; Zhang, L.; Cheng, G.; Zhang, K.; Bai, M. Texture feature-aware consistency for semi-supervised honeycomb lung lesion segmentation. Expert Syst. Appl. 2024, 258, 125119. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
Wang, J.; Yao, H.; Hu, J.; Ma, Y.; Wang, J. Dual-encoder network for pavement concrete crack segmentation with multi-stage supervision. Autom. Constr. 2025, 169, 105884. [Google Scholar] [CrossRef]
Ioffe, S. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Zhang, Y.; Yin, J.; Gu, Y.; Chen, Y. Multi-level Feature Attention Network for medical image segmentation. Expert Syst. Appl. 2024, 263, 125785. [Google Scholar] [CrossRef]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Li, X.; Ding, H.; Yuan, H.; Zhang, W.; Pang, J.; Cheng, G.; Chen, K.; Liu, Z.; Loy, C.C. Transformer-based visual segmentation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10138–10163. [Google Scholar] [CrossRef]
Azad, R.; Aghdam, E.K.; Rauland, A.; Jia, Y.; Avval, A.H.; Bozorgpour, A.; Karimijafarbigloo, S.; Cohen, J.P.; Adeli, E.; Merhof, D. Medical image segmentation review: The success of u-net. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10076–10095. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Gu, J.; Kwon, H.; Wang, D.; Ye, W.; Li, M.; Chen, Y.H.; Lai, L.; Chandra, V.; Pan, D.Z. Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 12094–12103. [Google Scholar]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Xu, J.; Xiong, Z.; Bhattacharyya, S.P. PIDNet: A real-time semantic segmentation network inspired by PID controllers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19529–19539. [Google Scholar]
Wan, Q.; Huang, Z.; Lu, J.; Gang, Y.; Zhang, L. Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Wang, Z.; Zheng, J.Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv 2024, arXiv:2402.05079. [Google Scholar]
Xu, Z.; Wu, D.; Yu, C.; Chu, X.; Sang, N.; Gao, C. SCTNet: Single-Branch CNN with Transformer Semantic Information for Real-Time Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 6378–6386. [Google Scholar]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Lin, X.; Yan, Z.; Deng, X.; Zheng, C.; Yu, L. ConvFormer: Plug-and-play CNN-style transformers for improving medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, BC, Canada, 8–12 October 2023; Springer: Cham, Switzerland, 2023; pp. 642–651. [Google Scholar]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 32–42. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 23–28 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Wang, S.; Jiao, H.; Su, X.; Yuan, Q. An ensemble learning approach with attention mechanism for detecting pavement distress and disaster-induced road damage. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13667–13681. [Google Scholar] [CrossRef]
Qiu, J.; Liu, D.; Zhao, K.; Lai, J.; Wang, X.; Wang, Z.; Liu, T. Influence spatial behavior of surface cracks and prospects for prevention methods in shallow loess tunnels in China. Tunn. Undergr. Space Technol. 2024, 143, 105453. [Google Scholar] [CrossRef]
Kang, F.; Huang, B.; Wan, G. Automated detection of underwater dam damage using remotely operated vehicles and deep learning technologies. Autom. Constr. 2025, 171, 105971. [Google Scholar] [CrossRef]
Li, Z.; Shao, P.; Zhao, M.; Yan, K.; Liu, G.; Wan, L.; Xu, X.; Li, K. Optimized deep learning for steel bridge bolt corrosion detection and classification. J. Constr. Steel Res. 2024, 215, 108570. [Google Scholar] [CrossRef]
Karimi, N.; Valibeig, N.; Rabiee, H.R. Deterioration detection in historical buildings with different materials based on novel deep learning methods with focusing on isfahan historical bridges. Int. J. Archit. Herit. 2024, 18, 981–993. [Google Scholar]
Germoglio Barbosa, I.; Lima, A.d.O.; Edwards, J.R.; Dersch, M.S. Development of track component health indices using image-based railway track inspection data. Proc. Inst. Mech. Eng. Part F J. Rail Rapid Transit 2024, 238, 706–716. [Google Scholar]
Zhang, K.; Pakrashi, V.; Murphy, J.; Hao, G. Inspection of floating offshore wind turbines using multi-rotor unmanned aerial vehicles: Literature review and trends. Sensors 2024, 24, 911. [Google Scholar] [CrossRef]
Yang, X.; Peng, P.; Li, D.; Ye, Y.; Lu, X. Adaptive decoupling-fusion in Siamese network for image classification. Neural Netw. 2025, 187, 107346. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef]
Yang, J.; Li, H.; Zhang, L.; Zhao, L.; Jiang, S.; Xie, H. Multi-label Concrete Bridge Damage Classification Using Graph Convolution. IEEE Trans. Instrum. Meas. 2024, 73, 1–14. [Google Scholar]

Figure 1. The overall framework of the proposed method structure.

Figure 2. Overview of Segment Anything Model. Different colored points indicate different categories of prompt points.

Figure 3. Overview of Segment Anything Model. The green points represent prompt points generated by Prompting optimization.

Figure 4. The architecture of the Mamba–ResNet hierarchical fusion network.

Figure 5. The architecture of MambaVision models.

Figure 6. The architecture of the MambaVision block.

Figure 7. The architecture of Hierarchical Attention Fusion.

Figure 8. The architecture of Polarized Self-Attention.

Figure 9. Example of concrete bridge damage image.

Figure 10. Example of concrete bridge damage annotation.

Figure 11. Details of the annotation protocol.

Figure 12. Characteristics of the dataset.

Figure 13. Example of concrete bridge damage pixel-level annotations.

Figure 14. Visualization results of comparison with other methods. (a) Image, (b) ground truth, (c) DeepLabV3, (d) SegFormer, (e) HRViT, (f) ABCNet, (g) PIDNet, (h) SeaFormer, (i) SemiMamba, (j) SCTNet, (k) OUR.

Figure 15. Visualization results of the overall framework ablation studies. (a) Image, (b) ground truth, (c) baseline, (d) baseline + OPS, (e) baseline+OPS+MambaVision, (f) baseline + OPS + MambaVision + HAF, (g) baseline + OPS + MambaVision + HAF + PSA.

Figure 16. Visualization results of comparison with other methods on CrackLS315 dataset. Visualization results of comparison with other methods on CrackLS315 dataset. (a) Image, (b) ground truth, (c) DeepLabV3, (d) SegFormer, (e) HRViT, (f) ABCNet, (g) PIDNet, (h) SeaFormer, (i) SemiMamba, (j) SCTNet, (k) OUR.

Table 1. Statistical information on concrete bridge damage dataset.

	Spall	Rebar	Corrosion	Crack	Hole	Speckle	Total
Number of boxes	6921	9901	2486	2735	1260	1092	30,458
Number of images	3365	2711	2486	1287	633	557	7354

Table 2. Results of comparison with different methods.

Method	Backbone	MIoU (%)	PA (%)	MDice (%)	Model Parameters (M)	FPS	IoU (%)
Method	Backbone	MIoU (%)	PA (%)	MDice (%)	Model Parameters (M)	FPS	Rebar	Spall	Corrosion	Crack	Hole	Speckle	Background
DeepLabV3+	ResNet50	54.45	69.84	70.91	42.0	30.2	48.05	60.86	50.52	41.49	20.57	62.77	96.94
SegFormer	Mit B0	56.45	71.33	72.96	84.7	20.3	50.82	62.25	51.55	45.16	23.55	64.91	96.95
HRViT	HRViT-B3	57.11	72.47	73.88	28.6	50.6	51.05	63.27	52.57	46.37	23.73	65.84	96.97
ABCNet	ResNet101	58.31	72.55	74.68	14.6	40.2	51.08	64.13	56.75	48.37	24.05	66.89	96.92
PIDNet	-	58.20	72.35	73.72	36.9	31.1	51.12	64.06	56.72	48.16	21.77	68.65	96.95
SeaFormer	MiT-B4	58.70	73.48	73.35	17.1	62.8	50.86	63.67	57.21	48.90	23.82	69.54	96.95
Semi-Mamba	VMamba	59.09	73.21	74.8	65.5	25.6	52.23	64.34	57.34	48.73	25.88	68.23	96.93
SCTNet	-	57.90	72.51	74.03	17.4	61.2	49.86	63.62	57.18	48.23	23.88	65.63	96.91
OUR	MambaVision	60.13	74.02	75.40	106.7	15.3	53.55	64.74	57.24	49.75	26.92	71.76	96.98

Table 3. Results of the overall framework ablation studies. ✓ represents the use of this experimental setup.

Exp.	Baseline	MambaVision	OPS	HAF	BTM	PSA	MIOU (%)	PA (%)	MDice (%)
#1	✓						53.28	68.23	70.22
#2	✓		✓				54.67	69.93	71.03
#3	✓	✓					56.85	71.67	73.31
#4	✓	✓	✓				57.69	72.79	73.68
#5	✓	✓	✓	✓			58.04	73.10	73.91
#6	✓	✓	✓	✓	✓		59.21	73.58	74.57
#7	✓	✓	✓	✓		✓	58.42	73.21	74.11
#8	✓	✓	✓			✓	58.31	73.25	74.08
#9	✓	✓	✓	✓	✓	✓	60.13	74.02	75.40

Table 4. Results of the different SAM prompting strategies. ✓ represents the use of this experimental setup.

Point	Box	box + Point	MIOU	MPA	MDice
✓			57.28	72.34	73.56
	✓		58.67	73.11	74.22
✓	✓	✓	60.13	74.02	75.40

Table 5. Results of ablation study on main backbone.

Backbone Networks	MIOU (%)	PA (%)	MDice (%)
MobileViT	58.26	73.57	73.91
Swin Transformer	59.21	73.78	74.21
ConvForme	58.13	73.21	74.34
CaiT	58.78	73.65	74.21
MambaVision	60.13	74.02	75.40

Table 6. Results of the ablation study on the individual contributions of different backbones. ✓ represents the use of this experimental setup.

ResNet	MambaVision	MIOU	MPA	MDice
✓		54.33	69.67	71.24
	✓	58.42	73.36	74.11
✓	✓	60.13	74.02	75.40

Table 7. Results of ablation study on attention module decoders.

Attention Module	MIOU (%)	PA (%)	MDice (%)
SE	59.32	73.68	74.77
CBAM	59.51	73.79	74.83
SKA	59.59	73.85	74.95
EMA	60.02	73.92	75.21
GAM	59.48	73.72	74.78
SPA	60.13	74.02	75.40

Table 8. Results of ablation study on loss function.

Loss Function	MIOU (%)	PA (%)	MDice (%)
$L_{m a e}$	58.11	72.56	73.62
$L_{C E}$	59.52	73.73	74.76
$L_{D i c e}$	59.89	73.95	75.32
$L_{C E} + L_{D i c e}$	60.13	74.02	75.40

Table 9. Results of hyperparametric ablation experiments.

Lr	MIOU (%)	MPA (%)	MDice (%)
0.0001	58.27	73.35	73.64
0.0005	59.32	73.88	74.36
0.001	60.13	74.02	75.40
0.005	58.77	73.72	74.41
0.01	58.68	73.62	74.12

Table 10. Results of comparison with different methods on CrackLS315 dataset.

Method	Backbone	MIoU (%)	MPA (%)	MDice (%)	Model Parameters (M)	FPS
DeepLabV3+	ResNet50	65.89	75.44	74.88	42.0	30.2
SegFormer	Mit B0	66.22	76.33	75.13	84.7	20.3
HRViT	HRViT-b3	67.22	80.62	76.12	28.6	50.6
ABCNet	ResNet101	66.41	76.21	75.78	14.6	40.2
PIDNet	-	67.89	80.89	76.69	36.9	31.1
SeaFormer	MiT-B4	68.06	81.86	76.85	17.1	62.8
Semi-Mamba	VMamba	67.21	80.58	76.26	65.5	25.6
SCTNet	-	66.32	80.51	76.01	17.4	61.2
OUR	MambaVision	69.01	82.19	77.73	106.7	15.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Yang, J.; Jiang, S.; Yang, X. SAM-Guided Concrete Bridge Damage Segmentation with Mamba–ResNet Hierarchical Fusion Network. Electronics 2025, 14, 1497. https://doi.org/10.3390/electronics14081497

AMA Style

Li H, Yang J, Jiang S, Yang X. SAM-Guided Concrete Bridge Damage Segmentation with Mamba–ResNet Hierarchical Fusion Network. Electronics. 2025; 14(8):1497. https://doi.org/10.3390/electronics14081497

Chicago/Turabian Style

Li, Hao, Jianxi Yang, Shixin Jiang, and Xiaoxia Yang. 2025. "SAM-Guided Concrete Bridge Damage Segmentation with Mamba–ResNet Hierarchical Fusion Network" Electronics 14, no. 8: 1497. https://doi.org/10.3390/electronics14081497

APA Style

Li, H., Yang, J., Jiang, S., & Yang, X. (2025). SAM-Guided Concrete Bridge Damage Segmentation with Mamba–ResNet Hierarchical Fusion Network. Electronics, 14(8), 1497. https://doi.org/10.3390/electronics14081497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAM-Guided Concrete Bridge Damage Segmentation with Mamba–ResNet Hierarchical Fusion Network

Abstract

1. Introduction

2. Related Works

2.1. Semantic Segmentation Algorithm with Application to Concrete Bridge Damage Segmentation

2.2. Application of Segment Anything Model

3. Method

3.1. Overall Architecture

3.2. SAM-Guided Concrete Damage Segmentation Mask Generation

3.2.1. Segment Anything Model

3.2.2. Optimizing Prompting Strategy with Saliency Information

3.3. Mamba–ResNet Hierarchical Fusion Network for Concrete Damage Segmentation

3.3.1. MambaVision Backbone

3.3.2. Hierarchical Attention Fusion

3.3.3. Polarized Self-Attention

3.4. Loss Function

4. Experiments and Results

4.1. Experimental Environment

4.1.1. Datasets

4.1.2. Experimental Settings

4.1.3. Evaluation Metrics

4.2. Performance Comparison of Different Models

4.3. Ablation Studies

4.3.1. Ablation Study on Overall Framework

4.3.2. Ablation Study on Different SAM Prompting Strategies

4.3.3. Ablation Study on Different Backbone Networks

4.3.4. Ablation Study on Different Attention Module Decoders

4.3.5. Ablation Study on Loss Function

4.3.6. Hyperparametric Analysis

4.4. Visualization Results

4.5. Diversity Validation

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI