Boundary-Sensitive Hybrid Attention Network for Multi-Scale Crack Fine Segmentation
Highlights
- The BSA-Net model achieves competitive performance on existing crack detection models, demonstrating enhanced performance in complex environments with weak contrast and noisy backgrounds.
- The proposed model improves segmentation accuracy, particularly in fine-scale crack detection, by leveraging multi-scale feature extraction and boundary refinement mechanisms.
- BSA-Net offers a robust and scalable solution for automated infrastructure monitoring, providing accurate and reliable crack segmentation in real-world applications.
- The method can be adapted for use in other engineering fields that require high-precision segmentation of fine details in challenging imaging conditions.
Abstract
1. Introduction
- A hierarchical encoder, Hiera-A, tailored for crack scenarios is proposed, incorporating a lightweight adapter, Adapter, into the backbone to achieve efficient parameter transfer and feature alignment. This approach reduces fine-tuning costs while enhancing the generalization ability from publicly trained domains to engineering domain deployments.
- A multi-scale feature enhancement module, Light-ASPP, is proposed, which utilizes depthwise separable/dilated convolutions and attention recalibration to achieve multi-receptive field context aggregation. This effectively enhances crack responses and suppresses complex background texture interference.
- A boundary-aware dual-branch decoding, BAD, is proposed, explicitly injecting high-frequency boundary cues into the segmentation decoding flow. This alleviates edge blurring, breakage, and adhesion issues, achieving fine crack segmentation outputs for engineering scenarios.
2. BSA-Net: Fine Crack Segmentation Network Architecture
2.1. Overall Network Architecture
- 1.
- Patch Embed: The input image is mapped into a sequence of tokens, and positional encoding is introduced to preserve spatial structure information.
- 2.
- Hiera-A: Multi-scale features are learned through stacking multi-layer encoding blocks (block) with hierarchical downsampling, producing outputs (with block 0, block 4, block 20, and block 23 in the diagram representing the outputs of different levels), where features at different scales represent local textures, edge details, and larger semantic contexts. For convenience in subsequent descriptions, the outputs of the encoder at the four scales are denoted as: , , , . The channel width is then doubled stage by stage. This configuration slightly increases the channel capacity of the hierarchical features and helps preserve more discriminative semantic and boundary cues for fine crack segmentation.
- 3.
- Light-ASPP: For the multi-scale nature of crack targets, a multi-scale context enhancement is applied to mid-to-high-level features , producing to expand the effective receptive field and preserve fine-grained boundary information.
- 4.
- Boundary-Aware Decoder: The semantic segmentation main branch (Seg branch) is built using multi-scale fusion based on the Feature Pyramid Network (FPN), while a boundary branch (Boundary branch) is designed to explicitly learn the crack boundary response.
- 5.
- Output Fusion: The boundary branch and segmentation branch are fused at the output to produce the final crack segmentation result.
2.2. Hiera-A Model Based on the Improved SAM2-Hiera Architecture
2.2.1. Hiera-A Backbone Hierarchical Transformer Encoding
2.2.2. Hybrid Attention Mechanism
2.2.3. Lightweight Adapter
2.3. Multi-Scale Feature Enhancement Mechanism
2.3.1. Lightweight Atrous Spatial Pyramid
- Branch 1: 3 × 3 Dilated Convolution with dilation rate , output ;
- Branch 2: 1 × 1 Convolution, output ;
- Branch 3: 3 × 3 Dilated Convolution with dilation rate , output ;
- Branch 4: Global Average Pooling followed by Upsampling to , output .
2.3.2. Fusion and Dimensionality Reduction
2.3.3. Feature Refinement Based on Channel and Spatial Attention Channel Attention and Spatial Attention
- 1.
- Channel Attention: Perform Global Average Pooling and Max Pooling on along the spatial dimension to obtain a -dimensional descriptor:
- 2.
- Spatial Attention: Perform mean and max aggregation on along the channel dimension:
2.4. Boundary-Aware Dual-Branch Decoder
- A multi-scale feature aggregation module based on FPN [26], designed to alleviate the semantic gap caused by scale variations;
- A parallel dual-branch processing architecture, responsible for main semantic segmentation and high-frequency boundary prediction, defined as the Segmentation Branch and the Boundary Branch;
- Dual-branch fusion, where residual connections inject explicit boundary cues into the segmentation flow, enabling pixel-level prediction with refined boundaries.
2.4.1. Multi-Scale Feature Pyramid
2.4.2. Segmentation Branch
- is fused through convolution and upsampled to obtain , with a size of 32 × 32 × 128;
- is concatenated with , then fused through convolution and upsampled to obtain , with a size of 64 × 64 × 64;
- is concatenated with , and fused through convolution to obtain the segmentation feature , with a size of 128 × 128 × 32.
2.4.3. Boundary Branch
2.4.4. Dual-Branch Fusion
2.5. Loss Function and Training Objective
2.5.1. Joint Supervision Modeling
2.5.2. Region Segmentation Loss
2.5.3. Boundary Branch Loss
2.5.4. Boundary Label Generation
2.5.5. Joint Loss Function
3. Experimental Design and Result Analysis
3.1. Experimental Environment Configuration
3.2. Dataset Composition
3.3. Evaluation Metrics
- 1.
- mean Intersection over Union (mIoU).
- 2.
- Precision and Recall.
- 3.
- Score: The score is an evaluation metric for binary classification tasks. Under the premise that positive and negative samples are defined consistently in the binary classification task, the score is mathematically equivalent to the Dice coefficient introduced in Section 2.5. It combines both precision and recall. The score is a single value that can be used to measure the performance of a classifier.
- 4.
- HD95: The 95% Hausdorff Distance (HD95) is an evaluation metric for measuring the boundary accuracy of image segmentation results. In pixel-level segmentation tasks, it can be adopted to quantify the degree of spatial deviation between predicted boundaries and ground-truth boundaries.
3.4. Overall Performance Evaluation of BSA-Net
3.4.1. Quantitative Comparison of Evaluation Metrics
- Crack500 Dataset
- 2.
- DeepCrack Dataset
- 3.
- GAPs384 Dataset
3.4.2. Visual Comparison
3.5. Ablation Experiment and Module Effectiveness Analysis
3.5.1. Quantitative Comparison of Evaluation Metrics
- Crack500 Dataset
- 2.
- DeepCrack Dataset
- 3.
- GAPs384 Dataset
3.5.2. Visual Comparison
3.6. Ablation Experiment and Effectiveness Analysis of Boundary Branch Loss
4. Conclusions
5. Discussion and Future Work
5.1. Discussion
5.2. Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Feng, C.; Li, B.; Liu, Y. A Bridge Crack Segmentation Method Based on Full Attention Feature Integration and Rendering. Structures 2025, 82, 110849. [Google Scholar] [CrossRef]
- Zhang, J.; Qian, S.; Tan, C. Automated Bridge Surface Crack Detection and Segmentation Using Computer Vision-Based Deep Learning Model. Eng. Appl. Artif. Intell. 2022, 115, 105225. [Google Scholar] [CrossRef]
- Liu, T.; Zhang, L.; Zhou, G.; Cai, W.; Cai, C.; Li, L. BC-DUnet-Based Segmentation of Fine Cracks in Bridges under a Complex Background. PLoS ONE 2022, 17, e0265258. [Google Scholar] [CrossRef]
- Ding, H.; Wu, S. Bridge Crack Segmentation and Measurement Based on SOLOv2 Segmentation Model. J. Meas. Eng. 2024, 12, 502–518. [Google Scholar] [CrossRef]
- Shokri, P.; Shahbazi, M.; Nielsen, J. Semantic Segmentation and 3D Reconstruction of Concrete Cracks. Remote Sens. 2022, 14, 5793. [Google Scholar] [CrossRef]
- Kheradmandi, N.; Mehranfar, V. A Critical Review and Comparative Study on Image Segmentation-Based Techniques for Pavement Crack Detection. Constr. Build. Mater. 2022, 321, 126162. [Google Scholar] [CrossRef]
- Sun, L.; Yang, Y.; Zhou, G.; Chen, A.; Zhang, Y.; Cai, W.; Li, L. An Integration–Competition Network for Bridge Crack Segmentation under Complex Scenes. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 617–634. [Google Scholar] [CrossRef]
- Zhou, S.; Canchila, C.; Song, W. Deep Learning-Based Crack Segmentation for Civil Infrastructure: Data Types, Architectures, and Benchmarked Performance. Autom. Constr. 2023, 146, 104678. [Google Scholar] [CrossRef]
- Sholevar, N.; Golroo, A.; Esfahani, S.R. Machine Learning Techniques for Pavement Condition Evaluation. Autom. Constr. 2022, 136, 104190. [Google Scholar] [CrossRef]
- Asadi Shamsabadi, E.; Xu, C.; Rao, A.S.; Nguyen, T.; Ngo, T.; Dias-da-Costa, D. Vision Transformer-Based Autonomous Crack Detection on Asphalt and Concrete Surfaces. Autom. Constr. 2022, 140, 104316. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation; Springer: Cham, Switzerland, 2015. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar] [CrossRef]
- Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
- Wang, D.; Dai, L.; Song, F.; Yang, H.; Li, J.; Yang, X.; Jiang, S. Enhanced Mask2Former with Multi-Scale and High-Resolution for Concrete Bridge Crack Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2025, 26, 14942–14952. [Google Scholar] [CrossRef]
- Guo, F.; Qian, Y.; Liu, J.; Yu, H. Pavement Crack Detection Based on Transformer Network. Autom. Constr. 2023, 145, 104646. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. arXiv 2021. [Google Scholar] [CrossRef]
- Zhou, Z.; Zhang, J.; Gong, C. Hybrid Semantic Segmentation for Tunnel Lining Cracks Based on Swin Transformer and Convolutional Neural Network. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 2491–2510. [Google Scholar] [CrossRef]
- Sun, X.; Xie, Y.; Jiang, L.; Cao, Y.; Liu, B. DMA-Net: DeepLab with Multi-Scale Attention for Pavement Crack Segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18392–18403. [Google Scholar] [CrossRef]
- Lei, Q.; Zhong, J.; Wang, C. Joint Optimization of Crack Segmentation with an Adaptive Dynamic Threshold Module. IEEE Trans. Intell. Transp. Syst. 2024, 25, 6902–6916. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. arXiv 2023. [Google Scholar] [CrossRef]
- Deng, F.; Yang, S.; Wang, B.; Dong, X.; Tian, S. UCrack-DA: A Multi-Scale Unsupervised Domain Adaptation Method for Surface Crack Segmentation. Remote Sens. 2025, 17, 2101. [Google Scholar] [CrossRef]
- Liu, G.; Li, X.; Di, J.; Sun, R.; Qin, F.; Su, Y.; Liu, D. A Novel Framework for Crack Segmentation Using Image Augmentation and a CannyNet. Eng. Appl. Artif. Intell. 2025, 159, 111644. [Google Scholar] [CrossRef]
- Wang, F.; Wang, Z.; Wu, X.; Wu, D.; Hu, H.; Liu, X.; Zhou, Y. E2S: A UAV-Based Levee Crack Segmentation Framework Using the Unsupervised Deblurring Technique. Remote Sens. 2025, 17, 935. [Google Scholar] [CrossRef]
- Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1525–1535. [Google Scholar] [CrossRef]
- Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A Deep Hierarchical Feature Learning Architecture for Crack Segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
- Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.M. How to Get Pavement Distress Detection Ready for Deep Learning? A Systematic Approach. In 2017 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2017; pp. 2039–2047. [Google Scholar] [CrossRef]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation; Springer: Cham, Switzerland, 2018. [Google Scholar] [CrossRef]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020; pp. 173–190. [Google Scholar]
- Zhou, Y.; Liu, Y.; Lian, Y.; Pan, T.; Zheng, Y.; Zhou, Y. Ambient Vibration Measurement-Aided Multi-1D CNNs Ensemble for Damage Localization Framework: Demonstration on a Large-Scale RC Pedestrian Bridge. Mech. Syst. Signal Process. 2025, 224, 111937. [Google Scholar] [CrossRef]
- Zhou, Y.; Liang, M.; Yue, X. Deep Residual Learning for Acoustic Emission Source Localization in a Steel-Concrete Composite Slab. Constr. Build. Mater. 2024, 411, 134220. [Google Scholar] [CrossRef]
- Pan, T.; Zheng, Y.; Zhou, Y.; Luo, W.; Xu, X.; Hou, C.; Zhou, Y. Damage Pattern Recognition for Corroded Beams Strengthened by CFRP Anchorage System Based on Acoustic Emission Techniques. Constr. Build. Mater. 2023, 406, 133474. [Google Scholar] [CrossRef]









| Experimental Environment | Configuration |
|---|---|
| CPU | Intel Xeon |
| GPU | NVIDIA RTX 4090 |
| GPU memory | 24 GB |
| Operating system | Linux Ubuntu 20.04 LTS |
| CUDA version | CUDA 11.8 |
| Deep-learning framework | PyTorch 2.0.1 |
| Model | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| SegFormer | 0.556 | 0.455 | 0.114 | 0.183 | 371.36 |
| U-Net | 0.487 | 0.018 | 0.002 | 0.003 | 595.83 |
| PSPNet | 0.529 | 0.342 | 0.104 | 0.159 | 454.28 |
| DeepLabV3+ | 0.558 | 0.390 | 0.181 | 0.247 | 257.24 |
| BiSeNet | 0.550 | 0.385 | 0.137 | 0.202 | 363.54 |
| DANet | 0.530 | 0.318 | 0.111 | 0.165 | 379.94 |
| OCRNet | 0.552 | 0.421 | 0.179 | 0.251 | 307.85 |
| BSA-Net | 0.589 | 0.388 | 0.308 | 0.343 | 181.37 |
| Model | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| SegFormer | 0.859 | 0.876 | 0.753 | 0.810 | 20.89 |
| U-Net | 0.856 | 0.912 | 0.804 | 0.854 | 27.08 |
| PSPNet | 0.716 | 0.738 | 0.552 | 0.632 | 71.20 |
| DeepLabV3+ | 0.858 | 0.891 | 0.801 | 0.844 | 16.54 |
| BiSeNet | 0.821 | 0.876 | 0.729 | 0.796 | 22.56 |
| DANet | 0.728 | 0.742 | 0.585 | 0.654 | 57.24 |
| OCRNet | 0.816 | 0.865 | 0.724 | 0.788 | 25.32 |
| BSA-Net | 0.866 | 0.924 | 0.774 | 0.842 | 16.07 |
| Model | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| SegFormer | 0.663 | 0.577 | 0.458 | 0.511 | 68.97 |
| U-Net | 0.659 | 0.642 | 0.406 | 0.497 | 71.53 |
| PSPNet | 0.567 | 0.445 | 0.180 | 0.256 | 165.54 |
| DeepLabV3+ | 0.673 | 0.593 | 0.496 | 0.540 | 61.32 |
| BiSeNet | 0.489 | 0.002 | 0.003 | 0.003 | 631.80 |
| DANet | 0.572 | 0.399 | 0.254 | 0.311 | 150.61 |
| OCRNet | 0.630 | 0.486 | 0.383 | 0.428 | 80.06 |
| BSA-Net | 0.684 | 0.592 | 0.564 | 0.577 | 58.48 |
| Model | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| BSA-Net (Backbone Replaced with ResNet50) | 0.565 | 0.449 | 0.196 | 0.273 | 238.36 |
| BSA-Net (Without Light-ASPP in the Neck) | 0.519 | 0.501 | 0.063 | 0.112 | 470.75 |
| BSA-Net (Head Replaced with a Single-Branch Decoder) | 0.489 | 0.047 | 0.015 | 0.023 | 506.23 |
| BSA-Net (Comprehensive) | 0.589 | 0.388 | 0.308 | 0.343 | 181.37 |
| Model | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| BSA-Net (Backbone Replaced with ResNet50) | 0.853 | 0.897 | 0.785 | 0.837 | 18.52 |
| BSA-Net (Without Light-ASPP in the Neck) | 0.824 | 0.828 | 0.773 | 0.800 | 22.06 |
| BSA-Net (Head Replaced with a Single-Branch Decoder) | 0.804 | 0.801 | 0.746 | 0.773 | 33.94 |
| BSA-Net (Comprehensive) | 0.866 | 0.924 | 0.774 | 0.842 | 16.07 |
| Model | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| BSA-Net (Backbone Replaced with ResNet50) | 0.494 | 0.023 | 0.003 | 0.005 | 622.79 |
| BSA-Net (Without Light-ASPP in the Neck) | 0.515 | 0.121 | 0.079 | 0.096 | 377.24 |
| BSA-Net (Head Replaced with a Single-Branch Decoder) | 0.495 | 0.027 | 0.012 | 0.020 | 536.42 |
| BSA-Net (Comprehensive) | 0.684 | 0.592 | 0.564 | 0.577 | 58.48 |
| Loss Function | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| Single Region Segmentation Loss | 0.512 | 0.363 | 0.275 | 0.313 | 235.62 |
| Joint Loss Function | 0.589 | 0.388 | 0.308 | 0.343 | 181.37 |
| Loss Function | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| Single Region Segmentation Loss | 0.783 | 0.858 | 0.697 | 0.769 | 35.10 |
| Joint Loss Function | 0.866 | 0.924 | 0.774 | 0.842 | 16.07 |
| Loss Function | mIoU | Precision | Recall | Score | HD95 |
|---|---|---|---|---|---|
| Single Region Segmentation Loss | 0.601 | 0.521 | 0.489 | 0.504 | 82.16 |
| Joint Loss Function | 0.684 | 0.592 | 0.564 | 0.577 | 58.48 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jiang, Y.; Wang, T.; Shao, C.; Chen, X.; Liang, J. Boundary-Sensitive Hybrid Attention Network for Multi-Scale Crack Fine Segmentation. Sensors 2026, 26, 3200. https://doi.org/10.3390/s26103200
Jiang Y, Wang T, Shao C, Chen X, Liang J. Boundary-Sensitive Hybrid Attention Network for Multi-Scale Crack Fine Segmentation. Sensors. 2026; 26(10):3200. https://doi.org/10.3390/s26103200
Chicago/Turabian StyleJiang, Yaotong, Tianmiao Wang, Congyu Shao, Xuanhe Chen, and Jianhong Liang. 2026. "Boundary-Sensitive Hybrid Attention Network for Multi-Scale Crack Fine Segmentation" Sensors 26, no. 10: 3200. https://doi.org/10.3390/s26103200
APA StyleJiang, Y., Wang, T., Shao, C., Chen, X., & Liang, J. (2026). Boundary-Sensitive Hybrid Attention Network for Multi-Scale Crack Fine Segmentation. Sensors, 26(10), 3200. https://doi.org/10.3390/s26103200
