Efficient Global–Local Context Fusion with Mobile-Optimized Transformers for Concrete Dam Crack Inspection
Abstract
1. Introduction
2. Background of Deep Learning-Based Crack Analysis
2.1. YOLO-Based Detection Paradigm for Crack Localization
2.2. Mask R-CNN-Based Segmentation Paradigm for Morphology Preservation
3. Related Work
3.1. Segmentation Architecture Innovations
3.2. Hybrid Architecture and Loss Optimization
3.2.1. Hybrid Design Paradigms
3.2.2. Loss Function Evolution
3.3. Overall Framework of the Proposed Method
4. Methodology
4.1. MobileNetV2-Based U-Net Architecture
- (1)
- Encoder Stream: Use original MobileNetV2 layers up to the final 1280-channel expansion (excluding classification head).
- (2)
- Channel Balancing: Insert 1 × 1 convolutions to align skip connection channels with decoder dimensions.
- (3)
- Skip Connections: Extract multi-scale features from four strategic stages:
4.2. Enhanced Transformer Block
4.2.1. Design Motivation and Architecture
4.2.2. Mathematical Formulation and Complexity Analysis
4.2.3. Benefits for Dam Crack Segmentation
5. Experiments
5.1. DamCrackSet-1K Dataset Construction
5.1.1. Dataset Preparation
5.1.2. Dataset Generation
5.2. Training Strategy
5.3. Ablation Study
5.3.1. Backbone Architecture Analysis
5.3.2. Transformer Block Design
5.3.3. Loss Function Components
Edge-RefinementTerm
5.3.4. Component Combinations
5.4. Comparative Experiments
5.5. Visual Comparison
5.5.1. Intersecting Cracks
5.5.2. Fine and Low-Contrast Cracks
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Ghannadi, P.; Kourehli, S.; Nguyen, A.; Oterkus, E. Letter to the Editor: A brief insight into the NDT in the UK. e-J. Nondestruct. Test. 2024. [Google Scholar] [CrossRef]
- Huang, B.; Kang, F.; Li, J.; Wang, F. Displacement prediction model for high arch dams using long short-term memory based encoder-decoder with dual-stage attention considering measured dam temperature. Eng. Struct. 2023, 280, 115686. [Google Scholar] [CrossRef]
- Kang, F.; Li, J.; Zhao, S.; Wang, Y. Structural health monitoring of concrete dams using long-term air temperature for thermal effect simulation. Eng. Struct. 2019, 180, 642–653. [Google Scholar] [CrossRef]
- Liu, K.; Wang, F.; He, Y.; Liu, Y.; Yang, J.; Yao, Y. Data-Augmented Manifold Learning Thermography for Defect Detection and Evaluation of Polymer Composites. Polymers 2023, 15, 173. [Google Scholar] [CrossRef] [PubMed]
- Kheradmandi, N.; Mehranfar, V. A critical review and comparative study on image segmentation-based techniques for pavement crack detection. Constr. Build. Mater. 2022, 321, 126162. [Google Scholar] [CrossRef]
- Timm, D.H.; McQueen, J.M. A Study of Manual vs Automated Pavement Condition Surveys; Auburn University: Auburn, AL, USA, 2004. [Google Scholar]
- Sturm, B.L. Stéphane Mallat: A Wavelet Tour of Signal Processing, 2nd Edition. Comput. Music. J. 2007, 31, 83–85. [Google Scholar] [CrossRef]
- Mahler, D.; Kharoufa, Z.; Wong, E.; Shaw, L.G. Pavement Distress Analysis Using Image Processing Techniques. Comput.-Aided Civ. Infrastruct. Eng. 1991, 6, 1–14. [Google Scholar] [CrossRef]
- Medina, R.; Gómez-García-Bermejo, J.; Zalama, E. Automated Visual Inspection of Road Surface Cracks. In Proceedings of the International Association for Automation and Robotics in Construction, Bratislava, Slovakia, 24–27 June 2010; pp. 155–164. [Google Scholar]
- Zakeri, H.; Nejad, F.M.; Fahimifar, A.; Torshizi, A.D.; Zarandi, M.F. A multi-stage expert system for classification of pavement cracking. In Proceedings of the 2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), Edmonton, AB, Canada, 24–28 June 2013; pp. 1125–1130. [Google Scholar]
- Banharnsakun, A. Hybrid ABC-ANN for pavement surface distress detection and classification. Int. J. Mach. Learn. Cyber 2017, 8, 699–710. [Google Scholar] [CrossRef]
- Kang, F.; Liu, J.; Li, J.; Li, S. Concrete dam deformation prediction model for health monitoring based on extreme learning machine. Struct. Control Health Monit. 2017, 24, e1997. [Google Scholar] [CrossRef]
- Laqsum, S.A.; Zhu, H.; Haruna, S.I.; Ibrahim, Y.E.; Amer, M.; Al-Shawafi, A.; Ahmed, O.S. Impact and Failure Analysis of U-Shaped Concrete Containing Polyurethane Materials: Deep Learning and Digital Imaging Correlation-Based Approach. Polymers 2025, 17, 1245. [Google Scholar] [CrossRef]
- Huang, B.; Kang, F.; Li, X.; Zhu, S. Underwater dam crack image generation based on unsupervised image-to-image translation. Autom. Constr. 2024, 163, 105430. [Google Scholar] [CrossRef]
- Kang, F.; Huang, B.; Wan, G. Automated detection of underwater dam damage using remotely operated vehicles and deep learning technologies. Autom. Constr. 2025, 171, 105971. [Google Scholar] [CrossRef]
- Wu, Y.; Kang, F.; Wan, G.; Li, H. Automatic operational modal analysis for concrete arch dams integrating improved stabilization diagram with hybrid clustering algorithm. Mech. Syst. Signal Process. 2025, 224, 112011. [Google Scholar] [CrossRef]
- Wu, Y.; Kang, F.; Zhu, S.; Li, J. Data-driven deformation prediction model for super high arch dams based on a hybrid deep learning approach and feature selection. Eng. Struct. 2025, 325, 119483. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Oktay, O.; Schlemper, J.; Folgoc, L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
- Zim, A.; Iqbal, A.; Al-Huda, Z.; Malik, A.; Kuribayashi, M. EfficientCrackNet: A Lightweight Model for Crack Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 26 February–6 March 2025; pp. 6279–6289. [Google Scholar]
- Su, G.; Qin, Y.; Xu, H.; Liang, J. Automatic real-time crack detection using lightweight deep learning models. Eng. Appl. Artif. Intell. 2024, 138, 109340. [Google Scholar] [CrossRef]
- Dong, K.; Zhou, C.; Ruan, Y.; Li, Y. MobileNetV2 model for image classification. In Proceedings of the 2020 2nd International Conference on Information Technology and Computer Application, Guangzhou, China, 18–20 December 2020; pp. 476–480. [Google Scholar]
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
- Al-maqtari, O.; Peng, B.; Al-Huda, Z.; Al-Malahi, A.; Maqtary, N. Lightweight Yet Effective: A Modular Approach to Crack Segmentation. IEEE Trans. Intell. Veh. 2024, 9, 7961–7972. [Google Scholar] [CrossRef]
- Zhang, J.; Sun, S.; Song, W.; Li, Y.; Teng, Q. A novel convolutional neural network for enhancing the continuity of pavement crack detection. Sci. Rep. 2024, 14, 30376. [Google Scholar] [CrossRef] [PubMed]
- Wu, K.; Peng, B.; Zhai, D. Boundary-Aware Axial Attention Network for High-Quality Pavement Crack Detection. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 13555–13566. [Google Scholar] [CrossRef]
- Li, S.; Yan, F.; Li, Z.; Hu, Q.; Xu, S.; Liu, S. TCI-Net: Structural Feature Enhancement and Multi-Level Constrained Network for Reliable Thin Crack Identification on Concrete Surfaces. IEEE Access 2025, 13, 65604–65616. [Google Scholar] [CrossRef]
- Kompanets, A.; Duits, R.; Pai, G.; Leonetti, D.; Snijder, H.B. Loss function inversion for improved crack segmentation in steel bridges using a CNN framework. Autom. Constr. 2025, 170, 105896. [Google Scholar] [CrossRef]
- Sudre, C.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. In Proceedings of the MICCAI Workshop on Deep Learning in Medical Image Analysis, Québec City, QC, Canada, 14 September 2017; Volume 10553, pp. 240–248. [Google Scholar]
- Salehi, S.; Erdogmus, D.; Gholipour, A. Tversky loss function for image segmentation using 3D fully convolutional deep networks. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Québec City, QC, Canada, 10 September 2017; pp. 379–387. [Google Scholar]
- Jiang, L.; Dai, B.; Wu, W.; Loy, C.C. Focal frequency loss for image reconstruction and synthesis. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13919–13929. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Li, W.; Dasarathy, G.; Berisha, V. Regularization via structural label smoothing. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 1453–1463. [Google Scholar]
- Zuo, X.; Sheng, Y.; Shen, J.; Shan, Y. Topology-aware Mamba for Crack Segmentation in Structures. arXiv 2024, arXiv:2410.19894. [Google Scholar] [CrossRef]
- Rostami, G.; Chen, P.H.; Hosseini, M.S. Segment Any Crack: Deep Semantic Segmentation Adaptation for Crack Detection. arXiv 2025, arXiv:2504.14138. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, J.; Wang, C.; Wen, X.; Yan, C.; Guo, Y.; Cao, R. MA-Xnet: Mobile-Attention X-Network for Crack Detection. Appl. Sci. 2022, 12, 11240. [Google Scholar] [CrossRef]










| Model | mAP0.50 (%) | mAP0.50:0.95 (%) | Params (M) | FPS |
|---|---|---|---|---|
| YOLOv5 | 71.85 | 48.04 | 97.2 | 45 |
| YOLOv8 | 63.39 | 37.73 | 68.2 | 58 |
| YOLOv9 | 64.79 | 37.09 | 57.3 | 62 |
| YOLOv10 | 69.67 | 43.09 | 47.8 | 67 |
| YOLOv11 | 61.85 | 34.94 | 56.9 | 71 |
| Model | Params (M) | Edge-Friendly | Key Feature |
|---|---|---|---|
| YOLOv10 | 47.8 | T | Fast, low precision |
| U-Net | 34.5 | F | Boundary-preserving |
| DeepLabV3+ | 41.2 | F | High accuracy, slower |
| Mask R-CNN | 158.0 | F | Heavy, two-stage |
| Stage | Output Size | Channels | Blocks |
|---|---|---|---|
| Input | 3 | - | |
| Stage 1 | 16 | 1 | |
| Stage 2 | 24 | 2 | |
| Stage 3 | 32 | 3 | |
| Stage 4 | 96 | 4 | |
| Stage 5 | 1280 | 3 |
| Parameter | Value |
|---|---|
| Optimizer | Adam |
| Epochs | 100 |
| Batch size | 8 |
| Initial learning rate | |
| Momentum (, ) | 0.9, 0.999 |
| Weight decay | |
| Hardware | NVIDIA RTX 3090 GPU |
| CUDA/cuDNN version | CUDA 11.3, cuDNN 8.2 |
| Backbone | mIoU (%) | Dice (%) | Precision (%) | Recall (%) | Latency (ms) |
|---|---|---|---|---|---|
| ResNet-50 | 64.9 | 78.1 | 87.5 | 72.5 | 18.7 |
| EfficientNet-B3 | 65.2 | 78.3 | 85.1 | 73.7 | 15.2 |
| MobileNetV2 | 65.8 | 78.6 | 83.8 | 76.1 | 12.4 |
| Ours (Modified MV2) | 67.9 | 80.3 | 86.7 | 76.6 | 13.1 |
| Attention Type | mIoU (%) | Dice (%) | Params (M) | FLOPs (G) |
|---|---|---|---|---|
| Local Window | 66.25 | 79.10 | 2.1 | 3.7 |
| Global | 68.53 | 80.60 | 4.3 | 8.2 |
| Non-Local | 68.28 | 80.76 | 5.2 | 9.1 |
| Cascaded Group (Ours) | 69.20 | 81.32 | 1.8 | 2.9 |
| Scheduling Method | Description | mIoU | Edge IoU | Dice | Epochs |
|---|---|---|---|---|---|
| Fixed | No change | 66.90 | 61.0 | 79.59 | 90 |
| Decrease 0.1 every 10 epochs | 68.55 | 65.9 | 80.70 | 70 | |
| Decrease 0.05 every 10 epochs | 68.26 | 64.6 | 80.57 | 80 | |
| Exponential decay | 67.87 | 63.4 | 80.37 | 100 |
| Components | mIoU (%) | Dice (%) | Precision (%) | Recall (%) |
|---|---|---|---|---|
| Focal | 66.45 | 79.17 | 86.14 | 75.33 |
| Focal + Edge-Aware | 67.77 | 80.23 | 84.97 | 78.59 |
| Focal + Shape-Regularized | 68.50 | 80.70 | 83.63 | 80.15 |
| Curvature-Aware (Ours) | 69.20 | 81.32 | 84.60 | 80.04 |
| Backbone | Attention | Loss | mIoU (%) | Latency (ms) |
|---|---|---|---|---|
| MV2 | - | Focal | 63.27 | 12.4 |
| MV2 | Local | Focal+Wing | 65.81 | 14.3 |
| MV2 | Cascaded Group | Focal | 67.28 | 13.9 |
| MV2 | Cascaded Group | Curvature-Aware | 69.20 | 14.1 |
| Method | mIoU (%) | Edge IoU (%) | FPS | Latency (ms) |
|---|---|---|---|---|
| UNet | 64.53 | 64.8 | 28.1 | 35.6 |
| Mask R-CNN | 70.35 | 67.21 | 16.7 | 59.9 |
| DeepLabV3+ | 68.35 | 65.14 | 22.4 | 44.6 |
| OURS | 69.20 | 66.87 | 70.9 | 14.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, J.; Huang, B.; Kang, F. Efficient Global–Local Context Fusion with Mobile-Optimized Transformers for Concrete Dam Crack Inspection. Buildings 2025, 15, 4487. https://doi.org/10.3390/buildings15244487
Hu J, Huang B, Kang F. Efficient Global–Local Context Fusion with Mobile-Optimized Transformers for Concrete Dam Crack Inspection. Buildings. 2025; 15(24):4487. https://doi.org/10.3390/buildings15244487
Chicago/Turabian StyleHu, Jiarui, Ben Huang, and Fei Kang. 2025. "Efficient Global–Local Context Fusion with Mobile-Optimized Transformers for Concrete Dam Crack Inspection" Buildings 15, no. 24: 4487. https://doi.org/10.3390/buildings15244487
APA StyleHu, J., Huang, B., & Kang, F. (2025). Efficient Global–Local Context Fusion with Mobile-Optimized Transformers for Concrete Dam Crack Inspection. Buildings, 15(24), 4487. https://doi.org/10.3390/buildings15244487
