Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition
Abstract
1. Introduction
2. Related Works
2.1. CNN-Based Segmentation Models
2.2. Transformer-Based Segmentation Models
3. Methodology
3.1. The Proposed Framework
3.2. Multi-Scale Feature Extraction for Input Images
3.3. Edge Feature Enhancement for Vibrated and Non-Vibrated Regions
3.4. Mask Modeling and Prediction for Vibrated Regions
3.5. Image Augmentation, Loss Function, Optimizer, and Learning Rate Scheduling Strategy
4. Experiment
4.1. Dataset
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ren, Q.; Li, H.; Li, M. Towards online monitoring of concrete dam displacement subject to time-varying environments: An improved sequential learning approach. Adv. Eng. Inform. 2023, 55, 101881. [Google Scholar] [CrossRef]
- Huang, B.; Kang, F.; Li, J. Displacement prediction model for high arch dams using long short-term memory based encoder-decoder with dual-stage attention considering measured dam temperature. Eng. Struct. 2023, 280, 115686. [Google Scholar] [CrossRef]
- Ren, B.; Wang, H.; Wang, D. Vision method based on deep learning for detecting concrete vibration quality. Case Stud. Constr. Mater. 2023, 18, e02132. [Google Scholar] [CrossRef]
- Quan, Y.; Wang, F. Machine learning-based real-time tracking for concrete vibration. Autom. Constr. 2022, 140, 104343. [Google Scholar] [CrossRef]
- Li, T.; Wang, H.; Tan, J.; Tan, J. Intelligent quality assessment of concrete vibration using computer vision and large language models. Autom. Constr. 2025, 180, 106507. [Google Scholar] [CrossRef]
- Li, J.; Tian, Z.; Ma, Y.; Ma, Y. Feedback control system for vibration construction of fresh concrete. Mech. Syst. Signal Process. 2024, 216, 111461. [Google Scholar] [CrossRef]
- Boumiz, A.; Vernet, C. Mechanical properties of cement pastes and mortars at early ages: Evolution with time and degree of hydration. Adv. Cem. Based Mater. 1996, 3, 94–106. [Google Scholar] [CrossRef]
- Zhu, B.; Zheng, Y.; Rong, Z.; Zhao, Z.; Yang, J.; Yang, B. Impact of early-age vibration on the permeability and microstructure of mature-age concrete. Constr. Build. Mater. 2024, 453, 139124. [Google Scholar] [CrossRef]
- Li, T.; Wang, H.; Tan, J.; Tan, J. A continuous concrete vibration method for robots based on machine vision with integrated spatial features. Appl. Soft Comput. 2024, 167, 112231. [Google Scholar] [CrossRef]
- Xu, Y.; Zhou, Y.; Sekula, P. Machine learning in construction: From shallow to deep learning. Dev. Built Environ. 2021, 6, 100045. [Google Scholar] [CrossRef]
- Li, T.; Wang, H.; Pan, D. A machine vision approach with temporal fusion strategy for concrete vibration quality monitoring. Appl. Soft Comput. 2024, 160, 111684. [Google Scholar] [CrossRef]
- Zhang, H.; Jin, Y.; Liu, Q. Intelligent monitoring method for tamping times during dynamic compaction construction using machine vision and pattern recognition. Measurement 2022, 193, 110835. [Google Scholar] [CrossRef]
- Yang, K.; Wang, H.; Wang, K. An effective monitoring method of dynamic compaction construction quality based on time series modeling. Measurement 2024, 224, 113930. [Google Scholar] [CrossRef]
- Zhang, H.; Yang, Q.; Liu, Q. Multi-sensor integrated monitoring equipment and its application to dynamic compaction quality in construction. Autom. Constr. 2023, 156, 105151. [Google Scholar] [CrossRef]
- Jiang, D.; Kong, L.; Wang, H. Precise control mode for concrete vibration time based on attention-enhanced machine vision. Autom. Constr. 2024, 158, 105232. [Google Scholar] [CrossRef]
- Wang, D.; Ren, B.; Cui, B. Real-time monitoring for vibration quality of fresh concrete using convolutional neural networks and IoT technology. Autom. Constr. 2021, 123, 103510. [Google Scholar] [CrossRef]
- Wang, S.; Chen, L.; Shi, P. Computer vision based manual concrete vibration quality monitoring. Dev. Built Environ. 2026, 26, 100895. [Google Scholar] [CrossRef]
- Cao, G.; Bai, Y.; Shi, Y. Investigation of vibration on rheological behavior of fresh concrete using CFD-DEM coupling method. Constr. Build. Mater. 2024, 425, 135908. [Google Scholar] [CrossRef]
- Koch, J.A.; Castaneda, D.I.; Ewoldt, R.H. Vibration of fresh concrete understood through the paradigm of granular physics. Cem. Concr. Res. 2019, 115, 31–42. [Google Scholar] [CrossRef]
- Mia, M.S.; Arnob, A.B.H.; Naim, A. ViTs are Everywhere: A Comprehensive Study Showcasing Vision Transformers in Different Domain. In Proceedings of the 2023 International Conference on the Cognitive Computing and Complex Data (ICCD), Huai’an, China, 21–22 October 2023; pp. 101–117. [Google Scholar] [CrossRef]
- Shan, J.; Huang, Y.; Jiang, W. DCUFormer: Enhancing pavement crack segmentation in complex environments with dual-cross/upsampling attention. Expert Syst. Appl. 2025, 264, 125891. [Google Scholar] [CrossRef]
- Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 207–210. [Google Scholar] [CrossRef]
- Cheng, B.; Schwing, A.G.; Kirillov, A. Per-Pixel Classification is Not All You Need for Semantic Segmentation. arXiv 2021. [Google Scholar] [CrossRef]
- Goo, J.M.; Milidonis, X.; Artusi, A. Hybrid-Segmentor: Hybrid approach for automated fine-grained crack segmentation in civil infrastructure. Autom. Constr. 2025, 170, 105960. [Google Scholar] [CrossRef]
- Ma, Y.; Zhang, Z. Position-Guided Hybrid Convolutional Neural Network and Transformer Network for steel strip surface defect detection. Eng. Appl. Artif. Intell. 2025, 162, 112741. [Google Scholar] [CrossRef]
- Beyene, D.A.; Tola, K.D.; Yigzew, F.E. Hybrid multi-scale CNN-Transformer network for structural surface crack segmentation. Results Eng. 2026, 30, 110145. [Google Scholar] [CrossRef]
- Cheng, B.; Misra, I.; Schwing, A.G. Masked-attention Mask Transformer for Universal Image Segmentation. arXiv 2022. [Google Scholar] [CrossRef]
- Tanaka, H.; Shibano, K.; Suzuki, T. Scene classification-assisted deep learning for crack detection in asphalt pavements. Case Stud. Constr. Mater. 2025, 23, e05064. [Google Scholar] [CrossRef]
- Li, D.; Xie, Q.; Gong, X. Automatic defect detection of metro tunnel surfaces using a vision-based inspection system. Adv. Eng. Inform. 2021, 47, 101206. [Google Scholar] [CrossRef]
- Fan, Z.; Chen, B.; Wang, Z. Defects recognition in on-site GPR images of mountain tunnel linings based on YOLOv8N model. Case Stud. Constr. Mater. 2025, 23, e05196. [Google Scholar] [CrossRef]
- Zhao, N.; Song, Y.; Liu, H. A novel MPDENet model and efficient combined loss function for real-time pixel-level segmentation detection of tunnel lining cracks. Case Stud. Constr. Mater. 2025, 22, e04618. [Google Scholar] [CrossRef]
- Fan, H.; Tian, Z.; Xu, X. Rockfill material segmentation and gradation calculation based on deep learning. Case Stud. Constr. Mater. 2022, 17, e01216. [Google Scholar] [CrossRef]
- Wang, S.; Han, R.; Wu, X.; Zhao, D.; Zeng, X.; Yin, R.; Han, Z.; Liu, Y.; Shu, S. Crack segmentation and quantification in concrete structures using a lightweight YOLO model based on pruning and knowledge distillation. Expert Syst. Appl. 2025, 283, 127834. [Google Scholar] [CrossRef]
- Yang, K.; Bao, Y.; Li, J. Deep learning-based YOLO for crack segmentation and measurement in metro tunnels. Autom. Constr. 2024, 168, 105818. [Google Scholar] [CrossRef]
- Li, H. Tunnel lining crack detection method based on deformable convolution and feature fusion with image enhancement of Retinex theory. Expert Syst. Appl. 2026, 299, 130285. [Google Scholar] [CrossRef]
- Liu, B.; Zhang, J.; Lei, M. Simultaneous tunnel defects and lining thickness identification based on multi-tasks deep neural network from ground penetrating radar images. Autom. Constr. 2023, 145, 104633. [Google Scholar] [CrossRef]
- Al-Huda, Z.; Peng, B.; Algburi, R.N.A. Asymmetric dual-decoder-U-Net for pavement crack semantic segmentation. Autom. Constr. 2023, 156, 105138. [Google Scholar] [CrossRef]
- Qu, Z.; Chen, W.; Wang, S.-Y. A Crack Detection Algorithm for Concrete Pavement Based on Attention Mechanism and Multi-Features Fusion. IEEE Trans. Intell. Transp. Syst. 2022, 23, 11710–11719. [Google Scholar] [CrossRef]
- Li, M.; Yuan, J.; Ren, Q. CNN-Transformer hybrid network for concrete dam crack patrol inspection. Autom. Constr. 2024, 163, 105440. [Google Scholar] [CrossRef]
- Ma, Y.; Bao, T.; Li, Y. GANFormerNet: A UAV-based Concrete Crack Segmentation Model for Water-related Structures Using Vision Transformer and Graph Attention Network. Adv. Eng. Inform. 2025, 68, 103725. [Google Scholar] [CrossRef]
- Wang, W.; Su, C. Automatic concrete crack segmentation model based on transformer. Autom. Constr. 2022, 139, 104275. [Google Scholar] [CrossRef]
- Cao, H.; Wang, Y.; Chen, J. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Computer Vision—ECCV 2022 Workshops; Springer Nature: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar] [CrossRef]
- Li, G.; Zhang, J.; Hu, K. HAIS-SegFormer: A Lightweight Underwater Crack Segmentation Network Based on Hybrid Attention and Feature Inhibition. J. Mar. Sci. Eng. 2026, 14, 526. [Google Scholar] [CrossRef]
- Zhang, E.; Shao, L.; Wang, Y. Unifying transformer and convolution for dam crack detection. Autom. Constr. 2023, 147, 104712. [Google Scholar] [CrossRef]
- Gan, G.; Xu, X.; Ding, Y. Pixel-Level Detection of Cracks Based on Loop Semantic Diffusion Integration. J. Comput. Civ. Eng. 2025, 39, 04025049. [Google Scholar] [CrossRef]
- Asadi Shamsabadi, E.; Xu, C.; Rao, A.S. Vision transformer-based autonomous crack detection on asphalt and concrete surfaces. Autom. Constr. 2022, 140, 104316. [Google Scholar] [CrossRef]
- Zhu, X.; Yu, W.; Dong, X. MR-Former: Improving universal image segmentation via refined masked-attention transformer. Alex. Eng. J. 2025, 131, 232–244. [Google Scholar] [CrossRef]
- Zuo, X.; Sheng, Y.; Shen, J. Multilabel Sewer Pipe Defect Recognition with Mask Attention Feature Enhancement and Label Correlation Learning. J. Comput. Civ. Eng. 2025, 39, 04024050. [Google Scholar] [CrossRef]
- Zhang, H.; Li, F.; Xu, H. MP-Former: Mask-Piloted Transformer for Image Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 18074–18083. [Google Scholar] [CrossRef]
- Li, F.; Zhang, H.; Xu, H. Mask DINO: Towards A Unified Transformer-based Framework for Object Detection and Segmentation. arXiv 2022. [Google Scholar] [CrossRef]
- Meda, D.; Ahmed, M.M.; Kalapatapu, P.; Pasupuleti, V.D.K. Enhanced Structural Damage Detection, Segmentation, and Quantification Using Computer Vision and Deep Learning. J. Comput. Civ. Eng. 2025, 39, 04025066. [Google Scholar] [CrossRef]
- Pantoja-Rosero, B.G.; Salamone, S. Integrating extended reality and AI-based damage segmentation for near real-time, traceable bridge inspections. Autom. Constr. 2025, 180, 106567. [Google Scholar] [CrossRef]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. arXiv 2019, arXiv:1904.02689. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. SOLOv2: Dynamic and Fast Instance Segmentation. arXiv 2020, arXiv:2003.10152. [Google Scholar] [CrossRef]
- Tian, Z.; Shen, C.; Chen, H. Conditional Convolutions for Instance Segmentation. arXiv 2020, arXiv:2003.05664. [Google Scholar] [CrossRef]
- Khatir, A.; Capozucca, R.; Khatir, S.; Magagnini, E.; Le Thanh, C.; Riahi, M.K. Advancements and emerging trends in integrating machine learning and deep learning for SHM in mechanical and civil engineering: A comprehensive review. J. Braz. Soc. Mech. Sci. Eng. 2025, 47, 419. [Google Scholar] [CrossRef]
- Bouabdallah, A.; Benaissa, A.; Bouabdallah, M.A.; Malab, S.; Khatir, A. Development and performance evaluation of self-leveling sand concrete: Enhanced fluidity, mechanical strength, durability, and non-destructive analysis. Constr. Build. Mater. 2025, 468, 140463. [Google Scholar] [CrossRef]
- Cacciola, M.; Angiulli, G.; Burrascano, P.; Laganà, F.; Versaci, M. A Prototypical Fuzzy Similarity-Based Classification Framework for Ultrasonic Defect Detection in Concrete. Eng 2026, 7, 88. [Google Scholar] [CrossRef]
- Laganà, F.; Pratticò, D.; Quattrone, M.F.; Pullano, S.A.; Calcagno, S. Hybrid AI–Taguchi–ANOVA Approach for Thermographic Monitoring of Electronic Devices. Eng 2026, 7, 28. [Google Scholar] [CrossRef]















| Dam Pouring Block ID | Video ID | Acquisition Time | Weather Condition | Training Images | Validation Images | Total |
|---|---|---|---|---|---|---|
| 1 | V1 | 4:00–4:40 PM | Cloudy | 104 | 26 | 146 |
| V2 | 5:10–5:30 PM | Cloudy | 12 | 4 | ||
| 2 | V3 | 10:30–11:00 AM | Sunny | 56 | 16 | 198 |
| V4 | 2:00–2:30 PM | Sunny | 103 | 23 | ||
| 3 | V5 | 10:40–11:10 AM | Sunny | 73 | 19 | 92 |
| Total | - | - | - | 348 | 88 | 436 |
| Models | P | R | F1 | mAP@0.5:0.95 | mAP@0.5 |
|---|---|---|---|---|---|
| YOLACT [53] | 0.8140 | 0.5580 | 0.6621 | 0.4580 | 0.6860 |
| YOLO11 [52] | 0.8888 | 0.7832 | 0.8327 | 0.5908 | 0.8607 |
| Solov2 [54] | 0.9220 | 0.7900 | 0.8509 | 0.7730 | 0.8670 |
| CondInst [55] | 0.9000 | 0.8740 | 0.8868 | 0.8070 | 0.9250 |
| Our study | 0.9390 | 0.9370 | 0.9380 | 0.9010 | 0.9620 |
| Metrics | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| Precision | 0.9390 | 0.8970 | 0.9060 | 0.9130 | 0.9250 |
| Recall | 0.9370 | 0.9110 | 0.9190 | 0.9230 | 0.9310 |
| F1 | 0.9380 | 0.9039 | 0.9124 | 0.9180 | 0.9280 |
| mAP@0.5:0.95 | 0.9010 | 0.8780 | 0.8830 | 0.8860 | 0.8901 |
| mAP@0.5 | 0.9620 | 0.9430 | 0.9490 | 0.9520 | 0.9590 |
| mean ± std | 0.9160 ± 0.0164 | 0.9242 ± 0.0102 | 0.9201 ± 0.0133 | 0.8876 ± 0.0087 | 0.9530 ± 0.0076 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lei, L.; Ji, Y.; Zhou, Y.; Zhao, C.; Wang, F.; Zhou, H.; Liang, Z. Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition. Appl. Sci. 2026, 16, 5479. https://doi.org/10.3390/app16115479
Lei L, Ji Y, Zhou Y, Zhao C, Wang F, Zhou H, Liang Z. Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition. Applied Sciences. 2026; 16(11):5479. https://doi.org/10.3390/app16115479
Chicago/Turabian StyleLei, Lei, Yu Ji, Yihong Zhou, Chunju Zhao, Fang Wang, Huawei Zhou, and Zhipeng Liang. 2026. "Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition" Applied Sciences 16, no. 11: 5479. https://doi.org/10.3390/app16115479
APA StyleLei, L., Ji, Y., Zhou, Y., Zhao, C., Wang, F., Zhou, H., & Liang, Z. (2026). Visual Guidance for Construction Equipment: A Masked-Attention Multi-Scale Transformer for Vibrated Concrete Recognition. Applied Sciences, 16(11), 5479. https://doi.org/10.3390/app16115479

