FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot
Abstract
1. Introduction
- A novel multimodal fusion network is proposed for road-pothole semantic segmentation by jointly using RGB and disparity images.
- A feature fusion module is designed to integrate global and local features effectively, thereby enhancing fusion performance and boundary segmentation accuracy. This module reduces information loss during RGB–disparity fusion and better exploits useful cues, particularly in regions where boundary information is easily missed.
- Three feature attention fusion modules are strategically designed and placed within the network to improve their effectiveness. By incorporating spatial attention, channel attention, and hybrid attention mechanisms, these modules enable the network to exploit complementary spatial and channel information more effectively. As a result, they enhance the network’s focus on small potholes and improve semantic segmentation accuracy.
2. Related Work
2.1. Detection of Road Potholes and Crevices Using RGB Images
2.2. Detection of Road Potholes and Crevices Using Disparity Images
2.3. Detection of Road Potholes and Crevices Using Multimodal Data
3. Methodology
3.1. Overall Network
3.2. Feature Fusion Module
3.3. Feature Attention Fusion Module
3.3.1. Spatial--Directional Attention Fusion
3.3.2. Channel-Interdependence Fusion
3.3.3. Multidimensional Correlation Fusion
3.3.4. Module Naming and Placement
4. Experiments
4.1. Dataset and Training Details
4.2. Ablation Study
4.2.1. Ablation of Feature Attention Fusion Modules
- SDAF: inserted immediately after the initial convolutional module and the first encoder layer to exploit its spatial–directional sensitivity at high resolution.
- CIF: placed after the second encoder layer, where the number of feature-map channels is sufficiently large to benefit from efficient channel reweighting.
- MCF: applied to deeper network stages, namely the third and fourth encoder layers, to exploit its self-attention capability for global and local correlation refinement.
4.2.2. Ablation of the Placement Locations of Three Designed Feature Attention Fusion Modules
4.2.3. Results Evaluations
4.3. Comparative Study
4.3.1. Qualitative Demonstrations
4.3.2. Evaluation of Results
4.3.3. Failure Cases and Limitations
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Karunasekera, H.; Wang, H.; Zhang, H. Energy Minimization Approach for Negative Obstacle Region Detection. IEEE Trans. Veh. Technol. 2019, 68, 11668–11678. [Google Scholar] [CrossRef]
- Qiu, J.; Jiang, C. A bilateral semantic guidance network for detection of off-road freespace with impairments based on joint semantic segmentation and edge detection. Comput. Electr. Eng. 2025, 123, 110045. [Google Scholar] [CrossRef]
- Wang, Z.; Ma, Z.; Wang, Z.; Gao, S.; Peng, J. A novel road damage detection model with efficient attention and Dynamic Snake Convolution. Eng. Appl. Artif. Intell. 2026, 163, 112618. [Google Scholar] [CrossRef]
- Fan, R.; Wang, H.; Wang, Y.; Liu, M.; Pitas, I. Graph Attention Layer Evolves Semantic Segmentation for Road Pothole Detection: A Benchmark and Algorithms. IEEE Trans. Image Process. 2021, 30, 8144–8154. [Google Scholar] [CrossRef] [PubMed]
- Ye, M.; Li, X.; Dai, J.; Li, H.; Xu, Z.; Zhang, C. SCSANet: Split Convolution Selective Attention Network of Drivable Area Detection for Mobile Robots. Eng 2026, 7, 176. [Google Scholar] [CrossRef]
- Subramanian, R.; Büker, U. Study of Contactless Computer Vision-Based Road Condition Estimation Methods Within the Framework of an Operational Design Domain Monitoring System. Eng 2024, 5, 2778–2804. [Google Scholar] [CrossRef]
- Deng, K.; Xing, L.; Wu, H.; Ma, H.; Ling, Y.; Gao, J. Advances in Object Detection for Autonomous Driving Using mmWave Radar and Camera: A Comprehensive Survey. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 328. [Google Scholar] [CrossRef]
- Dodge, D.; Yilmaz, M. Convex Vision-Based Negative Obstacle Detection Framework for Autonomous Vehicles. IEEE Trans. Intell. Veh. 2023, 8, 778–789. [Google Scholar] [CrossRef]
- Pandey, A.K.; Iqbal, R.; Maniak, T.; Karyotis, C.; Akuma, S.; Palade, V. Convolution neural networks for pothole detection of critical road infrastructure. Comput. Electr. Eng. 2022, 99, 107725. [Google Scholar] [CrossRef]
- Fu, T.; Dong, H.; Yang, B.; Deng, B. DE-DFNet: Edge Enhanced Diversity Feature Fusion Guided by Differences in Remote Sensing Imagery Tiny Object Detection. Image Vis. Comput. 2025, 161, 105627. [Google Scholar] [CrossRef]
- Zhou, Y.; Zhang, C.; Deng, L.; Fu, J.; Li, H.; Xu, Z.; Zhang, J. Resolution-sensitive self-supervised monocular absolute depth estimation. Appl. Intell. 2024, 54, 4781–4793. [Google Scholar] [CrossRef]
- Feng, Z.; Guo, Y.; Liang, Q.; Bhutta, M.; Wang, H.; Liu, M.; Sun, Y. MAFNet: Segmentation of Road Potholes with Multimodal Attention Fusion Network for Autonomous Vehicles. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, W.; Li, N.; Zhang, S.; Chen, Q.; Jiang, Z. Multimodal Parallel Attention Network for Medical Image Segmentation. Image Vis. Comput. 2024, 147, 105069. [Google Scholar] [CrossRef]
- Fan, R.; Wang, H.; Bocus, M.; Liu, M. We Learn Better Road Pothole Detection: From Attention Aggregation to Adversarial Domain Adaptation. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020, Proceedings, Part IV; Springer: Berlin/Heidelberg, Germany, 2020; pp. 285–300. [Google Scholar] [CrossRef]
- Feng, Z.; Guo, Y.; Sun, Y. Segmentation of Road Negative Obstacles Based on Dual Semantic-Feature Complementary Fusion for Autonomous Driving. IEEE Trans. Intell. Veh. 2024, 9, 4687–4697. [Google Scholar] [CrossRef]
- Liang, M.; Hu, J.; Bao, C.; Feng, H.; Deng, F.; Lam, T. Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks. IEEE Robot. Autom. Lett. 2023, 8, 4060–4067. [Google Scholar] [CrossRef]
- Sun, T.; Pan, W.; Wang, Y.; Liu, Y. Region of Interest Constrained Negative Obstacle Detection and Tracking with a Stereo Camera. IEEE Sens. J. 2022, 22, 3616–3625. [Google Scholar] [CrossRef]
- Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.M. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar] [CrossRef]
- He, L.; Todorovic, S. Attention Decomposition for Cross-Domain Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2024; pp. 414–431. [Google Scholar] [CrossRef]
- Yu, H.; Cho, Y.; Kang, B.; Moon, S.; Kong, K.; Kang, S.J. Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2024; pp. 92–110. [Google Scholar] [CrossRef]
- Shan, J.; Huang, Y.; Jiang, W. DCUFormer: Enhancing Pavement Crack Segmentation in Complex Environments with Dual-Cross/Upsampling Attention. Expert Syst. Appl. 2025, 264, 125891. [Google Scholar] [CrossRef]
- Fan, J.; Bocus, M.; Hosking, B.; Wu, R.; Liu, Y.; Vityazev, S.; Fan, R. Multi-scale Feature Fusion: Learning Better Semantic Segmentation for Road Pothole Detection. In Proceedings of the 2021 IEEE International Conference on Autonomous Systems (ICAS); IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar] [CrossRef]
- He, C.; Yang, H.; Zhang, Z.; Wang, H.; Cai, Y.; Chen, L.; Zhong, C.; Zhang, Y. Dual-stream Detection and Segmentation Framework for Vision Based Unmanned Ground Vehicle Pothole Perception on Unstructured Roads. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 203. [Google Scholar] [CrossRef]
- Han, J.; Zhang, Z.; Gao, X.; Li, K.; Kang, X. Research on Negative Obstacle Detection Method Based on Image Enhancement and Improved Anchor Box YOLO. In Proceedings of the 2022 IEEE International Conference on Mechatronics and Automation (ICMA); IEEE: New York, NY, USA, 2022; pp. 1216–1221. [Google Scholar] [CrossRef]
- Zhang, Y.; Zuo, Z.; Xu, X.; Wu, J.; Zhu, J.; Zhang, H.; Wang, J.; Tian, Y. Road Damage Detection Using UAV Images Based on Multi-level Attention Mechanism. Autom. Constr. 2022, 144, 104613. [Google Scholar] [CrossRef]
- Zhang, C.; Li, G.; Zhang, Z.; Shao, R.; Li, M.; Han, D.; Zhou, M. AAL-Net: A Lightweight Detection Method for Road Surface Defects Based on Attention and Data Augmentation. Appl. Sci. 2023, 13, 1435. [Google Scholar] [CrossRef]
- Ali, R.; Bin-Saeed, Q.; Buyukozturk, O.; Lee, S.; Cha, Y. Monocular Computer Vision-Based Simultaneous Pothole Segmentation and 3D Volume Prediction Using 3DPredictNet. SSRN Electron. J. 2024. [Google Scholar] [CrossRef]
- Lin, W.; Li, X.; Han, H.; Yu, Q.; Cho, Y.H. A Novel Approach for Pavement Distress Detection and Quantification Using RGB-D Camera and Deep Learning Algorithm. Constr. Build. Mater. 2023, 407, 133593. [Google Scholar] [CrossRef]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
- Hu, X.; Assaad, R.H. Real-time robotic teleoperation for pavement pothole segmentation, quantification, and localization using multimodal sensing and efficient multi-scale attention-enhanced edge deep learning. Autom. Constr. 2026, 183, 106806. [Google Scholar] [CrossRef]
- Feng, Z.; Guo, Y.; Navarro-Alarcon, D.; Lyu, Y.; Sun, Y. InconSeg: Residual-Guided Fusion with Inconsistent Multi-Modal Data for Negative and Positive Road Obstacles Segmentation. IEEE Robot. Autom. Lett. 2023, 8, 4871–4878. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
- Li, N.; Zhang, X.; Li, B.; Yuan, B.; Yang, G. IE-collaborative attention for spatial feature refinement and boundary aware in real-time semantic segmentation. Neurocomputing 2025, 653, 131096. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Li, X.; Xu, F.; Yu, A.; Lyu, X.; Gao, H.; Zhou, J. A Frequency Decoupling Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–21. [Google Scholar] [CrossRef]
- Lv, Y.; Liu, Z.; Li, G. Context-Aware Interaction Network for RGB-T Semantic Segmentation. IEEE Trans. Multimed. 2024, 26, 6348–6360. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.O.; Huang, B. A unified framework with multimodal fine-tuning for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5405015. [Google Scholar]
- Wang, Y.; Li, G.; Liu, Z. SGFNet: Semantic-Guided Fusion Network for RGB-Thermal Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7737–7748. [Google Scholar] [CrossRef]
- Bai, L.; Yang, J.; Tian, C.; Sun, Y.; Mao, M.; Xu, Y.; Xu, W. DCANet: Differential Convolution Attention Network for RGB-D Semantic Segmentation. Pattern Recognit. 2025, 162, 111379. [Google Scholar] [CrossRef]









| Encoder | Decoder | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Initial | 1st | 2nd | 3rd | 4th | 1st | 2nd | 3rd | 4th | 5th | |
| Input Size | 512 × 512 | 256 × 256 | 128 × 128 | 64 × 64 | 32 × 32 | 16 × 16 | 32 × 32 | 64 × 64 | 128 × 128 | 256 × 256 |
| Output Size | 256 × 256 | 128 × 128 | 64 × 64 | 32 × 32 | 16 × 16 | 32 × 32 | 64 × 64 | 128 × 128 | 256 × 256 | 512 × 512 |
| Input Channel | 3 | 64 | 64 | 128 | 256 | 512 | 256 | 128 | 64 | 32 |
| Output Channel | 64 | 64 | 128 | 256 | 512 | 256 | 128 | 64 | 32 | 2 |
| Model | Input | Backbone | LR | Decay | Weight Decay | Batch | Epochs | Architecture-Defining Modules (Source) |
|---|---|---|---|---|---|---|---|---|
| ADFormer | RGB/DIS | ResNet-34 | 8 | 100 | Attention-decomposition module; 8 attention heads [19] | |||
| EDAFormer | RGB/DIS | ResNet-34 | 8 | 100 | Embedding-free attention; 8 heads; spatial-reduction ratios [20] | |||
| FDNet | RGB/DIS | ResNet-34 | 8 | 100 | Frequency-decoupling modules (high-/low-frequency branches) [36] | |||
| SegNeXt | RGB/DIS | ResNet-34 | 8 | 100 | Multi-scale convolutional attention (kernels ) + Hamburger decoder [18] | |||
| EAEFNet | RGB+DIS | ResNet-34 | 8 | 100 | Explicit attention-enhanced fusion (EAEF) module [16] | |||
| CAINet | RGB+DIS | ResNet-34 | 8 | 100 | Context-aware interaction module with global context [37] | |||
| MFNet | RGB+DIS | ResNet-34 | 8 | 100 | Multimodal fine-tuning fusion adapters [38] | |||
| SGFNet | RGB+DIS | ResNet-34 | 8 | 100 | Semantic-guided fusion module [39] | |||
| DCANet | RGB+DIS | ResNet-34 | 8 | 100 | Differential convolution attention module [40] | |||
| PotCrackSeg | RGB+DIS | ResNet-34 | 8 | 100 | Dual semantic-feature complementary fusion (DSCF) + CompSemFE [15] | |||
| FAFMNet (Ours) | RGB+DIS | ResNet-34 | 8 | 100 | Feature fusion module (global + local branches) with SDAF/CIF/MCF attention |
| Variants | Evaluation Metrics | Parameters (M) | FLOPs (G) | ||||
|---|---|---|---|---|---|---|---|
| mPre (%) | mRec (%) | mAcc (%) | mF1 (%) | mIoU (%) | |||
| Baseline | 91.25 | 86.23 | 98.32 | 88.67 | 79.64 | 64.79 M | 112.22 G |
| Baseline + CIF | 91.32 | 89.63 | 98.57 (0.25) | 90.26 (1.59) | 82.25 (2.61) | 64.71 M | 112.14 G |
| Baseline + SDAF | 90.97 | 89.82 | 98.62 (0.30) | 90.39 (1.72) | 82.46 (2.82) | 64.82 M | 112.24 G |
| Baseline + MCF | 90.86 | 90.16 | 98.67 (0.35) | 90.51 (1.84) | 82.66 (3.02) | 65.04 M | 112.61 G |
| No. | The Aggregation of Categories by Feature Attention Fusion Modules | Evaluation Metrics | |||||
|---|---|---|---|---|---|---|---|
| Initial | 1st | 2nd | 3rd | 4th | mF1 (%) | mIoU (%) | |
| A | SDAF | CIF | MCF | MCF | MCF | 90.64 | 81.09 |
| B | CIF | SDAF | MCF | MCF | MCF | 89.66 | 81.26 |
| C | CIF | CIF | SDAF | MCF | MCF | 90.04 | 81.89 |
| D | CIF | SDAF | SDAF | MCF | MCF | 90.19 | 82.14 |
| E | SDAF | CIF | CIF | MCF | MCF | 90.44 | 82.56 |
| F | SDAF | SDAF | SDAF | CIF | MCF | 90.23 | 82.20 |
| G | SDAF | SDAF | CIF | CIF | MCF | 90.45 | 82.56 |
| H | SDAF | CIF | CIF | CIF | MCF | 90.43 | 82.54 |
| I | CIF | SDAF | SDAF | SDAF | MCF | 89.96 | 81.76 |
| J | CIF | CIF | SDAF | SDAF | MCF | 90.24 | 82.22 |
| K | CIF | CIF | CIF | SDAF | MCF | 89.68 | 81.29 |
| L | SDAF | SDAF | CIF | MCF | MCF | 91.26 | 83.93 |
| Approach | BF1 (%) | Trimap IoU (%) | Hausdorff Distance (px) |
|---|---|---|---|
| MAFNet | 83.47 | 72.58 | 9.64 |
| FAFMNet (Ours) | 87.92 | 77.36 | 6.81 |
| Approach | Evaluation Metrics | ||||
|---|---|---|---|---|---|
| mPre (%) | mRec (%) | mAcc (%) | mF1 (%) | mIoU (%) | |
| ADFormer (RGB) | 68.51 ± 0.53 | 72.24 ± 0.42 | 94.35 ± 0.24 | 70.79 ± 0.34 | 57.79 ± 0.41 |
| ADFormer (DIS) | 82.32 ± 0.58 | 84.21 ± 0.62 | 96.77 ± 0.18 | 83.25 ± 0.47 | 73.31 ± 0.53 |
| EDAFormer (RGB) | 66.74 ± 0.73 | 74.89 ± 0.51 | 93.53 ± 0.28 | 70.58 ± 0.42 | 54.54 ± 0.45 |
| EDAFormer (DIS) | 85.98 ± 0.54 | 84.33 ± 0.51 | 97.21 ± 0.15 | 85.14 ± 0.37 | 74.13 ± 0.42 |
| FDNet (RGB) | 73.21 ± 0.61 | 75.78 ± 0.57 | 95.27 ± 0.22 | 74.47 ± 0.50 | 59.33 ± 0.43 |
| FDNet (DIS) | 86.82 ± 0.57 | 82.33 ± 0.53 | 96.45 ± 0.17 | 84.52 ± 0.43 | 73.18 ± 0.40 |
| SegNeXt (RGB) | 76.25 ± 0.67 | 69.61 ± 0.73 | 95.23 ± 0.25 | 72.78 ± 0.62 | 61.08 ± 0.67 |
| SegNeXt (DIS) | 87.54 ± 0.52 | 81.63 ± 0.55 | 97.85 ± 0.16 | 85.54 ± 0.41 | 74.73 ± 0.53 |
| EAEFNet | 91.61 ± 0.32 | 83.42 ± 0.38 | 98.43 ± 0.13 | 87.32 ± 0.30 | 77.50 ± 0.34 |
| CAINet | 88.26 ± 0.42 | 87.24 ± 0.47 | 98.51 ± 0.12 | 87.75 ± 0.36 | 78.17 ± 0.42 |
| MFNet | 90.04 ± 0.43 | 85.28 ± 0.40 | 98.45 ± 0.13 | 87.60 ± 0.35 | 77.93 ± 0.44 |
| SGFNet | 85.68 ± 0.32 | 89.52 ± 0.47 | 98.44 ± 0.14 | 87.56 ± 0.38 | 77.87 ± 0.41 |
| DCANet | 90.93 ± 0.23 | 84.59 ± 0.33 | 98.60 ± 0.11 | 88.32 ± 0.20 | 79.09 ± 0.26 |
| PotCrackSeg | 91.45 ± 0.33 | 87.42 ± 0.34 | 98.65 ± 0.11 | 89.39 ± 0.28 | 80.81 ± 0.35 |
| FAFMNet (Ours) | 90.22 ± 0.26 | 92.32 ± 0.25 | 98.73 ± 0.09 | 91.26 ± 0.21 | 83.93 ± 0.27 |
| Approach | Backbone | Parameters (M) | FLOPs (G) |
|---|---|---|---|
| EAEFNet | ResNet-34 | 67.34 M | 121.37 G |
| CAINet | 66.69 M | 118.32 G | |
| MFNet | 65.21 M | 112.65 G | |
| SGFNet | 67.05 M | 119.85 G | |
| DCANet | 64.85 M | 111.76 G | |
| PotCrackSeg | 65.86 M | 116.74 G | |
| FAFMNet (Ours) | 65.36 M | 113.18 G |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Fu, J.; Li, H.; Liu, Q.; Zheng, G.; Zhang, J.; Jiang, J.; Zhang, C. FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot. Eng 2026, 7, 289. https://doi.org/10.3390/eng7060289
Fu J, Li H, Liu Q, Zheng G, Zhang J, Jiang J, Zhang C. FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot. Eng. 2026; 7(6):289. https://doi.org/10.3390/eng7060289
Chicago/Turabian StyleFu, Jianji, Hongyi Li, Qi Liu, Gaofeng Zheng, Jianhuan Zhang, Jin Jiang, and Chentao Zhang. 2026. "FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot" Eng 7, no. 6: 289. https://doi.org/10.3390/eng7060289
APA StyleFu, J., Li, H., Liu, Q., Zheng, G., Zhang, J., Jiang, J., & Zhang, C. (2026). FAFMNet: Feature Attention Fusion Multimodal Network of Road Potholes for Mobile Robot. Eng, 7(6), 289. https://doi.org/10.3390/eng7060289

