Dual-Level Attention Relearning for Cross-Modality Rotated Object Detection in UAV RGB–Thermal Imagery
Highlights
- The proposed DLANet introduces a dual-level attention relearning mechanism (IF2M + ABFW) that moves beyond simple feature addition, achieving a 9.6–12.7% performance gain by actively suppressing modality-specific noise (such as thermal ghosts) while preserving fine-grained complementary details.
- To address the scarcity of land-based multimodal inspection data, a dedicated OilLeak dataset is constructed, coupled with a novel coarse-to-fine registration strategy that effectively resolves spatial misalignments in coaxial UAV imagery.
- A critical synergistic mechanism is established. While the Implicit Fine-Grained Fusion Module (IF2M) captures deep semantic interactions, the Adaptive Branch Feature Weighting (ABFW) acts as a dynamic gatekeeper, ensuring robustness against environmental extremes (e.g., occlusion, low light), where single-modality methods fail.
- The framework challenges the trend of heavy Transformer-based fusion by achieving state-of-the-art accuracy with the fewest parameters (39.04 M) and a low computational cost of 72.69 GFLOPs among compared methods, demonstrating that lightweight, rotation-aware CNNs are the optimal solution for real-time edge deployment in resource-constrained UAV operations.
Abstract
1. Introduction
- (1)
- The introduction of an Implicit Fine-Grained Fusion Module (IF2M) to model channel–spatial dependencies and an Adaptive Branch Feature Weighting (ABFW) module to dynamically regulate modality contributions based on scene conditions.
- (2)
- An efficient coarse–fine multimodal alignment strategy to ensure reliable feature correspondence.
- (3)
- The construction of the “OilLeak” dataset to support research in UAV-based RGB–TIR detection for petroleum-leak monitoring.
2. Data and Methods
2.1. Data
2.1.1. DroneVehicle Dataset
2.1.2. OilLeak Dataset
2.1.3. Data Preparation and Statistical Overview
2.2. Methods
2.2.1. Overview
2.2.2. Cross-Modality UAV Image Registration
- (1)
- Geometric Optics-Based Coarse Alignment of TIR and RGB Images
- (2)
- Fine registration based on hierarchical feature template matching
2.3. The Architecture of the Cross-Modality Network
2.4. Dual-Level Attention Relearning Mechanism
2.4.1. Implicit Fine-Grained Feature Module (IF2M)
- (1)
- Sub-feature Group Partitioning
- (2)
- Enhanced Directional Spatial Attention
- (3)
- Cross-Branch Attention Modeling
2.4.2. The ABFW Module
3. Experimental Results and Analysis
3.1. Implementation Details
3.1.1. Training Configurations
3.1.2. Evaluation Metrics
3.2. Comparative Evaluation of Detection Performance
3.2.1. Evaluation on the DroneVehicle Dataset
3.2.2. Evaluation on the OilLeak Dataset
3.3. Ablation Study of DLANet
3.3.1. Module Contribution Analysis
3.3.2. Effect of Different K Values in IF2M
3.3.3. Sensitivity Analysis of Detection Accuracy to Simulated Registration Offsets
3.4. Detection Accuracy and Rotation Angle Error Across Different Target Sizes
3.5. Robustness Assessment to Varying Illumination and Occlusion
3.6. Feature-Level Interpretability and Visualization Analysis
4. Discussions
4.1. Synergistic Effects of Dual-Level Attention Relearning
4.2. Robustness Under Challenging Environmental Conditions
4.3. Mitigation of Modality-Specific False Positives
4.4. Efficiency Paradox: Challenging the Transformer Trend
4.5. Potential and Limitations
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| UAV | Unmanned aerial vehicle |
| DLANet | Dual-level attention network |
| TIR | Thermal infrared |
| IF2M | Implicit fine-grained fusion module |
| ABFW | Adaptive branch feature weighting |
| YOLO | You only look once |
| SSD | Single-shot multiBox detector |
| CMAFF | Cross-modal attention feature fusion |
| RISNet | Redundant information suppression network |
| NMS | Non-maximum suppression |
| CFT | Cross-modal feature Transformer |
| UMA | Unified modality attention |
| CMA | Cross-modality attention |
| HRNet | High resolution network |
| CFR | Context feature representation |
| C2PSA | Cross stage partial with spatial attention |
| SPPF | Spatial pyramid pooling fast |
| MSFC | Multi-scale feature cascade |
| AAP | Adaptive average pooling |
Appendix A. The Architecture of the DLANet
| Module | Scale | Unit | Kernels’ Number | Parameters | Input | Output |
|---|---|---|---|---|---|---|
| Backbone (RGB/TIR) | R1 | Conv | 64 | k:3 × 3, s:2, p:1 | 640 × 640 × 3 | 320 × 320 × 64 |
| R2 | Conv | 128 | k:3 × 3, s:2, p:1 | 320 × 320 × 64 | 160 × 160 × 128 | |
| C3K2 | 32, 64, 128, 256 | k:3 × 3/1 × 1, s:1, p:1 | 160 × 160 × 128 | 160 × 160 × 256 | ||
| R3 | Conv | 256 | k:3 × 3, s:2, p:1 | 160 × 160 × 256 | 80 × 80 × 256 | |
| C3K2 | 64, 128, 256, 512 | k:3 × 3/1 × 1, s:1, p:1 | 80 × 80 × 256 | 80 × 80 × 512 | ||
| R4 | Conv | 512 | k:3 × 3, s:2, p:1 | 80 × 80 × 512 | 40 × 40 × 512 | |
| C3K2 | 128, 256, 512, 1024 | k:3 × 3/1 × 1, s:1, p:1 | 40 × 40 × 512 | 40 × 40 × 512 | ||
| R5 | Conv | 512 | k:3 × 3, s:2, p:1 | 40 × 40 × 512 | 20 × 20 × 512 | |
| C3K2 | 128, 256, 512, 1024 | k:3 × 3/1 × 1, s:1, p:1 | 20 × 20 × 512 | 20 × 20 × 512 | ||
| SPPF | 256, 512, 1024 | k:5 × 5/1 × 1, s:1, p:1,2 | 20 × 20 × 512 | 20 × 20 × 512 | ||
| C2PSA | 256, 512 | k:3 × 3/1 × 1, s:1, p:1 | 20 × 20 × 512 | 20 × 20 × 512 | ||
| IF2M | R3 | Softmax AAP Conv | 32 | k:3 × 3/1 × 1, s:1, p:1 | 80 × 80 × 256 80 × 80 × 256 | 80 × 80 × 256 80 × 80 × 256 |
| R4 | 64 | k:3 × 3/1 × 1, s:1, p:1 | 40 × 40 × 512 40 × 40 × 512 | 40 × 40 × 512 40 × 40 × 512 | ||
| R5 | 64 | k:3 × 3/1 × 1, s:1, p:1 | 20 × 20 × 512 20 × 20 × 512 | 20 × 20 × 512 20 × 20 × 512 | ||
| ABFW | R3 | Add | 512 | αr, αi | 80 × 80 × 512 80 × 80 × 512 | 80 × 80 × 512 |
| R4 | 512 | αr, αi | 40 × 40 × 512 40 × 40 × 512 | 40 × 40 × 512 | ||
| R5 | 512 | αr, αi | 20 × 20 × 512 20 × 20 × 512 | 20 × 20 × 512 | ||
| Detection head | R3 | Conv | 64, 256 | k:3 × 3/1 × 1, s:1, p:1 | 80 × 80 × 256 | 80 × 80 × (5 + L) |
| R4 | 64, 512 | k:3 × 3/1 × 1, s:1, p:1 | 40 × 40 × 512 | 40 × 40 × (5 + L) | ||
| R5 | 64, 512 | k:3 × 3/1 × 1, s:1, p:1 | 20 × 20 × 512 | 20 × 20 × (5 + L) |
References
- Han, W.; Chen, J.; Wang, L.; Feng, R.; Li, F.; Wu, L.; Tian, T.; Yan, J. Methods for Small, Weak Object Detection in Optical High-Resolution Remote Sensing Images: A Survey of Advances and Challenges. IEEE Geosci. Remote Sens. Mag. 2021, 9, 8–34. [Google Scholar] [CrossRef]
- Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building Damage Assessment for Rapid Disaster Response with a Deep Object-Based Semantic Change Detection Framework: From Natural Disasters to Man-Made Disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
- Zhong, Y.; Zheng, Z.; Ma, A.; Lu, X.; Zhang, L. COLOR: Cycling, Offline Learning, and Online Representation Framework for Airport and Airplane Detection Using GF-2 Satellite Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8438–8449. [Google Scholar] [CrossRef]
- Hu, J.; Zhi, X.; Shi, T.; Wang, J.; Li, Y.; Sun, X. Dataset and Benchmark for Ship Detection in Complex Optical Remote Sensing Image. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5642611. [Google Scholar] [CrossRef]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet++ for Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 3509–3521. [Google Scholar] [CrossRef]
- Xiao, S.; Wang, P.; Diao, W.; Rong, X.; Li, X.; Fu, K.; Sun, X. MoCG: Modality Characteristics-Guided Semantic Segmentation in Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5625818. [Google Scholar] [CrossRef]
- Yuan, D.; Zhang, H.; Shu, X.; Liu, Q.; Chang, X.; He, Z.; Shi, G. Thermal Infrared Target Tracking: A Comprehensive Review. IEEE Trans. Instrum. Meas. 2023, 73, 5000419. [Google Scholar] [CrossRef]
- Feng, M.; Su, J. RGBT Tracking: A Comprehensive Review. Inf. Fusion 2024, 110, 102492. [Google Scholar] [CrossRef]
- Feng, D.; Haase-Schutz, C.; Rosenbaum, L.; Hertlein, H.; Glaser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef]
- Bao, W.; Huang, M.; Hu, J.; Xiang, X. Dual-Dynamic Cross-Modal Interaction Network for Multimodal Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5401013. [Google Scholar] [CrossRef]
- Zhang, N.; Chai, B.; Song, J.; Tian, T.; Zhu, P.; Ma, J.; Tian, J. Omni-Scene Infrared Vehicle Detection: An Efficient Selective Aggregation Approach and a Unified Benchmark. ISPRS J. Photogramm. Remote Sens. 2025, 223, 244–260. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3163–3171. [Google Scholar] [CrossRef]
- Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
- Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J. A Survey on Object Detection in Optical Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
- Yan, D.; Zhang, H.; Li, G.; Li, X.; Lei, H.; Lu, K.; Zhang, L.; Zhu, F. Improved Method to Detect the Tailings Ponds from Multispectral Remote Sensing Images Based on Faster R-CNN and Transfer Learning. Remote Sens. 2021, 14, 103. [Google Scholar] [CrossRef]
- Zhao, W.; Zhao, Z.; Xu, M.; Ding, Y.; Gong, J. Differential Multimodal Fusion Algorithm for Remote Sensing Object Detection through Multi-Branch Feature Extraction. Expert Syst. Appl. 2025, 265, 125826. [Google Scholar] [CrossRef]
- Kang, X.; Yin, H.; Duan, P. Global–Local Feature Fusion Network for Visible–Infrared Vehicle Detection. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6005805. [Google Scholar] [CrossRef]
- Fang, Q.; Wang, Z. Cross-Modality Attentive Feature Fusion for Object Detection in Multispectral Remote Sensing Imagery. Pattern Recognit. 2022, 130, 108786. [Google Scholar] [CrossRef]
- Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-Based RGB-Infrared Cross-Modality Vehicle Detection Via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
- Wang, Q.; Chi, Y.; Shen, T.; Song, J.; Zhang, Z.; Zhu, Y. Improving RGB-Infrared Object Detection by Reducing Cross-Modality Redundancy. Remote Sens. 2022, 14, 2020. [Google Scholar] [CrossRef]
- Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-Aware Faster R-CNN for Robust Multispectral Pedestrian Detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef]
- Wang, S.; Wang, C.; Shi, C.; Liu, Y.; Lu, M. Mask-Guided Mamba Fusion for Drone-Based Visible-Infrared Vehicle Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5005712. [Google Scholar] [CrossRef]
- Fang, Q.; Han, D.; Wang, Z. Cross-Modality Fusion Transformer for Multispectral Object Detection. arXiv 2022, arXiv:2111.00273. [Google Scholar] [CrossRef]
- Jiang, C.; Ren, H.; Yang, H.; Huo, H.; Zhu, P.; Yao, Z.; Li, J.; Sun, M.; Yang, S. M2FNet: Multi-Modal Fusion Network for Object Detection from Visible and Thermal Infrared Images. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103918. [Google Scholar] [CrossRef]
- Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
- Zhou, M.; Li, T.; Qiao, C.; Xie, D.; Wang, G.; Ruan, N.; Mei, L.; Yang, Y.; Tao Shen, H. DMM: Disparity-Guided Multispectral Mamba for Oriented Object Detection in Remote Sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5404913. [Google Scholar] [CrossRef]
- Asadzadeh, S.; de Oliveira, W.J.; de Souza Filho, C.R. UAV-Based Remote Sensing for the Petroleum Industry and Environmental Monitoring: State-of-the-Art and Perspectives. J. Pet. Sci. Eng. 2022, 208, 109633. [Google Scholar] [CrossRef]
- De Kerf, T.; Sels, S.; Samsonova, S.; Vanlanduit, S. A Dataset of Drone-Captured, Segmented Images for Oil Spill Detection in Port Environments. Sci. Data 2024, 11, 1180. [Google Scholar] [CrossRef]
- Kapil, R.; Castilla, G.; Marvasti-Zadeh, S.M.; Goodsman, D.; Erbilgin, N.; Ray, N. Orthomosaicking Thermal Drone Images of Forests via Simultaneously Acquired RGB Images. Remote Sens. 2023, 15, 2653. [Google Scholar] [CrossRef]
- Meng, L.; Zhou, J.; Liu, S.; Wang, Z.; Zhang, X.; Ding, L.; Shen, L.; Wang, S. A Robust Registration Method for UAV Thermal Infrared and Visible Images Taken by Dual-Cameras. ISPRS J. Photogramm. Remote Sens. 2022, 192, 189–214. [Google Scholar] [CrossRef]
- Khanam, R.; Hussain, M. Yolov11: An Overview of the Key Architectural Enhancements. arxiv 2024, arxiv, 241017725. [Google Scholar]
- Li, Z.; Chen, S.; Meng, X.; Zhu, R.; Lu, J.; Cao, L.; Lu, P. Full Convolution Neural Network Combined with Contextual Feature Representation for Cropland Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 2157. [Google Scholar] [CrossRef]
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
- Rasheed, A.F.; Zarkoosh, M. YOLOv11 Optimization for Efficient Resource Utilization. arXiv 2024, arXiv:241214790. [Google Scholar] [CrossRef]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhu, G.; Yuan, B.; Sun, Y.; Zhang, W. Adaptive Feature Fusion with Attention-Guided Small Target Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5623116. [Google Scholar] [CrossRef]
- Zhang, N.; Liu, Y.; Liu, H.; Tian, T.; Ma, J.; Tian, J. DTNet: A Specialized Dual-Tuning Network for Infrared Vehicle Detection in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5002815. [Google Scholar] [CrossRef]
- Zhang, N.; Liu, Y.; Liu, H.; Tian, T.; Tian, J. Oriented Infrared Vehicle Detection in Aerial Images via Mining Frequency and Semantic Information. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5002315. [Google Scholar] [CrossRef]
- Wang, J.; Xu, C.; Zhao, C.; Gao, L.; Wu, J.; Yan, Y.; Feng, S.; Su, N. Multimodal Object Detection of UAV Remote Sensing Based on Joint Representation Optimization and Specific Information Enhancement. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 12364–12373. [Google Scholar] [CrossRef]
- Liu, J.; Chen, H.; Wang, Y. Multi-Source Remote Sensing Image Fusion for Ship Target Detection and Recognition. Remote Sens. 2021, 13, 4852. [Google Scholar] [CrossRef]
- Zhao, M.; Li, W.; Li, L.; Hu, J.; Ma, P.; Tao, R. Single-Frame Infrared Small-Target Detection: A Survey. IEEE Geosci. Remote Sens. Mag. 2022, 10, 87–119. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhu, L. A Review on Unmanned Aerial Vehicle Remote Sensing: Platforms, Sensors, Data Processing Methods, and Applications. Drones 2023, 7, 398. [Google Scholar] [CrossRef]
- Arroyo-Mora, J.P.; Kalacska, M.; Løke, T.; Schläpfer, D.; Coops, N.C.; Lucanus, O.; Leblanc, G. Assessing the Impact of Illumination on UAV Pushbroom Hyperspectral Imagery Collected under Various Cloud Cover Conditions. Remote Sens. Environ. 2021, 258, 112396. [Google Scholar] [CrossRef]
- Bhadoriya, A.S.; Vegamoor, V.; Rathinam, S. Vehicle Detection and Tracking Using Thermal Cameras in Adverse Visibility Conditions. Sensors 2022, 22, 4567. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Fan, C.; Ou, C.; Zhang, H. Infrared and Visible Image Fusion Techniques for UAVs: A Comprehensive Review. Drones 2025, 9, 811. [Google Scholar] [CrossRef]
- Kahraman, S.; Bacher, R. A Comprehensive Review of Hyperspectral Data Fusion with Lidar and Sar Data. Annu. Rev. Control 2021, 51, 236–253. [Google Scholar] [CrossRef]
- Zhao, J.; Fang, D.; Ying, J.; Chen, Y.; Chen, Q.; Wang, Q.; Wang, G.; Zhou, B. A Camouflage Target Classification Method Based on Spectral Difference Enhancement and Pixel-Pair Features in Land-Based Hyperspectral Images. Eng. Appl. Artif. Intell. 2025, 156, 111141. [Google Scholar] [CrossRef]
- Alosaimi, N.; Alhichri, H.; Bazi, Y.; Ben Youssef, B.; Alajlan, N. Self-Supervised Learning for Remote Sensing Scene Classification under the Few Shot Scenario. Sci. Rep. 2023, 13, 433. [Google Scholar] [CrossRef]
- Ling, Q.; Li, Y.; An, Y.; Zhu, Z.; Li, P. Open-Set Remote Sensing Object Detection Method Based on Semantic Space Topological Features. ISPRS J. Photogramm. Remote Sens. 2025, 230, 881–894. [Google Scholar] [CrossRef]
- Wang, Y.; Albrecht, C.M.; Braham, N.A.A.; Mou, L.; Zhu, X.X. Self-Supervised Learning in Remote Sensing: A Review. IEEE Geosci. Remote Sens. Mag. 2022, 10, 213–247. [Google Scholar] [CrossRef]
- Cha, E.; Lee, C.; Jang, M.; Ye, J.C. Deepphasecut: Deep Relaxation in Phase for Unsupervised Fourier Phase Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9931–9943. [Google Scholar] [CrossRef]

















| Data Name | Category Number | Resolution | Observation | Training | Validation | Testing | ||
|---|---|---|---|---|---|---|---|---|
| Nadir Angles | Heights (m) | Time | ||||||
| DroneVehicle [23] | 5 | 840 × 712 | 0°, 15°, 30°, 45° | 80–120 | Daytime and nighttime | 17,990 | 1469 | 8980 |
| OilLeak | 1 | 640 × 512 | 0° | 30–100 | Daytime | 1743 | 449 | 450 |
| Methods | Modality | AP0.5 | mAP0.5 | mAP0.5:0.95 | ||||
|---|---|---|---|---|---|---|---|---|
| Car | Freight Car | Truck | Bus | Van | ||||
| Baseline | RGB | 90.1 | 57.7 | 73.7 | 89.6 | 57.3 | 71.4 | 41.7 |
| Baseline | TIR | 94.5 | 69.1 | 78.0 | 92.1 | 64.3 | 74.7 | 43.8 |
| Baseline | RGB + TIR | 97.4 | 61.5 | 75.9 | 94.5 | 50.4 | 76.2 | 45.3 |
| DTNet [41] | TIR | 90.2 | 78.1 | 65.7 | 89.2 | 67.9 | 78.2 | 52.9 |
| I2MDet [42] | TIR | 96.3 | 65.0 | 73.4 | 93.2 | 58.6 | 77.3 | 46.2 |
| M2FNet [28] | RGB + TIR | - | - | - | - | - | 71.5 | - |
| FFODNet [43] | RGB + TIR | 90.4 | 68.4 | 72.6 | 89.2 | 64.1 | 76.9 | - |
| DDCINet-Rol-Trans [10] | RGB + TIR | 91.0 | 66.1 | 78.9 | 90.7 | 65.5 | 78.4 | - |
| UA-CMDet [23] | RGB + TIR | 88.6 | 56.7 | 72.5 | 88.5 | 54.8 | 73.2 | 51.1 |
| CFT [27] | RGB + TIR | 97.8 | 58.1 | 74.0 | 94.6 | 51.1 | 75.1 | 58.8 |
| MGMF [26] | RGB + TIR | 91.4 | 78.5 | 70.1 | 91.1 | 69.4 | 80.3 | 55.2 |
| DMM [30] | RGB + TIR | 90.4 | 68.2 | 79.8 | 89.9 | 68.6 | 79.4 | 52.1 |
| DLANet | RGB + TIR | 98.7 | 77.3 | 82.1 | 97.6 | 71.6 | 85.8 | 69.1 |
| Methods | Modality | Precision | Recall | mAP0.5 | mAP0.5:0.95 |
|---|---|---|---|---|---|
| Baseline | RGB | 62.8 | 70.0 | 70.9 | 55.2 |
| Baseline | TIR | 72.2 | 65.0 | 71.0 | 56.9 |
| Baseline | RGB+ TIR | 62.6 | 70.3 | 72.2 | 58.0 |
| UA-CMDet [23] | RGB+ TIR | 61.0 | 75.0 | 71.5 | 57.1 |
| CFT [27] | RGB+ TIR | 63.2 | 67.5 | 72.8 | 59.1 |
| DLANet | RGB+ TIR | 79.9 | 79.3 | 84.9 | 68.3 |
| IF2M | ABFW | Car | Freight Car | Truck | Bus | Van | mAP0.5 | Δ(K) |
|---|---|---|---|---|---|---|---|---|
| √ | √ | 98.7 | 77.3 | 82.1 | 97.6 | 71.6 | 85.8 | 92.806 |
| √ | 98.6 | 71.7 | 83.1 | 96.7 | 67.3 | 83.5 | 92.800 | |
| √ | 98.6 | 66.5 | 80.6 | 95.9 | 59.7 | 80.3 | 0.006 | |
| 97.4 | 61.5 | 75.9 | 94.5 | 50.4 | 76.2 | 0 |
| IF2M | ABFW | Precision | Recall | mAP0.5 | ∆ (K) |
|---|---|---|---|---|---|
| √ | √ | 79.9 | 79.3 | 84.9 | 92.806 |
| √ | 80.4 | 71.7 | 80.8 | 92.800 | |
| √ | 67.2 | 71.6 | 77.6 | 0.006 | |
| 62.6 | 70.3 | 72.2 | 0 |
| Method | Parameters (M) | GFLOPs | Inference Time (ms) |
|---|---|---|---|
| Baseline | 38.95 | 70.54 | 30.9 |
| M2FNet [28] | 70.0 | - | 106 |
| DDCINet-Rol-Trans [10] | 112.51 | 124.86 | - |
| CFT [27] | 46.83 | 89.10 | 52.9 |
| UA-CMDet [23] | 138.69 | - | 370 |
| MGMF [26] | 122.0 | - | 223 |
| DMM [30] | 87.97 | 137 | - |
| DLANet | 39.04 | 72.69 | 35.1 |
| Methods | Small | Medium | Large | Average | ||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | Precision | Recall | Precision | Recall | |
| Baseline | 72.58 | 74.83 | 72.28 | 77.69 | 81.11 | 80.08 | 74.05 | 75.62 |
| UA-CMDet [23] | 67.92 | 69.83 | 67.53 | 79.33 | 81.98 | 78.05 | 70.87 | 73.60 |
| CFT [27] | 69.08 | 77.06 | 69.92 | 80.97 | 80.33 | 81.43 | 71.69 | 77.84 |
| DLANet | 79.69 | 83.62 | 82.21 | 87.34 | 87.43 | 86.84 | 81.93 | 84.35 |
| Method | Size | Acc@5° | Acc@10° | Acc@15° | AAE(°) | NTP | N |
|---|---|---|---|---|---|---|---|
| DLANet | Small | 82.75 | 96.46 | 97.50 | 2.91 | 30,068 | 35,959 |
| Medium | 94.49 | 97.73 | 98.97 | 1.85 | 106,724 | 122,194 | |
| Large | 94.85 | 98.15 | 99.85 | 1.54 | 1247 | 1479 |
| Scale | Darkness (Illumination ≤ 35) | Dusk (35 < Illumination ≤ 105) | Daytime (105 < Illumination ≤ 175) | Overexposed (Illumination > 175) | ||||
|---|---|---|---|---|---|---|---|---|
| αr | αi | αr | αi | αr | αi | αr | αi | |
| R3 | 0.05 ± 0.12 | 0.91 ± 0.09 | 0.25 ± 0.12 | 0.71 ± 0.11 | 0.65 ± 0.15 | 0.41 ± 0.08 | 0.25 ± 0.12 | 0.71 ± 0.13 |
| R4 | 0.11 ± 0.07 | 0.84 ± 0.12 | 0.31 ± 0.05 | 0.64 ± 0.12 | 0.51 ± 0.08 | 0.44 ± 0.12 | 0.39 ± 0.17 | 0.58 ± 0.15 |
| R5 | 0.09 ± 0.12 | 0.89 ± 0.13 | 0.41 ± 0.10 | 0.59 ± 0.19 | 0.56 ± 0.12 | 0.40 ± 0.14 | 0.42 ± 0.16 | 0.56 ± 0.22 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Li, Z.; Zhen, Z.; Chen, S.; Zhang, L.; Cao, L. Dual-Level Attention Relearning for Cross-Modality Rotated Object Detection in UAV RGB–Thermal Imagery. Remote Sens. 2026, 18, 107. https://doi.org/10.3390/rs18010107
Li Z, Zhen Z, Chen S, Zhang L, Cao L. Dual-Level Attention Relearning for Cross-Modality Rotated Object Detection in UAV RGB–Thermal Imagery. Remote Sensing. 2026; 18(1):107. https://doi.org/10.3390/rs18010107
Chicago/Turabian StyleLi, Zhuqiang, Zhijun Zhen, Shengbo Chen, Liqiang Zhang, and Lisai Cao. 2026. "Dual-Level Attention Relearning for Cross-Modality Rotated Object Detection in UAV RGB–Thermal Imagery" Remote Sensing 18, no. 1: 107. https://doi.org/10.3390/rs18010107
APA StyleLi, Z., Zhen, Z., Chen, S., Zhang, L., & Cao, L. (2026). Dual-Level Attention Relearning for Cross-Modality Rotated Object Detection in UAV RGB–Thermal Imagery. Remote Sensing, 18(1), 107. https://doi.org/10.3390/rs18010107

