Efficient Layer-Wise Cross-View Calibration and Aggregation for Multispectral Object Detection
Abstract
1. Introduction
- We propose a one-stage RGB-IR object detection framework, aligning multispectral images with contextual information to improve performance and efficiency.
- An ISRA module is embedded in each layer to calibrate multi-modal images at the feature level through neighborhood-aware aggregation.
- The design of the calibration and filtering of the ISRA and GCF modules and the design of the detector head together form an efficient one-stage oriented detector.
- Extensive experiments on three RGB-IR object detection benchmarks highlight LCMA’s superiority, demonstrating significant performance gains and increased speed.
2. Related Work
2.1. Oriented Object Detection
2.2. Cross-Modality Fusion
3. Methodology
3.1. Overall Architecture
3.2. Inter-Modality Spatial-Reduction Attention
3.3. Gated Coupled Filter
3.4. One-Stage Oriented Detector
3.5. Loss Function
4. Analysis of ISRA Module
5. Experiments and Results
5.1. Dataset and Implementation Details
5.2. Comparisons with State-of-the-Art Methods
5.2.1. Comparison Methods
5.2.2. DroneVehicle
5.2.3. LLVIP & FLIR
- FLIR
- Classification Performance
5.2.4. Speed and Memory
5.3. Ablation Studies
5.3.1. Effectiveness of Inter-Modality Spatial-Reduction Attention
5.3.2. Effectiveness of Gated Coupled Filter
5.4. Quantitative Visualization
5.5. Comparing the Impact of Diverse Transformer Backbones
One-Stage Oriented Detectors
- Handling of Small Objects
- Efficiency
5.6. Discussion
5.6.1. Visualization
5.6.2. Deployment Analysis
5.6.3. Limitation
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. arXiv 2019, arXiv:1905.05055. [Google Scholar] [CrossRef]
- Song, K.; Xue, X.; Wen, H.; Ji, Y.; Yan, Y.; Meng, Q. Misaligned Visible-Thermal Object Detection: A Drone-based Benchmark and Baseline. IEEE Trans. Intell. Veh. 2024, 9, 7449–7460. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2786–2795. [Google Scholar]
- Liu, T.; Lam, K.M.; Zhao, R.; Qiu, G. Deep cross-modal representation learning and distillation for illumination-invariant pedestrian detection. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 315–329. [Google Scholar] [CrossRef]
- Xu, D.; Ouyang, W.; Ricci, E.; Wang, X.; Sebe, N. Learning cross-modal deep representations for robust pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5363–5371. [Google Scholar]
- Zhou, T.; Fan, D.P.; Cheng, M.M.; Shen, J.; Shao, L. RGB-D salient object detection: A survey. Comput. Vis. Media 2021, 7, 37–69. [Google Scholar] [CrossRef]
- Tang, L.; Xiang, X.; Zhang, H.; Gong, M.; Ma, J. DIVFusion: Darkness-free infrared and visible image fusion. Inf. Fusion 2023, 91, 477–493. [Google Scholar] [CrossRef]
- Ji, W.; Li, J.; Yu, S.; Zhang, M.; Piao, Y.; Yao, S.; Bi, Q.; Ma, K.; Zheng, Y.; Lu, H.; et al. Calibrated RGB-D salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9471–9481. [Google Scholar]
- Jiang, X.; Wang, N.; Xin, J.; Xia, X.; Yang, X.; Gao, X. Learning lightweight super-resolution networks with weight pruning. Neural Netw. 2021, 144, 21–32. [Google Scholar] [CrossRef]
- Zhou, K.; Chen, L.; Cao, X. Improving multispectral pedestrian detection by addressing modality imbalance problems. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 787–803. [Google Scholar]
- Li, H.; Wang, N.; Ding, X.; Yang, X.; Gao, X. Adaptively learning facial expression representation via cf labels and distillation. IEEE Trans. Image Process. 2021, 30, 2016–2028. [Google Scholar] [CrossRef]
- Zhou, D.; Wang, N.; Peng, C.; Yu, Y.; Yang, X.; Gao, X. Towards multi-domain face synthesis via domain-invariant representations and multi-level feature parts. IEEE Trans. Multimed. 2021, 24, 3469–3479. [Google Scholar] [CrossRef]
- Xie, J.; Anwer, R.M.; Cholakkal, H.; Nie, J.; Cao, J.; Laaksonen, J.; Khan, F.S. Learning a Dynamic Cross-Modal Network for Multispectral Pedestrian Detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4043–4052. [Google Scholar]
- Yuan, M.; Wang, Y.; Wei, X. Translation, Scale and Rotation: Cross-Modal Alignment Meets RGB-Infrared Vehicle Detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 509–525. [Google Scholar]
- Liu, Y.; Wang, W.; Feng, C.; Zhang, H.; Chen, Z.; Zhan, Y. Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognit. 2023, 138, 109368. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, H.; Zhan, Y.; Chen, Z.; Yin, G.; Wei, L.; Chen, Z. Noise-resistant multimodal transformer for emotion recognition. Int. J. Comput. Vis. 2025, 133, 3020–3040. [Google Scholar] [CrossRef]
- Zhang, L.; Zhu, X.; Chen, X.; Yang, X.; Lei, Z.; Liu, Z. Weakly aligned cross-modal learning for multispectral pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5127–5137. [Google Scholar]
- Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-modality interactive attention network for multispectral pedestrian detection. Inf. Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
- Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef]
- Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
- Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3520–3529. [Google Scholar]
- Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 19–21 May 2021; Volume 35, pp. 3163–3171. [Google Scholar]
- Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
- Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. AO2-DETR: Arbitrary-Oriented Object Detection Transformer. arXiv 2022, arXiv:2205.12785. [Google Scholar] [CrossRef]
- Yuan, M.; Wei, X. C2 Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403712. [Google Scholar] [CrossRef]
- Sun, X.; Yu, Y.; Cheng, Q. Low-rank Multimodal Remote Sensing Object Detection with Frequency Filtering Experts. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5637114. [Google Scholar] [CrossRef]
- Li, C.; Song, D.; Tong, R.; Tang, M. Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recognit. 2019, 85, 161–171. [Google Scholar] [CrossRef]
- Yang, J.; Xiao, L.; Zhao, Y.Q.; Chan, J.C.W. Unsupervised deep tensor network for hyperspectral–multispectral image fusion. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 13017–13031. [Google Scholar] [CrossRef]
- Tu, Z.; Lin, C.; Zhao, W.; Li, C.; Tang, J. M 5 l: Multi-modal multi-margin metric learning for rgbt tracking. IEEE Trans. Image Process. 2021, 31, 85–98. [Google Scholar] [CrossRef]
- Zhou, T.; Fu, H.; Chen, G.; Zhou, Y.; Fan, D.P.; Shao, L. Specificity-preserving rgb-d saliency detection. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 4681–4691. [Google Scholar]
- Bai, X.; Hu, Z.; Zhu, X.; Huang, Q.; Chen, Y.; Fu, H.; Tai, C.L. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1090–1099. [Google Scholar]
- Cao, Y.; Luo, X.; Yang, J.; Cao, Y.; Yang, M.Y. Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection. Inf. Fusion 2022, 88, 1–11. [Google Scholar] [CrossRef]
- Yang, J.; Lin, T.; Chen, X.; Xiao, L. Multiple Deep Proximal Learning for Hyperspectral-Multispectral Image Fusion. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5525814. [Google Scholar] [CrossRef]
- Kim, J.U.; Park, S.; Ro, Y.M. Towards Versatile Pedestrian Detector with Multisensory-Matching and Multispectral Recalling Memory. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI 22), Online, 22 February–1 March 2022; Association for the Advancement of Artificial Intelligence: Washington, DC, USA, 2022. [Google Scholar]
- Li, X.; Ding, M.; Pižurica, A. Deep feature fusion via two-stream convolutional neural network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 58, 2615–2629. [Google Scholar] [CrossRef]
- Tu, Z.; Ma, Y.; Li, Z.; Li, C.; Xu, J.; Liu, Y. RGBT salient object detection: A large-scale dataset and benchmark. IEEE Trans. Multimed. 2023, 25, 4163–4176. [Google Scholar] [CrossRef]
- Tang, Z.; Xu, T.; Wu, X.J. A Survey for Deep RGBT Tracking. arXiv 2022, arXiv:2201.09296. [Google Scholar] [CrossRef]
- Liu, L.; Chen, J.; Wu, H.; Li, G.; Li, C.; Lin, L. Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4823–4833. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
- Wang, Y.; Wei, X.; Tang, X.; Shen, H.; Zhang, H. Adaptive fusion CNN features for rgbt object tracking. IEEE Trans. Intell. Transp. Syst. 2022, 23, 7831–7840. [Google Scholar] [CrossRef]
- Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2736–2746. [Google Scholar]
- Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 677–694. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-based RGB-Infrared Cross-Modality Vehicle Detection via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
- Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 3496–3504. [Google Scholar]
- Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Multispectral fusion for object detection with cyclic fuse-and-refine blocks. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Online, 25–28 September 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 276–280. [Google Scholar]
- Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou loss: Towards accurate oriented object detection in complex environments. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part V 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 195–211. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Zhang, H.; Fromont, E.; Lefèvre, S.; Avignon, B. Guided attentive feature fusion for multispectral pedestrian detection. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 72–80. [Google Scholar]
- Cao, Y.; Bin, J.; Hamari, J.; Blasch, E.; Liu, Z. Multimodal object detection by channel switching and spatial attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 403–411. [Google Scholar]
- Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K.V.; Joulin, A.; Misra, I. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 15180–15190. [Google Scholar]
- Chen, Y.T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal Object Detection via Probabilistic Ensembling. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 139–158. [Google Scholar]
- Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral deep neural networks for pedestrian detection. arXiv 2016, arXiv:1611.02644. [Google Scholar] [CrossRef]
- He, X.; Tang, C.; Zou, X.; Zhang, W. Multispectral Object Detection via Cross-Modal Conflict-Aware Learning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1465–1474. [Google Scholar]
- Wang, A.; Wang, H.; Huang, Z.; Zhao, B.; Li, W. Directional Alignment Instance Knowledge Distillation for Arbitrary-Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5618914. [Google Scholar] [CrossRef]










| DroneVehicle Val. | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Detectors | Backbone | Type | Car | truck | freight car | bus | van | mAP@0.5 | Modality |
| RetinaNet(OBB) [46] | ResNet50 | one-stage | 78.45 | 34.39 | 24.14 | 69.75 | 28.82 | 47.11 | RGB |
| Faster R-CNN(OBB) [3] | ResNet50 | two-stage | 79.69 | 41.99 | 33.99 | 76.94 | 37.68 | 54.06 | |
| RoITransformer [22] | ResNet50 | two-stage | 61.55 | 55.05 | 42.26 | 85.48 | 44.84 | 61.55 | |
| S2ANet [25] | ResNet50 | one-stage | 79.86 | 50.02 | 36.21 | 82.77 | 37.52 | 57.28 | |
| Oriented R-CNN [23] | ResNet50 | two-stage | 80.26 | 55.39 | 42.12 | 86.84 | 46.92 | 62.30 | |
| YOLOv5(OBB) [49] | DarkNet53 | one-stage | 89.15 | 59.72 | 42.95 | 78.75 | 43.88 | 62.89 | |
| DETR (OBB) [52] | Transformer | one-stage | 87.31 | 60.20 | 43.97 | 79.02 | 44.59 | 63.00 | |
| RetinaNet(OBB) [46] | ResNet50 | one-stage | 88.81 | 35.43 | 39.47 | 76.45 | 32.12 | 54.45 | IR |
| Faster-R-CNN(OBB) [3] | ResNet50 | two-stage | 89.68 | 40.95 | 43.10 | 86.32 | 41.21 | 60.27 | |
| RoITransformer [22] | ResNet50 | two-stage | 89.64 | 50.98 | 53.42 | 88.86 | 44.47 | 65.47 | |
| S2ANet [25] | ResNet50 | one-stage | 89.71 | 51.03 | 50.27 | 88.97 | 44.03 | 64.80 | |
| Oriented R-CNN [23] | ResNet50 | two-stage | 89.63 | 53.92 | 53.86 | 89.15 | 40.95 | 65.50 | |
| YOLOv5(OBB) [49] | DarkNet53 | one-stage | 89.29 | 59.89 | 49.51 | 86.72 | 45.06 | 65.49 | |
| DETR (OBB) [52] | Transformer | one-stage | 88.76 | 61.13 | 50.75 | 87.64 | 45.39 | 66.73 | |
| Halfway Fusion (OBB) [57] | ResNet50 | two-stage | 89.85 | 60.34 | 55.51 | 88.97 | 46.28 | 68.19 | RGB-IR |
| CIAN(OBB) [19] | ResNet50 | one-stage | 89.98 | 62.47 | 60.22 | 88.90 | 49.59 | 70.23 | |
| AR-CNN (OBB) [18] | ResNet50 | two-stage | 90.08 | 64.82 | 62.12 | 89.38 | 51.51 | 71.58 | |
| TSFADet [15] | ResNet50 | two-stage | 89.88 | 67.87 | 63.74 | 89.81 | 53.99 | 73.06 | |
| Cacade-TSFADet [15] | ResNet50 | two-stage | 90.01 | 69.15 | 65.45 | 89.70 | 55.19 | 73.90 | |
| CALNet [58] | ResNet50 | one-stage | 89.07 | 69.69 | 67.27 | 89.74 | 52.20 | 73.59 | |
| DAIK [59] | Efficient-rep | one-stage | 90.19 | 71.60 | 57.41 | 89.89 | 50.20 | 71.70 | |
| C2Former [27] | ResNet50 | one-stage | 90.20 | 68.30 | 64.40 | 89.80 | 58.50 | 74.20 | |
| LCMA-ResNet (199.5 M) | ResNet50 | one-stage | 90.24 | 73.93 | 70.23 | 90.04 | 57.31 | 76.35 | RGB-IR |
| LCMA-tiny (82.8 M) | DarkNet53 | one-stage | 90.09 | 67.89 | 64.27 | 89.69 | 57.51 | 73.89 | |
| LCMA-small (89.9 M) | DarkNet53 | one-stage | 90.12 | 70.26 | 66.41 | 89.49 | 58.30 | 74.91 | |
| LCMA-base (261.0 M) | DarkNet53 | one-stage | 90.31 | 76.83 | 70.91 | 89.89 | 63.45 | 78.28 | |
| DroneVehicle test. | |||||||||
| RetinaNet (OBB) [46] | ResNet50 | one-stage | 67.50 | 28.24 | 13.72 | 62.05 | 19.26 | 38.10 | RGB |
| Faster R-CNN (OBB) [3] | ResNet50 | two-stage | 67.88 | 38.59 | 26.31 | 66.98 | 23.2 | 44.59 | |
| RoITransformer [22] | ResNet50 | two-stage | 68.13 | 44.17 | 29.08 | 70.55 | 27.64 | 47.91 | |
| ReDet [4] | ResNet50 | two-stage | 69.48 | 47.87 | 31.46 | 77.37 | 29.03 | 51.04 | |
| Gliding Vertex [21] | ResNet50 | two-stage | 75.77 | 46.08 | 33.75 | 68.05 | 38.72 | 52.48 | |
| YOLOv5 (OBB) [49] | DarkNet53 | one-stage | 76.24 | 48.91 | 35.47 | 68.91 | 40.37 | 53.98 | |
| DETR (OBB) [52] | Transformer | one-stage | 77.01 | 50.32 | 37.95 | 69.34 | 42.82 | 55.48 | |
| RetinaNet (OBB) [46] | ResNet50 | one-stage | 79.86 | 32.84 | 28.05 | 67.32 | 16.44 | 44.90 | IR |
| Faster R-CNN (OBB) [3] | ResNet50 | two-stage | 88.63 | 42.51 | 35.16 | 77.92 | 28.52 | 54.55 | |
| RoITransformer [22] | ResNet50 | two-stage | 88.85 | 51.53 | 41.49 | 79.48 | 34.39 | 59.15 | |
| ReDet [4] | ResNet50 | two-stage | 89.47 | 53.95 | 42.82 | 79.89 | 36.56 | 60.54 | |
| Gliding Vertex [21] | ResNet50 | two-stage | 89.15 | 59.72 | 42.95 | 78.75 | 43.88 | 62.89 | |
| YOLOv5 (OBB) [49] | DarkNet53 | one-stage | 89.29 | 59.89 | 49.50 | 86.72 | 42.06 | 65.49 | |
| DETR (OBB) [52] | Transformer | one-stage | 89.87 | 60.50 | 51.71 | 87.39 | 44.82 | 66.86 | |
| UA-CMDet [48] | ResNet50 | two-stage | 87.51 | 60.70 | 46.80 | 87.08 | 37.95 | 64.01 | RGB-IR |
| TSFADet [15] | ResNet50 | two-stage | 89.23 | 72.00 | 54.15 | 88.08 | 48.77 | 70.44 | |
| CALNet [58] | ResNet50 | one-stage | 88.98 | 68.95 | 67.61 | 89.86 | 51.84 | 73.45 | |
| LCMA-ResNet50 | ResNet50 | one-stage | 90.18 | 72.14 | 58.66 | 88.84 | 56.60 | 73.29 | RGB-IR |
| LCMA-base | DarkNet53 | one-stage | 90.35 | 78.45 | 65.29 | 89.30 | 62.53 | 77.19 | |
| Method | Backbone | Type | mAP@.5:.95 |
|---|---|---|---|
| Faster R-CNN [3] | ResNet50 | two-stage | 56.9 |
| Cascade R-CNN [3] | ResNet50 | two-stage | 59.6 |
| DCMNet [14] | ResNet50 | two-stage | 61.5 |
| RetinaNet [46] | ResNet50 | one-stage | 56.9 |
| CSSA [54] | ResNet50 | one-stage | 56.9 |
| ImageBind [55] | ResNet50 | two-stage | 63.4 |
| LCMA-ResNet | ResNet50 | one-stage | 63.3 |
| LCMA-base (Ours) | DarkNet53 | one-stage | 64.0 |
| Method | mAP@0.5 | mAP@.5:.95 | Modality |
|---|---|---|---|
| YOLOv5 | 73.9 | 39.5 | IR |
| YOLOv5 | 67.8 | 31.8 | RGB |
| Faster R-CNN [3] | 73.4 | 37.9 | IR |
| Faster R-CNN [3] | 65.0 | 30.2 | RGB |
| Halfway Fusion [57] | 71.5 | 35.8 | RGB-IR |
| GAFF [53] | 71.5 | 35.8 | RGB-IR |
| ProbEn [56] | 75.5 | 37.9 | RGB-IR |
| CSSA [54] | 79.2 | 41.3 | RGB-IR |
| LCMA-base | 80.5 | 42.2 | RGB-IR |
| Method | Backbone | Type | FPS | mAP@0.5 |
|---|---|---|---|---|
| DETR | Transformer | one-stage | 10.6 | 66.13 |
| Halfway Fusion | ResNet50 | two-stage | 20.4 | 68.19 |
| CIAN(OBB) | ResNet50 | one-stage | 21.7 | 70.23 |
| AR-CNN(OBB) | ResNet50 | two-stage | 18.2 | 71.58 |
| TSFADet | ResNet50 | two-stage | 18.6 | 73.06 |
| LCMA-ResNet | ResNet50 | one-stage | 20.0 | 76.35 |
| LCMA-tiny | DarkNet53 | one-stage | 43.9 | 73.89 |
| LCMA-small | DarkNet53 | one-stage | 28.5 | 74.91 |
| LCMA-base | DarkNet53 | one-stage | 14.0 | 78.28 |
| Method | Memory | ISRA | R | FPS | mAP@0.5 |
|---|---|---|---|---|---|
| LCMA-ResNet | 199.5 M | (8, 8, 8) | (4, 2, 1) | 20.0 | 76.35 |
| LCMA-tiny | 82.8 M | (2, 2, 2) | (4, 2, 1) | 43.9 | 73.89 |
| LCMA-small | 89.9 M | (3, 6, 3) | (4, 2, 1) | 28.5 | 74.91 |
| LCMA-base | 261.0 M | (8, 8, 8) | (4, 2, 1) | 14.0 | 78.28 |
| Method | SRA | ISRA | GCF | mAP@0.5 |
|---|---|---|---|---|
| Baseline | − | − | − | 69.83 |
| Baseline+GCF | − | − | w/ | 71.58 |
| Baseline+SRA | o | − | − | 74.93 |
| LCMA w/o GCF | − | w/ | − | 76.01 |
| LCMA w/o ISRA | w/ | − | w/ | 76.66 |
| LCMA (Ours) | − | w/ | w/ | 78.28 |
| Method | Backbone | mAP@0.5 |
|---|---|---|
| LCMA-base | PVT | 76.66 |
| PVTv2 | 76.53 | |
| Swin-T | 77.36 | |
| DarkNet53 | 78.28 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
He, X.; Yang, T.; Yan, T.; Li, H.; Ge, Y.; Ren, Z.; Liu, Z.; Jiang, J.; Tang, C. Efficient Layer-Wise Cross-View Calibration and Aggregation for Multispectral Object Detection. Electronics 2026, 15, 498. https://doi.org/10.3390/electronics15030498
He X, Yang T, Yan T, Li H, Ge Y, Ren Z, Liu Z, Jiang J, Tang C. Efficient Layer-Wise Cross-View Calibration and Aggregation for Multispectral Object Detection. Electronics. 2026; 15(3):498. https://doi.org/10.3390/electronics15030498
Chicago/Turabian StyleHe, Xiao, Tong Yang, Tingzhou Yan, Hongtao Li, Yang Ge, Zhijun Ren, Zhe Liu, Jiahe Jiang, and Chang Tang. 2026. "Efficient Layer-Wise Cross-View Calibration and Aggregation for Multispectral Object Detection" Electronics 15, no. 3: 498. https://doi.org/10.3390/electronics15030498
APA StyleHe, X., Yang, T., Yan, T., Li, H., Ge, Y., Ren, Z., Liu, Z., Jiang, J., & Tang, C. (2026). Efficient Layer-Wise Cross-View Calibration and Aggregation for Multispectral Object Detection. Electronics, 15(3), 498. https://doi.org/10.3390/electronics15030498

