RT-DETR-MCDAF: Multimodal Fusion of Visible Light and Near-Infrared Images for Citrus Surface Defect Detection in the Compound Domain
Abstract
:1. Introduction
- This study proposed the application of an RGB-NIR multimodal fusion method for citrus defect detection and the development of a multimodal dataset utilizing coaxial RGB and NIR imaging.
- This study developed a composite domain channel fusion model, namely the Real-Time DEtection TRansformer—Multimodal Compound Domain Attention Fusion (RT-DETR-MCDAF), based on vision transformers. It integrates the Multimodal Compound Domain Attention Fusion (MCDAF) module to extract and fuse features from both time and frequency domains for RGB and NIR images.
- This study conducted extensive comparative tests of the RT-DETR-MCDAF model to validate the effectiveness and superiority of the proposed multimodal approach.
2. Materials
2.1. Image Acquisition
2.2. Image Annotation
2.3. Data Augmentation Methods
3. Methods
3.1. Overview of Multimodal Models
3.2. MCDAF Fusion Structure
Algorithm 1: Multimodal Compound Domain Attention Fusion (MCDAF) |
Input: Multi-modal image input X with NIR and RGB channels, Number of input channels c1, Number of feature channels c2 |
Output: Fused feature representation of X |
1. Initialize NIR_channels ← c1/4 2. Initialize RGB_channels ← c1 − NIR_channels 3. Initialize scale_factor ← (c2/c1)/2 4. NIR_features1 ← PWConv (NIR_channels, NIR_channels × scale_factor, kernel = 1) 5. NIR_features1 ← BatchNorm (NIR_features1) 6. NIR_features1 ← ReLU (NIR_features1) 7. NIR_features1 ← MaxPool (NIR_features1, kernel = 2, stride = 2) 8. NIR_features2 ←Conv (NIR_channels, NIR_channels × scale_factor, kernel = 3) 9. NIR_features2 ← FFN (NIR_features2) 10. NIR_features ← NIR_features1 + NIR_features2 11. RGB_features1 ← PWConv (RGB_channels, RGB_channels × scale_factor, kernel = 1) 12. RGB_features1 ← BatchNorm (RGB_features1) 13. RGB_features1 ← ReLU (RGB_features1) 14. RGB_features1 ← AvgPool (RGB_features1, kernel = 2, stride = 2) 15. RGB_features2 ← Conv (RGB_channels, RGB_channels × scale_factor, kernel = 3) 16. RGB_features2 ← FFN (RGB_features2) 17. RGB_features ← RGB_features1 + RGB_features2 18. FFT_features ← Conv (X, c1, kernel = 3, stride = 2) 19. FFT_features ← FFT (FFT_features) 20. FFT_features ← SCA (FFT_features) 21. FFT_features ← InverseFFT (FFT_features) 22. FFT_features ← Conv (FFT_features, c1 × scale_factor, kernel = 3, stride = 2) 23. Concatenated_features ← Concat (RGB_features, NIR_features) 24. Fusion_features ← Conv (Concatenated_features, c1 × scale_factor, kernel = 3, stride = 2) 25. SAB_features ← SAB (Fusion_features) 26. Enhanced_features ← SAB_features × Fusion_features 27. Output ← Concat (FFT_features, Enhanced_features) 28. if Output is valid then 29. return Output 30. else 31. return ErrorMessage (“Fusion Failed”) 32. end if |
3.2.1. MCDAF Frequency Domain Direction
3.2.2. MCDAF Time Domain Direction
3.3. Experimental Environment and Hyperparameter Configuration
3.4. Evaluation Metrics
4. Results and Analysis
4.1. Analysis of Multimodal Model Training Results
4.2. Ablation Experiment
4.3. Comparison Experiment of Fusion Module Attention Mechanisms
4.4. Comparison with Other Fusion Models
4.5. Analysis of Comparison with Single-Modal Models
5. Discussion
6. Conclusions
- A deep learning-based multimodal method was proposed to fuse RGB and NIR images to detect external defects in citrus fruits, leveraging the unique characteristics of each modality to enhance defect detection.
- The first coaxial imaging multimodal dataset for citrus external defect detection, comprising RGB and NIR images, was developed to cover defect types such as cankers, pests, melanoses, and cracks.
- A multimodal fusion model, RT-DETR-MCDAF, was developed based on the advanced ViT-based object detection model RT-DETR. It integrated RGB and NIR data through the multimodal module MCDAF, which fused RGB and NIR images in the channel dimension, leveraging both time- and frequency-domain spaces. Meanwhile, a traditional channel-fusion model, RT-DETR-RGB&IR, was developed as the baseline. Experiments demonstrated that RT-DETR-MCDAF achieved optimal detection performance, with mAP@0.5 reaching 0.937 and mAP@0.5:0.95 reaching 0.598. Each module within MCDAF significantly contributed to enhancing the multimodal fusion performance.
- Traditional channel-fusion models built upon advanced single-modal object detection frameworks were developed and compared with RT-DETR-MCDAF. The results showed that RT-DETR-MCDAF outperformed numerous models, attaining the highest mAP@0.5 and mAP@0.5:0.95 scores and achieving the highest AP@0.5 scores for the vast majority of categories in the dataset. Visual analysis of the model’s detection effect indicated that RT-DETR-MCDAF exhibited excellent generalization capabilities in complex scenarios involving defects of varying sizes, shapes, and background color interference.
- Comparative experiments were conducted between the spatial attention mechanism SAB and the existing spatial attention mechanisms. The results indicated that SAB exhibited the best performance.
- The RGB detection performance of RT-DETR-MCDAF was compared with that of current advanced single-modal detection models. The results showed that RT-DETR-MCDAF achieved the best balance of Precision, Recall, and F1, as well as the highest mAP@0.5 and mAP@0.5:0.95 scores.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chen, H.; Xuan, Z.; Yang, L.; Zhang, S.; Cao, M. Managing virus diseases in citrus: Leveraging high-throughput sequencing for versatile applications. Hortic. Plant J. 2024, 11, 57–68. [Google Scholar] [CrossRef]
- Feng, J.; Wang, Z.; Wang, S.; Tian, S.; Xu, H. MSDD-YOLOX: An enhanced YOLOX for real-time surface defect detection of oranges by type. Eur. J. Agron. 2023, 149, 126918. [Google Scholar] [CrossRef]
- Lopez, J.J.; Cobos, M.; Aguilera, E. Computer-based detection and classification of flaws in citrus fruits. Neural Comput. Appl. 2011, 20, 975–981. [Google Scholar] [CrossRef]
- Bhargava, A.; Barisal, A. Automatic Detection and Grading of Multiple Fruits by Machine Learning. Food Anal. Methods 2020, 13, 751–761. [Google Scholar] [CrossRef]
- López-García, F.; Andreu-García, G.; Blasco, J.; Aleixos, N.; Valiente, J.-M. Automatic detection of skin defects in citrus fruits using a multivariate image analysis approach. Comput. Electron. Agric. 2010, 71, 189–197. [Google Scholar] [CrossRef]
- Wu, Y.; Chen, J.; Wu, S.; Li, H.; He, L.; Zhao, R.; Wu, C. An improved YOLOv7 network using RGB-D multi-modal feature fusion for tea shoots detection. Comput. Electron. Agric. 2024, 216, 108541. [Google Scholar] [CrossRef]
- Cai, X.; Zhu, Y.; Liu, S.; Yu, Z.; Xu, Y. FastSegFormer: A knowledge distillation-based method for real-time semantic segmentation of surface defects in navel oranges. Comput. Electron. Agric. 2024, 217, 108604. [Google Scholar] [CrossRef]
- Xu, X.; Xu, T.; Li, Z.; Huang, X.; Zhu, Y.; Rao, X. SPMUNet: Semantic segmentation of citrus surface defects driven by superpixel feature. Comput. Electron. Agric. 2024, 224, 109182. [Google Scholar] [CrossRef]
- Liu, D.; Parmiggiani, A.; Psota, E.; Fitzgerald, R.; Norton, T. Where’s your head at? Detecting the orientation and position of pigs with rotated bounding boxes. Comput. Electron. Agric. 2023, 212, 108099. [Google Scholar] [CrossRef]
- Chen, Y.; An, X.; Gao, S.; Li, S.; Kang, H. A deep learning-based vision system combining detection and tracking for fast on-line citrus sorting. Front. Plant Sci. 2021, 12, 622062. [Google Scholar] [CrossRef]
- Da Costa, A.Z.; Figueroa, H.E.; Fracarolli, J.A. Computer vision based detection of external defects on tomatoes using deep learning. Biosyst. Eng. 2020, 190, 131–144. [Google Scholar] [CrossRef]
- Hu, W.; Xiong, J.; Liang, J.; Xie, Z.; Liu, Z.; Huang, Q.; Yang, Z. A method of citrus epidermis defects detection based on an improved YOLOv5. Biosyst. Eng. 2023, 227, 19–35. [Google Scholar] [CrossRef]
- Jia, X.; Zhao, C.; Zhou, J.; Wang, Q.; Liang, X.; He, X.; Huang, W.; Zhang, C. Online detection of citrus surface defects using improved YOLOv7 modeling. Trans. Chin. Soc. Agric. Eng. 2023, 39, 142–151. [Google Scholar]
- Lu, J.; Chen, W.; Lan, Y.; Qiu, X.; Huang, J.; Luo, H. Design of citrus peel defect and fruit morphology detection method based on machine vision. Comput. Electron. Agric. 2024, 219, 108721. [Google Scholar] [CrossRef]
- Fan, S.; Liang, X.; Huang, W.; Zhang, V.J.; Pang, Q.; He, X.; Li, L.; Zhang, C. Real-time defects detection for apple sorting using NIR cameras with pruning-based YOLOV4 network. Comput. Electron. Agric. 2022, 193, 106715. [Google Scholar] [CrossRef]
- Zhang, B.; Fan, S.; Li, J.; Huang, W.; Zhao, C.; Qian, M.; Zheng, L. Detection of Early Rottenness on Apples by Using Hyperspectral Imaging Combined with Spectral Analysis and Image Processing. Food Anal. Methods 2015, 8, 2075–2086. [Google Scholar] [CrossRef]
- Blasco, J.; Aleixos, N.; Gómez, J.; Moltó, E. Citrus sorting by identification of the most common defects using multispectral computer vision. J. Food Eng. 2007, 83, 384–393. [Google Scholar] [CrossRef]
- Abdelsalam, A.M.; Sayed, M.S. Real-time defects detection system for orange citrus fruits using multi-spectral imaging. In Proceedings of the 2016 IEEE 59th International Midwest Symposium on Circuits and Systems (MWSCAS), Abu Dhabi, United Arab Emirates, 16–19 October 2016; pp. 1–4. [Google Scholar]
- Fan, X.; Ge, C.; Yang, X.; Wang, W. Cross-Modal Feature Fusion for Field Weed Mapping Using RGB and Near-Infrared Imagery. Agriculture 2024, 14, 2331. [Google Scholar] [CrossRef]
- Liu, C.; Feng, Q.; Sun, Y.; Li, Y.; Ru, M.; Xu, L. YOLACTFusion: An instance segmentation method for RGB-NIR multimodal image fusion based on an attention mechanism. Comput. Electron. Agric. 2023, 213, 108186. [Google Scholar] [CrossRef]
- Lu, Y.; Gong, M.; Li, J.; Ma, J. Strawberry Defect Identification Using Deep Learning Infrared–Visible Image Fusion. Agronomy 2023, 13, 2217. [Google Scholar] [CrossRef]
- Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. Deepfruits: A fruit detection system using deep neural networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Shen, D.; Chen, D.; Ming, D.; Ren, D.; Diao, Z. ISMSFuse: Multi-modal fusing recognition algorithm for rice bacterial blight disease adaptable in edge computing scenarios. Comput. Electron. Agric. 2024, 223, 109089. [Google Scholar] [CrossRef]
- Wang, W. Advanced Auto Labeling Solution with Added Features. 2023. Available online: https://github.com/CVHub520/X-AnyLabeling (accessed on 21 December 2024).
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Vijayarekha, K.; Govindaraj, R. Citrus fruit external defect classification using wavelet packet transform features and ANN. In Proceedings of the 2006 IEEE International Conference on Industrial Technology, Mumbai, India, 15–17 December 2006; pp. 2872–2877. [Google Scholar]
- Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Why did you say that? arXiv 2016, arXiv:1611.07450. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9167–9176. [Google Scholar]
Model | RT-DETR-RGB&IR | RT-DETR-MCDAF I | RT-DETR-MCDAF II | RT-DETR-MCDAF III | RT-DETR-MCDAF IV | RT-DETR-MCDAF |
---|---|---|---|---|---|---|
RGB&IR | √ | |||||
PM | √ | √ | √ | √ | √ | |
PA | √ | √ | √ | √ | √ | |
DS1 | √ | √ | √ | √ | ||
DS2 | √ | √ | √ | √ | ||
SCA | √ | √ | √ | √ | ||
SAB | √ | √ | √ | √ | ||
P | 0.901 | 0.889 | 0.884 | 0.909 | 0.912 | 0.914 |
R | 0.902 | 0.847 | 0.861 | 0.893 | 0.870 | 0.919 |
F1 | 0.89 | 0.88 | 0.87 | 0.89 | 0.88 | 0.90 |
mAP@0.5 | 0.922 | 0.901 | 0.891 | 0.926 | 0.923 | 0.937 |
mAP@0.5:0.95 | 0.581 | 0.562 | 0.569 | 0.586 | 0.584 | 0.598 |
Model | Param (M) | Modelsize (MB) |
---|---|---|
RT-DETR-RGB&NIR | 32.82 | 63.1 |
RT-DETR-MCDAF | 32 | 63.1 |
Model | P | R | F1 | mAP@0.5 | mAP@0.5:0.95 |
---|---|---|---|---|---|
SE | 0.901 | 0.898 | 0.88 | 0.928 | 0.588 |
CBAM | 0.880 | 0.911 | 0.87 | 0.926 | 0.587 |
ECA | 0.895 | 0.909 | 0.89 | 0.932 | 0.590 |
CA | 0.90 | 0.897 | 0.89 | 0.934 | 0.592 |
EMA | 0.916 | 0.893 | 0.90 | 0.935 | 0.595 |
Ours | 0.914 | 0.919 | 0.90 | 0.937 | 0.598 |
Model | P | R | F1 | mAP@0.5 | mAP@0.5:0.95 |
---|---|---|---|---|---|
YOLOv5l-RGB&IR | 0.905 | 0.893 | 0.88 | 0.909 | 0.568 |
YOLOv7-RGB&IR | 0.851 | 0.869 | 0.86 | 0.897 | 0.502 |
YOLOv8l-RGB&IR | 0.901 | 0.903 | 0.89 | 0.921 | 0.571 |
YOLOv10l-RGB&IR | 0.892 | 0.862 | 0.86 | 0.901 | 0.568 |
YOLOv11l-RGB&IR | 0.924 | 0.911 | 0.90 | 0.919 | 0.568 |
RT-DETRl-RGB&IR | 0.901 | 0.902 | 0.89 | 0.922 | 0.581 |
Ours | 0.914 | 0.919 | 0.90 | 0.937 | 0.598 |
Model | Canker | Pest | Melanose | Crack | Peduncle (Stem) | Navel |
---|---|---|---|---|---|---|
YOLOv5l-RGB&IR | 0.893 | 0.889 | 0.914 | 0.893 | 0.936 | 0.930 |
YOLOv7-RGB&IR | 0.898 | 0.812 | 0.906 | 0.901 | 0.950 | 0.912 |
YOLOv8l-RGB&IR | 0.904 | 0.897 | 0.909 | 0.925 | 0.962 | 0.929 |
YOLOv10l-RGB&IR | 0.861 | 0.853 | 0.914 | 0.902 | 0.942 | 0.935 |
YOLOv11l-RGB&IR | 0.895 | 0.898 | 0.916 | 0.919 | 0.953 | 0.933 |
RT-DETRl-RGB&IR | 0.910 | 0.889 | 0.920 | 0.912 | 0.968 | 0.933 |
ours | 0.911 | 0.903 | 0.945 | 0.918 | 0.977 | 0.966 |
P | R | F1 | mAP@0.5 | mAP@0.5:0.95 | |
---|---|---|---|---|---|
Faster-Rcnn | 0.89 | 0.871 | 0.86 | 0.90 | 0.55 |
SSD | 0.894 | 0.883 | 0.86 | 0.89 | 0.56 |
YOLOv5l | 0.879 | 0.916 | 0.91 | 0.924 | 0.581 |
YOLOv8l | 0.910 | 0.899 | 0.90 | 0.917 | 0.579 |
YOLOv10l | 0.893 | 0.895 | 0.89 | 0.917 | 0.574 |
YOLOv11l | 0.905 | 0.912 | 0.90 | 0.915 | 0.569 |
RT-DETRl | 0.909 | 0.919 | 0.89 | 0.920 | 0.566 |
Ours | 0.914 | 0.919 | 0.90 | 0.937 | 0.598 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Luo, J.; Yang, Z.; Cao, Y.; Wen, T.; Li, D. RT-DETR-MCDAF: Multimodal Fusion of Visible Light and Near-Infrared Images for Citrus Surface Defect Detection in the Compound Domain. Agriculture 2025, 15, 630. https://doi.org/10.3390/agriculture15060630
Luo J, Yang Z, Cao Y, Wen T, Li D. RT-DETR-MCDAF: Multimodal Fusion of Visible Light and Near-Infrared Images for Citrus Surface Defect Detection in the Compound Domain. Agriculture. 2025; 15(6):630. https://doi.org/10.3390/agriculture15060630
Chicago/Turabian StyleLuo, Jingxi, Zhanwei Yang, Ying Cao, Tao Wen, and Dapeng Li. 2025. "RT-DETR-MCDAF: Multimodal Fusion of Visible Light and Near-Infrared Images for Citrus Surface Defect Detection in the Compound Domain" Agriculture 15, no. 6: 630. https://doi.org/10.3390/agriculture15060630
APA StyleLuo, J., Yang, Z., Cao, Y., Wen, T., & Li, D. (2025). RT-DETR-MCDAF: Multimodal Fusion of Visible Light and Near-Infrared Images for Citrus Surface Defect Detection in the Compound Domain. Agriculture, 15(6), 630. https://doi.org/10.3390/agriculture15060630