Next Article in Journal
Securing IoT Communications via Anomaly Traffic Detection: Synergy of Genetic Algorithm and Ensemble Method
Previous Article in Journal
Multi-Task Learning for Joint Indoor Localization and Blind Channel Estimation in OFDM Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Cross-Modal Data Fusion via Vision-Language Model for Crop Disease Recognition

1
School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China
2
Department of Mechanical Engineering, Nantong Institute of Technology, Nantong 226002, China
3
The College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(13), 4096; https://doi.org/10.3390/s25134096
Submission received: 27 May 2025 / Revised: 25 June 2025 / Accepted: 29 June 2025 / Published: 30 June 2025
(This article belongs to the Section Smart Agriculture)

Abstract

Crop diseases pose a significant threat to agricultural productivity and global food security. Timely and accurate disease identification is crucial for improving crop yield and quality. While most existing deep learning-based methods focus primarily on image datasets for disease recognition, they often overlook the complementary role of textual features in enhancing visual understanding. To address this problem, we proposed a cross-modal data fusion via a vision-language model for crop disease recognition. Our approach leverages the Zhipu.ai multi-model to generate comprehensive textual descriptions of crop leaf diseases, including global description, local lesion description, and color-texture description. These descriptions are encoded into feature vectors, while an image encoder extracts image features. A cross-attention mechanism then iteratively fuses multimodal features across multiple layers, and a classification prediction module generates classification probabilities. Extensive experiments on the Soybean Disease, AI Challenge 2018, and PlantVillage datasets demonstrate that our method outperforms state-of-the-art image-only approaches with higher accuracy and fewer parameters. Specifically, with only 1.14M model parameters, our model achieves a 98.74%, 87.64% and 99.08% recognition accuracy on the three datasets, respectively. The results highlight the effectiveness of cross-modal learning in leveraging both visual and textual cues for precise and efficient disease recognition, offering a scalable solution for crop disease recognition.
Keywords: crop disease recognition; image classification; cross-model data fusion; vision-language model crop disease recognition; image classification; cross-model data fusion; vision-language model

Share and Cite

MDPI and ACS Style

Liu, W.; Wu, G.; Wang, H.; Ren, F. Cross-Modal Data Fusion via Vision-Language Model for Crop Disease Recognition. Sensors 2025, 25, 4096. https://doi.org/10.3390/s25134096

AMA Style

Liu W, Wu G, Wang H, Ren F. Cross-Modal Data Fusion via Vision-Language Model for Crop Disease Recognition. Sensors. 2025; 25(13):4096. https://doi.org/10.3390/s25134096

Chicago/Turabian Style

Liu, Wenjie, Guoqing Wu, Han Wang, and Fuji Ren. 2025. "Cross-Modal Data Fusion via Vision-Language Model for Crop Disease Recognition" Sensors 25, no. 13: 4096. https://doi.org/10.3390/s25134096

APA Style

Liu, W., Wu, G., Wang, H., & Ren, F. (2025). Cross-Modal Data Fusion via Vision-Language Model for Crop Disease Recognition. Sensors, 25(13), 4096. https://doi.org/10.3390/s25134096

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop