This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Open AccessArticle
Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning
by
Jianlin Qiu
Jianlin Qiu 1,
Depeng Gao
Depeng Gao 1,
Shuxi Chen
Shuxi Chen 1 and
Wenjie Liu
Wenjie Liu 2,*
1
School of Yonyou Digital Intelligence, Nantong Institute of Technology, Nantong 226002, China
2
School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(13), 4052; https://doi.org/10.3390/s26134052 (registering DOI)
Submission received: 13 May 2026
/
Revised: 23 June 2026
/
Accepted: 24 June 2026
/
Published: 25 June 2026
Abstract
Crop leaf diseases cause 10–40% annual yield losses, yet timely field diagnosis remains difficult. Vision-language models (VLMs) lift recognition accuracy with rich textual descriptions, but multimodal pipelines are too slow for real-time field use because they require text processing at inference. We present MTL-AWL, a framework built on a training–inference asymmetry: VLM text serves as privileged training-time supervision, and two coupled mechanisms—one retaining VLM semantics in the image encoder and one exploiting them—enable image-only deployment at multimodal accuracy. A modal-dropout strategy () intermittently masks the VLM text sequence during training, forcing the image encoder to retain cross-modal representations independently. An adaptive multi-task loss jointly optimizes InfoNCE contrastive alignment, attention diversity, and modality consistency under learnable softmax weights, consistently converging to a dominant contrastive weight (55% on soybean, 68% on PlantDoc)—identifying cross-modal alignment as the primary mechanism of VLM knowledge transfer. At inference, the model reaches 818 FPS (3.7× faster than multimodal methods) at only 0.41% accuracy cost, attaining 99.30%/98.89% (multimodal/image-only) on soybean and 72.65%/68.80% on PlantDoc—compact enough for real-time, offline field screening.
Share and Cite
MDPI and ACS Style
Qiu, J.; Gao, D.; Chen, S.; Liu, W.
Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors 2026, 26, 4052.
https://doi.org/10.3390/s26134052
AMA Style
Qiu J, Gao D, Chen S, Liu W.
Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors. 2026; 26(13):4052.
https://doi.org/10.3390/s26134052
Chicago/Turabian Style
Qiu, Jianlin, Depeng Gao, Shuxi Chen, and Wenjie Liu.
2026. "Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning" Sensors 26, no. 13: 4052.
https://doi.org/10.3390/s26134052
APA Style
Qiu, J., Gao, D., Chen, S., & Liu, W.
(2026). Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors, 26(13), 4052.
https://doi.org/10.3390/s26134052
Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details
here.
Article Metrics
Article Access Statistics
For more information on the journal statistics, click
here.
Multiple requests from the same IP address are counted as one view.