Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning

Qiu, Jianlin; Gao, Depeng; Chen, Shuxi; Liu, Wenjie

doi:10.3390/s26134052

This is an early access version, the complete PDF, HTML, and XML versions will be available soon.

Open AccessArticle

Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning

¹

School of Yonyou Digital Intelligence, Nantong Institute of Technology, Nantong 226002, China

²

School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(13), 4052; https://doi.org/10.3390/s26134052 (registering DOI)

Submission received: 13 May 2026 / Revised: 23 June 2026 / Accepted: 24 June 2026 / Published: 25 June 2026

(This article belongs to the Special Issue Computer Vision and Pattern Recognition for Advanced Smart Agriculture Solutions—Second Edition)

Download Versions Notes

Abstract

Crop leaf diseases cause 10–40% annual yield losses, yet timely field diagnosis remains difficult. Vision-language models (VLMs) lift recognition accuracy with rich textual descriptions, but multimodal pipelines are too slow for real-time field use because they require text processing at inference. We present MTL-AWL, a framework built on a training–inference asymmetry: VLM text serves as privileged training-time supervision, and two coupled mechanisms—one retaining VLM semantics in the image encoder and one exploiting them—enable image-only deployment at multimodal accuracy. A modal-dropout strategy (

p = 0.6

) intermittently masks the VLM text sequence during training, forcing the image encoder to retain cross-modal representations independently. An adaptive multi-task loss jointly optimizes InfoNCE contrastive alignment, attention diversity, and modality consistency under learnable softmax weights, consistently converging to a dominant contrastive weight (55% on soybean, 68% on PlantDoc)—identifying cross-modal alignment as the primary mechanism of VLM knowledge transfer. At inference, the model reaches 818 FPS (3.7× faster than multimodal methods) at only 0.41% accuracy cost, attaining 99.30%/98.89% (multimodal/image-only) on soybean and 72.65%/68.80% on PlantDoc—compact enough for real-time, offline field screening.

Keywords: crop leaf disease recognition; modal dropout; multi-task loss function; adaptive weight learning

Share and Cite

MDPI and ACS Style

Qiu, J.; Gao, D.; Chen, S.; Liu, W. Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors 2026, 26, 4052. https://doi.org/10.3390/s26134052

AMA Style

Qiu J, Gao D, Chen S, Liu W. Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors. 2026; 26(13):4052. https://doi.org/10.3390/s26134052

Chicago/Turabian Style

Qiu, Jianlin, Depeng Gao, Shuxi Chen, and Wenjie Liu. 2026. "Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning" Sensors 26, no. 13: 4052. https://doi.org/10.3390/s26134052

APA Style

Qiu, J., Gao, D., Chen, S., & Liu, W. (2026). Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors, 26(13), 4052. https://doi.org/10.3390/s26134052

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI