Next Article in Journal
ROIV-SLAM: Rotation-Optimized Inertial–Visual SLAM for a Non-Coaxial Two-Wheeled Robot Under Roll Disturbances
Previous Article in Journal
Enhanced Inversion for Distributed Acoustic Sensing: A Robust Approach with HOLp–OGS Regularization
Previous Article in Special Issue
Advancements in 3D Reconstruction for Plant Phenotyping: Technologies, Applications, Challenges, and Future Directions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning

1
School of Yonyou Digital Intelligence, Nantong Institute of Technology, Nantong 226002, China
2
School of Transportation and Civil Engineering, Nantong University, Nantong 226019, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(13), 4052; https://doi.org/10.3390/s26134052 (registering DOI)
Submission received: 13 May 2026 / Revised: 23 June 2026 / Accepted: 24 June 2026 / Published: 25 June 2026

Abstract

Crop leaf diseases cause 10–40% annual yield losses, yet timely field diagnosis remains difficult. Vision-language models (VLMs) lift recognition accuracy with rich textual descriptions, but multimodal pipelines are too slow for real-time field use because they require text processing at inference. We present MTL-AWL, a framework built on a training–inference asymmetry: VLM text serves as privileged training-time supervision, and two coupled mechanisms—one retaining VLM semantics in the image encoder and one exploiting them—enable image-only deployment at multimodal accuracy. A modal-dropout strategy (p=0.6) intermittently masks the VLM text sequence during training, forcing the image encoder to retain cross-modal representations independently. An adaptive multi-task loss jointly optimizes InfoNCE contrastive alignment, attention diversity, and modality consistency under learnable softmax weights, consistently converging to a dominant contrastive weight (55% on soybean, 68% on PlantDoc)—identifying cross-modal alignment as the primary mechanism of VLM knowledge transfer. At inference, the model reaches 818 FPS (3.7× faster than multimodal methods) at only 0.41% accuracy cost, attaining 99.30%/98.89% (multimodal/image-only) on soybean and 72.65%/68.80% on PlantDoc—compact enough for real-time, offline field screening.
Keywords: crop leaf disease recognition; modal dropout; multi-task loss function; adaptive weight learning crop leaf disease recognition; modal dropout; multi-task loss function; adaptive weight learning

Share and Cite

MDPI and ACS Style

Qiu, J.; Gao, D.; Chen, S.; Liu, W. Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors 2026, 26, 4052. https://doi.org/10.3390/s26134052

AMA Style

Qiu J, Gao D, Chen S, Liu W. Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors. 2026; 26(13):4052. https://doi.org/10.3390/s26134052

Chicago/Turabian Style

Qiu, Jianlin, Depeng Gao, Shuxi Chen, and Wenjie Liu. 2026. "Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning" Sensors 26, no. 13: 4052. https://doi.org/10.3390/s26134052

APA Style

Qiu, J., Gao, D., Chen, S., & Liu, W. (2026). Efficient Image-Only Inference for Multimodal Crop Disease Recognition via Modal Dropout and Adaptive Multi-Task Loss Learning. Sensors, 26(13), 4052. https://doi.org/10.3390/s26134052

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop