HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture
Abstract
1. Introduction
- The first interactive recognition system for horticultural pest–predator identification is proposed, combining semantic segmentation and VQA to build an ecologically aware visual intelligence framework.
- A comprehensive dataset, HortiVQA-PP, is constructed, containing image annotations and Q&A pairs covering over 30 types of pests and their predator relationships.
- Three key modules are developed: a segmentation-aware module, a multi-label co-occurrence detection module, and a semantic alignment question-answering module, all contributing to improved identification and response quality.
- The integration of ecological knowledge into an AI vision system is realized for the first time in horticultural multimodal understanding tasks, enabling an interpretable and interactive intelligent platform for digital horticulture.
2. Related Work
2.1. Research on Pest Identification and Ecological Regulation in Horticultural Environments
2.2. The Development of Visual Question Answering (VQA) Technology
2.3. Multimodal Semantic Alignment and Knowledge Enhancement
3. Materials and Method
3.1. Data Collection
3.2. Data Preprocessing
3.3. Proposed Method
3.3.1. Overall
3.3.2. Segmentation-Aware Visual Encoder
3.3.3. Pest–Predator Co-Occurrence Multi-Object Segmentation Network
3.3.4. Horticultural Knowledge-Augmented Question Answering Module
4. Results and Discussion
4.1. Experimental Setup and Evaluation Metrics
4.2. Performance Comparison on Pest and Predator Semantic Segmentation Task (Task 1)
4.3. Performance Comparison on Pest–Predator Co-Occurrence Matching Task (Task 2)
4.4. Performance Comparison on Visual Question Answering Task (Task 3)
4.5. Robustness Evaluation on Public Datasets
4.6. Ablation Study
4.7. Discussion
4.7.1. Impact of Core Architectural Modules on System Performance
4.7.2. Application Prospects of HortiVQA-PP in Intelligent Horticulture Management
4.7.3. Algorithm Complexity and Real-Time Deployment Analysis
4.8. Limitation and Future Work
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lu, P.; Zheng, W.; Zhang-Zhong, L.; Lyu, X.; Shi, K.; Liu, C. Research progress and frontier hotspot of intelligent facility horticulture. Agric. Eng. 2025, 15, 58–66. [Google Scholar]
- Wang, S.; Xu, D.; Liang, H.; Bai, Y.; Li, X.; Zhou, J.; Su, C.; Wei, W. Advances in deep learning applications for plant disease and pest detection: A review. Remote Sens. 2025, 17, 698. [Google Scholar] [CrossRef]
- Guy, F.; Liu, H.; Yael, E. Deep-learning-based counting methods, datasets, and applications in agriculture: A review. Precis. Agric. 2023, 24, 1683–1711. [Google Scholar]
- Tong, Y.S.; Lee, T.H.; Yen, K.S. Deep Learning for Image-Based Plant Growth Monitoring: A Review. Int. J. Eng. Technol. Innov. 2022, 12, 225–246. [Google Scholar] [CrossRef]
- Qin, C.Y.; Yang, Y.S.; Gu, F.W.; Chen, P.Y.; Qin, W.C. Application and development of computer vision technology in modern agriculture. J. Chin. Agric. Mech. 2023, 44, 119. [Google Scholar]
- Lun, Z.; Hui, Z. Research on agricultural named entity recognition based on pre train BERT. Acad. J. Eng. Technol. Sci. 2022, 5, 34–42. [Google Scholar] [CrossRef]
- Li, C.; Zhen, T.; Li, Z. Image classification of pests with residual neural network based on transfer learning. Appl. Sci. 2022, 12, 4356. [Google Scholar] [CrossRef]
- Ukwuoma, C.C.; Qin, Z.; Bin Heyat, M.B.; Ali, L.; Almaspoor, Z.; Monday, H.N. Recent advancements in fruit detection and classification using deep learning techniques. Math. Probl. Eng. 2022, 2022, 9210947. [Google Scholar] [CrossRef]
- Alexandridis, N.; Marion, G.; Chaplin-Kramer, R.; Dainese, M.; Ekroos, J.; Grab, H.; Jonsson, M.; Karp, D.S.; Meyer, C.; O’Rourke, M.E.; et al. Archetype models upscale understanding of natural pest control response to land-use change. Ecol. Appl. 2022, 32, e2696. [Google Scholar] [CrossRef]
- Wang, L.; Jin, T.; Yang, J.; Leonardis, A.; Wang, F.; Zheng, F. Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases. arXiv 2024, arXiv:2412.02158. [Google Scholar]
- Lan, Y.; Guo, Y.; Chen, Q.; Lin, S.; Chen, Y.; Deng, X. Visual question answering model for fruit tree disease decision-making based on multimodal deep learning. Front. Plant Sci. 2023, 13, 1064399. [Google Scholar] [CrossRef]
- Yang, T.; Mei, Y.; Xu, L.; Yu, H.; Chen, Y. Application of question answering systems for intelligent agriculture production and sustainable management: A review. Resour. Conserv. Recycl. 2024, 204, 107497. [Google Scholar] [CrossRef]
- Thenmozhi, K.; Reddy, U.S. Crop pest classification based on deep convolutional neural network and transfer learning. Comput. Electron. Agric. 2019, 164, 104906. [Google Scholar] [CrossRef]
- Tian, L.; Liu, C.; Liu, Y.; Li, M.; Zhang, J.; Duan, H. Research on plant diseases and insect pests identification based on CNN. In Proceedings of the IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2020; Volume 594, p. 012009. [Google Scholar]
- Liu, Y.; Zhang, X.; Gao, Y.; Qu, T.; Shi, Y. Improved CNN method for crop pest identification based on transfer learning. Comput. Intell. Neurosci. 2022, 2022, 9709648. [Google Scholar] [CrossRef]
- Huang, M.L.; Chuang, T.C.; Liao, Y.C. Application of transfer learning and image augmentation technology for tomato pest identification. Sustain. Comput. Inform. Syst. 2022, 33, 100646. [Google Scholar] [CrossRef]
- Martínez-Sastre, R.; García, D.; Miñarro, M.; Martín-López, B. Farmers’ perceptions and knowledge of natural enemies as providers of biological control in cider apple orchards. J. Environ. Manag. 2020, 266, 110589. [Google Scholar] [CrossRef] [PubMed]
- Jasrotia, P.; Bhardwaj, A.K.; Katare, S.; Yadav, J.; Kashyap, P.L.; Kumar, S.; Singh, G.P. Tillage intensity influences insect-pest and predator dynamics of wheat crop grown under different conservation agriculture practices in rice-wheat cropping system of indo-Gangetic plain. Agronomy 2021, 11, 1087. [Google Scholar] [CrossRef]
- Ji, M.; Zhang, K.; Wu, Q.; Deng, Z. Multi-label learning for crop leaf diseases recognition and severity estimation based on convolutional neural networks. Soft Comput. 2020, 24, 15327–15340. [Google Scholar] [CrossRef]
- Duan, J.; Ding, H.; Kim, S. A multimodal approach for advanced pest detection and classification. arXiv 2023, arXiv:2312.10948. [Google Scholar] [CrossRef]
- Zhang, J.; Liu, Z.; Yu, K. MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection. arXiv 2025, arXiv:2505.02441. [Google Scholar]
- Jin, K.; Zi, X.; Thiyagarajan, K.; Braytee, A.; Prasad, M. IP-VQA Dataset: Empowering Precision Agriculture with Autonomous Insect Pest Management through Visual Question Answering. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 2000–2007. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Guilin, China, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Ni, F.; Hao, J.; Wu, S.; Kou, L.; Yuan, Y.; Dong, Z.; Liu, J.; Li, M.; Zhuang, Y.; Zheng, Y. Peria: Perceive, reason, imagine, act via holistic language and vision planning for manipulation. Adv. Neural Inf. Process. Syst. 2024, 37, 17541–17571. [Google Scholar]
- Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; Elhoseiny, M. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv 2023, arXiv:2310.09478. [Google Scholar]
- Yang, J.; Guo, X.; Li, Y.; Marinello, F.; Ercisli, S.; Zhang, Z. A survey of few-shot learning in smart agriculture: Developments, applications, and challenges. Plant Methods 2022, 18, 28. [Google Scholar] [CrossRef]
- Sharma, K.; Vats, V.; Singh, A.; Sahani, R.; Rai, D.; Sharma, A. LLaVA-PlantDiag: Integrating Large-scale Vision-Language Abilities for Conversational Plant Pathology Diagnosis. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–7. [Google Scholar]
- Nanavaty, A.; Sharma, R.; Pandita, B.; Goyal, O.; Rallapalli, S.; Mandal, M.; Singh, V.K.; Narang, P.; Chamola, V. Integrating deep learning for visual question answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust. Sci. Rep. 2024, 14, 28203. [Google Scholar] [CrossRef] [PubMed]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Saravanan, K.S.; Bhagavathiappan, V. AOQAS: Ontology Based Question Answering System for Agricultural Domain. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 2024, 16, 16. [Google Scholar]
- Wang, Y.; Yasunaga, M.; Ren, H.; Wada, S.; Leskovec, J. VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering. arXiv 2022, arXiv:2205.11501. [Google Scholar]
- Zhang, W.; Yu, J.; Zhao, W.; Ran, C. DMRFNet: Deep multimodal reasoning and fusion for visual question answering and explanation generation. Inf. Fusion 2021, 72, 70–79. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 4015–4026. [Google Scholar]
- Zhang, M.L.; Li, Y.K.; Liu, X.Y.; Geng, X. Binary relevance for multi-label learning: An overview. Front. Comput. Sci. 2018, 12, 191–202. [Google Scholar] [CrossRef]
- Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains: A review and perspectives. J. Artif. Intell. Res. 2021, 70, 683–718. [Google Scholar] [CrossRef]
- Seo, H.; Lee, M.; Cheong, W.; Yoon, H.; Kim, S.; Kang, M. Enhancing multi-label long-tailed classification on chest x-rays through ML-GCN augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 2747–2756. [Google Scholar]
- Omeroglu, A.N.; Mohammed, H.M.; Oral, E.A.; Aydin, S. A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng. Appl. Artif. Intell. 2023, 120, 105897. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; Yuan, L. Video-llava: Learning united visual representation by alignment before projection. arXiv 2023, arXiv:2311.10122. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
Category | Number of Classes | Number of Images | Avg. Q&A Pairs/Image |
---|---|---|---|
Pest Images | 30 | 9560 | 6.4 |
Predator Images | 10 | 3120 | 5.2 |
Pest–Predator Co-occurrence Images | 15 combinations | 1780 | 7.1 |
Total | - | 14,460 | 6.3 |
Model | Precision (%) | Recall (%) | F1-Score (%) | mAP50 (%) | IoU (%) |
---|---|---|---|---|---|
SegNet | 78.5 | 72.9 | 75.6 | 68.4 | 62.7 |
UNet | 81.3 | 76.5 | 78.8 | 72.1 | 65.2 |
UNet+ | 83.4 | 78.9 | 81.1 | 74.8 | 67.5 |
Mask R-CNN | 84.7 | 79.6 | 82.0 | 75.9 | 68.9 |
Segment Anything | 86.2 | 81.5 | 83.8 | 78.7 | 71.2 |
HortiVQA-PP (Ours) | 89.6 | 85.2 | 87.3 | 82.4 | 75.1 |
Model | Subset Accuracy (%) | Hamming Loss | Macro-F1 (%) |
---|---|---|---|
Binary Relevance | 71.4 | 0.096 | 68.9 |
Classifier Chain | 73.6 | 0.089 | 70.4 |
ML-GCN | 76.2 | 0.082 | 73.1 |
ASL (Attention-based Soft Labels) | 78.0 | 0.077 | 74.8 |
HortiVQA-PP (Ours) | 83.5 | 0.063 | 79.4 |
Model | BLEU-4 (%) | ROUGE-L (%) | METEOR (%) | EM (%) | GPT Rating (/5) |
---|---|---|---|---|---|
BLIP-2 | 38.2 | 45.6 | 35.4 | 28.7 | 3.6 |
LLaVA | 41.5 | 48.1 | 37.8 | 30.9 | 3.8 |
CLIP-QA | 35.7 | 44.2 | 33.5 | 27.3 | 3.5 |
MiniGPT-4 | 43.6 | 49.3 | 39.2 | 32.5 | 4.0 |
Flamingo | 44.1 | 50.5 | 39.6 | 33.2 | 4.1 |
HortiVQA-PP (Ours) | 48.7 | 54.8 | 43.3 | 36.9 | 4.5 |
Model | Task 1: F1-Score (%) | Task 2: Macro-F1 (%) | Task 3: EM (%) | Task 3: GPT Rating (/5) |
---|---|---|---|---|
SegNet | 76.1 | – | – | – |
UNet | 79.4 | – | – | – |
UNet+ | 81.7 | – | – | – |
Mask R-CNN | 82.6 | – | – | – |
Segment Anything | 84.3 | – | – | – |
Binary Relevance | – | 69.3 | – | – |
Classifier Chain | – | 70.9 | – | – |
ML-GCN | – | 73.7 | – | – |
ASL | – | 75.2 | – | – |
BLIP-2 | – | – | 29.1 | 3.7 |
LLaVA | – | – | 31.4 | 3.9 |
CLIP-QA | – | – | 27.9 | 3.6 |
MiniGPT-4 | – | – | 33.0 | 4.1 |
Flamingo | – | – | 33.7 | 4.2 |
HortiVQA-PP (Ours) | 88.1 | 80.1 | 37.6 | 4.6 |
Variant | Task 1: F1-Score (%) | Task 2: Macro-F1 (%) | Task 3: EM (%) | Task 3: GPT Rating (/5) |
---|---|---|---|---|
w/o LoRA | 84.4 | 76.5 | 35.4 | 4.4 |
w/o QS-Mamba | 83.7 | 74.2 | 35.9 | 4.3 |
w/o HortiKG | 87.9 | 79.7 | 33.2 | 4.0 |
Full Model | 88.1 | 80.1 | 37.6 | 4.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Z.; Du, C.; Li, S.; Jiang, Y.; Zhang, L.; Ju, C.; Yue, F.; Dong, M. HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture. Horticulturae 2025, 11, 1009. https://doi.org/10.3390/horticulturae11091009
Li Z, Du C, Li S, Jiang Y, Zhang L, Ju C, Yue F, Dong M. HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture. Horticulturae. 2025; 11(9):1009. https://doi.org/10.3390/horticulturae11091009
Chicago/Turabian StyleLi, Zhongxu, Chenxi Du, Shengrong Li, Yaqi Jiang, Linwan Zhang, Changhao Ju, Fansen Yue, and Min Dong. 2025. "HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture" Horticulturae 11, no. 9: 1009. https://doi.org/10.3390/horticulturae11091009
APA StyleLi, Z., Du, C., Li, S., Jiang, Y., Zhang, L., Ju, C., Yue, F., & Dong, M. (2025). HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture. Horticulturae, 11(9), 1009. https://doi.org/10.3390/horticulturae11091009