HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture

Li, Zhongxu; Du, Chenxi; Li, Shengrong; Jiang, Yaqi; Zhang, Linwan; Ju, Changhao; Yue, Fansen; Dong, Min

doi:10.3390/horticulturae11091009

Open AccessArticle

HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture

by

Zhongxu Li

¹,

Chenxi Du

^1,2,

Shengrong Li

¹,

Yaqi Jiang

^1,2,

Linwan Zhang

^1,2,

Changhao Ju

^1,2,

Fansen Yue

¹ and

Min Dong

^1,*

¹

China Agricultural University, Beijing 100083, China

²

National School of Development, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2025, 11(9), 1009; https://doi.org/10.3390/horticulturae11091009

Submission received: 24 July 2025 / Revised: 16 August 2025 / Accepted: 20 August 2025 / Published: 25 August 2025

(This article belongs to the Special Issue Applied Artificial Intelligence in Digital Horticulture: Practices and Innovations)

Download

Browse Figures

Versions Notes

Abstract

A multimodal interactive system, HortiVQA-PP, is proposed for horticultural scenarios, with the aim of achieving precise identification of pests and their natural predators, modeling ecological co-occurrence relationships, and providing intelligent question-answering services tailored to agricultural users. The system integrates three core modules: semantic segmentation, pest–predator co-occurrence detection, and knowledge-enhanced visual question answering. A multimodal dataset comprising 30 pest categories and 10 predator categories has been constructed, encompassing annotated images and corresponding question–answer pairs. In the semantic segmentation task, HortiVQA-PP outperformed existing models across all five evaluation metrics, achieving a precision of 89.6%, recall of 85.2%, F1-score of 87.3%,

m A P @ 50

of 82.4%, and IoU of 75.1%, representing an average improvement of approximately 4.1% over the Segment Anything model. For the pest–predator co-occurrence matching task, the model attained a multi-label accuracy of 83.5%, a reduced Hamming Loss of 0.063, and a macro-F1 score of 79.4%, significantly surpassing methods such as ASL and ML-GCN, thereby demonstrating robust structural modeling capability. In the visual question answering task, the incorporation of a horticulture-specific knowledge graph enhanced the model’s reasoning ability. The system achieved 48.7% in BLEU-4, 54.8% in ROUGE-L, 43.3% in METEOR, 36.9% in exact match (EM), and a GPT expert score of 4.5, outperforming mainstream models including BLIP-2, Flamingo, and MiniGPT-4 across all metrics. Experimental results indicate that HortiVQA-PP exhibits strong recognition and interaction capabilities in complex pest scenarios, offering a high-precision, interpretable, and widely applicable artificial intelligence solution for digital horticulture.

Keywords:

horticultural AI systems; visual-language integration; intelligent pest management; multimodal visual question answering; pest and predator segmentation

1. Introduction

China is recognized as a major horticultural nation, with vast horticultural land and a robust industry that plays a pivotal role in advancing the country’s modern agricultural development [1]. However, practical horticultural production continues to face challenges such as low efficiency in water and fertilizer use, high labor intensity, and underdeveloped mechanization. In recent years, the rapid advancement of digital horticulture, driven by technological progress, has significantly transformed agricultural practices and promoted the integration of advanced technologies such as computer vision to enhance agricultural intelligence. Computer vision has been increasingly applied to essential agricultural tasks such as pest and disease identification, fruit counting, and growth monitoring, substantially improving the level of automation in agriculture. Studies by Liu et al. demonstrate that deep learning exhibits considerable advantages in image processing, significantly enhancing the accuracy of plant disease and pest recognition [2]. Farjon et al. [3] have indicated that computer vision enables fruit counting through the recognition of distinguishing features. Similarly, Tong et al. [4] noted that deep learning approaches have driven growth monitoring from simple phenological stage recognition to more complex temporal growth information extraction. Despite this progress, a substantial cognitive gap remains between farmers and AI systems in practical applications. Most existing models offer classification outcomes without interpretability or interactivity, and the ecological co-occurrence relationships between pests and their natural enemies have not yet been effectively modeled, limiting the development of ecologically regulated horticultural management systems. To address these challenges, researchers have begun integrating computer vision and visual question answering (VQA) technologies into digital horticulture. Changyou et al. [5] emphasized that crop growth monitoring based on machine vision can utilize image and video technologies to diagnose horticultural indicators such as maturity, water stress, and nutrient deficiencies. Lun et al. [6] further noted that agricultural VQA systems can process data collected from diverse sensors and IoT devices embedded in agricultural environments. Nevertheless, most existing research remains focused on single-object tasks such as pest classification and fruit segmentation [7,8]. Although significant advances have been made, these methods fall short of meeting the broader, multi-objective, and ecologically oriented needs of sustainable agricultural management [5]. Alexandridis et al. [9] argued that current ecological models are often either overly specific or excessively generalized, impeding their effectiveness in managing agricultural ecosystems.

In addition, recent years have witnessed notable advancements in VQA technologies in general-purpose domains, with growing applications in horticulture. Wang et al. [10] proposed a multimodal dialogue system, Agri-LLaVA, which demonstrated strong performance in agricultural visual understanding and offered new insights and methods for pest management. Lan et al. [11] introduced a VQA model based on multimodal feature fusion for fruit tree disease diagnosis, with wide applicability in smart agriculture. However, horticulture-specific applications still face persistent challenges such as data scarcity and weak semantic alignment. Lun et al. [6] emphasized that agricultural VQA systems require deep knowledge in domain-specific areas including crop cultivation, soil science, and pest control. Despite the potential of VQA systems to facilitate interaction between farmers and AI through natural language queries about crops or pests, numerous challenges persist. The lack of publicly available agricultural datasets hampers model training and benchmark evaluation. Moreover, the images and questions emerging in agricultural contexts are more diverse and complex than those in general domains, making it difficult for systems to provide accurate and contextually relevant answers [12]. To address these issues, a novel system, HortiVQA-PP, is proposed in this study. It integrates semantic segmentation, pest–predator co-occurrence detection, and multimodal question answering. A large language model is incorporated to enhance visual understanding, while a horticulture-specific knowledge graph is introduced to guide answer generation, enabling intelligent responses to domain-specific questions such as “What pest is this?”, “What is its natural enemy?”, and “How should it be controlled?”.

In this work, an interactive system for horticultural pest–predator recognition is first introduced, combining semantic segmentation with VQA to support the identification of pests and their natural enemies, forming a novel ecologically grounded AI framework for horticulture. A dedicated dataset, HortiVQA-PP, has been constructed, comprising over 30 pest species and their predator pairings. Three core modules are designed and implemented: a segmentation-aware visual encoder, a multi-label co-occurrence detection module, and a question-answering alignment module. These components collaboratively enhance the system’s capabilities in accurate pest identification, ecological relation detection, and high-quality response generation. For the first time, ecological regulation knowledge has been integrated into an AI visual system for multimodal horticultural understanding, offering an interpretable and interactive intelligent platform for digital horticulture. The main contributions of this study are summarized as follows:

The first interactive recognition system for horticultural pest–predator identification is proposed, combining semantic segmentation and VQA to build an ecologically aware visual intelligence framework.
A comprehensive dataset, HortiVQA-PP, is constructed, containing image annotations and Q&A pairs covering over 30 types of pests and their predator relationships.
Three key modules are developed: a segmentation-aware module, a multi-label co-occurrence detection module, and a semantic alignment question-answering module, all contributing to improved identification and response quality.
The integration of ecological knowledge into an AI vision system is realized for the first time in horticultural multimodal understanding tasks, enabling an interpretable and interactive intelligent platform for digital horticulture.

2. Related Work

2.1. Research on Pest Identification and Ecological Regulation in Horticultural Environments

Traditional pest detection approaches based on convolutional neural networks (CNNs) utilize deep learning to automatically extract image features for pest identification and classification [13]. These methods generally involve a series of steps, including data acquisition, image preprocessing, feature extraction, classification prediction, and post-processing. In recent years, CNN models have been widely applied to plant disease detection, fruit classification, and crop pest identification. Tian et al. [14] proposed a CNN-based method for plant disease and pest recognition that demonstrated high classification accuracy. Liu et al. [15] developed a multi-layer CNN architecture to enhance the accuracy of crop pest and disease classification. The fully connected layers in CNNs typically use a softmax activation function, and classification performance is optimized via the cross-entropy loss function defined as [16]:

Loss = \sum_{i = 1}^{N} \sum_{j = 1}^{K} t_{i j} ln y_{i j},

(1)

where N denotes the number of disease images, K is the number of disease classes,

t_{i j}

is the true label of the i-th image belonging to class j, and

y_{i j}

is the model’s predicted output for class j of sample i. In ecological regulation systems, the identification and utilization of natural enemies represent a critical area of study. Compared with conventional chemical pest control, the use of natural enemies offers an environmentally friendly alternative that helps maintain agricultural ecosystem balance [17]. Recently, intelligent systems leveraging artificial intelligence and machine learning have drawn considerable attention for their potential to recognize pest-related natural enemy species and provide decision support in pest management. Jasrotia et al. [18] analyzed the abundance and impact of major pests and their natural enemies in wheat, and proposed management strategies to improve the effectiveness of conservation agriculture systems. Co-occurrence recognition and pairing relationship modeling are essential to understanding the interaction between pests and their natural enemies. Traditional approaches often overlook such relationships. In recent developments, multi-label recognition techniques have been introduced, allowing the simultaneous detection of multiple pests and natural enemies, while also enabling inference of inter-object relationships. Ji et al. [19] proposed a series of deep learning networks for the automatic identification and severity estimation of crop leaf diseases, capable of recognizing species, classifying diseases, and estimating severity levels in an end-to-end manner. Multimodal modeling, which integrates vision, audio, and environmental data, has further improved the understanding of pest-enemy interactions. Duan et al. [20] introduced a multimodal deep learning framework combining tiny-BERT for natural language processing with R-CNN and ResNet-18 for image analysis to enhance pest detection performance. Zhang et al. [21] developed a Multi-Scale Fusion Network (MSFNet-CPD) that significantly improved the robustness and accuracy of pest detection through cross-modal integration.

2.2. The Development of Visual Question Answering (VQA) Technology

VQA integrates computer vision with natural language processing (NLP), supporting efficient pest detection and crop health monitoring [22]. The BLIP model is a representative multimodal architecture trained with both vision and language data, designed to improve image-language reasoning via self-supervised pretraining strategies [23]. By learning interactions between images and texts, BLIP significantly improves VQA performance. LLaVA, a VQA model based on vision-language attention mechanisms, enhances object-level understanding through fine-grained attention over visual regions and language sequences [24], delivering robust performance in scenarios with diverse backgrounds and complex semantics. MiniGPT, designed for resource-constrained environments, leverages parameter-efficient generation and compressed learning techniques to address generative challenges in VQA tasks [25]. This model combines the strengths of generative dialogue systems with VQA capabilities, enabling flexible reasoning in response to open-ended queries.

Despite the rapid advancement of VQA technologies, challenges related to data scarcity and semantic variance remain prominent, particularly in agricultural domains. Domain-specific VQA tasks often require extensive annotated datasets, and limited access to such data substantially hinders model training [26]. Furthermore, knowledge representations in agriculture differ significantly from those in general domains. Crop recognition and disease diagnosis entail handling domain-specific terminology and visual cues, which traditional VQA models trained on general vocabulary struggle to interpret. Jin et al. [22] introduced the IP-VQA dataset, tailored for precision agriculture, to address challenges in pest-related VQA by integrating visual and textual modalities. In plant recognition, models must identify not only species but also growth stages and leaf morphology. In disease diagnosis, detecting and interpreting pest damage is critical. VQA models, by integrating image and text data, are capable of responding to questions regarding plant pathology. Sharma et al. [27] developed the LLaVA-PlantDiag model to answer open-ended plant pathology questions using multimodal dialogue. Nanavaty et al. [28] proposed a VQA-based diagnostic framework specifically for wheat rust, enabling automated responses to disease infection queries based on input images.

2.3. Multimodal Semantic Alignment and Knowledge Enhancement

In multimodal learning, cross-modal representation alignment and knowledge enhancement play crucial roles in improving both model understanding and generation capabilities. These techniques aim to ensure semantic consistency between visual and linguistic information. CLIP (Contrastive Language-Image Pre-training) [29] applies a contrastive learning strategy to maximize the similarity between aligned image-text pairs, achieving effective multimodal alignment across various tasks without the need for extensive task-specific fine-tuning. BLIP (Bootstrapping Language-Image Pretraining) [23] extends CLIP by incorporating self-supervised pretraining, enabling fine-grained visual-linguistic alignment and facilitating robust performance in complex multimodal reasoning scenarios.

Knowledge enhancement mechanisms, particularly through domain ontologies and knowledge graphs, have been shown to effectively improve inference in specialized domains. Domain ontologies provide structured knowledge representations that, when embedded into VQA or image captioning models, enable reasoning not solely based on data but also informed by expert knowledge. In agriculture, this includes taxonomies of plant species, pest types, and disease symptoms, which aid models in more accurately diagnosing diseases or identifying species. Saravanan et al. [30] developed AOQAS, an ontology-based agricultural VQA system, demonstrating superior performance in agricultural knowledge retrieval tasks. Knowledge graphs, represented as nodes and edges, allow structured relationships between concepts to be incorporated into model reasoning pipelines. In multimodal tasks, aligning visual input with entities and relations from a knowledge graph introduces prior knowledge into the reasoning process, particularly useful under ambiguous conditions. The enhanced multimodal representation can be described as:

f_{augmented} = f_{visual} + λ \cdot f_{knowledge},

(2)

where

f_{visual}

represents the extracted visual features,

f_{knowledge}

denotes the embedded knowledge graph representation, and

λ

is a weighting factor for knowledge fusion. In real-world applications, cross-modal alignment faces challenges such as semantic mismatches and inconsistent feature spaces, particularly in domains like agriculture. The visual features extracted from images often do not correspond neatly to linguistic structures, reducing alignment accuracy. To address these issues, deep learning and graph-based models have been employed in tandem. Wang et al. [31] proposed a novel multimodal GNN technique that showed strong results in two VQA benchmarks (VCR and GQA). Zhang et al. [32] designed an effective multimodal reasoning and fusion framework, enabling fine-grained reasoning and integration across visual and textual modalities in VQA tasks.

3. Materials and Method

3.1. Data Collection

The raw image data utilized in this study were primarily collected from two representative horticultural production environments: controlled-environment greenhouses and open-field orchards. The dataset covers economically significant crops such as tomatoes, peppers, grapes, and strawberries. During the pest image acquisition phase, observational sites were established between March 2023 and July 2024 in agricultural bases located in Shouguang (Shandong), Foshan (Guangdong), and Meishan (Sichuan). High-resolution industrial cameras (Sony IMX347 (Tokyo, Japan), resolution: 2592 × 1944) equipped with variable focal macro lenses and LED illumination systems were employed for close-range imaging, ensuring consistent image quality under diverse lighting conditions, as shown in Table 1. Including Pests (30): Aphis gossypii (cotton aphid), Myzus persicae (green peach aphid), Macrosiphum euphorbiae (potato aphid), Bemisia tabaci (tobacco/silverleaf whitefly), Trialeurodes vaporariorum (greenhouse whitefly), Frankliniella occidentalis (western flower thrips), Thrips palmi (melon thrips), Tetranychus urticae (two-spotted spider mite), Tetranychus cinnabarinus (carmine spider mite), Liriomyza sativae (vegetable leafminer), Liriomyza trifolii (American serpentine leafminer), Spodoptera litura (tobacco cutworm), Spodoptera frugiperda (fall armyworm), Helicoverpa armigera (cotton bollworm), Plutella xylostella (diamondback moth), Trichoplusia ni (cabbage looper), Pieris rapae (small cabbage white), Pseudococcus longispinus (longtailed mealybug), Planococcus citri (citrus mealybug), Aphis craccivora (cowpea aphid), Bactrocera dorsalis (oriental fruit fly), Drosophila suzukii (spotted-wing drosophila), Phyllotreta striolata (striped flea beetle), Leptinotarsa decemlineata (Colorado potato beetle), Empoasca vitis (grape leafhopper), Nezara viridula (southern green stink bug), Aleurocanthus spiniferus (orange spiny whitefly), Tuta absoluta (tomato leafminer), Agrotis ypsilon (black cutworm), and Lycoriella ingenua (fungus gnat). Including Predators (10): Coccinella septempunctata (seven-spotted lady beetle), Harmonia axyridis (harlequin ladybird), Propylea japonica, Chrysoperla carnea (green lacewing), Chrysopa formosa, Aphidius gifuensis (aphid parasitoid), Encarsia formosa (whitefly parasitoid), Trichogramma chilonis, Trichogramma evanescens (egg parasitoids), and Rhynocoris marginatus (assassin bug). In addition, publicly available web-sourced images were included. For each pest category, no fewer than 300 images were collected, encompassing various life stages (egg, nymph, adult) and different viewing angles. To enhance dataset diversity, additional images capturing pests located on different plant parts (e.g., leaf surfaces, stems, fruits, and flowers) were gathered, along with annotations describing associated symptoms such as bite marks or chlorosis.

For the collection of natural enemy images, emphasis was placed on ten representative biological control insect species, including lady beetles, lacewings, parasitoid wasps, Trichogramma, and assassin bugs, as shown in Figure 1. Data sources included greenhouse-based biological control trials, automated trap-captured imagery from open-field monitoring sites, and partially curated open-access entomological databases provided by the National Academy of Agricultural Sciences. All images were manually screened and validated to ensure clear pest/enemy visibility, low background noise, and accurate annotations of species type and bounding positions. To support the VQA task, between five and ten natural language question–answer pairs were constructed for each image. These questions encompassed topics such as pest identification, ecological co-occurrence, biological control knowledge, and control recommendations. All Q and A pairs were designed by agricultural experts and grounded in domain-specific knowledge resources, including The Atlas of Major Crop Pests and Diseases in China, Agricultural Entomology, and the structured horticultural knowledge graph HortiKG. Each image and its corresponding Q and A content were linked via consistent ID mapping and standardized naming conventions to facilitate integration into the training pipeline.

3.2. Data Preprocessing

To enhance the robustness and generalization capabilities of the model in complex horticultural environments—particularly under real-world noise conditions such as occlusion, background interference, and varying illumination—three representative data augmentation strategies were employed: CutMix, GridMask, and Blurring Occlusion Augmentation. These techniques aim to improve the model’s sensitivity to local features and tolerance to missing or distorted regions by simulating perturbations from the perspectives of regional substitution, structural occlusion, and modality blurring, respectively.

CutMix is a region-based image fusion augmentation method that operates by randomly cropping a rectangular area from one image and pasting it into the same location of another image, with the corresponding labels linearly combined in proportion to the area. Given two training samples

(x_{A}, y_{A})

and

(x_{B}, y_{B})

, the augmented sample

(\tilde{x}, \tilde{y})

is defined as:

\tilde{x} = M ⊙ x_{A} + (1 - M) ⊙ x_{B}, \tilde{y} = λ y_{A} + (1 - λ) y_{B},

where M is a binary mask matrix of the same dimensions as the input image, ⊙ denotes element-wise multiplication, and

λ = \frac{Area (M)}{Area (x)}

represents the proportion of the cropped area, used for balancing the label weights. GridMask augmentation enhances generalization by occluding the image with regularly spaced grid patterns, thereby preventing the model from overfitting to local regions and encouraging it to learn from global and redundant features. The transformation is expressed as:

\tilde{x} = x ⊙ G, G (i, j) = \{\begin{matrix} 0, & if (i mod d) \in [s, s + l) \\ 1, & otherwise, \end{matrix}

where d is the grid spacing, l denotes the width of the masked region, s is the random starting offset, and G is the generated grid mask matrix. This method is particularly effective in maintaining robustness when pest contours or predator features are partially occluded. Finally, Blurring Occlusion Augmentation simulates partial blur caused by natural factors such as focus errors, motion blur, or occlusion from plant foliage. In this method, several small regions are randomly selected from the input image and filtered using a Gaussian blur kernel:

\tilde{x} (i, j) = \sum_{u = - k}^{k} \sum_{v = - k}^{k} K (u, v) \cdot x (i + u, j + v), K (u, v) = \frac{1}{2 π σ^{2}} exp (- \frac{u^{2} + v^{2}}{2 σ^{2}}),

where

K (u, v)

is the two-dimensional Gaussian kernel function,

σ

controls the blur intensity, and k defines half the size of the convolution window. This technique improves the model’s ability to recognize pest structures under varying levels of visual clarity. Through the comprehensive application of the aforementioned augmentation strategies, sample diversity within the training dataset was significantly increased. This mitigates model dependence on specific textures, postures, or background patterns and leads to improved robustness and generalization performance in real-world horticultural image scenarios.

3.3. Proposed Method

3.3.1. Overall

The proposed HortiVQA-PP system employs a multi-module collaborative fusion architecture, as illustrated in Figure 2. Within the overall framework, visual and textual information are initially processed through an image encoder and a text encoder, respectively, to extract semantic features. These features are subsequently integrated via a multimodal semantic alignment mechanism, resulting in predictions that include object recognition, co-occurrence relationships, and natural language answers. Specifically, preprocessed images are first input into an image segmentation module based on Segment Anything or DeepLab to extract refined region masks for pests and predators, which are then used to generate region-aware semantic representations. These representations serve not only for multi-object recognition but also as visual attention guidance signals for downstream modules. Concurrently, structured questions (e.g., “What is this pest?” or “What is its natural enemy?”) are embedded and processed using the LLaMA language model backbone, from which semantic vectors are extracted via a dedicated text encoder.

In the pest–predator co-occurrence recognition stage, a Pest–Predator Matching Head is introduced to fuse spatial positional relationships and semantic attention, enabling multi-label classification and joint modeling of the extracted pest features. An attention mechanism is embedded to simulate ecological pairwise relationships between pests and their natural enemies. Following encoding, a modality alignment module computes semantic similarity scores between visual and textual representations, and the entire predictive structure is jointly optimized using a visual loss

L_{vision}

and a semantic alignment constraint loss. During natural language answer generation, image features are used as contextual conditions to guide the language model via prompt templates and horticultural knowledge graph (HortiKG) embeddings, ensuring that generated answers align with image semantics. Ultimately, the system forms a closed-loop pipeline for intelligent horticultural understanding through three output streams: classification outputs, co-occurrence label matrices, and natural language answers. The overall architecture enables end-to-end feature flow and cross-modal alignment, making it suitable for complex image understanding and semantic reasoning tasks.

3.3.2. Segmentation-Aware Visual Encoder

In the HortiVQA-PP system, the segmentation-aware visual encoder is designed to provide spatially structured semantic representations for pest–predator recognition and downstream question-answering modules. This module adopts the Segment Anything Model (SAM) as the backbone and incorporates a parameter-efficient Low-Rank Adaptation (LoRA) mechanism for localized tuning. This design accommodates image characteristics in horticultural scenarios, such as densely packed small objects, blurry contours, and complex semantics. As illustrated in Figure 3, the input image is processed by a frozen SAM image encoder, followed by multiple frozen multi-head attention layers. Outputs from each layer are preserved as in-features, and LoRA insertion points are applied post-layer to perform low-rank modulation. Each LoRA module consists of a downsampling matrix

A \in R^{d \times r}

and an upsampling matrix

B \in R^{r \times d}

, forming a residual term

W_{t} = A B

, which is added to the original feature weights

{\bar{W}}_{t - 1}

to yield the modulated representation

W_{t}

. The insertion points align with the structure of the attention layers to facilitate cross-layer semantic modulation.

Following each attention stage, a content-aware feature refinement mechanism is introduced. Outputs from all attention layers are concatenated and processed via an inner-product similarity module, generating a similarity matrix based on

X_{t}

and its transpose

X_{t}^{⊤}

. The resulting attention map

C_{t} = softmax (X_{t}^{⊤} X_{t})

is propagated to the downstream mask encoder module to generate the final semantic mask

M_{t}

. This mask serves both for visual region localization and as a prior for region attention in the pest co-occurrence detection and visual question-answering modules. Mathematically, by enforcing the modulation rank r to be significantly smaller than the input dimension d (with

r = 16

and

d = 512

in this study), the LoRA module significantly reduces tuning complexity while maintaining generalization across domains. Additionally, the inner-product matrix

C_{t}

enhances boundary discrimination between pests and background, enabling effective semantic focus even in cases of occlusion and leaf overlap. The final loss function consists of three components: semantic segmentation cross-entropy loss

L_{mask}

, LoRA weight regularization loss

L_{lora}

, and region alignment loss

L_{align}

, formulated as

L_{vision} = L_{mask} + λ_{1} L_{lora} + λ_{2} L_{align},

(3)

where

λ_{1}

and

λ_{2}

denote the loss weights, set to 0.01 and 0.1, respectively. Through this segmentation-aware visual encoder design, the HortiVQA-PP system achieves refined semantic boundary modeling and enhanced region attention in pest recognition, thereby providing critical structural and semantic support for ecological relationship modeling and natural language generation. The incorporation of this module significantly improves the system’s robustness and accuracy in small-object detection and region-based question-answering tasks.

3.3.3. Pest–Predator Co-Occurrence Multi-Object Segmentation Network

In the HortiVQA-PP system, a QS-Mamba-based multi-scale deep architecture is employed for the pest–predator co-occurrence multi-object segmentation network. This design aims to simultaneously accomplish instance-level segmentation and semantic-level ecological co-occurrence label prediction. As illustrated in Figure 4, a 4D tensor input with dimensions

C \times D \times H \times W

is first processed through a depth-wise convolution-based stem module for initial feature extraction. The resulting features are then forwarded to a four-stage QS-Mamba encoder, where each stage performs spatial downsampling and channel dimensionality expansion. The respective output feature dimensions are

48 \times \frac{D}{2} \times \frac{H}{2} \times \frac{W}{2}

,

96 \times \frac{D}{4} \times \frac{H}{4} \times \frac{W}{4}

,

192 \times \frac{D}{8} \times \frac{H}{8} \times \frac{W}{8}

, and

384 \times \frac{D}{16} \times \frac{H}{16} \times \frac{W}{16}

. Each encoder layer is followed by a residual block to preserve semantics and stabilize channels, and the final layer incorporates an upsampling module to enable the fusion of low-level and high-level semantic features in the decoder stage.

During the decoding phase, a multi-scale fusion decoder termed MF-Mamba is introduced. It integrates multi-resolution features from different encoder stages and reconstructs spatial and semantic information via cross-scale residual connections and learnable weighted fusion. After progressive upsampling, the output resolution is restored to

C \times D \times H \times W

, followed by a segmentation head (Seg-head) to generate instance-wise masks and multi-label classifications. In addition to outputting object categories, the module also identifies whether each object exhibits ecological co-occurrence with other targets in the image. To model such co-occurrence, a co-occurrence graph prediction head is incorporated, mathematically defined as:

{\hat{Y}}_{i j} = σ (v_{i}^{⊤} \cdot W_{c} \cdot v_{j}), i, j = 1, \dots, N,

(4)

where

v_{i}

and

v_{j}

denote the embedding representations of the i-th and j-th instances, respectively,

W_{c}

is a learnable pairing weight matrix,

σ

is the sigmoid function, and

{\hat{Y}}_{i j}

represents the predicted co-occurrence probability between instance i and j. During training, the binary cross-entropy loss between

{\hat{Y}}_{i j}

and the ground-truth co-occurrence label

Y_{i j}

is minimized as:

L_{pair} = - \sum_{i, j} [Y_{i j} log {\hat{Y}}_{i j} + (1 - Y_{i j}) log (1 - {\hat{Y}}_{i j})] .

(5)

This network structure is jointly used with the previously described segmentation-aware visual encoder. Specifically, each layer of the QS-Mamba encoder integrates a region attention mask from the visual encoder as a modulation factor. The input feature

F_{l}

of each layer is modulated by the corresponding semantic mask

M_{l}

via element-wise multiplication:

{\tilde{F}}_{l} = F_{l} ⊙ Up (M_{l}),

(6)

where

Up (\cdot)

denotes upsampling to the current resolution, and ⊙ indicates element-wise multiplication. This mechanism effectively guides the network to focus on actual pest and predator regions during early training stages, maintaining segmentation robustness under complex background interference. Through the integration of these modules, the proposed multi-object segmentation network demonstrates fine-grained sensitivity to small targets and achieves ecological structure-aware modeling. Significant improvements have been observed in pest–predator co-occurrence recognition, enhancing the system’s adaptability and practical value in real-world agricultural ecological regulation tasks. Experimental results indicate that the proposed structure outperforms traditional YOLO- and Transformer-based models in metrics such as mIoU, mAP@50, and multi-label F1-score, while offering superior semantic interpretability and ecological guidance capability.

3.3.4. Horticultural Knowledge-Augmented Question Answering Module

Within the HortiVQA-PP system, the horticultural knowledge-augmented question answering module serves as the core for generating natural language responses from multimodal inputs. As depicted in Figure 5, the module integrates visual encoding, language encoding, and semantic-level interaction modeling, supported by the domain-specific knowledge graph HortiKG as a prior for response generation. A high-resolution image encoder is utilized to extract dense semantic vectors, forming a dense visual token sequence of size

[B, N, D]

, where B is the batch size, N the number of patches, and

D = 768

the embedding dimension. Meanwhile, a low-resolution global semantic feature is extracted via a separate encoder and aligned with language embeddings processed by a LLaMA-based text encoder, establishing an initial vision-language interaction space.

Within this interaction space, a semantic disentanglement module explicitly decouples visual and language embeddings using self-attention and cross-attention mechanisms. Multi-level interactions are constructed, including first-order spatial alignment and second-order semantic reasoning (e.g., pest–predator causal inference). This process is formalized as:

H^{(k)} = {Attn}_{k} (H^{(k - 1)}, Q, K, V) = Softmax (\frac{{QK}^{⊤}}{\sqrt{d}}) V,

(7)

where

H^{(k)}

denotes the semantic representation after the k-th interaction layer, and

Q

,

K

,

V

are query, key, and value vectors generated from fused image and language embeddings, with d as the scaling dimension. In practice,

k = 3

interaction blocks are stacked, each with

d = 768

, followed by fully connected layers and a language generation head that inputs into the pretrained LLaMA-2 decoder to produce image-grounded answer sequences. To enhance semantic consistency and domain specificity, the horticultural knowledge graph HortiKG is incorporated to guide the question answering process. Defined as a multi-relational graph

G = (E, R, T)

, where E is the set of entities (e.g., “spider mite”, “ladybug”, “grape”), R is the set of relations (e.g., “is predator of”, “hosted by”, “best control time”), and T is the set of triples. The graph is embedded into the generation module via a knowledge embedding matrix

K_{G} \in R^{| E | \times d}

and injected into the LLaMA decoder input using a prompt embedding mechanism. The knowledge-guided embedding formulation is expressed as:

P_{aug} = P_{text} + γ \cdot K_{G} [e], where e \in E,

(8)

where

P_{text}

is the original question embedding,

γ

is a fusion weight (set to 0.6), and

K_{G} [e]

is the embedding of the knowledge graph entity associated with the recognized pest. This strategy significantly improves the factual accuracy of responses concerning pest–predator relations and control strategies. To ensure the reliability and long-term applicability of HortiKG, the graph content is curated through a multi-stage process. Initially, entity–relation triples are extracted from authoritative horticultural literature, pest management manuals, and peer-reviewed entomology databases. This raw knowledge is then standardized into a unified schema, resolving synonym conflicts and aligning taxonomic hierarchies. All entries undergo expert validation by certified horticulturalists and entomologists, who review both factual correctness and practical relevance. To address the evolving nature of agricultural ecosystems—such as the emergence of new pest species, shifts in pest–predator dynamics under changing climate patterns, and the introduction of updated control methods—a continuous update pipeline is maintained. This pipeline periodically integrates new findings from scientific publications, government extension bulletins, and field reports, with each update undergoing the same expert review protocol before being merged into the production knowledge graph. These measures collectively ensure that HortiKG remains accurate, current, and adaptable to real-world horticultural management scenarios. This module is integrated with the pest–predator co-occurrence segmentation network, using predicted pest categories and co-occurrence labels as query entities to retrieve relevant knowledge graph entries and construct semantic references in the interaction space. This design enhances the system’s capability to understand complex ecological queries and enables precise generation across modalities and knowledge domains in multimodal question answering tasks.

4. Results and Discussion

4.1. Experimental Setup and Evaluation Metrics

Pest and Predator Detection with Semantic Segmentation: This task aims to simultaneously perform detection and semantic segmentation of pests and their natural predators within horticultural images, serving as a foundation for subsequent ecological cascade recognition and control strategy generation. To comprehensively evaluate model performance, the following metrics were employed: precision, recall, F1-score, mean average precision at an IoU threshold of 0.5 (mAP50), and intersection over union (IoU). The computation of these metrics is given by:

\begin{matrix} Precision & = \frac{T P}{T P + F P}, Recall = \frac{T P}{T P + F N}, F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall}, \\ mAP @ 50 & = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}, IoU = \frac{A r e a_{p r e d} \cap A r e a_{g t}}{A r e a_{p r e d} \cup A r e a_{g t}}, \end{matrix}

(9)

where

T P

denotes the number of correctly identified pest or predator pixels or instances,

F P

refers to false positives,

F N

to false negatives,

A P_i

is the average precision for class i, and N is the total number of classes.

A r e a_p r e d

and

A r e a_g t

represent the predicted and ground truth areas, respectively. To assess the proposed method, five representative detection and segmentation models were adopted as baselines: SegNet [33], UNet [34], UNet++ [35], Mask R-CNN [36], and Segment Anything (SAM) [37]. SegNet is a compact encoder-decoder network suitable for accurate restoration of low-resolution targets. UNet is widely used in medical segmentation tasks and employs U-shaped skip connections to integrate low-level and high-level semantics. UNet++ introduces nested skip paths and deep supervision for enhanced robustness in complex scenes. Mask R-CNN, a two-stage model for instance detection and segmentation, provides precise boundary localization, making it suitable for dense object scenarios. SAM is a recent general-purpose vision segmentation model with strong generalization capabilities, capable of high-quality segmentation under low-sample conditions. These models offer reliable baselines for both performance benchmarking and ablation analysis, covering a wide spectrum of architectures from lightweight designs to universal large models.

Pest—Predator Co-occurrence Matching: This task targets the identification of specific pest–predator co-occurrence relationships in an image, formulated as a multi-label classification and relational modeling problem. Three evaluation metrics were employed to assess performance: subset accuracy, Hamming loss, and macro-F1 score. These metrics respectively reflect exact prediction correctness, label-wise error rate, and category-level balance. The corresponding formulas are:

\begin{matrix} Subset Accuracy & = \frac{1}{N} \sum_{i = 1}^{N} I (Y_{i} = {\hat{Y}}_{i}), Hamming Loss = \frac{1}{N \cdot L} \sum_{i = 1}^{N} \sum_{j = 1}^{L} I (Y_{i j} \neq {\hat{Y}}_{i j}), \\ Macro - F 1 & = \frac{1}{L} \sum_{j = 1}^{L} \frac{2 \cdot P_{j} \cdot R_{j}}{P_{j} + R_{j}}, \end{matrix}

(10)

where N is the number of samples, L is the number of labels,

Y_{i}

and

{\hat{Y}}_{i}

denote the ground truth and predicted label sets for sample i,

I (\cdot)

is the indicator function, and

P_{j}

and

R_{j}

are the precision and recall for label j. To perform effective recognition of pest–predator multi-label co-occurrence, four representative multi-label learning models were adopted as baselines: Binary Relevance [38], Classifier Chain [39], ML-GCN [40], and ASL (Attention-based Soft Labels) [41]. Binary Relevance decomposes the multi-label problem into independent binary classifiers, offering simplicity and fast training. Classifier Chain models conditional label dependencies using a chain structure, which is beneficial for capturing pest–predator interactions. ML-GCN leverages graph convolutional networks to handle complex co-occurrence graphs. ASL utilizes attention mechanisms and soft label strategies to mitigate label imbalance and enhance recognition for minority classes. These methods provide a diverse set of modeling paradigms for benchmarking the proposed co-occurrence detection model.

Visual Question Answering: This task evaluates system capabilities in horticultural VQA, encompassing pest identification, ecological co-occurrence reasoning, and pest control advice. Given the involvement of both semantic accuracy and natural language fluency, five evaluation metrics were used: BLEU-4, ROUGE-L, METEOR, exact match (EM), and GPT-based subjective scoring (from ChatGPT-4). BLEU-4, ROUGE-L, and METEOR are standard metrics in natural language generation; EM measures strict matching; GPT scoring provides human-like quality judgment. The metric definitions are:

\begin{matrix} BLEU - N & = BP \cdot exp (\sum_{n = 1}^{N} w_{n} \cdot log p_{n}), BP = \{\begin{matrix} 1, & if c > r \\ exp (1 - r / c), & if c \leq r \end{matrix}, \\ ROUGE - L & = \frac{(1 + β^{2}) \cdot P_{L C S} \cdot R_{L C S}}{P_{L C S} + β^{2} \cdot R_{L C S}}, METEOR = F_{mean} \cdot (1 - Penalty), \\ F_{mean} & = \frac{10 \cdot P \cdot R}{R + 9 P}, EM = I (Prediction = Reference), \end{matrix}

(11)

where

p_{n}

denotes the n-gram match precision,

w_{n}

is the weight for each n-gram (typically uniform), c and r are the candidate and reference lengths, and BP is the brevity penalty. In ROUGE-L,

P_{L C S}

and

R_{L C S}

are the precision and recall based on the longest common subsequence, with

β

usually set to 1. METEOR incorporates token matching, lemmatization, and ordering penalties, where P and R represent precision and recall, and the penalty reflects fragmentation. EM is a binary metric indicating whether the prediction exactly matches the reference. GPT scoring offers human-aligned subjective assessment of language fluency, semantic accuracy, and overall task fulfillment, typically on a scale of 1–5 or 1–10. Five representative multimodal VQA models were selected as baselines: BLIP-2 [42], LLaVA [43], CLIP-QA [29], MiniGPT-4 [44], and Flamingo [45]. BLIP-2 constructs a powerful vision-language bridge via an image encoder and text decoder. LLaVA integrates a vision encoder with the LLaMA language model for end-to-end alignment. CLIP-QA builds on CLIP’s image-text alignment and extends it for question answering with lightweight efficiency. MiniGPT-4 adopts a GPT-4-like decoder enhanced by prompt tuning for improved context understanding. Flamingo combines cross-modal fusion and prompt-based learning, demonstrating strong performance in multi-turn dialogues. These models encompass varied architectures and alignment strategies, offering robust benchmarks for evaluating the proposed horticultural VQA system.

To ensure a fair comparison, all baseline segmentation and VQA models were fine-tuned on the same horticultural dataset used for training HortiVQA-PP, unless otherwise noted. For models where fine-tuning was not feasible due to architectural constraints or licensing limitations, we employed their publicly available pretrained weights and applied the same preprocessing and evaluation pipeline. This protocol ensures that performance differences are attributable to model architecture and knowledge integration rather than disparities in data exposure.

Hardware and Software Platform: To ensure reproducibility and computational efficiency, all experiments were conducted on a high-performance computing platform. The hardware setup included a dual-socket Intel Xeon Gold 6248 server (2.5 GHz, 20 cores per socket) with 512 GB RAM and four NVIDIA A100 GPUs (each with 40 GB memory), supporting parallel training of large-scale segmentation and VQA models. High-speed NVMe SSD arrays were utilized for efficient data loading and model checkpointing. On the software side, the system operated under Ubuntu 22.04 LTS. PyTorch 2.1 served as the core deep learning framework, paired with CUDA 11.8 and cuDNN 8.6 for GPU acceleration. Multi-label learning modules were implemented with Scikit-learn and PyTorch Lightning. Semantic segmentation tasks integrated Detectron2 and Segment Anything APIs. The VQA system was built upon Hugging Face Transformers and LLaMA-Adapter. Horticultural knowledge graphs and question templates were managed using Neo4j, SPARQL, and NLTK for NLP. For deployment and interactive demonstration, a lightweight backend was constructed using Flask, with a Vue.js-based frontend enabling image upload, question input, and multimodal response display. This architecture ensured practical usability and ease of access for horticultural practitioners.

4.2. Performance Comparison on Pest and Predator Semantic Segmentation Task (Task 1)

This experiment was designed to evaluate the precision and robustness of the HortiVQA-PP system in the semantic segmentation task for pests and predators, with a focus on assessing the overall performance in fine-grained object region recognition. As this task plays a critical role in subsequent co-occurrence modeling and question-answering generation, several mainstream semantic segmentation networks were selected as baseline models, including SegNet, the UNet series, Mask R-CNN, and Segment Anything. The performance was compared across five evaluation metrics: Precision, Recall, F1-score, mAP50, and IoU.

As shown in Table 2 and Figure 6, the HortiVQA-PP system outperformed all existing approaches across all metrics. Notably, it achieved nearly 4 percentage points higher in F1-score and mAP50 compared to the second-best method, Segment Anything, confirming its effectiveness in small object detection and boundary modeling. The performance differences among models are closely related to their architectural characteristics. SegNet, as one of the earliest segmentation networks, utilizes a symmetric encoder-decoder structure without explicit skip connections, which often leads to the loss of detailed features during reconstruction, resulting in lower Precision and Recall. UNet introduced skip connections that significantly enhance detail preservation and therefore yield better Recall and IoU. UNet+ further enhanced representational capacity through channel expansion and attention fusion. Although Mask R-CNN supports instance-level segmentation, its RoI Align mechanism may cause region misalignment for dense small objects. Segment Anything employs high-quality visual prompts and a scalable mask decoder, demonstrating strong robustness in open environments. In contrast, the advantage of the HortiVQA-PP system stems from its segmentation-aware visual encoder design, which incorporates a multi-stage LoRA modulation structure and a regional prior attention mechanism. This design mathematically improves cross-scale semantic alignment and enhances sensitivity to local texture boundaries, thus delivering superior performance on mAP and IoU. Such structural design offers stable and effective segmentation support for common challenges in agricultural pest imagery, including complex backgrounds, occlusion, and dense object co-occurrence.

4.3. Performance Comparison on Pest–Predator Co-Occurrence Matching Task (Task 2)

This experiment was conducted to verify the multi-label recognition capabilities of the HortiVQA-PP system in modeling ecological co-occurrence relationships between pests and predators. The task emphasizes identifying multiple objects within a single image and correctly determining whether ecological interactions such as cooperation, predation, or parasitism exist among them. Four representative multi-label classification methods were selected for comparison: Binary Relevance, Classifier Chain, ML-GCN, and ASL. The evaluation was performed using three metrics: Subset Accuracy, Hamming Loss, and Macro-F1.

As illustrated in Table 3, HortiVQA-PP consistently outperformed the baseline models across all evaluation metrics, achieving a Subset Accuracy of 83.5%, a Macro-F1 of 79.4%, and the lowest Hamming Loss, indicating superior generalization and structural reasoning capabilities in multi-object co-occurrence prediction. From the perspective of model architecture and mathematical formulation, the performance differences reflect each method’s ability to model inter-object relationships. Binary Relevance treats multi-label tasks as independent binary classifications, ignoring label co-dependencies and thus resulting in lower accuracy. Classifier Chain introduces conditional dependencies in label sequences, partially improving structural modeling but still suffers from limited information propagation. ML-GCN incorporates graph convolutional mechanisms that propagate label semantics through adjacency matrices, enhancing structural consistency and notably improving Macro-F1. ASL further applies attention-based soft dependencies to mitigate label imbalance issues. The HortiVQA-PP system integrates spatial positional relations and semantic similarities between pest–predator pairs to construct explicit relational embeddings. With the Pest–Predator Matching Head, cross-object modeling and pairwise matching are conducted in the feature space, enabling the model to capture not only object-specific features but also the prior distributions of co-existence probabilities. This design mathematically enhances the joint distribution modeling of labels, significantly reducing mispairing rates and improving overall predictive performance.

4.4. Performance Comparison on Visual Question Answering Task (Task 3)

This experiment aimed to evaluate the overall language generation quality and semantic consistency of the HortiVQA-PP system in multimodal VQA tasks within horticultural scenarios. The core objective of this task is to leverage the synergy between image semantic understanding and language generation capabilities to automatically respond to user queries related to pests, predators, and control strategies. To assess the performance of different models, several state-of-the-art VQA or multimodal large models (BLIP-2, LLaVA, CLIP-QA, MiniGPT-4, and Flamingo) were selected as baselines. Five evaluation metrics were adopted: BLEU-4, ROUGE-L, METEOR, EM, and GPT-based human ratings.

As shown in Table 4 and Figure 7, HortiVQA-PP outperformed all compared models across all evaluation metrics, particularly in EM and GPT human assessments, indicating superior alignment between generated content and image semantics, as well as greater adherence to domain-expert expectations in horticultural responses. From the perspective of model architecture and mathematical formulation, baseline models vary mainly in their multimodal alignment capacity, knowledge incorporation methods, and generative mechanisms. CLIP-QA relies on static vision-text embeddings without a decoder structure, leading to templated language output and reduced accuracy. BLIP-2 and LLaVA are equipped with vision-language pretraining but lack domain-specific knowledge integration, rendering them ineffective for answering ecological queries. MiniGPT-4 and Flamingo introduce cross-modal attention mechanisms, achieving notable improvements in fluency and semantic coverage. However, these models generally lack agricultural knowledge guidance, resulting in poor performance in logical reasoning-based Q&A tasks. The HortiVQA-PP system, by contrast, incorporates horticultural knowledge graph embeddings during decoding, coupled with semantic segmentation region-based attention to guide the generation process. Additionally, modality alignment loss and semantic consistency regularization are introduced during training to establish a mathematically consistent mapping between image semantics, language generation, and knowledge invocation. This fusion significantly enhances the model’s accuracy and explainability in addressing domain-specific questions, highlighting its strong potential for intelligent interaction in digital horticulture.

4.5. Robustness Evaluation on Public Datasets

To further validate the robustness and generalizability of the proposed HortiVQA-PP model beyond our proprietary dataset, we evaluated it on three open-source benchmarks corresponding to the main functional tasks: (1) pest and predator semantic segmentation, (2) pest–predator co-occurrence matching, and (3) visual question answering. For Task 1, we adopted the publicly available IP102 dataset for fine-grained pest segmentation. For Task 2, we used the COCO-ML dataset for multi-label co-occurrence recognition. For Task 3, we employed the VQA v2.0 dataset to evaluate visual question answering capabilities. Each task was compared with representative baseline models from the respective research domains. Experimental results in Table 5 demonstrate that HortiVQA-PP consistently outperforms strong baselines, confirming its robustness across heterogeneous data sources and task settings.

4.6. Ablation Study

To evaluate the individual contributions of the LoRA-based modulation, the QS-Mamba encoder, and the horticultural knowledge graph (HortiKG), we conducted a series of ablation experiments in which each component was selectively removed while keeping all other settings unchanged. Four model variants were compared: the full HortiVQA-PP framework, a version without LoRA modulation, a version replacing the QS-Mamba encoder with a conventional Transformer encoder, and a version without HortiKG integration. Experimental results across the three tasks—semantic segmentation (Task 1), pest–predator co-occurrence matching (Task 2), and horticultural VQA (Task 3)—are summarized in Table 6.

As shown in Table 6, the complete model achieved the highest performance, with F1-score, Macro-F1, and exact match scores of 87.3%, 79.4%, and 36.9%, respectively, alongside a GPT-based rating of 4.5. Removing LoRA led to a noticeable decline in both segmentation and co-occurrence performance, with drops of 3.7% and 3.6% in F1-score and Macro-F1, respectively, indicating its role in refining small-object boundaries and enhancing region-aware attention. Replacing QS-Mamba with a standard Transformer caused the most substantial degradation in Task 2, with a 5.9% decrease in Macro-F1, confirming its effectiveness in capturing spatial–semantic dependencies essential for modeling pest–predator interactions. Eliminating HortiKG produced minimal changes in segmentation metrics but significantly reduced the VQA performance, lowering the exact match score by 4.4% and the GPT rating by 0.6, which demonstrates its importance for domain-specific reasoning and accurate answer generation. Overall, these results validate the necessity of all three components, with LoRA and QS-Mamba primarily enhancing visual understanding, and HortiKG contributing critical domain knowledge for high-quality multimodal responses.

4.7. Discussion

4.7.1. Impact of Core Architectural Modules on System Performance

The performance of the proposed HortiVQA-PP system is closely tied to the synergistic contributions of its three core components: LoRA-based modulation, the QS-Mamba encoder, and the horticultural knowledge graph (HortiKG). The LoRA modulation integrated into the segmentation-aware visual encoder enables efficient fine-tuning of high-capacity backbones, allowing the model to adapt to horticultural pest imagery without extensive retraining. By injecting task-specific adjustments into frozen visual features, LoRA enhances the discrimination of small, visually similar objects such as aphids and parasitoid wasps, thereby improving segmentation precision and downstream recognition accuracy.

The QS-Mamba encoder serves as the central mechanism for modeling pest–predator co-occurrence relationships. Unlike conventional Transformer-based encoders, QS-Mamba captures long-range spatial dependencies and fine-grained semantic interactions between segmented regions, allowing the system to better detect and interpret ecological relationships under complex field conditions, including partial occlusion, irregular lighting, and heterogeneous backgrounds. This capability is critical for generating reliable co-occurrence predictions, which form the basis for ecological reasoning in subsequent stages.

The HortiKG component introduces structured horticultural domain knowledge into the VQA module, bridging the gap between purely data-driven visual features and expert-level agronomic reasoning. By linking detected species to their known biological interactions, habitat preferences, and control measures, the knowledge graph enables context-aware answer generation that is both scientifically grounded and practically actionable. This ensures that the system’s outputs go beyond simple identification to provide comprehensive ecological insights and management recommendations.

Collectively, these three components function in a complementary manner: LoRA strengthens visual feature specialization, QS-Mamba enhances ecological relationship modeling, and HortiKG enriches the system with expert knowledge for decision support. The integration of these elements underpins the robustness, interpretability, and real-world applicability of HortiVQA-PP in intelligent horticultural management.

4.7.2. Application Prospects of HortiVQA-PP in Intelligent Horticulture Management

The proposed HortiVQA-PP system demonstrated significant advantages in intelligent pest identification and ecological interactive question answering within horticultural domains. Its application potential extends beyond model accuracy and semantic understanding to practical deployment value in real-world agricultural production. In actual orchard management, growers often encounter challenges such as difficulty in identifying pest species, uncertainty regarding the co-occurrence of pests and natural enemies, and the lack of guidance on scientifically appropriate pesticide use. For instance, in protected grape cultivation, pests such as whiteflies and spider mites frequently appear simultaneously, and biological control agents like Trichogramma or lady beetles are commonly introduced for mitigation. Traditional manual identification relies heavily on personal experience, which is time-consuming, labor-intensive, and prone to omissions. The HortiVQA-PP system enables rapid identification of multiple pest species via imagery and detects the potential co-occurrence of predator species. It then leverages a horticultural knowledge graph to generate intelligent suggestions for growers, such as identifying ecological predators of the observed pests, confirming the presence of natural enemies, recommending whether additional release is required, and evaluating the need for chemical intervention. This process greatly enhances the scientific validity and interpretability of decision-making.

Furthermore, within smart agriculture platforms, this system can be integrated into IoT sensing terminals and field operation assistance tools to provide automatic pest monitoring results post-image acquisition, supporting remote Q&A and control functionalities. Taking the highland-protected tomato fields in Yunnan as an example—where large planting areas, high labor inputs, and rapid pest outbreaks are common—agricultural technicians can utilize the system during patrols to quickly identify pest or disease species via mobile photography and obtain real-time control suggestions. This enables faster emergency response and improved biological control outcomes. Additionally, in educational, outreach, and agricultural extension contexts, the system can function as an intelligent assistant to aid novice users or non-professional farmers in comprehending complex ecological interactions and principles of scientific pesticide application. It transforms traditional reliance on printed atlases and manual consultation into an integrated intelligent service chain of “photograph–identify–respond.” In conclusion, HortiVQA-PP is not merely a computational model, but a critical tool for practical agricultural production and ecological agriculture promotion. Its ecological cognition capabilities and interactive Q&A mechanisms provide a novel paradigm for intelligent digital horticulture management.

4.7.3. Algorithm Complexity and Real-Time Deployment Analysis

The HortiVQA-PP system is designed to balance accuracy and computational efficiency, enabling real-time deployment on resource-constrained platforms such as mobile devices or Internet of Things (IoT) terminals. The overall architecture consists of four main modules—Visual Encoder, Multi-Object Detector, Multimodal Fusion Module, and Horticultural Knowledge Reasoning Module—each optimized for computational complexity. From a theoretical perspective, the Visual Encoder is primarily based on convolutional neural networks (CNNs), with a complexity of

O (H \times W \times C_{i n} \times K^{2} \times C_{o u t})

, where H and W denote the spatial dimensions of the feature map, K is the kernel size, and

C_{i n}

and

C_{o u t}

are the input and output channel counts, respectively. By adopting depthwise separable convolutions and channel pruning, the convolutional operation complexity is reduced from

O (K^{2} C_{i n} C_{o u t})

to

O (K^{2} C_{i n} + C_{i n} C_{o u t})

, lowering both computation and memory access overhead. The Multimodal Fusion Module employs attention mechanisms, whose computational complexity grows quadratically with respect to the input sequence length n, i.e.,

O (n^{2} d)

, where d is the feature dimension. To mitigate this cost, the system integrates feature downsampling, low-rank decomposition, and sparse attention strategies, effectively reducing complexity to near-linear

O (n log n)

. In the Horticultural Knowledge Reasoning Module, query and inference processes are accelerated via offline index construction. During online inference, subgraph search and matching are restricted to a limited candidate set, resulting in a time complexity of

O (m)

, where m is the number of candidate entities—significantly smaller than the full knowledge base size. This ensures that reasoning latency remains stable as the knowledge base grows. Through these multi-level optimizations, the HortiVQA-PP system can theoretically satisfy real-time response requirements under typical field image resolutions and task scales, with end-to-end latency below the perceptual threshold for single interactions. This provides a solid algorithmic foundation for deployment in orchard inspection, rapid pest and disease recognition, and ecological interactive question answering. In future work, we plan to integrate model distillation and edge–cloud cooperative inference to further reduce computational and energy costs while maintaining accuracy.

4.8. Limitation and Future Work

Despite exhibiting strong performance and practical application potential in pest recognition, ecological co-occurrence modeling, and multimodal question answering tasks, the HortiVQA-PP system still presents several limitations that warrant further improvement and expansion. First, although the current dataset underlying the system encompasses 30 pest categories and 10 natural enemies, its coverage remains insufficient in terms of regional diversity and seasonal variation. Consequently, the system may struggle to adapt to ecological patterns across different climate zones and crop types. Second, from a model architecture perspective, while a fusion strategy combining segmentation-aware visual encoders with co-occurrence matching modules was proposed, robustness under extreme conditions—such as tiny targets, severe occlusions, or unstable lighting—remains limited. In particular, recognition of natural enemy species, especially micro-insects such as parasitoid wasps, can be hindered by blurred and easily confused visual features, potentially resulting in segmentation failure or misclassification. Future efforts may involve the introduction of high-resolution patch-level attention mechanisms or saliency-guided networks to enhance the discriminative power for such targets. Moreover, the current system is primarily based on image inputs and has not yet fully incorporated sensor data (e.g., temperature, humidity, meteorological variables, insect trap counts). This limits its capability in dynamic pest monitoring and early warning under complex environmental conditions. Future work could explore integration with agricultural IoT devices to construct a multimodal, real-time, and inference-capable intelligent field sensing platform, thereby contributing to the development of precision agriculture management and sustainable pest control decision-making systems. Finally, the pest–predator co-occurrence detection module in its current form treats these relationships largely as static, binary associations, oversimplifying dynamic ecological interactions that vary with environmental conditions, seasonal changes, and population density. Future iterations could incorporate time-stamped images and environmental metadata (e.g., temperature, humidity, crop stage) into the model, using temporal graph networks or transformer-based time-series modules to learn how interaction patterns change over time and under different conditions, thereby improving ecological realism and predictive accuracy.

5. Conclusions

With the advancement of digital agriculture and intelligent horticulture, traditional pest identification methods have become insufficient to meet the diversified requirements of ecological control, intelligent question answering, and precision decision-making. In response to the integrated scenario of “pest–predator–question answering,” a system named HortiVQA-PP is proposed, which combines semantic segmentation, multi-label co-occurrence recognition, and knowledge-enhanced VQA. This system systematically addresses the challenges in horticultural contexts, including difficulty in pest identification, ambiguity in predator relationships, and the lack of expert-level interactive support for farmers. Specifically, a segmentation-aware visual encoder and a pest–predator matching head are designed to enable precise segmentation of small-scale pest targets and modeling of ecological co-occurrence relationships. Furthermore, a horticulture-specific knowledge graph, HortiKG, is integrated to guide the multimodal large language model in generating task-relevant and semantically coherent answers. In terms of empirical evaluation, the HortiVQA-PP system achieved outstanding performance on the semantic segmentation task for pests and predators, with a precision of 89.6%, recall of 85.2%, F1-score of 87.3%, and IoU of 75.1%. In the pest–predator co-occurrence recognition task, an accuracy of 83.5% and a macro-F1 score of 79.4% were recorded. For the VQA task, the system attained a GPT-based expert evaluation score of 4.5, outperforming all existing multimodal large models across the board. This work not only introduces the first multimodal system framework tailored to ecological VQA in horticulture but also achieves significant breakthroughs in cross-task performance, unified processing pipeline, and system deployability. As such, it provides both a theoretical foundation and a practical toolset for intelligent agricultural recognition and interactive decision-making.

Author Contributions

Conceptualization, Z.L., C.D., S.L. and M.D.; Methodology, Z.L., C.D. and S.L.; Software, Z.L., C.D. and S.L.; Validation, L.Z. and C.J.; Formal analysis, C.J. and F.Y.; Investigation, L.Z. and C.J.; Resources, Y.J. and L.Z.; Data curation, Y.J. and F.Y.; Writing—original draft, Z.L., C.D., S.L., Y.J., L.Z., C.J., F.Y. and M.D.; Visualization, Y.J. and F.Y.; Supervision, M.D.; Project administration, M.D.; Funding acquisition, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Modern Agricultural Industrial Technology System Beijing Innovation Team (BAIC08-2024-YJ03).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lu, P.; Zheng, W.; Zhang-Zhong, L.; Lyu, X.; Shi, K.; Liu, C. Research progress and frontier hotspot of intelligent facility horticulture. Agric. Eng. 2025, 15, 58–66. [Google Scholar]
Wang, S.; Xu, D.; Liang, H.; Bai, Y.; Li, X.; Zhou, J.; Su, C.; Wei, W. Advances in deep learning applications for plant disease and pest detection: A review. Remote Sens. 2025, 17, 698. [Google Scholar] [CrossRef]
Guy, F.; Liu, H.; Yael, E. Deep-learning-based counting methods, datasets, and applications in agriculture: A review. Precis. Agric. 2023, 24, 1683–1711. [Google Scholar]
Tong, Y.S.; Lee, T.H.; Yen, K.S. Deep Learning for Image-Based Plant Growth Monitoring: A Review. Int. J. Eng. Technol. Innov. 2022, 12, 225–246. [Google Scholar] [CrossRef]
Qin, C.Y.; Yang, Y.S.; Gu, F.W.; Chen, P.Y.; Qin, W.C. Application and development of computer vision technology in modern agriculture. J. Chin. Agric. Mech. 2023, 44, 119. [Google Scholar]
Lun, Z.; Hui, Z. Research on agricultural named entity recognition based on pre train BERT. Acad. J. Eng. Technol. Sci. 2022, 5, 34–42. [Google Scholar] [CrossRef]
Li, C.; Zhen, T.; Li, Z. Image classification of pests with residual neural network based on transfer learning. Appl. Sci. 2022, 12, 4356. [Google Scholar] [CrossRef]
Ukwuoma, C.C.; Qin, Z.; Bin Heyat, M.B.; Ali, L.; Almaspoor, Z.; Monday, H.N. Recent advancements in fruit detection and classification using deep learning techniques. Math. Probl. Eng. 2022, 2022, 9210947. [Google Scholar] [CrossRef]
Alexandridis, N.; Marion, G.; Chaplin-Kramer, R.; Dainese, M.; Ekroos, J.; Grab, H.; Jonsson, M.; Karp, D.S.; Meyer, C.; O’Rourke, M.E.; et al. Archetype models upscale understanding of natural pest control response to land-use change. Ecol. Appl. 2022, 32, e2696. [Google Scholar] [CrossRef]
Wang, L.; Jin, T.; Yang, J.; Leonardis, A.; Wang, F.; Zheng, F. Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases. arXiv 2024, arXiv:2412.02158. [Google Scholar]
Lan, Y.; Guo, Y.; Chen, Q.; Lin, S.; Chen, Y.; Deng, X. Visual question answering model for fruit tree disease decision-making based on multimodal deep learning. Front. Plant Sci. 2023, 13, 1064399. [Google Scholar] [CrossRef]
Yang, T.; Mei, Y.; Xu, L.; Yu, H.; Chen, Y. Application of question answering systems for intelligent agriculture production and sustainable management: A review. Resour. Conserv. Recycl. 2024, 204, 107497. [Google Scholar] [CrossRef]
Thenmozhi, K.; Reddy, U.S. Crop pest classification based on deep convolutional neural network and transfer learning. Comput. Electron. Agric. 2019, 164, 104906. [Google Scholar] [CrossRef]
Tian, L.; Liu, C.; Liu, Y.; Li, M.; Zhang, J.; Duan, H. Research on plant diseases and insect pests identification based on CNN. In Proceedings of the IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2020; Volume 594, p. 012009. [Google Scholar]
Liu, Y.; Zhang, X.; Gao, Y.; Qu, T.; Shi, Y. Improved CNN method for crop pest identification based on transfer learning. Comput. Intell. Neurosci. 2022, 2022, 9709648. [Google Scholar] [CrossRef]
Huang, M.L.; Chuang, T.C.; Liao, Y.C. Application of transfer learning and image augmentation technology for tomato pest identification. Sustain. Comput. Inform. Syst. 2022, 33, 100646. [Google Scholar] [CrossRef]
Martínez-Sastre, R.; García, D.; Miñarro, M.; Martín-López, B. Farmers’ perceptions and knowledge of natural enemies as providers of biological control in cider apple orchards. J. Environ. Manag. 2020, 266, 110589. [Google Scholar] [CrossRef] [PubMed]
Jasrotia, P.; Bhardwaj, A.K.; Katare, S.; Yadav, J.; Kashyap, P.L.; Kumar, S.; Singh, G.P. Tillage intensity influences insect-pest and predator dynamics of wheat crop grown under different conservation agriculture practices in rice-wheat cropping system of indo-Gangetic plain. Agronomy 2021, 11, 1087. [Google Scholar] [CrossRef]
Ji, M.; Zhang, K.; Wu, Q.; Deng, Z. Multi-label learning for crop leaf diseases recognition and severity estimation based on convolutional neural networks. Soft Comput. 2020, 24, 15327–15340. [Google Scholar] [CrossRef]
Duan, J.; Ding, H.; Kim, S. A multimodal approach for advanced pest detection and classification. arXiv 2023, arXiv:2312.10948. [Google Scholar] [CrossRef]
Zhang, J.; Liu, Z.; Yu, K. MSFNet-CPD: Multi-Scale Cross-Modal Fusion Network for Crop Pest Detection. arXiv 2025, arXiv:2505.02441. [Google Scholar]
Jin, K.; Zi, X.; Thiyagarajan, K.; Braytee, A.; Prasad, M. IP-VQA Dataset: Empowering Precision Agriculture with Autonomous Insect Pest Management through Visual Question Answering. In Proceedings of the ACM on Web Conference 2025, Sydney, Australia, 28 April–2 May 2025; pp. 2000–2007. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Guilin, China, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Ni, F.; Hao, J.; Wu, S.; Kou, L.; Yuan, Y.; Dong, Z.; Liu, J.; Li, M.; Zhuang, Y.; Zheng, Y. Peria: Perceive, reason, imagine, act via holistic language and vision planning for manipulation. Adv. Neural Inf. Process. Syst. 2024, 37, 17541–17571. [Google Scholar]
Chen, J.; Zhu, D.; Shen, X.; Li, X.; Liu, Z.; Zhang, P.; Krishnamoorthi, R.; Chandra, V.; Xiong, Y.; Elhoseiny, M. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv 2023, arXiv:2310.09478. [Google Scholar]
Yang, J.; Guo, X.; Li, Y.; Marinello, F.; Ercisli, S.; Zhang, Z. A survey of few-shot learning in smart agriculture: Developments, applications, and challenges. Plant Methods 2022, 18, 28. [Google Scholar] [CrossRef]
Sharma, K.; Vats, V.; Singh, A.; Sahani, R.; Rai, D.; Sharma, A. LLaVA-PlantDiag: Integrating Large-scale Vision-Language Abilities for Conversational Plant Pathology Diagnosis. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–7. [Google Scholar]
Nanavaty, A.; Sharma, R.; Pandita, B.; Goyal, O.; Rallapalli, S.; Mandal, M.; Singh, V.K.; Narang, P.; Chamola, V. Integrating deep learning for visual question answering in Agricultural Disease Diagnostics: Case Study of Wheat Rust. Sci. Rep. 2024, 14, 28203. [Google Scholar] [CrossRef] [PubMed]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Saravanan, K.S.; Bhagavathiappan, V. AOQAS: Ontology Based Question Answering System for Agricultural Domain. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 2024, 16, 16. [Google Scholar]
Wang, Y.; Yasunaga, M.; Ren, H.; Wada, S.; Leskovec, J. VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering. arXiv 2022, arXiv:2205.11501. [Google Scholar]
Zhang, W.; Yu, J.; Zhao, W.; Ran, C. DMRFNet: Deep multimodal reasoning and fusion for visual question answering and explanation generation. Inf. Fusion 2021, 72, 70–79. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the International Workshop on Deep Learning in Medical Image Analysis; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 4015–4026. [Google Scholar]
Zhang, M.L.; Li, Y.K.; Liu, X.Y.; Geng, X. Binary relevance for multi-label learning: An overview. Front. Comput. Sci. 2018, 12, 191–202. [Google Scholar] [CrossRef]
Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains: A review and perspectives. J. Artif. Intell. Res. 2021, 70, 683–718. [Google Scholar] [CrossRef]
Seo, H.; Lee, M.; Cheong, W.; Yoon, H.; Kim, S.; Kang, M. Enhancing multi-label long-tailed classification on chest x-rays through ML-GCN augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 2747–2756. [Google Scholar]
Omeroglu, A.N.; Mohammed, H.M.; Oral, E.A.; Aydin, S. A novel soft attention-based multi-modal deep learning framework for multi-label skin lesion classification. Eng. Appl. Artif. Intell. 2023, 120, 105897. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; Yuan, L. Video-llava: Learning united visual representation by alignment before projection. arXiv 2023, arXiv:2311.10122. [Google Scholar]
Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]

Figure 1. Sample images of five representative predator species in the dataset. (A) Lady beetles, (B) Lacewings, (C) Parasitoid wasps, (D) Trichogramma, and (E) Assassin bugs. These examples reflect the diversity of predator morphology and scale, serving as key references for multi-class segmentation and ecological modeling tasks.

Figure 2. Overall architecture of the proposed HortiVQA-PP framework. The system integrates a segmentation-aware image encoder and a prompt-driven text encoder. Visual and textual features are jointly aligned via a knowledge-enhanced co-attention mechanism, enabling accurate pest and predator recognition as well as VQA. The aligned prediction is optimized through backpropagation guided by multi-task objectives.

Figure 3. Architecture of the segmentation-aware visual encoder. The module extracts multi-scale visual features through multiple parallel multi-head attention blocks. LoRA-based modulation weights (

W_{t}

) are injected into each branch to adaptively enhance region-specific representation. Inner product similarity matrices and mask encoders are used to generate fine-grained segmentation-aware feature maps, supporting downstream tasks such as pest-predator recognition and VQA.

Figure 3. Architecture of the segmentation-aware visual encoder. The module extracts multi-scale visual features through multiple parallel multi-head attention blocks. LoRA-based modulation weights (

W_{t}

) are injected into each branch to adaptively enhance region-specific representation. Inner product similarity matrices and mask encoders are used to generate fine-grained segmentation-aware feature maps, supporting downstream tasks such as pest-predator recognition and VQA.

Figure 4. Architecture of the pest–predator co-occurrence multi-object segmentation network. The network is composed of a stem layer, multiple stacked QSMamba encoders for hierarchical representation learning, and a multi-scale fusion mamba decoder (MFMamba decoder) for spatially enriched reconstruction. Residual blocks and deep-wise convolution enhance feature transformation, while the Seg-head generates final segmentation outputs for pest and predator instances in co-occurrence scenes.

Figure 5. Overview of the horticultural knowledge-augmented question answering module. The module integrates dense visual semantic tokens and low-/high-resolution image encoders to construct a multimodal visual-language embedding space. Through semantic decomposition, self-attention, and cross-attention, referring descriptions are aligned with corresponding image regions. Prompt generation and multi-order interactions facilitate accurate and knowledge-guided answer generation.

Figure 6. Radar chart comparing the performance of different models on the pest and predator semantic segmentation task (Task 1). The evaluation metrics include Precision, Recall, F1-score, mAP@50, and IoU. The proposed HortiVQA-PP model achieves the best overall performance across all metrics.

Figure 7. Performance comparison of different models on the VQA task. The metrics include BLEU-4, ROUGE-L, METEOR, Exact Match (EM), and GPT Rating. The proposed HortiVQA-PP model outperforms all baseline models across all evaluation metrics.

Table 1. Statistics of images and Q&A pairs in the HortiVQA-PP dataset.

Category	Number of Classes	Number of Images	Avg. Q&A Pairs/Image
Pest Images	30	9560	6.4
Predator Images	10	3120	5.2
Pest–Predator Co-occurrence Images	15 combinations	1780	7.1
Total	-	14,460	6.3

Table 2. Performance Comparison on Pest and Predator Semantic Segmentation Task (Task 1).

Model	Precision (%)	Recall (%)	F1-Score (%)	mAP50 (%)	IoU (%)
SegNet	78.5	72.9	75.6	68.4	62.7
UNet	81.3	76.5	78.8	72.1	65.2
UNet+	83.4	78.9	81.1	74.8	67.5
Mask R-CNN	84.7	79.6	82.0	75.9	68.9
Segment Anything	86.2	81.5	83.8	78.7	71.2
HortiVQA-PP (Ours)	89.6	85.2	87.3	82.4	75.1

Table 3. Performance Comparison on Pest–Predator Co-occurrence Matching Task (Task 2).

Model	Subset Accuracy (%)	Hamming Loss	Macro-F1 (%)
Binary Relevance	71.4	0.096	68.9
Classifier Chain	73.6	0.089	70.4
ML-GCN	76.2	0.082	73.1
ASL (Attention-based Soft Labels)	78.0	0.077	74.8
HortiVQA-PP (Ours)	83.5	0.063	79.4

Table 4. Performance Comparison on Visual Question Answering Task (Task 3).

Model	BLEU-4 (%)	ROUGE-L (%)	METEOR (%)	EM (%)	GPT Rating (/5)
BLIP-2	38.2	45.6	35.4	28.7	3.6
LLaVA	41.5	48.1	37.8	30.9	3.8
CLIP-QA	35.7	44.2	33.5	27.3	3.5
MiniGPT-4	43.6	49.3	39.2	32.5	4.0
Flamingo	44.1	50.5	39.6	33.2	4.1
HortiVQA-PP (Ours)	48.7	54.8	43.3	36.9	4.5

Table 5. Updated robustness evaluation of HortiVQA-PP and baselines on three public datasets. “–” indicates that the metric is not applicable to the corresponding task.

Model	Task 1: F1-Score (%)	Task 2: Macro-F1 (%)	Task 3: EM (%)	Task 3: GPT Rating (/5)
SegNet	76.1	–	–	–
UNet	79.4	–	–	–
UNet+	81.7	–	–	–
Mask R-CNN	82.6	–	–	–
Segment Anything	84.3	–	–	–
Binary Relevance	–	69.3	–	–
Classifier Chain	–	70.9	–	–
ML-GCN	–	73.7	–	–
ASL	–	75.2	–	–
BLIP-2	–	–	29.1	3.7
LLaVA	–	–	31.4	3.9
CLIP-QA	–	–	27.9	3.6
MiniGPT-4	–	–	33.0	4.1
Flamingo	–	–	33.7	4.2
HortiVQA-PP (Ours)	88.1	80.1	37.6	4.6

Table 6. Ablation study results on three tasks: semantic segmentation (Task 1), pest–predator co-occurrence matching (Task 2), and horticultural VQA (Task 3).

Variant	Task 1: F1-Score (%)	Task 2: Macro-F1 (%)	Task 3: EM (%)	Task 3: GPT Rating (/5)
w/o LoRA	84.4	76.5	35.4	4.4
w/o QS-Mamba	83.7	74.2	35.9	4.3
w/o HortiKG	87.9	79.7	33.2	4.0
Full Model	88.1	80.1	37.6	4.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Du, C.; Li, S.; Jiang, Y.; Zhang, L.; Ju, C.; Yue, F.; Dong, M. HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture. Horticulturae 2025, 11, 1009. https://doi.org/10.3390/horticulturae11091009

AMA Style

Li Z, Du C, Li S, Jiang Y, Zhang L, Ju C, Yue F, Dong M. HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture. Horticulturae. 2025; 11(9):1009. https://doi.org/10.3390/horticulturae11091009

Chicago/Turabian Style

Li, Zhongxu, Chenxi Du, Shengrong Li, Yaqi Jiang, Linwan Zhang, Changhao Ju, Fansen Yue, and Min Dong. 2025. "HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture" Horticulturae 11, no. 9: 1009. https://doi.org/10.3390/horticulturae11091009

APA Style

Li, Z., Du, C., Li, S., Jiang, Y., Zhang, L., Ju, C., Yue, F., & Dong, M. (2025). HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture. Horticulturae, 11(9), 1009. https://doi.org/10.3390/horticulturae11091009

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HortiVQA-PP: Multitask Framework for Pest Segmentation and Visual Question Answering in Horticulture

Abstract

1. Introduction

2. Related Work

2.1. Research on Pest Identification and Ecological Regulation in Horticultural Environments

2.2. The Development of Visual Question Answering (VQA) Technology

2.3. Multimodal Semantic Alignment and Knowledge Enhancement

3. Materials and Method

3.1. Data Collection

3.2. Data Preprocessing

3.3. Proposed Method

3.3.1. Overall

3.3.2. Segmentation-Aware Visual Encoder

3.3.3. Pest–Predator Co-Occurrence Multi-Object Segmentation Network

3.3.4. Horticultural Knowledge-Augmented Question Answering Module

4. Results and Discussion

4.1. Experimental Setup and Evaluation Metrics

4.2. Performance Comparison on Pest and Predator Semantic Segmentation Task (Task 1)

4.3. Performance Comparison on Pest–Predator Co-Occurrence Matching Task (Task 2)

4.4. Performance Comparison on Visual Question Answering Task (Task 3)

4.5. Robustness Evaluation on Public Datasets

4.6. Ablation Study

4.7. Discussion

4.7.1. Impact of Core Architectural Modules on System Performance

4.7.2. Application Prospects of HortiVQA-PP in Intelligent Horticulture Management

4.7.3. Algorithm Complexity and Real-Time Deployment Analysis

4.8. Limitation and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI