ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture

Fu, Yuluxin; Shi, Chen

doi:10.3390/su17167443

Open AccessArticle

ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture

by

Yuluxin Fu

and

Chen Shi

^*

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(16), 7443; https://doi.org/10.3390/su17167443

Submission received: 14 July 2025 / Revised: 4 August 2025 / Accepted: 15 August 2025 / Published: 18 August 2025

(This article belongs to the Special Issue Sustainable Aquaponic Systems and the Role of Deep Learning in Aquaculture)

Download

Browse Figures

Versions Notes

Abstract

In response to the challenges posed by visually similar disease symptoms, complex background noise, and the need for fine-grained disease classification in leafy vegetables, this study proposes ProtoLeafNet—a prototype attention-based deep learning model for multi-task disease detection and segmentation. By integrating a class-prototype–guided attention mechanism with a prototype loss function, the model effectively enhances the focus on lesion areas and improves category discrimination. The architecture leverages a dual-task framework that combines object detection and semantic segmentation, achieving robust performance in real agricultural scenarios. Experimental results demonstrate that the model attains a detection precision of 93.12%, recall of 90.27%, accuracy of 91.45%, and mAP scores of 91.07% and 90.25% at IoU thresholds of 50% and 75%, respectively. In the segmentation task, the model achieves a precision of 91.79%, recall of 90.80%, accuracy of 93.77%, and mAP@50 and mAP@75 both reaching 90.80%. Comparative evaluations against state-of-the-art models such as YOLOv10 and TinySegformer verify the superior detection accuracy and fine-grained segmentation ability of ProtoLeafNet. These results highlight the potential of prototype attention mechanisms in enhancing model robustness, offering practical value for intelligent disease monitoring and sustainable agriculture.

Keywords:

deep learning; precision agriculture; disease detection; prototype

1. Introduction

With the continuous growth of the global population and the intensification of climate change, agricultural production is confronting unprecedented challenges [1,2,3,4,5]. Globally, pests and pathogens are responsible for approximately 10–40% of crop yield losses, resulting in substantial economic impacts across regions [6]. In northern China, climatic fluctuations have led to frequent outbreaks of pests and diseases, posing severe threats to crop security and yield stability [7]. Leafy vegetables, as one of the most important daily-consumed crops in China, are particularly vulnerable due to their short cultivation cycles and exposure to open environments. In some production and distribution chains, postharvest losses of leafy vegetables can exceed 50% [8]. These factors contribute to the high diversity of pathogens, rapid disease transmission [9], and considerable diagnostic difficulties [10], while current disease detection systems in real-field conditions may show reduced accuracy (77–90%) and error rates of 10–23% [11], ultimately exerting significant impacts on both yield and quality [12]. Against this backdrop, achieving efficient detection and precise identification of diseases affecting leafy vegetables has become a critical component in safeguarding agricultural productivity and food safety [13].

Traditional methods for disease identification typically rely on visual inspection and expert judgment [14]. These approaches are not only inefficient and costly, but their accuracy also depends heavily on the skill of the inspectors [15], making them ill-suited for large-scale, high-frequency monitoring [16]. In recent years, the adoption of modern agricultural models such as aquaponics has offered new avenues for intensive vegetable production, but it has also introduced new challenges for disease management. In closed or semi-closed recirculating systems, once a disease outbreak occurs, it can spread rapidly, severely compromising system stability and water quality. Therefore, it is imperative to develop an intelligent disease recognition system tailored to leafy vegetables in aquaponic environments, enabling efficient and fine-grained monitoring of multiple diseases to support precise intervention and sustainable control.

With the ongoing progress of agricultural modernization and rapid advancements in computer vision technologies [17,18,19], image-based approaches for detecting plant diseases have gained increasing attention [20,21,22]. These techniques generally function by analyzing visual features such as color distribution, texture patterns, and structural shapes within images to locate diseased areas [23]. Typical methodologies include threshold-based segmentation, edge detection, and texture characterization [24]. Despite demonstrating acceptable results under controlled or simple conditions, these traditional image processing methods are inherently dependent on handcrafted features and fixed rule sets, which limits their adaptability to real-world, complex agricultural environments. When faced with a broad variety of disease types and constantly changing field conditions, conventional methods often fail to effectively distinguish between healthy and infected crops, leading to significant performance degradation [25,26,27]. Moreover, their limited generalization capacity becomes apparent in scenarios involving coexisting diseases, inconsistent growing environments, or when data availability is low or imbalanced across disease categories [28,29,30]. These challenges significantly hinder their applicability in practical agricultural settings, especially in rural regions where digital infrastructure is underdeveloped. Therefore, there is a pressing need for more intelligent and adaptable solutions for the effective detection of leafy vegetable diseases.

In recent years, deep learning—particularly convolutional neural networks (CNNs)—has emerged as a powerful alternative for visual analysis in agriculture [31,32]. Unlike traditional image-based techniques that rely on predefined features, deep learning models are capable of automatically learning hierarchical representations directly from large datasets, reducing manual intervention. These models have shown remarkable success in various computer vision tasks such as image classification, object localization, and semantic segmentation [33,34,35,36]. For instance, Abdu et al. introduced a disease recognition framework that efficiently captures both local and global lesion features, reducing redundancy in feature vectors and achieving a recall rate exceeding 99% [37]. Similarly, Rahman et al. developed an automated tomato leaf disease detection system that computed 13 statistical descriptors and used a support vector machine (SVM) for classification, yielding accuracy above 85% [38]. Li et al. enhanced the YOLOv5s model by modifying components such as the CSP, FPN, and NMS modules to improve feature extraction across multiple scales and adapt to environmental variability, achieving a mean average precision (mAP) of 93.1% [39]. Tiwari et al. employed transfer learning to construct a potato disease detection model, resulting in a classification accuracy of 97.8% [40]. Wang et al. proposed a two-stage framework combining DeepLabV3+ and U-Net for identifying the severity of cucumber leaf diseases under complex background conditions; their system reached a segmentation accuracy of 93.27%, a Dice coefficient of 0.6914 for lesion areas, and an average disease severity classification accuracy of 92.85% [41]. Lastly, Jiang et al. utilized CNNs for feature extraction on rice leaf disease images and adopted an SVM classifier, achieving an overall accuracy of 96.8% [42].

Although deep learning–based models have achieved remarkable progress in agricultural disease detection, they still face numerous challenges in real-world applications. These include the large variety of disease types, highly similar visual characteristics among diseases, and significant interference from complex backgrounds, which often lead to target recognition confusion and unclear boundary segmentation. To effectively address these challenges, this study proposes a disease detection and segmentation model incorporating a prototype attention mechanism—ProtoLeafNet—specifically designed to meet the demands of multi-disease coexistence and fine-grained recognition in leafy vegetables. The proposed approach substantially improves both the accuracy and robustness of disease identification. The main contributions of this work are as follows:

We collected and organized a large-scale image dataset comprising multiple typical leafy vegetable disease categories, covering diverse scenarios and disease morphologies, with strong representativeness and diversity.
Prototype-guided attention mechanism design: We propose a class-prototype–based attention mechanism to guide the model in focusing on key discriminative features of disease regions, effectively suppressing background interference and enhancing its ability to distinguish among multiple diseases.
Prototype loss optimization strategy: By optimizing the distance relationships between sample embeddings and class prototypes, we enhance inter-class separability and intra-class compactness in the feature space, thereby improving classification and segmentation accuracy.
Dual-task network structure for detection and segmentation: We construct a multi-task model that integrates object detection and semantic segmentation, enabling both precise localization of disease regions and fine-grained segmentation, thus, improving the overall recognition capability of the system.

The remainder of this paper is organized as follows: Section 2 reviews related work in plant disease detection and segmentation. Section 3 introduces the overall architecture of ProtoLeafNet, including the prototype attention mechanism. Section 4 details the dataset construction and experimental setup. Section 5 presents quantitative and qualitative results, including ablation studies and specific case discussions. Finally, Section 6 concludes the paper and outlines future research directions.

2. Related Work

2.1. Object Detection

In the domain of computer vision, object detection plays a vital role as it focuses on recognizing various target entities within static images or video streams while determining their precise spatial positions using bounding boxes [43,44]. This task fundamentally operates by leveraging specific algorithms that analyze segmented portions or regions of the visual input to ascertain the presence of relevant objects. Upon successful identification, the algorithm outputs two essential elements: the class or type of the detected item and its spatial coordinates. Conventionally, the object’s category (e.g., “apple,” “carrot,” and so on) is assigned a label, whereas its position is defined through a rectangular bounding box. The bounding box is typically parameterized by four numerical values

(x_{\min}, y_{\min}, x_{\max}, y_{\max})

, representing the horizontal and vertical positions of its top-left and bottom-right corners, respectively. These coordinate pairs enable precise enclosure of the detected object within the frame.

The task of object detection can be mathematically described as a combination of two core processes: category classification and bounding box localization. In the localization stage, the model utilizes convolutional feature extraction to generate region-based predictions that approximate the position of target objects through a regression mechanism. Typically, the prediction of the object’s coordinates is achieved by minimizing a loss function, often defined using the

L_{2}

norm. The formulation of the loss for the regression component is illustrated in Equation (1):

L_{b b o x} = \sum_{i = 1}^{N} ({({\hat{x}}_{i} - x_{i})}^{2} + {({\hat{y}}_{i} - y_{i})}^{2} + {({\hat{w}}_{i} - w_{i})}^{2} + {({\hat{h}}_{i} - h_{i})}^{2})

(1)

where,

x_{i}, y_{i}, w_{i}, h_{i}

represent the actual bounding box position and size, whereas

{\hat{x}}_{i}, {\hat{y}}_{i}, {\hat{w}}_{i}, {\hat{h}}_{i}

correspond to the model’s predicted values. The designed loss quantifies the mismatch between estimated and true boxes, thereby guiding the network to enhance its localization precision.

To perform category recognition, the network estimates a probability distribution over possible classes for each proposed region. The classification objective adopts the cross-entropy formulation, which evaluates how far the predicted class likelihoods deviate from the actual label assignments. This loss function is mathematically defined as follows Equation (2):

L_{c l s} = - \sum_{i = 1}^{N} (y_{i} \cdot log ({\hat{y}}_{i}) + (1 - y_{i}) \cdot log (1 - {\hat{y}}_{i}))

(2)

where,

y_{i}

refers to the actual class label (commonly binary: 0 or 1), while

{\hat{y}}_{i}

indicates the probability output produced by the model. The classification loss aims to reduce the gap between predicted outcomes and real annotations, thus, improving the precision of class assignments.

In recent years, object detection has rapidly expanded within agriculture, especially in areas such as disease diagnosis, plant development analysis, and smart farming machinery. However, complex backgrounds, occlusion, and illumination variations in agricultural environments present unique challenges. Target detection methods have been increasingly adopted for automatic detection and identification of crop diseases in such settings.

In this study, several representative target detection models are selected as baselines to cover the current mainstream development trends. The compared models include YOLOv8, YOLOv9, YOLOv10, YOLOv11, and DETR based on the Transformer architecture [45,46,47,48,49,50]. The YOLO series is known for its compact architecture and fast inference, making it widely used in real-time detection tasks. Specifically, YOLOv8 introduces a deeper and more optimized network structure with improved anchor box allocation, enhancing detection performance. YOLOv9 adopts a more advanced depth-wise training strategy, improving overall accuracy. YOLOv10 achieves higher inference speed while maintaining or improving recognition accuracy. YOLOv11 further enhances lightweight design and inference efficiency through architectural improvements.

In contrast, DETR introduces the Transformer mechanism, replacing traditional convolution-based structures with attention mechanisms, which enables global modeling of the entire image and region-level feature alignment. Through query-key matching, DETR effectively handles complex scenarios and achieves more precise detection tasks, demonstrating strong advantages.

2.2. Semantic Segmentation

Semantic segmentation refers to a pixel-level image analysis technique in which each individual pixel is classified into a specific predefined category [51]. Within the domain of agricultural disease analysis, this method is commonly used to distinguish infected areas from the surrounding environment, thereby allowing for detailed annotation of pathological regions. The central objective is to assign an appropriate semantic label to every pixel, aligning it with its corresponding class. Earlier segmentation approaches—such as those based on thresholding techniques or region-growing strategies—often face limitations when dealing with noisy data or visually complex scenes. Such methods are heavily dependent on handcrafted features and typically lack the capacity to autonomously learn discriminative representations from data.

Modern semantic segmentation frameworks based on deep learning predominantly rely on convolutional neural networks to extract multi-level features. Notable architectures in this space include fully convolutional networks (FCNs), SegNet, and U-Net [52,53,54,55]. These models are capable of leveraging spatial and contextual cues from visual inputs, leading to more precise segmentation outcomes. In agricultural applications, semantic segmentation techniques have proven effective for various tasks such as disease localization, weed discrimination, and crop condition monitoring. When applied to leafy vegetable disease detection, this pixel-wise classification technique can accurately isolate diseased tissue from healthy leaf structures, supporting fine-grained analysis and enhancing the precision of detection under real-world conditions.

In this study, several representative semantic segmentation models were selected as comparative baselines, covering different technical paradigms and development trends of current mainstream approaches. These include five representative models: FCN, U-Net, SegNet, TinySegformer, and Mask R-CNN. U-Net adopts a symmetric encoder–decoder architecture and integrates shallow and deep features through skip connections, effectively improving segmentation accuracy and performing particularly well in medical imaging applications [55,56,57,58]. SegNet also employs an encoder–decoder structure, where it records the positions of max-pooling operations to restore spatial information, achieving good segmentation performance while maintaining low computational complexity, making it suitable for resource-constrained scenarios. TinySegformer is a lightweight segmentation model based on the Transformer architecture, capable of capturing long-range dependencies in images and maintaining robust segmentation performance in complex backgrounds. Mask R-CNN, as a representative instance segmentation method, detects objects while generating an independent, high-quality segmentation mask for each target, making it suitable for tasks that demand high accuracy in both detection and segmentation.

To conclude, both object detection and semantic segmentation serve as critical components in the identification of plant diseases within agricultural settings. The incorporation of deep learning technologies—especially models built upon convolutional neural networks and Transformer architectures—has enabled the accurate recognition and delineation of infected regions. This advancement contributes substantially to the improvement of disease monitoring systems in terms of precision and efficiency. By automating key stages of the diagnostic workflow, deep learning approaches assist agricultural practitioners in minimizing crop damage caused by disease outbreaks, thereby supporting more stable and sustainable farming practices.

2.3. Prototype Learning and Class-Guided Atte3ntion

Prototype learning is a metric-based paradigm originally embodied by Prototypical Networks [59], which represent each class via learned prototypes in embedding space and classify query instances based on distances to these prototypes. This approach naturally aligns with few-shot and imbalanced learning scenarios. In plant disease recognition—especially under low data regimes—recent studies such as Rezaei et al. demonstrate that few-shot methods can achieve competitive accuracy with only a handful of samples per class by integrating prototype representations and attention modules [60].

Alongside prototype learning, class-guided attention (or semantic-guided attention), as developed by Yang et al., uses semantic or class prototype information to guide visual feature attention, helping models focus on discriminative regions even when samples are scarce or backgrounds complex [61]. These mechanisms are particularly valuable in agricultural imagery, where disease symptoms are subtle, class imbalance is severe, and environmental noise is high. Despite their promise, prototype-based and attention-guided frameworks remain underexplored within unified detection or segmentation models for plant disease diagnosis. Existing agricultural systems seldom jointly leverage prototype learning and class-guided attention to tackle challenges such as limited labeled data, skewed class distribution, and complex field conditions. This gap motivates our proposed ProtoLeafNet architecture, which explicitly integrates these components to enhance both robustness and interpretability in leafy vegetable disease detection.

3. Methods and Materials

3.1. Dataset Construction

Constructing a high-quality dataset is a foundational step in this study, as the data quality directly affects the performance and generalization ability of the model. In this work, an image dataset was created specifically for leafy vegetable disease detection and segmentation tasks. The dataset includes several common vegetable types and their corresponding disease symptoms.

The images were collected from two sources: (1) public online platforms including PlantVillage, Kaggle, and Google Image Search, and (2) field photographs taken at the experimental agricultural base of China Agricultural University in Haidian District, Beijing. When searching online, we used disease-specific and crop-specific keywords (e.g., “lettuce downy mildew”, “spinach anthracnose”) and manually filtered the results based on image clarity, completeness, and relevance. Low-resolution, watermarked, or duplicate images were excluded.

In the field collection, we used handheld cameras (mainly mobile phone cameras) under natural lighting to photograph various disease conditions. The selected vegetable species included spinach, lettuce, water spinach, Chinese cabbage, and celtuce. The disease types covered in the dataset include downy mildew, white rust, anthracnose, virus disease, gray mold, soft rot, black rot, black spot, sclerotinia, and powdery mildew. For each disease type, about 1000 to 2000 images were collected to ensure sufficient variation and coverage.

Although healthy leaves were not labeled as a separate class, most of the disease symptoms are localized in nature. As a result, each image naturally contains areas of healthy leaf tissue, which serve as negative examples during training. A summary of the number of images per disease type is provided in Table 1. To support reproducibility, we are open to sharing the dataset upon reasonable request.

To guarantee high visual fidelity and preserve fine-grained details during the image acquisition process, high-resolution digital cameras were employed. The hardware setup primarily consisted of digital single-lens reflex (DSLR) cameras manufactured by Nikon and Canon. Furthermore, to enrich the dataset and enhance its diversity, additional images were sourced from publicly available online platforms. Only images that met strict selection standards—such as high pixel density, clearly distinguishable disease symptoms, and credible origins—were included. An example of the image quality and selection rationale is shown in Figure 1.

The dataset was partitioned with 80% allocated for training, 10% for validation, and the remaining 10% for testing—serving purposes of model fitting, parameter adjustment, and evaluation of final performance, respectively.

3.2. Data Augmentation

In the context of deep learning, data augmentation serves as a vital strategy to improve the robustness and predictive performance of models. It involves generating additional training instances by applying a range of transformations to the original dataset, thereby enhancing generalization capabilities and reducing overfitting tendencies. As deep learning research progresses, a number of advanced augmentation techniques—such as Cutout, Mixup, and CutMix—have emerged and gained significant attention. These methods introduce unique perturbations to training data, greatly enriching the variability of input distributions and contributing to performance improvements across diverse application domains.

3.2.1. Cutout

Among these, Cutout stands out for its conceptual simplicity and practical effectiveness. The method operates by randomly masking a rectangular patch within the input image, setting all pixel values in the selected area to zero. This localized dropout encourages the model to focus on more comprehensive contextual cues during training. One notable advantage of Cutout is its minimal computational overhead, as the operation requires only a straightforward masking step. The formal representation of the Cutout procedure is provided in Equation (3):

I^{'} = I ⊙ (1 - M)

(3)

In the Cutout method, let I represent the original input image and

I^{'}

the resulting image after augmentation. A binary mask M, which shares the same dimensions as the input image, is utilized to apply the occlusion. Within this mask, a rectangular region is randomly selected and assigned a value of 0 to indicate masked pixels, while all other positions retain a value of 1, preserving the corresponding original pixel values. By introducing such masked-out areas, the Cutout strategy compels the neural network to learn more generalized and spatially diverse features, thereby improving robustness during training.

3.2.2. Mixup

Mixup is a data augmentation strategy that synthesizes new examples by blending two input images in a fixed proportion. This process involves pixel-wise interpolation and label fusion through weighted averaging. Its core benefit lies in promoting model generalization by mitigating issues such as noisy annotations, data scarcity, and unclear class boundaries.The procedure of Mixup is as Equations (4) and (5):

\tilde{I} = λ I_{1} + (1 - λ) I_{2}

(4)

\tilde{y} = λ y_{1} + (1 - λ) y_{2}

(5)

In the case of Mixup,

I_{1}

and

I_{2}

refer to two distinct input samples, and

y_{1}

,

y_{2}

denote their respective labels. A mixing parameter

λ

is drawn from a uniform distribution

U (0, 1)

, which determines the ratio used to blend both images and their corresponding labels. This interpolation process not only exposes the model to transitional cases between different classes but also reduces sensitivity to label noise and sharp decision boundaries, ultimately enhancing generalization performance and mitigating the risk of overfitting.

3.2.3. CutMix

CutMix is an alternative augmentation technique that constructs synthetic samples by extracting a region from one image and embedding it into a different image. This method contributes to better learning of localized patterns and strengthens the model’s resistance to overfitting and noise.The operation of CutMix can be formulated as Equations (6) and (7):

I^{'} = M ⊙ I_{A} + (1 - M) ⊙ I_{B}

(6)

y^{'} = λ y_{A} + (1 - λ) y_{B}

(7)

Specifically,

I_{A}

and

I_{B}

denote two input images with their corresponding labels

y_{A}

and

y_{B}

, respectively. A binary mask M is generated to indicate the rectangular region cropped from image A and pasted onto image B. The proportion of the cropped region is determined by a scalar

λ

sampled from a uniform distribution

U (0, 1)

. The resulting label

y^{'}

corresponds to the augmented image. CutMix improves the model’s understanding of local image regions and enhances its ability to recognize combinations of features from different objects. The data enhancement effect is shown in Figure 2:

3.3. Proposed Method

3.3.1. Network Architecture Overview

This paper proposes a vegetable disease detection and segmentation model based on a Transformer architecture combined with a prototype enhancement mechanism, as illustrated in Figure 3. The model integrates the local perception capability of convolutional neural networks with the global modeling strength of Transformers, and significantly improves the detection and segmentation performance in complex diseased regions through foreground-guided and prototype optimization mechanisms. The overall network employs ResNet-50 as the backbone feature extractor, encoding the input image into multi-scale convolutional features, which are then projected to a fixed dimension and flattened into a sequence fed into the Transformer.

During the semantic modeling process, the Transformer encoder utilizes multi-layer, multi-head self-attention to capture global relationships among different image regions, yielding semantically enriched representations. To enhance the model’s focus on key regions, a class-agnostic foreground prediction mechanism is introduced to estimate the probability of each encoded token belonging to the foreground. The top-k tokens with the highest confidence are selected as query vectors for the decoder, enabling a foreground-driven decoding process. The Transformer decoder further integrates global semantics and local spatial information to produce task-discriminative representations.

To improve the category distinctiveness and semantic consistency of the decoder outputs, a prototype enhancement module is introduced following the decoder. This module extracts prototype vectors for each disease category from the training samples and computes the similarity between the decoder features and the prototypes, refining the model’s response to target regions through weighted fusion. Finally, the enhanced features are fed into a multi-task predictor comprising three branches for object classification, bounding box regression, and lesion mask generation.

The entire model is trained end-to-end by jointly optimizing multiple loss functions, including standard classification, bounding box, and segmentation losses, as well as an additional prototype loss that constrains the distance between the decoder output and the corresponding category prototype. This design enhances the discriminability and stability of semantic modeling. The proposed architecture supports end-to-end training and achieves high-precision object detection and semantic segmentation in agricultural scenarios characterized by imbalanced data distributions and diverse disease region morphologies.

3.3.2. Feature Encoding and Semantic Modeling Module

As illustrated in Figure 4, the model first takes an input image

I \in R^{H \times W \times 3}

and employs ResNet-50 as the backbone feature extractor to obtain multi-scale convolutional features. To achieve efficient information fusion and construct a unified representation, features from multiple stages of the backbone, denoted as

{f_{1}, f_{2}, f_{3}, f_{4}}

, corresponding to different spatial resolutions and semantic depths, are selected. Each feature map is first passed through a

1 \times 1

convolution to adjust the channel dimension to a common size C, and then combined through upsampling and concatenation operations to produce a unified global feature representation

F \in R^{C \times H \times W}

.

X_{0} = X + P

(8)

To incorporate global contextual information, the fused convolutional features F are flattened into a token sequence

X \in R^{N \times C}

, where

N = H^{'} \times W^{'}

. A learnable positional encoding

P \in R^{N \times C}

is then added to X to form the input to the Transformer encoder (Equation (8)).

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V

(9)

Specifically, the query matrix Q, key matrix K, and value matrix V for each token are obtained via linear projections, where

d_{k}

denotes the dimension of each attention head (Equation (9)).

After passing through L layers of the encoder, the globally modeled output sequence

Z = {z_{1}, z_{2}, \dots, z_{N}} \in R^{N \times C}

is obtained. This sequence encodes rich semantic and contextual information, which serves as the foundation fo Equation (10).

p_{i} = σ (W z_{i} + b)

(10)

For the i-th token

z_{i}

, a foreground score is computed as

σ (W z_{i} + b)

, where

W \in R^{1 \times C}

and

b \in R

are learnable parameters, and

σ

denotes the sigmoid function. Based on the foreground scores, the top-k tokens are selected to form the foreground feature set

Z_{f g} \in R^{k \times C}

, which serves as the query input to the Transformer decoder.

The decoder takes

Z_{f g}

as queries and the encoder output Z as keys and values. Through cross-attention, it fuses the features and generates task-discriminative representations

T = {t_{1}, \dots, t_{k}} \in R^{k \times C}

. These output tokens are subsequently used for target classification, localization, and mask prediction.

This module effectively combines local convolutional perception with global Transformer-based semantic modeling, which not only enhances the ability of this model to perceive disease regions but also provides a solid semantic foundation for subsequent prototype enhancement and multi-task prediction.

3.3.3. Prototype Enhancement Module

Although the Transformer decoder possesses strong global modeling capability, its reliance solely on the attention mechanism may lead to inaccurate category discrimination in the presence of irregularly distributed disease regions and subtle inter-class differences, particularly in multi-class disease images with confusing regions. To address this issue, a prototype enhancement module is introduced, which leverages category-level semantic representations as guidance to improve feature discriminability from a categorical perspective. This module consists of two parts: one is prototype extraction module and the other one is prototype attention mechanism, which, respectively, construct category prototypes during training and utilize them to semantically enhance the decoder outputs during inference.

Prototype Extraction Module: During training, for each disease category $c \in {1, 2, \dots, C}$ , a semantic centroid vector, referred to as the category prototype vector $p_{c} \in R^{C}$ , is computed by aggregating the features of all samples belonging to category c. This prototype represents the most representative feature of the category by summarizing the shared semantics of its samples (Equation (11)).

$p_{c} = \frac{1}{N_{c}} \sum_{N_{c}}^{i = 1} t_{i}^{c}$

(11)
Prototype Attention Mechanism: During inference or training, the output of the Transformer decoder is denoted as $T = {t_{1}, t_{2}, \dots, t_{k}} \in R^{k \times C}$ . To enhance the category-specific expressiveness of each target token, a prototype attention mechanism based on cosine similarity is designed. Specifically, the similarity between each token $t_{j}$ and each category prototype $p_{c}$ is computed as Equation (12).

$a_{j, c} = \frac{t_{j} \cdot p_{c}}{∥ t_{j} ∥ ∥ p_{c} ∥}$

(12)

The attention weights are then normalized via a softmax function to obtain the attention distribution (Equation (13)).

α_{j, c} = \frac{exp (a_{j, c})}{\sum_{c^{'} = 1}^{C} exp (a_{j, c^{'}})}

(13)

Based on these weights, the prototype-enhanced representation of each token is formulated as Equation (14).

\tilde{T} = t_{j} + \sum_{c = 1}^{C} α_{j, c} \cdot p_{c}

(14)

The enhanced token sequence

\tilde{T} = {{\tilde{t}}_{1}, {\tilde{t}}_{2}, \dots, {\tilde{t}}_{k}}

incorporates category-guided information, resulting in more discriminative and semantically consistent features, which are beneficial for subsequent multi-task branches performing classification, localization, and mask prediction.

3.3.4. Multi-Task Predictor and Loss Function Design

After completing the Transformer decoding and prototype enhancement, the output feature sequence

\tilde{T} = {{\tilde{t}}_{1}, {\tilde{t}}_{2}, \dots, {\tilde{t}}_{k}} \in R^{k \times C}

is fed into three task-specific branches, which are responsible for target classification, bounding box regression, and pixel-level mask prediction of the diseased regions, respectively. This constitutes an end-to-end multi-task learning framework. The structure of each branch is as follows:

Target Classification Branch: The classification branch employs a multi-layer perceptron (MLP) to predict the category probability distribution for each token. Let ${\hat{y}}_{j} \in R^{C}$ denote the predicted category distribution of the j-th token, and $y_{j} \in {1, \dots, C}$ be its ground-truth category label. The classification loss is defined as the average multi-class cross-entropy (Equation (15)).

$L_{c l s} = - \frac{1}{k} \sum_{j = 1}^{k} log {\hat{y}}_{j, y_{j}}$

(15)
Bounding Box Regression Branch. The regression branch outputs the four-dimensional bounding box parameters $b_{j} = (x, y, w, h)$ for each target, indicating the bounding box’s center position along with its width and height dimensions. The loss function combines the L1 loss with the Generalized IoU (GIoU) loss (Equation (16)).

$L_{b o x} = \frac{1}{k} \sum_{j = 1}^{k} (∥ b_{j} - {\hat{b}}_{j} ∥_{1} + GIoU (b_{j}, {\hat{b}}_{j}))$

(16)

where ${\hat{b}}_{j}$ denotes the ground-truth bounding box of the j-th target.
Mask Prediction Branch. The mask prediction branch generates a binary mask $M_{j} \in R^{H^{''} \times W^{''}}$ for each detected diseased region. To obtain precise and complete segmentation, the loss function combines the binary cross-entropy (BCE) and Dice losses (Equation (17)).

$L_{m a s k} = \frac{1}{k} \sum_{j = 1}^{k} (L_{B C E} (M_{j}, {\hat{M}}_{j}) + Dice (M_{j}, {\hat{M}}_{j}))$

(17)

where ${\hat{M}}_{j}$ is the ground-truth mask and $M_{j}$ is the predicted mask for the j-th target.
Prototype Loss. To further improve the discriminability of the model for different disease categories, a prototype loss (PrototypeLoss) is introduced. This loss encourages each target feature to be closer to the semantic prototype of its corresponding category. The prototype loss is defined as Equation (18)).

$L_{p r o t o} = \frac{1}{k} \sum_{j = 1}^{k} ∥ t_{j} - p_{y_{j}} ∥_{2}^{2}$

(18)

where $t_{j}$ is the feature representation of the j-th target and $p_{y_{j}}$ is the prototype vector of its ground-truth category. This loss minimizes the intra-class distance while enlarging the inter-class distance, improving feature compactness and separability, which is particularly beneficial for distinguishing confusing disease categories.

3.3.5. Multi-Task Joint Loss Function

To jointly optimize the model for classification, localization, segmentation, and semantic consistency, the total loss function is defined as a weighted sum of the individual loss terms (Equation (19)).

L_{t o t a l} = λ_{c l s} L_{c l s} + λ_{b o x} L_{b o x} + λ_{m a s k} L_{m a s k} + λ_{p r o t o} L_{p r o t o}

(19)

where

λ_{c l s}

,

λ_{b o x}

,

λ_{m a s k}

, and

λ_{p r o t o}

are hyperparameters controlling the relative contributions of the classification loss, bounding box regression loss, mask prediction loss, and prototype loss, respectively. This joint optimization enables the model to learn a balanced representation that is both semantically discriminative and spatially precise.

3.4. Evaluation Metrics

In this research, several quantitative indicators—such as accuracy, precision, recall, and mAP (mean Average Precision)—are utilized to assess model performance.Among them, accuracy is a straightforward yet fundamental indicator that quantifies the proportion of correctly classified instances relative to the total number of samples. Within classification scenarios, this metric captures how many predictions match the ground truth across the entire dataset. Precision acts as an essential metric to evaluate how reliably the model identifies positive predictions. It indicates the proportion of correctly classified positives out of all instances marked as positive by the system. In contrast, recall measures the model’s capacity to detect actual positive cases, calculated as the number of true positives divided by the total count of ground-truth positive examples. While precision emphasizes prediction correctness, recall focuses on the ability to capture all relevant positives, aiming to maximize coverage of true cases. To deliver a thorough evaluation of performance across multiple object classes and varying confidence levels, the metric of mAP is adopted. The calculation process consists of several stages: initially, a PR (precision–recall) curve is plotted for each category using model predictions; subsequently, the area beneath each PR curve is computed to yield the AP (Average Precision) for the corresponding class. The overall mAP is then obtained by averaging these AP values across all object categories. The formal mathematical expressions for these evaluation measures are presented as follows:

Accuracy = \frac{T P + T N}{N}

(20)

Precision = \frac{T P}{T P + F P}

(21)

Recall = \frac{T P}{T P + F N}

(22)

A P_{i} = \int_{0}^{1} precision (r) d r

(23)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(24)

where,

T P

and

T N

indicate the counts of correctly predicted positive and negative instances, while

F P

and

F N

correspond to incorrect positive and negative predictions. The symbol N denotes either the total sample count (for accuracy) or the number of object classes (for mAP). Additionally,

precision (r)

reflects the precision value at a given recall level r.

4. Results

4.1. Experimental Setup

The experiments were executed on a high-efficiency computing system configured with GPU-enabled infrastructure tailored for deep learning workloads. Table 2 outlines the experimental environment, including both software frameworks and hardware specifications utilized during the evaluation.

To ensure consistency and fairness in performance comparison, all models were trained for a unified total of 200 epochs. For ProtoLeafNet, the Transformer decoder is composed of

L = 5

layers, which was empirically selected to balance representational depth and computational cost.

The training process employed a carefully tuned set of hyperparameters to achieve stable model behavior under data-scarce conditions. Specifically, the learning rate was initialized at 0.001 and decayed by a factor of 10 every 10 epochs to facilitate convergence and global optima search. A batch size of 16 was chosen to strike a balance between computational efficiency and gradient stability.

Prior to training, all input images were resized to

224 \times 224

pixels to match the input dimension expectations of standard convolutional architectures. To enhance generalization and reduce the variance introduced by dataset partitioning, we employed a five-fold cross-validation strategy. The dataset was evenly divided into five subsets; in each round, one subset was used as validation while the remaining four were used for training. The final results were reported as the average across all five folds.

To prevent overfitting, both L2 regularization (with a weight decay of 0.001) and dropout (with a rate of 0.5) were applied during training. Optimization was performed using the Adam algorithm with hyperparameters

β_{1} = 0.9

,

β_{2} = 0.999

, and

ε = 1 \times 10^{- 8}

.

For all baseline models, including YOLOv8, YOLOv9, YOLOv10, YOLOv11, and TinySegformer, we adhered to the best-performing configurations and training procedures as reported in their respective original publications to ensure a fair and reliable comparison.

4.2. Multi-Task Evaluation Results for Disease Recognition

This study conducts a comparative analysis of several representative models in object detection and segmentation, aiming to verify the proposed method’s capability in identifying disease regions on leafy vegetables. The evaluation relies on multiply performance indicators, including precision, recall, classification accuracy, and mAP calculated at IoU thresholds of 50% (mAP@50) and 75% (mAP@75), ensuring a holistic measurement of both localization precision and general robustness.

The outcomes of the experiments are reported in Table 3 and Table 4, where various models’ effectiveness on detection and segmentation tasks is outlined. Results suggest that architectures with enhanced multi-scale feature extraction, such as YOLOv10 and TinySegformer, yield better results than prior baselines. Importantly, the proposed approach consistently surpasses all benchmarks, especially in complex environments and scenarios with limited training data, highlighting its practical advantages and generalization strength.

In the object detection evaluation, YOLOv9 surpassed DETR by a margin of 2.94% on both mAP@50 and mAP@75, reflecting a clear performance uplift. Building on this, YOLOv10 demonstrated a strong balance between precision and recall, achieving mAP scores of 85.26% at 50% IoU and 84.28% at 75% IoU. These gains stem from its refined detection head and enhanced multi-scale feature integration, which notably improve the identification of small-scale targets. Continuing this progression, YOLOv11 sustained high accuracy in detection scenarios.

Advancing beyond existing frameworks, the method proposed in this work reached 91.07% and 90.25% for mAP@50 and mAP@75, respectively—exceeding all comparative models. Its superior performance is largely due to the integration of a prototype-driven feature encoder and attention-guided refinement module. These components boost the distinctiveness of extracted representations and intensify focus on semantically critical regions, especially in scenes with dense background interference or limited annotated data, thereby affirming its robust generalization capability.

Within the segmentation evaluation, TinySegformer exhibited strong performance in capturing global contextual dependencies, attributed to its Transformer-based design. It achieved mAP scores of 89.82% at 50% IoU and 88.83% at 75% IoU. However, the method introduced in this study yielded even better outcomes, recording 90.80% for both mAP@50 and mAP@75.

Furthermore, the proposed model reported an accuracy of 93.77%, recall of 90.80%, and precision of 91.79%. These metrics validate the model’s strength in improving segmentation fidelity and its resilience under adverse conditions, such as cluttered scenes and category overlap—challenges commonly encountered in agricultural disease imagery.

From the modeling standpoint, DETR employs self-attention to capture holistic feature interactions across the image. Nonetheless, its detection capability is hindered by substantial computational overhead and inadequate handling of small-scale objects, limiting its overall applicability in practical detection scenarios. In contrast, the YOLO family improves inference efficiency and detection accuracy through architectural simplification and multi-scale feature pyramid enhancement. TinySegformer integrates a lightweight Transformer backbone with multi-level receptive field fusion strategies, offering certain advantages in contextual modeling, though its performance remains limited by the spatial resolution constraints inherent to multi-head attention.

In contrast to previous approaches, the method introduced here integrates prototype-based representation learning and attention-driven refinement to reinforce intra-class consistency while maximizing inter-class distinction within the embedding space. This structural design strengthens the model’s ability to differentiate complex and texture-rich disease types. Additionally, a prototype-aware loss formulation is employed to explicitly minimize the distance between feature vectors and their respective class centers during optimization, thereby enhancing generalization performance, particularly in data-constrained environments.

In summary, the presented approach delivers leading results across object detection and semantic segmentation benchmarks, highlighting its versatility and practical value in the context of agricultural image analysis.

4.3. Comparative Ablation of Attention Mechanisms

This investigation focuses on explore how various attention mechanisms influence model performance, with a specific focus on their effectiveness in detecting and segmenting diseases in leafy vegetables. Three different attention approaches were evaluated: traditional self-attention mechanisms, the CBAM (Convolutional Block Attention Module), and the newly proposed attention mechanism based on prototype guidance. Each method was evaluated to assess its contribution to improving detection accuracy and segmentation quality within the agricultural disease analysis task.

Through a comparative evaluation using indicators—including precision, recall, classification accuracy, and mAP computed at IoU levels of 50% and 75%, this study investigates the impact of distinct attention mechanisms on feature expressiveness, detection robustness, and overall recognition performance. The outcomes, presented in Table 5, highlight clear performance variations across the three evaluated strategies.

As shown in Table 5, the ablation study on different attention mechanisms demonstrates that the proposed prototype attention mechanism outperforms the baseline methods across all evaluation metrics, exhibiting superior feature focusing and discriminative capabilities. Specifically, the standard self-attention mechanism yielded relatively poor performance, with a precision of 72.85%, recall of 69.86%, accuracy of 71.86%, mAP@50 of 71.86%, and mAP@75 of 70.86%. Although self-attention is effective in modeling global dependencies, it is prone to interference from redundant information in scenarios with complex backgrounds or subtle inter-class differences. This leads to reduced selectivity in feature extraction and degrades overall performance.

Conversely, the Convolutional Block Attention Module (CBAM), which integrates spatial and channel-wise attention components, notably strengthens the model’s resilience and feature discrimination capability. This enhancement translated into consistent gains across evaluation criteria, yielding a precision of 84.83%, recall of 80.84%, accuracy of 82.83%, mAP@50 of 83.83%, and mAP@75 of 82.83%.

Remarkably, the introduced prototype-guided attention module delivered the highest overall performance, achieving 94.81% in precision, 91.82% in recall, 92.81% in accuracy, and identical mAP values of 91.82% at both 50% and 75% IoU thresholds. This mechanism integrates prototype vector modeling with an adaptive weighting strategy, enabling the extraction of representative features for each disease class at the global semantic level. It dynamically emphasizes relevant regions in local feature maps via attention weighting, thereby enhancing the response to disease-affected areas while effectively suppressing background noise.

Overall, the prototype attention mechanism demonstrates superior performance in both detection and segmentation tasks, confirming its strong adaptability to the challenges of high-variance backgrounds, complex textures, and inter-class feature overlap in vegetable disease imagery. These findings highlight its potential as an effective solution for intelligent disease recognition in precision agriculture applications.

4.4. Ablation Study on Different Loss Functions

In order to assess how different loss formulations affect model behavior in detecting and segmenting diseases on leafy vegetables, an ablation analysis was performed. The analysis centers on three widely adopted loss formulations: conventional cross-entropy, focal loss designed to handle class imbalance, and a newly developed prototype-oriented loss proposed in this work.

A comprehensive evaluation was carried out using several performance indicators, including precision, recall, accuracy, and mAP, assessed at IoU thresholds of 50% (mAP@50) and 75% (mAP@75). This investigation explores how various loss functions influence model convergence behavior, prediction reliability, and general robustness. As detailed in Table 6, experimental outcomes indicate that the prototype-guided loss consistently delivers the best performance, surpassing both the conventional cross-entropy and focal losses across all measured dimensions.

The experimental outcomes reveal that the model employing the cross-entropy loss function demonstrates relatively inferior performance across various evaluation metrics, with mAP@50 and mAP@75 values of 65.87% and 64.87%, respectively. This finding underscores that, despite its extensive use in multi-class classification tasks, cross-entropy loss is incapable of adequately modeling the discriminative boundaries between classes in scenarios involving fine-grained recognition and class imbalance.

In contrast, the introduction of the focal loss leads to a substantial improvement in model performance, with mAP@50 and mAP@75 increasing to 80.84%. This enhancement can be attributed to the focal loss’s ability to focus on hard-to-classify samples, thereby effectively alleviating the performance degradation caused by class imbalance.

Furthermore, the prototype-based loss function yielded the highest performance across all evaluation indicators, achieving mAP scores of 91.82% at both the 50% and 75% IoU thresholds. By explicitly regulating the embedding-to-prototype distance during training, this loss formulation markedly improves the model’s capability to differentiate between low-frequency categories and reduces the impact of irrelevant background features. Consequently, it leads to accuracy gains in both classification and segmentation tasks. This prototype-driven strategy introduces an effective mechanism for strengthening feature separability under complex visual conditions.

4.5. Specific Case Discussions

This experiment is designed to assess how effectively the proposed model identifies various plant disease types, with a particular emphasis on challenging conditions such as cluttered visual backgrounds and feature-level ambiguity. To evaluate how effectively the model responds to difficult cases, this analysis investigates its strengths and potential shortcomings in fine-grained disease recognition. Detection outcomes for multiple disease types—namely downy mildew, gray mold, black rot, sclerotinia, and brown spot—are comparatively analyzed, as summarized in Table 7 and Table 8.

The integrated results of the detection and segmentation tasks demonstrate that the proposed model holds significant potential for application in aquaponic systems, particularly in the identification of diseases affecting various leafy vegetables. For instance, in the case of spinach downy mildew, the model achieved an accuracy of 94.56%, a recall of 91.61%, and an mAP@50 of 92.59% in the context of object recognition. In this segmentation task, the precision reached 97.80%, with mAP@50 and mAP@75 values of 94.47% and 93.47%, respectively. The results indicate that is model is highly capable of recognizing and segmenting diseases with clear lesion edges and distinct morphological features, making it suitable for early disease monitoring in aquaponic systems.

In contrast, for diseases with less prominent features or stronger background interference, such as cabbage sclerotinia and lettuce leaf brown spot, the model’s mAP@75 values in both detection and segmentation tasks were slightly lower (88.65% and 90.45%, respectively). This suggests that the model still faces challenges in handling lesions with blurred textures, diffused spots, or unclear boundaries. This also reflects the heterogeneous distribution of different diseases in the image feature space, imposing stricter demands on inter-class differentiation. To further illustrate these findings, we have visualized the detection and segmentation results for the detection task, as shown in Figure 5.

Figure 5 presents representative qualitative detection results for five common leafy vegetable diseases: soft rot and gray mold on lettuce, downy mildew on spinach, sclerotinia on cabbage, and brown spot on celtuce leaves. The predicted bounding boxes accurately highlight symptomatic areas of varying shapes, colors, and textures, demonstrating the model’s robustness across diverse species and disease manifestations. These visual results complement the quantitative metrics by confirming that the proposed method can localize and classify disease symptoms consistently, even under variations in leaf morphology and lesion appearance, thereby reinforcing the overall reliability of the detection framework.

5. Discussion

5.1. Advantages

ProtoLeafNet enhances the discriminative and feature-focusing capabilities of key regions in disease images by introducing a prototype-guided attention mechanism and a prototype loss function. This model delivers strong results across both detection and segmentation tasks, exhibiting remarkable resilience and generalization capabilities in agricultural environments characterized by visual clutter and overlapping disease categories. According to the evaluation, it obtains mAP scores of 91.07% at 50% IoU and 90.25% at 75% IoU for object detection, while achieving an mAP@75 of 90.80% in the delineation of pathological areas. In comparison with mainstream models like YOLOv10 and TinySegformer, ProtoLeafNet shows significant advantages in boundary discrimination and feature differentiation. Its multi-task joint training strategy further promotes feature sharing and collaborative optimization, significantly enhancing the model’s overall disease recognition capability. Additionally, the prototype-enhanced mechanism improves the model’s interpretability and semantic consistency, making it particularly suitable for fine-grained recognition tasks involving multiple co-existing diseases, thus, providing theoretical and algorithmic support for smart agricultural disease monitoring systems.

Given its intended use in precision agriculture and aquaponic environments, we further analyzed ProtoLeafNet’s computational efficiency and deployment feasibility on resource-constrained edge devices. The model comprises approximately 38.2 million parameters and requires 144 MB in storage (FP32). Inference tests on an NVIDIA RTX 4090D GPU showed an average processing time of 28.4 ms per 224 × 224 image and a peak memory usage of 1.7 GB. While slightly heavier than TinySegformer, ProtoLeafNet offers superior accuracy and robustness, and remains compatible with edge AI accelerators, making it practical for real-time deployment in smart greenhouses, vertical farms, and other embedded agricultural systems.

5.2. Challenges and Limitations

Despite the strong overall performance of ProtoLeafNet, several limitations were observed during experimental evaluation. First, the model exhibits performance degradation under extreme lighting variations, such as overexposure or severe shadows, which can distort disease color and texture cues. Second, in cases where multiple leaves overlap significantly, occlusion may lead to missegmentation or missed detections, particularly for small or early-stage lesions. Additionally, some visually ambiguous symptoms—such as early powdery mildew versus dust or leaf aging effects—may challenge the model’s feature differentiation capacity. These limitations suggest that while ProtoLeafNet generalizes well across most conditions, its performance may be affected in low-quality image settings or ambiguous disease scenarios. Future work will consider incorporating illumination-invariant pre-processing and occlusion-aware attention mechanisms to further improve robustness.

5.3. Future Perspectives

Future research can further expand and optimize the findings of this study in several key areas. Firstly, it would be beneficial to combine the prototype mechanism with a dynamic update strategy to enable adaptive updates of the prototype vectors during both training and inference phases. This would enhance the model’s ability to adapt to diverse agricultural environments. Secondly, in terms of model architecture, exploring more lightweight Transformer variants or incorporating techniques such as knowledge distillation and sparse attention mechanisms could improve deployment efficiency and inference performance, facilitating its application in edge computing and IoT devices for agricultural monitoring. Additionally, multi-modal fusion approaches could be further explored, such as integrating image data with environmental variables like dissolved oxygen and pH values from aquaponic systems, to build a multi-source perception disease recognition model that handles visually complex cases. Finally, while this model demonstrates promising performance, there is still room for improvement in its generalization ability and domain adaptability. Future work could employ strategies like distribution shift mitigation to increase this model’s transferability across regions and crops, thereby laying the foundation for building more universally applicable smart agricultural vision systems to serve aquaponic systems.

To better contextualize the model’s technical performance within the framework of sustainable agriculture, we emphasize that the improvement in detection accuracy—from 85.26% in YOLOv10 to 91.07% in ProtoLeafNet—has practical implications beyond numerical gains. Higher accuracy in disease detection enables earlier and more precise identification of affected plants, allowing farmers to apply treatments only when and where necessary. This targeted intervention reduces excessive or preventive pesticide use, thereby lowering environmental toxicity and input costs. Furthermore, precise lesion segmentation supports quantification of disease severity, enabling informed decisions about culling, nutrient adjustment, or environmental control in hydroponic systems. By improving detection robustness under real-world conditions, ProtoLeafNet contributes to reducing crop losses due to late or missed diagnoses. Collectively, these outcomes align with sustainable agriculture goals by enhancing resource efficiency, minimizing ecological impact, and supporting high-yield, low-input production systems.

6. Conclusions

This paper proposed ProtoLeafNet, a model for identifying and segmenting diseases in green-leaf vegetables that integrates a prototype-guided attention mechanism, aiming to address the challenges of multi-disease recognition, complex backgrounds, and high sample diversity in agricultural systems like aquaponics.

In performance assessments, ProtoLeafNet attained 93.12% accuracy, alongside mAP scores of 91.07% at 50% IoU and 90.25% at 75% IoU for object detection. For segmentation, the model reached 90.80% at both mAP@50 and mAP@75, outperforming leading models such as YOLOv10 and TinySegformer across several key metrics. Specifically, in the case of Spinacia oleracea infected with downy mildew, it recorded an mAP@50 of 92.59% and a segmentation accuracy of 97.80%. These findings underscore the model’s effectiveness in fine-level feature targeting, detailed classification, and maintaining generalization across challenging visual conditions.

Moreover, ablation studies comparing different attention strategies and supervision signals revealed that the prototype-based attention mechanism markedly enhances focus on semantically important areas, leading to an approximate 10-point improvement in mAP. When contrasted with conventional cross-entropy loss, the prototype-guided loss displayed superior consistency and class discriminability, especially in subtle category distinctions. Collectively, these enhancements contribute to better learning under data scarcity and improve the model’s interpretability in practical deployments.

Despite the promising experimental results, ProtoLeafNet still faces several challenges in practical applications. Firstly, the prototype vector’s expressiveness heavily depends on high-quality annotated data and balanced class distribution. In cases where some disease categories have very few samples or poor-quality labels, the model’s attention allocation may become skewed, impacting overall recognition performance. Secondly, while the Transformer module enhances the model’s semantic modeling and global perception, the substantial computational burden limits its applicability on low-resource edge platforms, limiting time really application in smart agricultural terminals. Furthermore, the current model relies primarily on a single visual modality. Under interference from factors such as lighting variations, leaf occlusion, or image blur, the model’s robustness and stability still have room for improvement.

In the future, research works may explore strategies to enhance model flexibility and streamline its deployment in practical settings. One possible avenue is to incorporate a dynamic prototype update mechanism, allowing the prototype vectors to be self-adjusted during both training and inference phases, thereby enhancing the model’s response to real-time environmental changes. In terms of architecture, exploring more lightweight model designs, such as MobileViT or TinyFormer, combined with techniques like knowledge distillation or sparse attention mechanisms, could reduce inference overhead and meet the requirements for edge deployment. Additionally, multi-modal fusion of image data with environmental sensor data from aquaponic systems, such as water temperature, pH value, and dissolved oxygen concentration, could help construct a more comprehensive disease perception model and improve its understanding of complex agricultural environments. Finally, adopting meta-learning or domain adaptation strategies could enhance the model’s transferability across crop varieties, different cultivation regions, and even across systems, thereby improving its generalization ability. With these improvements, ProtoLeafNet has the potential to further enhance its practicality and broader application value, providing stable and reliable technical support for efficient disease monitoring in smart agriculture, especially within aquaponic systems.

Author Contributions

Conceptualization, C.S.; methodology, C.S.; software, Y.F.; validation, Y.F. and C.S.; formal analysis, Y.F.; investigation, Y.F.; resources, Y.F. and C.S.; data curation, Y.F.; writing—original draft preparation, Y.F.; writing—review and editing, C.S.; visualization, Y.F.; supervision, Y.F.; project administration, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Tian, Z.; Wang, J.W.; Li, J.; Han, B. Designing future crops: Challenges and strategies for sustainable agriculture. Plant J. 2021, 105, 1165–1178. [Google Scholar] [CrossRef]
Islam, A.; Raisa, S.R.S.; Khan, N.H.; Rifat, A.I. A deep learning approach for classification and segmentation of leafy vegetables and diseases. In Proceedings of the 2023 International Conference on Next-Generation Computing, IoT and Machine Learning (NCIM), Gazipur, Bangladesh, 16–17 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Alam, T.S.; Jowthi, C.B.; Pathak, A. Comparing pre-trained models for efficient leaf disease detection: A study on custom CNN. J. Electr. Syst. Inf. Technol. 2024, 11, 12. [Google Scholar] [CrossRef]
Liu, X.; Wang, P.; Wang, S.; Liao, W.; Ouyang, M.; Lin, S.; Lin, R.; Sarris, P.F.; Michalopoulou, V.; Feng, X.; et al. The circular RNA circANK suppresses rice resistance to bacterial blight by inhibiting microRNA398b-mediated defense. Plant Cell 2025, 37, koaf082. [Google Scholar] [CrossRef]
Wang, R.F.; Qu, H.R.; Su, W.H. From Sensors to Insights: Technological Trends in Image-Based High-Throughput Plant Phenotyping. Smart Agric. Technol. 2025, 12, 101257. [Google Scholar] [CrossRef]
Savary, S.; Willocquet, L.; Pethybridge, S.J.; Esker, P.; McRoberts, N.; Nelson, A. The global burden of pathogens and pests on major food crops. Nat. Ecol. Evol. 2019, 3, 430–439. [Google Scholar] [CrossRef] [PubMed]
Prasad, P.; Bhardwaj, S.C.; Thakur, R.K.; Adhikari, S.; Gangwar, O.P.; Lata, C.; Kumar, S. Prospects of climate change effects on crop diseases with particular reference to wheat. J. Cereal 2021, 13, 117–134. [Google Scholar]
Gogo, E.; Opiyo, A.; Ulrichs, C.; Huyskens-Keil, S. Nutritional and economic postharvest loss analysis of African indigenous leafy vegetables along the supply chain in Kenya. Postharvest Biol. Technol. 2017, 130, 39–47. [Google Scholar] [CrossRef]
Natesh, H.; Abbey, L.; Asiedu, S. An overview of nutritional and antinutritional factors in green leafy vegetables. Hortic. Int. J. 2017, 1, 58–65. [Google Scholar] [CrossRef]
Wang, R.F.; Tu, Y.H.; Chen, Z.Q.; Zhao, C.T.; Su, W.H. A Lettpoint-Yolov11l Based Intelligent Robot for Precision Intra-Row Weeds Control in Lettuce. SSRN 2025. [Google Scholar] [CrossRef]
Bhagat, S.; Kokare, M.; Haswani, V.; Hambarde, P.; Taori, T.; Ghante, P.; Patil, D. Advancing real-time plant disease detection: A lightweight deep learning approach and novel dataset for pigeon pea crop. Smart Agric. Technol. 2024, 7, 100408. [Google Scholar] [CrossRef]
Parashar, N.; Johri, P. Enhancing apple leaf disease detection: A CNN-based model integrated with image segmentation techniques for precision agriculture. Int. J. Math. Eng. Manag. Sci. 2024, 9, 943. [Google Scholar] [CrossRef]
Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-accuracy detection of maize leaf diseases CNN based on multi-pathway activation function module. Remote Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
Buja, I.; Sabella, E.; Monteduro, A.G.; Chiriacò, M.S.; De Bellis, L.; Luvisi, A.; Maruccio, G. Advances in plant disease detection and monitoring: From traditional assays to in-field diagnostics. Sensors 2021, 21, 2129. [Google Scholar] [CrossRef]
Kotwal, J.; Kashyap, R.; Pathan, S. Agricultural plant diseases identification: From traditional approach to deep learning. Mater. Today Proc. 2023, 80, 344–356. [Google Scholar] [CrossRef]
Da Silveira, F.; Lermen, F.H.; Amaral, F.G. An overview of agriculture 4.0 development: Systematic review of descriptions, technologies, barriers, advantages, and disadvantages. Comput. Electron. Agric. 2021, 189, 106405. [Google Scholar] [CrossRef]
Jiang, H.; Diao, Z.; Shi, T.; Zhou, Y.; Wang, F.; Hu, W.; Zhu, X.; Luo, S.; Tong, G.; Yao, Y.D. A review of deep learning-based multiple-lesion recognition from medical images: Classification, detection and segmentation. Comput. Biol. Med. 2023, 157, 106726. [Google Scholar] [CrossRef]
Liu, J.; Wang, X. Plant diseases and pests detection based on deep learning: A review. Plant Methods 2021, 17, 1–18. [Google Scholar] [CrossRef] [PubMed]
Yao, M.; Huo, Y.; Tian, Q.; Zhao, J.; Liu, X.; Wang, R.; Xue, L.; Wang, H. FMRFT: Fusion mamba and DETR for query time sequence intersection fish tracking. Comput. Electron. Agric. 2025, 237, 110742. [Google Scholar] [CrossRef]
Selvaraj, M.G.; Vergara, A.; Ruiz, H.; Safari, N.; Elayabalan, S.; Ocimati, W.; Blomme, G. AI-powered banana diseases and pest detection. Plant Methods 2019, 15, 1–11. [Google Scholar] [CrossRef]
Polk, S.L.; Chan, A.H.; Cui, K.; Plemmons, R.J.; Coomes, D.A.; Murphy, J.M. Unsupervised detection of ash dieback disease (Hymenoscyphus fraxineus) using diffusion-based hyperspectral image clustering. In Proceedings of the IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2287–2290. [Google Scholar]
Wang, Z.; Wang, R.; Wang, M.; Lai, T.; Zhang, M. Self-supervised transformer-based pre-training method with General Plant Infection dataset. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; Springer: Berlin/Heidelberg, Germany, 2024; pp. 189–202. [Google Scholar]
Ghazal, S.; Munir, A.; Qureshi, W.S. Computer vision in smart agriculture and precision farming: Techniques and applications. Artif. Intell. Agric. 2024, 13, 64–83. [Google Scholar] [CrossRef]
Tian, H.; Wang, T.; Liu, Y.; Qiao, X.; Li, Y. Computer vision technology in agricultural automation—A review. Inf. Process. Agric. 2020, 7, 1–19. [Google Scholar] [CrossRef]
Yang, Y.; Zhao, X.; Huang, M.; Wang, X.; Zhu, Q. Multispectral image based germination detection of potato by using supervised multiple threshold segmentation model and Canny edge detector. Comput. Electron. Agric. 2021, 182, 106041. [Google Scholar] [CrossRef]
Chang-Tao, Z.; Rui-Feng, W.; Yu-Hao, T.; Xiao-Xu, P.; Wen-Hao, S. Automatic lettuce weed detection and classification based on optimized convolutional neural networks for robotic weed control. Agronomy 2024, 14, 2838. [Google Scholar] [CrossRef]
Wang, R.F.; Su, W.H. The application of deep learning in the whole potato production Chain: A Comprehensive review. Agriculture 2024, 14, 1225. [Google Scholar] [CrossRef]
Liu, J.; Wang, X. Early recognition of tomato gray leaf spot disease based on MobileNetv2-YOLOv3 model. Plant Methods 2020, 16, 1–16. [Google Scholar] [CrossRef] [PubMed]
Nazarov, P.A.; Baleev, D.N.; Ivanova, M.I.; Sokolova, L.M.; Karakozova, M.V. Infectious plant diseases: Etiology, current status, problems and prospects in plant protection. Acta Nat. 2020, 12, 46. [Google Scholar] [CrossRef] [PubMed]
Khakimov, A.; Salakhutdinov, I.; Omolikov, A.; Utaganov, S. Traditional and current-prospective methods of agricultural plant diseases detection: A review. In IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2022; Volume 951, p. 012002. [Google Scholar]
Martinelli, F.; Scalenghe, R.; Davino, S.; Panno, S.; Scuderi, G.; Ruisi, P.; Villa, P.; Stroppiana, D.; Boschetti, M.; Goulart, L.R.; et al. Advanced methods of plant disease detection. A review. Agron. Sustain. Dev. 2015, 35, 1–25. [Google Scholar] [CrossRef]
Hasan, M.Z.; Ahamed, M.S.; Rakshit, A.; Hasan, K.Z. Recognition of jute diseases by leaf image classification using convolutional neural network. In Proceedings of the 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Kanpur, India, 6–8 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Cui, K.; Tang, W.; Zhu, R.; Wang, M.; Larsen, G.D.; Pauca, V.P.; Alqahtani, S.; Yang, F.; Segurado, D.; Fine, P.; et al. Efficient Localization and Spatial Distribution Modeling of Canopy Palms Using UAV Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4413815. [Google Scholar] [CrossRef]
Guan, R.; Liu, T.; Tu, W.; Tang, C.; Luo, W.; Liu, X. Sampling Enhanced Contrastive Multi-View Remote Sensing Data Clustering with Long-Short Range Information Mining. IEEE Trans. Knowl. Data Eng. 2025, 37, 5598–5612. [Google Scholar] [CrossRef]
Wu, A.Q.; Li, K.L.; Song, Z.Y.; Lou, X.; Hu, P.; Yang, W.; Wang, R.F. Deep Learning for Sustainable Aquaculture: Opportunities and Challenges. Sustainability 2025, 17, 5084. [Google Scholar] [CrossRef]
Cao, Z.; Lu, Y.; Yuan, J.; Xin, H.; Wang, R.; Nie, F. Tensorized graph learning for spectral ensemble clustering. IEEE Trans. Circuits Syst. Video Technol. 2024, 35, 2662–2674. [Google Scholar] [CrossRef]
Abdu, A.M.; Mokji, M.M.; Sheikh, U.U. Automatic vegetable disease identification approach using individual lesion features. Comput. Electron. Agric. 2020, 176, 105660. [Google Scholar] [CrossRef]
Rahman, S.U.; Alam, F.; Ahmad, N.; Arshad, S. Image processing based system for the detection, identification and treatment of tomato leaf diseases. Multimed. Tools Appl. 2023, 82, 9431–9445. [Google Scholar] [CrossRef]
Li, J.; Qiao, Y.; Liu, S.; Zhang, J.; Yang, Z.; Wang, M. An improved YOLOv5-based vegetable disease detection method. Comput. Electron. Agric. 2022, 202, 107345. [Google Scholar] [CrossRef]
Tiwari, D.; Ashish, M.; Gangwar, N.; Sharma, A.; Patel, S.; Bhardwaj, S. Potato leaf diseases detection using deep learning. In Proceedings of the 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 13–15 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 461–466. [Google Scholar]
Wang, C.; Du, P.; Wu, H.; Li, J.; Zhao, C.; Zhu, H. A cucumber leaf disease severity classification method based on the fusion of DeepLabV3+ and U-Net. Comput. Electron. Agric. 2021, 189, 106373. [Google Scholar] [CrossRef]
Jiang, F.; Lu, Y.; Chen, Y.; Cai, D.; Li, G. Image recognition of four rice leaf diseases based on deep learning and support vector machine. Comput. Electron. Agric. 2020, 179, 105824. [Google Scholar] [CrossRef]
Di, X.; Cui, K.; Wang, R.F. Toward Efficient UAV-Based Small Object Detection: A Lightweight Network with Enhanced Feature Fusion. Remote Sens. 2025, 17, 2235. [Google Scholar] [CrossRef]
Cui, K.; Zhu, R.; Wang, M.; Tang, W.; Larsen, G.D.; Pauca, V.P.; Alqahtani, S.; Yang, F.; Segurado, D.; Lutz, D.; et al. Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms. arXiv 2025, arXiv:2502.13023. [Google Scholar]
Huo, Y.; Wang, R.F.; Zhao, C.T.; Hu, P.; Wang, H. Research on Obtaining Pepper Phenotypic Parameters Based on Improved YOLOX Algorithm. AgriEngineering 2025, 7, 209. [Google Scholar] [CrossRef]
Yang, Z.Y.; Xia, W.K.; Chu, H.Q.; Su, W.H.; Wang, R.F.; Wang, H. A comprehensive review of deep learning applications in cotton industry: From field monitoring to smart processing. Plants 2025, 14, 1481. [Google Scholar] [CrossRef]
Badgujar, C.M.; Poulose, A.; Gan, H. Agricultural object detection with You Only Look Once (YOLO) Algorithm: A bibliometric and systematic literature review. Comput. Electron. Agric. 2024, 223, 109090. [Google Scholar] [CrossRef]
Li, Y.; Miao, N.; Ma, L.; Shuang, F.; Huang, X. Transformer for object detection: Review and benchmark. Eng. Appl. Artif. Intell. 2023, 126, 107021. [Google Scholar] [CrossRef]
Zheng, D.; Dong, W.; Hu, H.; Chen, X.; Wang, Y. Less is more: Focus attention for efficient detr. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6674–6683. [Google Scholar]
Wang, Z.; Zhang, H.W.; Dai, Y.Q.; Cui, K.; Wang, H.; Chee, P.W.; Wang, R.F. Resource-Efficient Cotton Network: A Lightweight Deep Learning Framework for Cotton Disease and Pest Classification. Plants 2025, 14, 2082. [Google Scholar] [CrossRef]
Zhang, W.; Ma, M.; Jiang, Y.; Lian, R.; Wu, Z.; Cui, K.; Ma, X. Center-guided Classifier for Semantic Segmentation of Remote Sensing Images. arXiv 2025, arXiv:2503.16963. [Google Scholar]
Yang, Z.X.; Li, Y.; Wang, R.F.; Hu, P.; Su, W.H. Deep Learning in Multimodal Fusion for Sustainable Plant Care: A Comprehensive Review. Sustainability 2025, 17, 5255. [Google Scholar] [CrossRef]
Huang, S.Y.; Hsu, W.L.; Hsu, R.J.; Liu, D.W. Fully convolutional network for the semantic segmentation of medical images: A survey. Diagnostics 2022, 12, 2765. [Google Scholar] [CrossRef] [PubMed]
Sohail, A.; Nawaz, N.A.; Shah, A.A.; Rasheed, S.; Ilyas, S.; Ehsan, M.K. A systematic literature review on machine learning and deep learning methods for semantic segmentation. IEEE Access 2022, 10, 134557–134570. [Google Scholar] [CrossRef]
Krithika Alias AnbuDevi, M.; Suganthi, K. Review of semantic segmentation of medical images using modified architectures of UNET. Diagnostics 2022, 12, 3064. [Google Scholar] [CrossRef]
Qin, Y.M.; Tu, Y.H.; Li, T.; Ni, Y.; Wang, R.F.; Wang, H. Deep Learning for sustainable agriculture: A systematic review on applications in lettuce cultivation. Sustainability 2025, 17, 3190. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Lv, J.; Shen, Q.; Lv, M.; Li, Y.; Shi, L.; Zhang, P. Deep learning-based semantic segmentation of remote sensing images: A review. Front. Ecol. Evol. 2023, 11, 1201125. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R.S. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 4080–4090. [Google Scholar]
Rezaei, M.; Diepeveen, D.; Laga, H.; Jones, M.G.; Sohel, F. Plant disease recognition in a low data scenario using few-shot learning. Comput. Electron. Agric. 2024, 219, 108812. [Google Scholar] [CrossRef]
Yang, F.; Wang, R.; Chen, X. SEGA: Semantic Guided Attention on Visual Prototype for Few-Shot Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1056–1066. [Google Scholar]

Figure 1. Data set display chart.

Figure 2. Data enhancement effect demonstration diagram. The black area in the cutout image indicates that a part of the input image is randomly masked.

Figure 3. Network architecture overview. The proposed model takes leafy vegetable disease images as input. Multi-scale features are first extracted by ResNet-50 and then processed by the Transformer encoder to capture global semantics. A foreground predictor identifies key regions, and the Transformer decoder integrates the foreground-guided features. These features are further refined by a prototype enhancement module to produce task-specific representations. Finally, multi-task branches perform disease classification, object detection, and lesion segmentation. The red color and the square box in the leaf at the top right corner of the image indicate the segmentation and localization of the leaf disease area, respectively.

Figure 4. Feature encoding and semantic modeling module. Transformer Encoder * N indicates that an N-layer transformer encoder is used.

Figure 5. Test results display chart. The colors and boxes in the image leaves represent the segmentation and localization of leaf disease areas, respectively.

Table 1. Common vegetable diseases and their sample quantities.

Vegetable	Disease 1	Disease 2	Disease 3	Disease 4
Spinach	Downy mildew (1845)	White rust (1424)	Anthracnose (1371)	Viral disease (1558)
Lettuce	Downy mildew (1983)	Gray mold (1675)	Soft rot (1494)	Viral disease (1765)
Water spinach	Downy mildew (1301)	Black rot (1507)	Soft rot (1643)	Black spot (1488)
Chinese cabbage	Downy mildew (1293)	Black spot (1527)	Sclerotinia (1673)	Black rot (1982)
Celtuce leaf	Downy mildew (1781)	Brown spot (1323)	Sclerotinia (1357)	Powdery mildew (1094)

Table 2. Hardware environment of the experiment.

Configuration	Parameter
Programming	Python 3.8
Library and wrapper	PyTorch 2.0.0
CPU	18 vCPU AMD EPYC 9754 128-Core Processor
GPU	RTX 4090D (24 GB)
Operating system	Ubuntu20.04

Table 3. Disease detection results for leafy vegetables.

Models	Accuracy	Recall	Precision	mAP@50	mAP@75
DETR	81.34%	77.42%	78.40%	79.38%	78.40%
YOLOv8	80.45%	76.57	77.54%	78.51%	77.54%
YOLOv9	83.30%	80.36%	81.34%	82.32%	81.34%
YOLOv10	86.24%	83.30%	84.28%	85.26%	84.28%
YOLOv11	87.05%	84.08%	85.07%	86.16%	85.03%
Ours	93.12%	90.27%	91.45%	91.07%	90.25%

Table 4. Disease segmentation results for leafy vegetables.

Models	Accuracy	Recall	Precision	mAP@50	mAP@75
FCN	84.93%	80.04%	82.98%	82.00%	82.98%
U-Net	86.87%	82.00%	84.93%	83.95%	84.93%
SegNet	85.87%	80.93%	83.90%	82.91%	83.90%
MaskRCNN	87.84%	82.91%	85.87%	84.88%	85.87%
TinySegformer	90.80%	87.84%	88.83%	89.82%	88.83%
Ours	93.77%	90.80%	91.79%	90.80%	90.80%

Table 5. Comparative ablation of attention mechanisms.

Attention Mechanism	Accuracy	Recall	Precision	mAP@50	mAP@75
Transfomer	72.85%	69.86%	71.86%	71.86%	70.86%
CBAM	84.83%	80.84%	82.83%	83.83%	82.83%
Ours	94.81%	91.82%	92.81%	91.82%	91.82%

Table 6. Comparative ablation of loss functions.

Loss Function	Accuracy	Recall	Precision	mAP@50	mAP@75
Cross-Entropy Loss	66.87%	63.87%	64.87%	65.87%	64.87%
Focal Loss	83.83%	79.84%	81.84%	80.84%	80.84%
Ours	94.81%	91.82%	92.81%	91.82%	91.82%

Table 7. Detection results for various diseases.

Disease Type	Accuracy	Recall	Precision	mAP@50	mAP@75
Spinach-Downy Mildew	94.56%	91.61%	93.58%	92.59%	91.61%
Lettuce-Gray Mold	93.58%	90.62%	92.59%	91.61%	90.62%
Chinese Cabbage-Black Rot	92.59%	89.64%	91.61%	90.62%	89.64%
Cabbage-Sclerotinia	91.61%	88.65%	90.62%	89.64%	88.65%
Lettuce Leaf-Brown Spot	90.62%	88.65%	89.64%	89.64%	88.65%

Table 8. Segmentation results for various diseases.

Disease Type	Accuracy	Recall	Precision	mAP@50	mAP@75
Spinach-Downy Mildew	97.80%	93.47%	95.48%	94.47%	93.47%
Lettuce-Gray Mold	95.48%	92.46%	94.47%	93.47%	92.46%
Chinese Cabbage-Black Rot	94.47%	91.46%	93.47%	92.46%	91.46%
Cabbage-Sclerotinia	93.47%	90.45%	92.46%	91.46%	90.45%
Lettuce Leaf-Brown Spot	92.46%	90.45%	91.46%	91.46%	90.45%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, Y.; Shi, C. ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture. Sustainability 2025, 17, 7443. https://doi.org/10.3390/su17167443

AMA Style

Fu Y, Shi C. ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture. Sustainability. 2025; 17(16):7443. https://doi.org/10.3390/su17167443

Chicago/Turabian Style

Fu, Yuluxin, and Chen Shi. 2025. "ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture" Sustainability 17, no. 16: 7443. https://doi.org/10.3390/su17167443

APA Style

Fu, Y., & Shi, C. (2025). ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture. Sustainability, 17(16), 7443. https://doi.org/10.3390/su17167443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ProtoLeafNet: A Prototype Attention-Based Leafy Vegetable Disease Detection and Segmentation Network for Sustainable Agriculture

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Semantic Segmentation

2.3. Prototype Learning and Class-Guided Atte3ntion

3. Methods and Materials

3.1. Dataset Construction

3.2. Data Augmentation

3.2.1. Cutout

3.2.2. Mixup

3.2.3. CutMix

3.3. Proposed Method

3.3.1. Network Architecture Overview

3.3.2. Feature Encoding and Semantic Modeling Module

3.3.3. Prototype Enhancement Module

3.3.4. Multi-Task Predictor and Loss Function Design

3.3.5. Multi-Task Joint Loss Function

3.4. Evaluation Metrics

4. Results

4.1. Experimental Setup

4.2. Multi-Task Evaluation Results for Disease Recognition

4.3. Comparative Ablation of Attention Mechanisms

4.4. Ablation Study on Different Loss Functions

4.5. Specific Case Discussions

5. Discussion

5.1. Advantages

5.2. Challenges and Limitations

5.3. Future Perspectives

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI