Disease-Seg: A Lightweight and Real-Time Segmentation Framework for Fruit Leaf Diseases

Cao, Liying; Jiang, Donghui; Wang, Yunxi; Cao, Jiankun; Liu, Zhihan; Li, Jiaru; Si, Xiuli; Du, Wen

doi:10.3390/agronomy16030311

Open AccessArticle

Disease-Seg: A Lightweight and Real-Time Segmentation Framework for Fruit Leaf Diseases

by

Liying Cao

¹,

Donghui Jiang

¹,

Yunxi Wang

²,

Jiankun Cao

¹,

Zhihan Liu

¹,

Jiaru Li

¹,

Xiuli Si

^1,* and

Wen Du

^3,*

¹

College of Information and Technology, Jilin Agricultural University, Changchun 130118, China

²

College of Foreign Languages, Jilin Agricultural University, Changchun 130118, China

³

School of Information and Electrical Engineering, Shenyang Agricultural University, Shenyang 110866, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2026, 16(3), 311; https://doi.org/10.3390/agronomy16030311

Submission received: 18 December 2025 / Revised: 13 January 2026 / Accepted: 21 January 2026 / Published: 26 January 2026

(This article belongs to the Special Issue Intelligent Detection and Classification of External Traits in Crop Plants, Fruits, and Vegetables)

Download

Browse Figures

Versions Notes

Abstract

Accurate segmentation of fruit tree leaf diseases is critical for yield protection and precision crop management, yet it is challenging due to complex field conditions, irregular leaf morphology, and diverse lesion patterns. To address these issues, Disease-Seg, a lightweight real-time segmentation framework, is proposed. It integrates CNN and Transformer with a parallel fusion architecture to capture local texture and global semantic context. The Extended Feature Module (EFM) enlarges the receptive field while retaining fine details. A Deep Multi-scale Attention mechanism (DM-Attention) allocates channel weights across scales to reduce redundancy, and a Feature-weighted Fusion Module (FWFM) optimizes integration of heterogeneous feature maps, enhancing multi-scale representation. Experiments show that Disease-Seg achieves 90.32% mIoU and 99.52% accuracy, outperforming representative CNN, Transformer, and hybrid-based methods. Compared with HRNetV2, it improves mIoU by 6.87% and FPS by 31, while using only 4.78 M parameters. It maintains 69 FPS on 512 × 512 crops and requires approximately 49 ms per image on edge devices, demonstrating strong deployment feasibility. On two grape leaf diseases from the PlantVillage dataset, it achieves 91.19% mIoU, confirming robust generalization. These results indicate that Disease-Seg provides an accurate, efficient, and practical solution for fruit leaf disease segmentation, enabling real-time monitoring and smart agriculture applications.

Keywords:

agricultural diseases; disease detection; smart agriculture; multi-scale networks

1. Introduction

Fruits are rich in a large number of vitamins and other nutrients required by the human body. However, the frequent occurrence of diseases on the leaves of fruit trees, due to environmental and weather impacts, greatly weakens the photosynthesis of fruit trees and affects the yield of fruit trees and the quality of fruits, which will bring dire economic impacts to fruit farmers [1]. According to global estimates, plant diseases cause average annual yield losses of 17.2% to 22.5% in major crops [2], while for high-value perennial fruit trees, the potential yield reduction can exceed 40% if effective management is not implemented [3]. In terms of economic impact, various leaf diseases cause global losses exceeding USD220 billion each year, with high-value fruit crops often being the most vulnerable [4]. The traditional method of diagnosing crop diseases is by observing the symptoms and spots with the naked eye and listening to the opinions of experts, which is difficult and labor-intensive for large orchards with many types of fruit trees, and also increases the economic expenses of growers [5]. Therefore, studying fruit tree disease detection is crucial for agriculture. Pixel-level image segmentation, which assigns labels to each pixel, enables more in-depth understanding and analysis of diseases [6]. It can effectively help growers and plant protection personnel to keep abreast of crop health and develop effective control measures to prevent the spread of diseases, thereby reducing the use of pesticides and ensuring food safety [7].

With the continuous development of deep learning, researchers have gradually implemented deep learning in the field of agriculture and achieved good results. Fully Convolutional Neural Network (FCN) is the first model to utilize fully convolutional networks for semantic segmentation, which achieves classification of each pixel by extracting image features using fully convolutional layers and recovering the spatial resolution of the image through up-sampling and hopping concatenation [8]. U-Net was first used in the medical field, which extracts image features through encoder down-sampling and combines decoder up-sampling with a hopping connection to recover the resolution, realizing high-precision pixel-level image segmentation [9]. PSPNet extracts multi-level feature images of different dimensions through a pyramid pooling module to achieve target segmentation at different scales [10]. Other CNN-based semantic segmentation networks include HR-NetV2 [11], SegNet [12], and Mask R-CNN [13]. Due to its powerful migration capability, it has been widely migrated to the agricultural field by researchers with good results. Z. Wang et al. [14] utilized U-Net combined with multi-scale convolution to obtain multi-scale information about tomato diseases and achieved 91.32% accuracy by highlighting the edge feature information of tomato diseases through the Squeeze-and-Excitation (SE) module. Tassis et al. [15] simplified the work of recognizing diseases and pests in coffee leaves into three parts. In the first stage, a Mask R-CNN network was employed for instance segmentation; in the second stage, UNet and PSPNet networks were applied for semantic segmentation; and in the third stage, a ResNet network was used for classification. C. Wang et al. [16] used a two-stage model to separate leaf segmentation and lesion extraction tasks. In the first stage, DeepLabv3+ was used to segment cucumber leaves, and in the second stage, U-Net was used to extract cucumber downy mildew spots. The method effectively extracted leaves and lesions and obtained high accuracy of lesion segmentation. However, CNNs are better at local feature extraction and usually, global features only rely on the stacking of the network, but the deeper the layers are, the more serious the network degradation phenomenon is [17].

In recent years, Transformer, which is based on a self-attention mechanism and fully connected layer, captures the global dependencies of all positions in the input sequence through self-attention mechanism and utilizes the encoder–decoder structure for efficient sequence-to-sequence modeling [18]. Vision Transformer (ViT) first adopted the Transformer in the field of computer vision, which discarded the traditional use of convolutional layers and adopted a purely attentional mechanism [19]. Other scholars introduced the idea of Transformer into the field of image segmentation, such as SegFormer [20], PoolFormer [21], and Swin Transformer [22]. Salamai [23] proposed the novel Dual-Path Vision Transformer for real-time diagnosis of coffee leaves through collaboration between lesion segmentation paths and classification paths. Thai et al. [24] accurately detects leaf diseases by Vision Transformer while utilizing sparse matrix–matrix multiplication to reduce the training time, after which the model finally performs better. However, the self-attention mechanism has redundancy in calculating the neighboring pixel correlation and cannot encode the position information, which increases the complexity of the network and restricts the accurate acquisition of target position information.

Researchers have demonstrated robustness in image segmentation by combining Transformer with conventional CNNs. Pacal et al. [25] improved Transformer by replacing the conventional convolutional structure in the MaxViT Architecture Stem with SE to achieve high accuracy in detecting leaf diseases in maize. Xuechen Li et al. [26] used Generative Adversarial Networks (GANs) to generate disease datasets suitable for downstream tasks and subsequently combined CNN and Swin Transformer to achieve accurate classification of sugarcane diseases. Zhang et al. [7] used Local Learning Bottleneck (LLB) to enhance local perception and combined it with inverse residual convolution to extract richer semantic information about grape leaf diseases with good results. Y. Wang et al. [27] achieved accurate segmentation of anthracnose spots and species recognition based on Swin Transformer and path aggregation. Lu et al. [28] achieved accurate segmentation of multiple fruit tree leaf diseases by stacking CNN and Transformer—two different encoders—in serial alternating stacking to achieve accurate segmentation of multiple fruit tree leaf diseases. Zhang et al. [29] proposed a lightweight segmentation architecture called U-shape sharpening perceptual method for efficient extraction of neighboring grapevine leaf diseases that achieves a better trade-off between performance and efficiency. Z. Guo et al. [30] achieved optimal segmentation in a complex and variable weed shape task by using two feature extraction modules to generate global semantic features and local visual features with optimal segmentation Accuracy. Although combining a Transformer with CNN balances local and global feature extraction, it increases model complexity, computational cost, and training time, while limited generalization restricts direct application to other crops or diseases.

Despite recent advances in hybrid architectures that integrate convolutional neural networks (CNNs) and Transformers, several critical challenges remain in fruit leaf disease detection and assessment. First, many existing models rely on computationally intensive Transformer components, resulting in high model complexity and inference latency, which limits their applicability for real-time disease monitoring on resource-constrained devices commonly used in field environments. This issue is particularly pronounced when processing high-resolution leaf images required for accurate disease delineation.

Second, many existing models are easily influenced by inherent leaf characteristics, such as venation patterns, surface textures, and species-specific leaf shapes, which can interfere with the accurate identification of disease regions. As a result, models trained on one fruit species often perform poorly when applied to another, indicating limited robustness and generalization across different leaf types.

There are many difficulties in coping with the real complex environment for segmentation of different fruit tree leaf pathology images. In the case of leaves (1), leaf curling and folding can cause shadows and thus increase the difficulty of segmentation. (2) The special shape of leaf edges makes it challenging to perform edge feature extraction. (3) Alternating and overlapping leaves make global feature extraction of leaves difficult. For disease spots, (1) a strong dark shift in light intensity can lead to blurred disease spot boundaries and spot colors similar to leaf colors, resulting in inaccurate segmentation results. (2) It is easy to miss the spots and sticky segmentation in the dense area of small diseased spots. (3) In a rain environment, light reflection makes the disease spot features blurred and their characteristics difficult to extract. The specific details are shown in Table 1.

We hypothesize that by jointly integrating lightweight convolutional representations for local lesion texture modeling with transformer-based global context encoding, a unified segmentation framework can achieve robust cross-crop and cross-acquisition generalization under complex field conditions, while maintaining real-time inference efficiency.

In this work, we make the following key contributions to the real-time segmentation of fruit leaf diseases:

We propose a lightweight real-time leaf disease segmentation framework that employs a parallel fusion of Transformer and CNN, effectively capturing both local texture and global context, significantly reducing lesion omission.
We design an Extended Feature Module (EFM), a Deep Multi-scale Attention mechanism (DM-Attention), and a Feature-Weighted Fusion Module (FWFM) to expand the receptive field, enhance multi-scale feature representation, and optimize heterogeneous feature fusion.
The model demonstrates strong segmentation accuracy and cross-dataset generalization across various fruit disease datasets and public benchmarks.
The framework achieves 69 FPS on high-performance devices and an average inference time of 49 ms per image on edge devices, showing practical applicability in smart agriculture scenarios.

2. Materials and Methods

This study first systematically delineates the construction process of datasets comprising seven disease categories across six fruit tree species. Subsequently, the design of a CNN-Transformer hybrid architecture is presented, with a particular focus on the structural characteristics of the proposed modules and their contributions to enhancing segmentation performance. Thereafter, a comparative analysis is conducted to evaluate the differences in segmentation accuracy and inference speed between the proposed method and multiple classical, high-performance models. Finally, the deployment and testing results of the model on edge devices and web platforms are further demonstrated. The overall technical framework of the study is illustrated in Figure 1.

2.1. Data Collection

The dataset in this study integrates multi-source data from both authentic field environments and standardized laboratory settings, covering seven disease types across six fruit species. Specifically, field data were collected at the teaching and research base of Jilin Agricultural University during the fruiting period from July to September 2024. Images were acquired at a strictly controlled distance of 5–10 cm under diverse and challenging scenarios, including intense midday sun, low-light evening conditions, and post-rain environments (characterized by leaf surface moisture and specular reflections), aiming to enhance the model’s robustness in real-world agronomic contexts. To further strengthen the model’s feature extraction capabilities and taxonomic diversity, mango and pomegranate disease images from the Plant Village [31] database were incorporated into the training set. By integrating these high-contrast standardized samples, the model is compelled to learn more discriminative lesion texture features. Furthermore, Plant Village images of grape brown spot and black rot were reserved as an independent validation set to rigorously evaluate the model’s generalization and transferability when encountering unseen species and cross-domain tasks. Relevant geographic information and acquisition environment schematics are detailed in Figure 2.

Table 2 summarizes the sample distribution of different leaf disease categories under outdoor and indoor scenes, with data collected at noon and in the evening.

Table 3 provides a detailed overview of the parameters of different fruit tree leaf disease datasets, including disease categories, fruit species, data sources, sample sizes, and pixel-level statistics. Specifically, the dataset covers seven disease types—Spotted Leaf Drop, Rust, White rot, Black star, Red spot, Cercospora spot, and Brown spot—corresponding to various fruit trees such as apple, grape, pear, plum, pomegranate, and mango. The first five disease categories were collected from real field environments at JLAU, while the Cercospora spot and Brown spot datasets were obtained from the public Plant Village dataset. The number of images per category ranges from 204 to 261, with leaf pixel counts varying from 5.07 M to 28.61 M and lesion pixel counts ranging from 0.28 M to 1.3 M, indicating notable differences in the proportion of diseased areas among categories. All images were standardized to a resolution of 512 × 512, ensuring consistent input conditions for model training and fair performance comparison.

2.2. Data Preprocessing

All images in the self-built dataset were resized to 512 × 512 pixels. Expert-guided Labelme [32] annotations produced a VOC segmentation dataset with six fruit species and seven disease types (13 classes). Grape brown spot and black rot from Plant Village were used as an independent validation set, resized to 256 × 256 to test resolution effects. Annotated images are shown in Figure 3. Despite some similar colors, the independent sources prevent interference.

Deep learning networks tend to overfit on small datasets, achieving high training accuracy but poor generalization. To address this, collected images were randomly augmented with rotation, cropping, contrast adjustment, and 10 × 20 black mask blocks to increase data diversity, mitigate class imbalance, and improve model robustness across different scenarios, as illustrated in Figure 4.

2.3. Disease-Seg

To address diverse fruit leaf diseases, this study proposes a Transformer-based fusion segmentation method, Disease-Seg, which combines the strengths of CNNs and Transformers for accurate disease recognition. The model uses a single-layer parallel fusion encoder to effectively integrate local features and global context without increasing network depth, reducing the risk of missing key features.

As shown in Figure 5, the two encoders produce features at 1/4, 1/8, 1/16, and 1/32 resolutions with channel sizes of 16, 32, 64, 160, and 256. EFM employs dilated convolutions to expand the receptive field and capture local details; FWFM adaptively weights features to reduce redundancy; DM-Att (DM-Attention) uses multi-scale Depthwise and pointwise convolutions to enhance multi-scale feature extraction efficiently. The decoder integrates these features and progressively up-samples to full resolution, yielding multi-class leaf disease segmentation for precise localization and classification.

2.4. Extended Feature Module

To address small spots or blurred features caused by uneven illumination on fruit tree leaves, we propose an Extended Feature Module (EFM) to enhance disease region detection. Through designed convolutional operations, it effectively captures multi-scale disease features and adapts well to small early lesions. Specific implementation details are shown in Figure 6.

The module first applies a 1 × 1 convolution to map input channels C to

C_{o u t}

, preserving spatial dimensions while aggregating channel information. A dilated convolution with rate 2 expands the receptive field and captures large-scale local features without extra parameters. SiLU activation enhances nonlinear transformation and sensitivity to subtle leaf edge or disease boundary changes, improving segmentation accuracy. Batch Normalization reduces inter-channel variation, preventing overfitting and improving robustness under variable lighting. Finally, another 1 × 1 convolution compresses and integrates disease features, producing high-quality outputs for subsequent fusion and prediction. The specific computation process is as follows.

x_{1} = c o n v 1 \times 1_{1} (x) = x \cdot W_{1} + b_{1}

(1)

where x is the input tensor, W is the convolution kernel, and b is the bias. Then, Batch Normalization operation is performed on

x_{1}

:

x_{1} = B N (x_{1})

(2)

Next, the SiLU activation function is applied:

x_{2} = S i L U (x_{1}) = x_{1} \cdot σ (x_{1}) = \frac{1}{1 + \exp (- x_{1})}

(3)

After that, the dilated convolution operation is performed:

x_{3} = C o n v 3 \times 3 d i l a t e d (x_{2}) = x_{2} \cdot W_{2} + b_{2}

(4)

The BN and SiLU activation functions are then applied:

x_{4} = S i L U (B N (x_{3}))

(5)

Finally, a 1 × 1 convolution operation is performed to compress and integrate feature information:

x_{f i n a l} = x_{1} + B N (c o n v 1 \times 1_{2} (x_{4}))

(6)

This design enhances the model’s ability to capture disease characteristics through multi-level convolution and activation functions, especially when processing lesion details, which can effectively improve the segmentation accuracy and robustness of the model.

2.5. Deep Multi-Scale Attention Mechanism

When detecting disease edges and fine details, the Transformer’s attention reduces sequence length with ratio R to save computation, which may cause feature loss. To address this, we propose DM-Attention, which applies Depthwise separable convolutions and integrates a Deep Multi-scale Module (DMSM) to preserve fine details and edges with minimal parameter overhead, making it well-suited for crop disease segmentation in smart agriculture. The specific structure is shown in Figure 7.

First, the input features are processed by a multi-head attention mechanism, which generates a query matrix (Q), a key matrix (K), and a value matrix (V) via linear transformations, allowing each head to focus on diseased regions from different perspectives. Specifically, the input feature x is first mapped to the query matrix Q through a fully connected layer, and its dimensions are adjusted so that each attention head attends to different feature channels. The computation is expressed as follows:

Q = r e s h a p e (p e r m u t e (L i n e a r (x)), B, N, n u m_{h e a d s}, \frac{C}{n u m_{h e a d s}})

(7)

Among them, B represents the batch size, N represents the sequence, C represents the number of channels, and num_heads represents the number of heads.

The K and V are generated similarly to the queries, through a linear layer to obtain two separate matrices, and partitioned by num_heads.

K, V = s p l i t (p e r m u t e (L i n e a r (x)), B, - 1,2, n u m_{h e a d s}, \frac{C}{n u m_{h e a d s}})

(8)

After querying the dot product of Q and key K, it is scaled by a scaling factor

1 / \sqrt h e a d_d i m

and then passed through the softmax function to obtain the attention weight.

A t t e n t i o n = S o f t m a x (\frac{Q \cdot K^{T}}{\sqrt{d}})

(9)

where d is the dimension of each head. By using the attention weights to weight the value matrix V, the result is projected back to the original dimension through a linear layer, and finally the output feature x with global context is obtained as follows:

x = r e s h a p e (p r o j (D r o p o u t (A t t n \cdot V)), B, N, C)

(10)

To further improve the feature expression capability, DM-Attention introduces a DMSM to convolve features using convolution kernels of different sizes to capture local features of different scales. The output results of multiple convolution operations are aggregated into a multi-scale convolution list, which is calculated as

c o n v_{o u t s} = \sum {c o n v}_{k} (x) | {k \in (1,3, 5,7)}

(11)

The different scale features are then added together to generate the aggregate feature U, which is the superposition of all convolution outputs, to represent the overall multi-scale information.

U = \sum_{i = 0}^{l e n (c o n v_{o u t s})} c o n v_{o u t s [i]}

(12)

The feature map of U is compressed into a vector S by global average pooling to obtain the global features of the input.

S = \frac{1}{H \times W} \sum_{h = 0}^{H} \sum_{w = 0}^{W} U [h, w]

(13)

where H and W are the height and width of the feature map, respectively.

A fully connected layer is used to map S to a smaller dimension Z, which reduces the computational burden and speeds up reasoning, and guides the subsequent weighting process.

Z = f c (S)

(14)

Generate weights matrix weights, each weight is used for weighted combination of features with different convolutional kernel sizes with the following formula:

V = \sum_{i = 0}^{l e n (w e i g h t s)} w e i g h t s [i] \cdot c o n v_{o u t s [i]}

(15)

Finally, each multi-scale feature map is weighted and summed to produce the output V, integrating multi-scale information. This allows DM-Attention to adapt to diverse disease morphologies, textures, and structures, enhancing overall fruit tree disease detection and accurately capturing subtle leaf lesions.

2.6. Feature-Weighted Fusion Module

In a single-layer parallel encoder architecture, since different encoders focus on different aspects of extracted features, the FWFM is designed to efficiently integrate features from both encoders in an adaptive manner. The detailed structure is shown in Figure 8.

First, the module splices the two input feature maps

C_{1}

and

C_{2}

in the channel dimension to generate a joint feature map

H \times W \times (C_{1} + C_{2})

containing all the information from the two encoders. Next, the spliced feature map is processed using a 1 × 1 convolution operation to generate a weighted weight map, which is used for weighted fusion of the input features in subsequent steps.

W_{S i g m o i d} = σ ({c o n v}_{1 \times 1} (c a t (C_{1}, C_{2}))) \in R^{H \times W \times C_{1}}

(16)

The weight map is normalized to [0,1] via the Sigmoid function, where each element indicates the importance of its spatial position in the fused feature map. The two input features are then combined through weighted summation, as follows:

x_{w} = W_{S i g m o i d} \cdot C_{1} + (1 - W_{S i g m o i d}) \cdot C_{2}

(17)

Here,

W_{S i g m o i d}

is the Sigmoid-based fusion weight, which adaptively adjusts the fusion ratio according to each feature map’s contribution, enhancing the model’s ability to capture complex patterns.

x_{f i n a l} = B N ({c o n v}_{1 \times 1} (x_{w})) \in R^{H \times W \times C_{o u t}}

(18)

Through weighted fusion, the FWFM integrates features from two sources, capturing multi-level context and enhancing performance in diverse feature extraction tasks.

2.7. Experimental Platform and Evaluation Metrics

To verify the proposed method, all experiments were conducted under the same hardware and software environment, an NVIDIA TU102 [GeForce RTX 2080 Ti Rev. A] GPU (NVIDIA Corporation, Santa Clara, CA, USA) with an 11th Gen Intel^® Core™ i5-11400F processor (Intel Corporation, USA) [M19.1] [yl19.2] (2.60 GHz, 12 cores). AdamW was used as the optimizer with weight decay

10^{- 2}

, cosine learning rate decay, momentum 0.9, batch size 16, and 300 training epochs.

To validate the robustness of the model, this paper employs eight evaluation metrics, including Intersection over Union (IoU), mean Intersection over Union (mIoU), mean Pixel Accuracy (mPA), Accuracy (Acc), Floating Point Operations (FLOPs), Parameters (Params), Coefficient of Determination (R²), and Root Mean Square Error (RMSE) to comprehensively assess the proposed model from various perspectives.

IoU measures the overlap between predicted and true regions for a single category, serving as a key indicator of category-level segmentation accuracy.

I o U = \frac{T P}{T P + F P + F N}

(19)

where TP is true positive, FP is false positive, and FN is false negative.

mIoU evaluates segmentation performance by measuring the overlap between predicted and ground-truth regions, effectively reflecting the model’s overall accuracy.

m I o U = \frac{1}{N} \sum_{I = 1}^{N} \frac{{T P}_{i}}{{T P}_{i} + {F P}_{i} + {F N}_{i}}

(20)

N denotes the total number of categories and

T P_{i}

,

T N_{i}

,

F P_{i}

, and

F N_{i}

denote the true positive, true negative, false positive, and false negative examples of category i, respectively.

The mPA is used to evaluate the proportion of pixels in each category that are correctly categorized. It measures the pixel-level predictive effect of each category.

m P A = \frac{1}{N} \sum_{I = 1}^{N} \frac{{T P}_{i} + {T N}_{i}}{{T P}_{i} + {F P}_{i} + {T N}_{i} + {F N}_{i}}

(21)

Acc measures the overall classification correctness of the model, defined as the ratio of correctly predicted pixels to the total number of pixels.

A c c = \frac{T P + T N}{T P + F P + T N + F N}

(22)

In linear regression, common metrics include R² and RMSE. R² measures the goodness of fit, with values closer to 1 indicating higher accuracy. RMSE represents the average prediction error, with smaller values indicating better performance.

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(23)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(v - {\hat{y}}_{i})}^{2}}

(24)

where

\hat{y}

is the i-th predicted value,

y_{i}

is the i-th actual value, and n is the total number of samples.

FLOPs measure the model’s computational complexity during inference, indicating suitability for resource-limited devices. Parameters count the total number of model weights, reflecting model size and complexity. Inference speed represents the time or resources needed for a single prediction, directly affecting real-time performance.

3. Results

This section compares the proposed method with state-of-the-art and classical semantic segmentation models, including PSPNet, HRNetV2, U-Net, DeepLabV3+, SegFormer, SegNeXt [33], DANet [34], OCRNet [35], and UPerNet [36], with some models from the open-source segmentation [37] framework. These methods cover three architectures: CNN, Transformer, and CNN-Transformer fusion. The effectiveness of the proposed method is further demonstrated through systematic comparisons with existing models, with the experimental results providing clear quantitative and qualitative evidence of its robustness, generalization ability, and practical applicability in precision agriculture.

3.1. Comparative Experiment

To validate the proposed method in real scenarios, Table 4 and Table 5 present single-category segmentation results for apple, grape, and plum samples, with bold values indicating the highest performance.

Table 4 shows the segmentation of apple rust, spotted leaf drop disease, grape white rot, and plum red spot disease in outdoor environments. The proposed method achieved the best performance on four disease types and maintained a stable leaf segmentation accuracy of around 98%. For instance, in apple spotted leaf drop disease, its IoU was 33%, 44%, and 38% higher than OCRNet, UPerNet, and PSPNet, and 14% higher than SegNeXt. Across other disease types, the method outperformed SegFormer and U-Net by an average of 13%. These results indicate that the proposed approach not only excels in standard segmentation metrics but also effectively addresses practical challenges encountered in real-world scenarios, such as leaf folding, jagged or irregular edges, and variations in lighting conditions.

As shown in Table 5, the proposed method also performed well on indoor pomegranate cercospora spot, indoor mango brown spot, and outdoor pear black star disease. For pomegranate cercospora spot, HRNetV2, UPerNet, and OCRNet showed significantly lower leaf and spot segmentation accuracy. In the mango category, although SegFormer was the best among the comparisons, its IoU for leaf and spot segmentation was still 1% and 16% lower than the proposed method. For pomegranate, UPerNet’s leaf and spot IoU were 6% and 15% lower. Specifically, while SegFormer (Mit B_3) achieves competitive leaf IoU, its disease IoU is notably lower than that of the proposed model, particularly for fine-grained disease regions.

These results indicate that increasing the Transformer depth alone does not necessarily improve segmentation accuracy for complex leaf disease patterns. In contrast, the proposed single-layer parallel CNN-Transformer architecture effectively compensates for the reduced Transformer depth by explicitly enhancing local texture representation through the CNN branch, thereby preserving boundary details while maintaining sufficient global context modeling. Overall, it offers a reliable solution for precise leaf and disease spot segmentation and provides valuable insights for advancing research in agricultural disease detection and management.

Figure 9 provides a systematic comparison of the proposed Disease-Seg model against several classical segmentation methods across multiple disease categories. Although some traditional methods maintain competitive performance in specific cases, Disease-Seg consistently achieves higher IoU scores and exhibits more stable and reliable segmentation results. These quantitative improvements are further supported by qualitative observations, where Disease-Seg more accurately delineates leaf boundaries and small lesion regions under challenging conditions such as occlusion, irregular shapes, and variable lighting. Overall, the results highlight the model’s robustness, generalization, and effectiveness in multi-category leaf disease segmentation, providing strong support for smart agriculture research.

Figure 10 shows the boxplot distribution of per-class IoU across 13 categories for different segmentation methods. Compared with CNN-based baselines, Transformer-based models exhibit higher median IoU and more compact distributions, indicating improved consistency across categories. Among all methods, the proposed approach achieves the highest median and mean IoU with a narrower interquartile range and fewer low-IoU outliers, demonstrating both superior accuracy and stronger robustness for fine-grained leaf and disease segmentation.

To comprehensively evaluate the proposed method across different scenarios, we compared it with CNN-based, Transformer-based, and CNNTransformer fusion models. Table 6 reports mIoU, Acc, Params, FLOPs, and FPS, with bold and underlined values indicating the highest and second-highest, respectively.

CNNs benefit from weight sharing, local connectivity, and fast inference, but struggle with small lesion segmentation due to limited long-range dependency capture. For example, DeepLabV3+, HRNetV2, and U-Net achieve mIoUs 11.78%, 6.87%, and 7.71% lower than the proposed method, though PSPNet has 2.41 M fewer parameters.

Transformers capture global context via self-attention, dynamically adjusting feature representations. UPerNet improves mIoU by 7.79% and 7.45% over SegFormer and SegNeXt with lightweight decoders. Disease-Seg is 1.8× faster than SegFormer in FPS but slightly slower than SegNeXt, demonstrating strong local and fine-grained feature capture while maintaining high inference speed.

CNN-Transformer fusion methods aim to balance global and local features, but excessive parameter growth can reduce efficiency. The proposed method achieves a favorable trade-off; OCRNet and UPerNet mIoUs are 18.60% and 25.67% lower, while their Params and FLOPs are significantly higher. Although not optimal in Params, FLOPs, and FPS, the overall performance is among the best, suitable for cloud and edge deployment.

In summary, the results demonstrate the proposed method’s effectiveness and applicability in leaf disease segmentation, providing a reliable technical foundation for smart agriculture.

Figure 11 presents a detailed comparison of disease segmentation for apple rust, spotted leaf drop disease, and plum red spot disease, with white boxes highlighting model differences. In sample (a), SegFormer, U-Net, and HRNetV2 partially segment smaller spots but miss some target areas. In sample (b), SegNeXt better segments leaves and spots but misjudges overlapping regions. OCRNet, DeepLabV3+, and DANet struggle to distinguish leaf edges in overlapping or similarly colored areas. In sample (c), most models locate diseased spots despite complex backgrounds, but over-segmentation occurs at edges for OCRNet, DeepLabV3+, and HRNetV2.

Overall, under complex real-world conditions, the proposed method precisely segments diseased spots and maintains edge details, highlighting its effectiveness for smart agriculture.

Figure 12 presents the segmentation of grape and pear leaves, where folding, jagged edges, irregular shapes, and background interference pose significant challenges. Compared with DeepLabV3+, OCRNet, U-Net, SegFormer, DANet, and SegNeXt, which often miss diseased spots or produce blurred leaf edges, the proposed approach more effectively captures complex leaf morphology and densely distributed lesions, preserves fine structural details, and mitigates background interference, resulting in markedly improved segmentation accuracy.

Figure 13A,B present the segmentation of mango brown spot and pomegranate cercospora spot disease in a laboratory environment, where leaf folding, small and variable lesion morphology, diverse leaf shapes, and color similarity to the background challenge accurate segmentation. As shown in the white box in (a), models such as DeepLabV3+, U-Net, and OCRNet captured only basic differences between spots and leaves, failing to delineate lesion shapes. Similarly, in (b), OCRNet, SegFormer, and U-Net exhibited limited capability in suppressing background interference in pomegranate lesion segmentation, which consequently resulted in lesion omission. Overall, the proposed approach demonstrates superior performance in precisely extracting lesion morphology and suppressing background interference, achieving higher robustness and accuracy in complex scenarios.

3.2. Generalization Experiment

The model’s generalization was evaluated on the Plant Village dataset by comparison with classical and state-of-the-art methods at different image resolutions, as shown in Table 7.

The results indicate that the proposed method exhibits significant advantages. HRNetV2, despite its competitive performance, achieves mIoU and Acc values 1.96% and 0.12% lower, with 24.76 M additional parameters and 18.70 G higher FLOPs. U-Net and DeepLabV3+ also underperform, with mIoU and Acc 4.65%, 0.86%, 3.48%, 0.81%, 0.82% and 1.12% lower. SegNeXt, a Transformer-based fusion model, records mIoU and Acc 28.84% and 1.16% lower, while OCRNet and UPerNet fail to achieve a balanced trade-off between segmentation accuracy and computational efficiency. Overall, the proposed method demonstrates superior performance and robustness in complex disease segmentation tasks.

To illustrate the superiority of the proposed method on the Plant Village dataset, segmentation results for brown spot and black rot are visualized in Figure 14A,B. In sample (a), shown in the white dashed box, DeepLabV3+, U-Net, and HRNetV2 fail to accurately extract detailed features of closely adjacent spots, whereas the proposed method effectively preserves fine-grained information lost during multi-resolution feature fusion. In sample (b), within the white dashed box, the proposed method achieves more precise segmentation of disease shapes compared to SegFormer and SegNeXt. Overall, the model excels at segmenting subtle leaf disease spots, particularly capturing edge textures and fine details, demonstrating superior performance.

To more clearly and intuitively highlight the advantages of the proposed method, Figure 15 presents the comprehensive performance of various models under different datasets and input resolution conditions. This design not only facilitates a thorough evaluation of the models’ generalization ability and robustness across diverse data scenarios but also verifies their adaptability in balancing accuracy and efficiency under varying image quality conditions. The comparison between Figure 15a,b clearly demonstrates that the proposed Disease-Seg model exhibits significant superiority across multiple experimental settings. While maintaining a relatively low parameter count and computational complexity, the method achieves the highest mIoU, with overall performance markedly surpassing that of other classical models. These results fully indicate that Disease-Seg ensures high-precision segmentation while simultaneously maintaining a lightweight design and computational efficiency, reflecting its comprehensive advantages in both accuracy and practicality. Importantly, this characteristic provides solid support and broad application prospects for disease segmentation tasks on resource-constrained edge devices and in complex real-world agricultural scenarios.

3.3. Attention Comparison Experiment

To evaluate DM-Attention in complex agricultural scenarios, it was compared with SE [38], CBAM [39], COT [40], SK [41], Triplet [42], and Global Context [43] attention mechanisms, as shown in Table 8. The baseline model achieved 81.70% mIoU, 87.90% mPA, and 99.09% Acc. By integrating DM-Attention, the performance was improved to 85.98% mIoU, 91.31% mPA, and 99.11% Acc, representing gains of 4.28%, 3.41%, and 0.02%, respectively. Among the compared methods, CBAM ranked second with 84.52% mIoU and 88.83% mPA, followed by SE with 82.57% mIoU and 87.57% mPA. In contrast, COT, SK, and Triplet exhibited relatively weaker performance, with Triplet performing the worst. These results highlight the superior capacity of DM-Attention to model complex features and effectively capture local–global interactions, thereby providing robust support for precise segmentation and detection of crop diseases.

To further evaluate the effectiveness of DM-Attention, this study visualizes its regions of interest using Grad-CAM [44] and compares it with classical and efficient attention mechanisms. Figure 16a–f presents heat maps of six fruit leaves and seven disease spots.

Samples (a) and (b) show diseased areas on pomegranate and mango leaves. DM-Attention captures the full leaf contour and clearly distinguishes disease spots from leaf edges, outperforming SE, SK, Triplet, Global Context, CBAM, and CoT in targeting regions and differentiating background-similar features. Samples (c)–(e) illustrate plum, pear, and grape leaves under complex outdoor conditions with occlusion, blurred edges, and complex backgrounds; DM-Attention consistently preserves edge details and focuses on disease regions more accurately than other methods. Sample (f) shows an apple leaf affected by both spotted leaf drop disease and rust, where DM-Attention accurately highlights all characteristic regions and differentiates the two disease types. The module employs Depthwise separable and pointwise convolutions with multi-scale feature fusion, enhancing extraction efficiency and robustness under complex backgrounds and occlusion.

Figure 17a–f presents heat maps of seven disease spots, illustrating differences among attention mechanisms in segmentation. In laboratory samples (a,b), spots similar in color to the background caused blurred edges; SE, SK, CoT, and Triplet were distracted and failed to focus on targets, while CBAM located spots more accurately but missed edge details. In contrast, DM-Attention precisely localized spots and captured edges, improving segmentation accuracy.

In outdoor samples (c–e), variations in illumination and climate significantly challenged segmentation performance. Global Context failed to capture the discriminative patterns of plum red spot, while CoT and Global Context were susceptible to background interference. SE and SK exhibited insufficient local perception, limiting their ability to identify fine-grained pear leaf spots, and grape leaf spots were frequently misclassified as background, leading CoT, Triplet, and Global Context to attend to non-target regions. In sample (f), the coexistence of apple rust and spotted leaf drop increased the complexity of multi-class recognition. Triplet exhibited weak responses to rust lesions, and CoT, Global Context, and SK were adversely influenced by background noise. Overall, DM-Attention performs excellently in both laboratory and complex outdoor environments, enabling precise multi-class disease segmentation with high robustness, supporting rapid and accurate crop disease detection in smart agriculture.

3.4. Ablation Experiment

Six ablation experiments were conducted to evaluate the effectiveness of Disease-Seg, as shown in Table 9. Test1 serves as the baseline. In Test2, introducing DM-Attention improves mIoU, mPA, and Acc by 4.28%, 3.41%, and 0.02%, respectively. Test4 combines EFM and DM-Attention, raising mIoU, mPA, and Acc by 6.23%, 3.97%, and 0.35%, respectively, demonstrating the critical role of fusing local and global features in complex agricultural scenarios. Test5 incorporates FWFM on top of Test3, effectively reducing feature loss and improving performance via weighted feature fusion. Test6 represents the complete Disease-Seg model, achieving improvements of 8.62%, 5.75%, and 0.43% over the baseline, confirming its superior segmentation accuracy and computational efficiency.

3.5. Deploy Experiment

To evaluate the practical applicability of the proposed model, inference time was tested on the resource-constrained Jetson Nano platform. As shown in Table 10, Disease-Seg achieves faster inference speed than all comparison models while maintaining high segmentation accuracy. Models such as DeepLabV3+, U-Net, DANet, PSPNet, and UPerNet, despite strong performance in general semantic segmentation tasks, exhibit slower inference due to their complex architectures and higher computational demands. In contrast, Disease-Seg achieves efficient operation on low-power devices by optimizing computation and storage.

3.6. Heterogeneous Comparison Experiment

To investigate the impact of convolutional feature extraction depth on model performance, comparative experiments with different numbers of stacked CNN modules were conducted within a CNN-Transformer hybrid architecture. While CNNs are responsible for modeling local texture information, Transformers excel at capturing global semantic context; therefore, the depth of their feature fusion directly affects the model’s multi-scale representation capability and segmentation performance.

As shown in Table 11, increasing the number of CNN stacks from ×1 to ×4 leads to a continuous growth in model parameters and FLOPs, indicating a significant increase in network complexity with deeper convolutional structures. In terms of performance, the configuration with a single CNN layer achieves the best results, reaching an mIoU of 90.32%, an mPA of 93.65%, and an Acc of 99.52%. Further increasing the CNN depth does not yield additional performance gains; instead, segmentation accuracy degrades, particularly for the ×3 and ×4 configurations. This suggests that excessive CNN stacking may introduce feature redundancy and weaken the model’s ability to capture global semantic information.

Overall, a moderate CNN depth effectively enhances local feature representation and complements the Transformer’s global modeling capability, whereas deeper structures fail to provide meaningful performance improvements. Considering the trade-off between accuracy and computational cost, the ×1 CNN configuration achieves the optimal balance between performance and efficiency within the proposed hybrid architecture.

3.7. Experiments for Disease Severity Assessment

To clarify the process of disease severity assessment, four severity levels of fruit leaves are defined and listed in Table 12. Disease coverage, which reflects the proportion of the diseased area relative to the total leaf area, is used as a key indicator of disease spread. This metric provides an effective quantitative basis for severity evaluation and supports precise disease management. The disease rate is calculated as follows.

This grading strategy follows the general principles of field efficacy evaluation adopted in national standards, such as the National Standard of the People’s Republic of China GB/T 17980.47-2000 [45] Pesticide—Guidelines for the Field Efficacy Trials (I)—Herbicides against Weeds in Root Vegetable Fields, in which the affected area ratio is commonly used as a quantitative indicator for damage assessment and management decision-making.

Specifically, disease severity is categorized into four levels: Level 1 (0–5%), Level 2 (5–10%), and Level 3 (10–20%). These thresholds represent increasing degrees of disease spread and functional impairment of leaves and are consistent with established agronomic severity.

D i s e a s e R a t i o = \frac{S_{D i s e a s e}}{S_{l e a f} + S_{D i s e a s e}}

(25)

To further assess the reliability of the Disease-Seg model in predicting spot coverage, linear regression was performed between model-predicted and true spot coverage using 179 test samples. Model performance was evaluated using the correlation coefficient R² and RMSE, as shown in Figure 18.

To test the practicality of the Disease-Seg model in fruit leaf disease segmentation, we established a fruit tree leaf disease segmentation system, as shown in Figure 19.

4. Discussion

While the Disease-Seg model achieves high inference speed and sound quantitative performance, its practical deployment in unstructured orchard environments presents several non-trivial challenges that stem from both data constraints and methodological design choices [46].

A primary limitation concerns dataset bias. For specific disease categories such as Mango and Pomegranate anthracnose, the model relies on Plant Village images, which are captured under controlled illumination with uniform, distraction-free backgrounds [47]. These laboratory-style samples do not include environmental perturbations characteristic of field settings at Jilin Agricultural University, such as specular leaf reflections following rainfall, occlusions generated by overlapping foliage, wind-induced leaf motion, and visual clutter caused by weeds, branches, or background debris. Although the PlantVillage dataset has played a pivotal role in advancing plant disease recognition, multiple studies have demonstrated that models trained predominantly on laboratory-style imagery tend to exhibit degraded performance when transferred to real-world field environments [48]. In contrast, orchard scenes at Jilin Agricultural University involve complex environmental perturbations, including specular leaf reflections following rainfall, occlusions generated by overlapping foliage, wind-induced leaf motion, and visual clutter caused by weeds, branches, and background debris, all of which are known to challenge visual perception systems in outdoor agricultural settings. Consequently, although the model demonstrates strong performance on standardized imagery, its generalization may deteriorate when confronted with uncontrolled orchard conditions. This gap underscores the importance of curating or synthesizing large-scale field datasets that capture real-world variability in lighting, phenology, disease progression, and weather.

Another methodological constraint is the exclusive use of static image segmentation for performance evaluation. Many operational agricultural tasks robotic harvesting, drone-based disease scouting, and continuous orchard surveillance depend on robust video analytics rather than isolated frame inference.

Additionally, the experiments focus primarily on pixel-level segmentation accuracy without addressing broader system-level concerns. For example, disease detection latency, power consumption on edge hardware, network bandwidth requirements for remote orchards, and resilience to adverse weather are increasingly relevant for real-world agricultural deployments. Even high-performing segmentation models may remain impractical if they require computational resources unavailable on embedded platforms or if their output is not easily translated into actionable decision-support tools for growers.

Beyond technical limitations, it is also important to note that plant disease symptomatology evolves over time and may vary across cultivars and climatic zones. Models trained exclusively on visual cues from specific growth stages or single geographic regions risk performance degradations when deployed in orchards with different phenological calendars, soil characteristics, or microclimates [49]. This challenge suggests a need for multi-institutional dataset collaboration, longitudinal data collection, and multi-modal sensing to improve cross-domain robustness.

To address these gaps, future work will focus on replacing laboratory-style samples with high-fidelity field datasets and evaluating the Disease-Seg architecture under authentic orchard conditions. Integrating temporal smoothing or video-based deep learning techniques offers a promising pathway toward stable real-time surveillance. In addition, deployment-oriented studies involving embedded inference, model compression, and uncertainty quantification could further bridge the divide between academic benchmarking and operational farm use [50]. Through these efforts, the Disease-Seg framework can evolve from a static segmentation tool into a reliable component within autonomous orchard management systems.

5. Conclusions

This paper presents a novel algorithm based on the Disease-Seg network for accurate segmentation of diseased fruit tree leaves. The proposed network adopts a single-layer parallel fusion architecture that extracts dense multi-scale feature representations through dual pathways, significantly enhancing the ability to characterize diverse lesion morphologies and effectively reducing missed detections. The DM-Attention module integrates the DMSM mechanism to extract hierarchical features and achieve global modeling while maintaining low computational complexity through positional information.

In addition, this study introduces a collaborative mechanism between the EFM and Transformer modules to jointly capture global and local features, enabling finer segmentation of leaf and lesion boundaries and accurate detection of small disease spots. The FWFM employs an adaptive weighting strategy to fuse shallow features containing detailed edge information with deep features rich in semantic context, thereby achieving precise feature reconstruction and semantic recovery.

Experimental results demonstrate that the proposed method outperforms existing segmentation approaches across multiple fruit leaf datasets, achieving an mIoU of 90.32% and an accuracy of 99.52%, representing improvements of 8.62% and 0.43% over the baseline, respectively. The results on the Plant Village dataset further confirm the strong generalization capability and robustness of Disease-Seg, providing effective technical support for intelligent analysis of pathological fruit leaf images. Notably, Disease-Seg achieves lower Params and FLOPs compared with most existing models, and maintains real-time inference performance on both local servers and edge devices, meeting the practical requirements of agricultural applications.

Furthermore, the model’s performance was validated on a comprehensive dataset covering six fruit species and seven disease types, and its generalization ability was further tested on two grape disease subsets from the Plant Village dataset, demonstrating strong robustness and cross-domain adaptability. Future work will focus on improving model accuracy and efficiency while expanding experimental validation to more crop leaf disease datasets, thereby promoting the practical implementation of this method in intelligent agriculture.

Author Contributions

Conceptualization, D.J., X.S. and L.C.; methodology, Y.W. and Z.L.; software, D.J. and J.C.; validation, J.L., J.C. and Z.L.; formal analysis, J.L. and Z.L.; data curation, J.C., Z.L. and J.L.; writing—original draft preparation, D.J. and Y.W.; writing—review and editing, L.C., X.S. and W.D.; supervision, Y.W. and W.D.; funding acquisition, L.C.; Resources, W.D. All authors have read and agreed to the published version of the manuscript.

Funding

This article was supported by the Jilin Provincial Science and Technology Development Plan Project for Cultivating Young and Middle-aged Scientific and Technological Innovation Talents (Teams)—Research and Application of Key Technologies for Precision Decision-Making in Agricultural Production Based on Multi Source Data (Grant No. 20250601061RC).

Data Availability Statement

The data that support the findings of this study are available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ebrahimi, M.; Khoshtaghaza, M.H.; Minaei, S.; Jamshidi, B. Vision-based pest detection based on SVM classification method. Comput. Electron. Agric. 2017, 137, 52–58. [Google Scholar] [CrossRef]
Savary, S.; Willocquet, L.; Pethybridge, S.J.; Esker, P.; McRoberts, N.; Nelson, A. The global burden of pathogens and pests on major food crops. Nat. Ecol. Evol. 2019, 3, 430–439. [Google Scholar] [CrossRef] [PubMed]
Oerke, E.-C. Crop losses to pests. J. Agric. Sci. 2006, 144, 31–43. [Google Scholar] [CrossRef]
Fisher, M.C.; Henk, D.A.; Briggs, C.J.; Brownstein, J.S.; Madoff, L.C.; McCraw, S.L.; Gurr, S.J. Emerging fungal threats to animal, plant and ecosystem health. Nature 2012, 484, 186–194. [Google Scholar] [CrossRef] [PubMed]
Hasan, S.; Jahan, S.; Islam, M.I. Disease detection of apple leaf with combination of color segmentation and modified DWT. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 7212–7224. [Google Scholar] [CrossRef]
Zheng, J.; Yang, L.; Li, Y.; Yang, K.; Wang, Z.; Zhou, J. Lightweight Vision Transformer with Spatial and Channel Enhanced Self-Attention. In Proceedings of the the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1492–1496. [Google Scholar]
Zhang, X.; Li, F.; Jin, H.; Mu, W. Local Reversible Transformer for semantic segmentation of grape leaf diseases. Appl. Soft Comput. 2023, 143, 110392. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. arXiv 2015, arXiv:1411.4038. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18, 2015. pp. 234–241. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Wang, Z.; Zhang, Z.; Lu, Y.; Luo, R.; Niu, Y.; Yang, X.; Jing, S.; Ruan, C.; Zheng, Y.; Jia, W. SE-COTR: A novel fruit segmentation model for green apples application in complex orchard. Plant Phenomics 2022, 2022, 0005. [Google Scholar] [CrossRef]
Tassis, L.M.; de Souza, J.E.T.; Krohling, R.A. A deep learning approach combining instance and semantic segmentation to identify diseases and pests of coffee leaves from in-field images. Comput. Electron. Agric. 2021, 186, 106191. [Google Scholar] [CrossRef]
Wang, C.; Du, P.; Wu, H.; Li, J.; Zhao, C.; Zhu, H. A cucumber leaf disease severity classification method based on the fusion of DeepLabV3+ and U-Net. Comput. Electron. Agric. 2021, 189, 106373. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Dosovitskiy, A. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale; Curran Associates, Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Salamai, A.A. Towards automated, efficient, and interpretable diagnosis coffee leaf disease: A dual-path visual transformer network. Expert Syst. Appl. 2024, 255, 124490. [Google Scholar] [CrossRef]
Thai, H.-T.; Le, K.-H.; Nguyen, N.L.-T. FormerLeaf: An efficient vision transformer for Cassava Leaf Disease detection. Comput. Electron. Agric. 2023, 204, 107518. [Google Scholar] [CrossRef]
Pacal, I. Enhancing crop productivity and sustainability through disease identification in maize leaves: Exploiting a large dataset with an advanced vision transformer model. Expert Syst. Appl. 2024, 238, 122099. [Google Scholar] [CrossRef]
Li, X.; Li, X.; Zhang, M.; Dong, Q.; Zhang, G.; Wang, Z.; Wei, P. SugarcaneGAN: A novel dataset generating approach for sugarcane leaf diseases based on lightweight hybrid CNN-Transformer network. Comput. Electron. Agric. 2024, 219, 108762. [Google Scholar] [CrossRef]
Wang, Y.; Wang, S.; Ni, W.; Zeng, Q. An Instance Segmentation Method for Anthracnose Based on Swin Transformer and Path Aggregation. In Proceedings of the 2022 7th International Conference on Image, Vision and Computing (ICIVC), Xi’an, China, 26–28 July 2022; pp. 381–386. [Google Scholar]
Lu, J.; Lu, B.; Ma, W.; Sun, Y. EAIS-Former: An efficient and accurate image segmentation method for fruit leaf diseases. Comput. Electron. Agric. 2024, 218, 108739. [Google Scholar] [CrossRef]
Zhang, X.; Li, F.; Zheng, H.; Mu, W. UPFormer: U-sharped perception lightweight transformer for segmentation of field grape leaf diseases. Expert Syst. Appl. 2024, 249, 123546. [Google Scholar] [CrossRef]
Guo, Z.; Cai, D.; Jin, Z.; Xu, T.; Yu, F. Research on unmanned aerial vehicle (UAV) rice field weed sensing image segmentation method based on CNN-transformer. Comput. Electron. Agric. 2025, 229, 109719. [Google Scholar] [CrossRef]
Hughes, D.; Salathé, M. An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv 2015, arXiv:1511.08060. [Google Scholar]
Russell, B.C.; Torralba, A.; Murphy, K.P.; Freeman, W.T. LabelMe: A database and web-based tool for image annotation. Int. J. Comput. Vis. 2008, 77, 157–173. [Google Scholar] [CrossRef]
Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Yuan, Y.; Chen, X.; Wang, J. Object-contextual representations for semantic segmentation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VI 16, 2020. pp. 173–190. [Google Scholar]
Xiao, T.; Liu, Y.; Zhou, B.; Jiang, Y.; Sun, J. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 418–434. [Google Scholar]
MMSegmentation: Openmmlab Semantic Segmentation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 13 January 2026).
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1489–1500. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
GB/T 17980.47-2000; Pesticide--Guidelines for the Field Efficacy Trials(I)--Herbicides Against Weeds in Root Vegetables. Standardization Administration of China: Beijing, China, 2000.
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Mohanty, S.P.; Hughes, D.P.; Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci. 2016, 7, 215232. [Google Scholar] [CrossRef]
Too, E.C.; Li, Y.; Njuki, S.; Lin, Y. A comparative study of fine-tuning deep learning models for plant disease identification. Comput. Electron. Agric. 2019, 161, 272–279. [Google Scholar] [CrossRef]
Picon, A.; Alvarez-Gila, A.; Seitz, M.; Ortiz-Barredo, A.; Echazarra, J.; Johannes, A. Deep convolutional neural networks for mobile capture device-based crop disease classification in the wild. Comput. Electron. Agric. 2019, 161, 280–290. [Google Scholar] [CrossRef]
Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv 2015, arXiv:1510.00149. [Google Scholar]

Figure 1. Schematic overview of the research workflow.

Figure 2. Study site locations.

Figure 3. Examples of leaf disease annotations from both datasets.

Figure 4. (a) is the image flipped randomly, (b) is the image contrast adjustment, (c) is the image cropping, and (d) is the addition of a black mask block.

Figure 5. Shows the overall architecture of Disease-Seg.

Figure 6. Overall EFM architecture.

Figure 7. DM-Attention overall architecture.

Figure 8. FMFW overall architecture.

Figure 9. Performance of various classical models across different categories, where (a–f) respectively represent the visualization of segmentation accuracy for different diseases.

Figure 10. Boxplot of per-class IoU across 13 categories.

Figure 11. Samples of apple spotted leaf drop, rust, and plum red spot, where the lowercase letters (a–c) correspond to the visualization images of the uppercase ones (A–C).

Figure 12. Samples of grape white rot and pear black star, where the lowercase letters (a,b) correspond to the visualization images of the uppercase ones (A,B).

Figure 13. Samples of mango brown spot and pomegranate cercospora spot.

Figure 14. Samples of grape brown spot and black rot.

Figure 15. Performance of various models under different datasets and resolution conditions.

Figure 16. Heat map of diseased leaves for pomegranate, mango, plum, pear, grape, and apple.

Figure 17. Heat map of disease spots on pomegranate, mango, plum, pear, grape, and apple, where (1) and (2) represent rust and spotted leaf drop disease spots, respectively.

Figure 18. Linear regression analysis of disease coverage in different models.

Figure 19. Disease detection system.

Table 1. Sample dataset.

Crop	Diseases	Sample	Crop	Diseases
Pomegranate	Cercospora spot		Apple	Spotted Leaf Drop Disease
Mango	Brown Spot		Grape	White Rot
Apple	Rust		Plum	Red Spot Disease
Pear		Black Star Disease

Table 2. Sample size distribution of multiple leaf disease categories in different scenarios.

Scene	Outdoor					Indoor
Class	Spotted Leaf Drop	Rust	White Rot	Black Star	Red Spot	Cercospora Spot	Brown Spot
Noon	185	163	223	192	222	261	248
Evening	33	41	28	31	36
Total	218	204	251	223	258

Table 3. Detailed parameter examples for various fruit tree datasets.

Category (Disease)	Species	Origin	Images	Leaf Pixels	Spot Pixels	Resolution
Spotted Leaf Drop	Apple	JLAU (Field)	218	16.14 M	1.3 M	512 × 512
Rust	Apple	JLAU (Field)	204	16.14 M	0.83 M	512 × 512
White rot	Grape	JLAU (Field)	251	28.61 M	0.60 M	512 × 512
Black star	Pear	JLAU (Field)	223	11.84 M	0.42 M	512 × 512
Red spot	Plum	JLAU (Field)	258	18.68 M	0.54 M	512 × 512
Cercospora spot	Pomegranate	Plant Village	261	5.07 M	0.28 M	512 × 512
Brown spot	Mango	Plant Village	248	11.09 M	0.42 M	512 × 512

Table 4. Results of validation experiments for apple rust and spotted leaf drop disease, grape white rot, and plum red spot disease. (SLD is the abbreviation form of spot leaf drop disease).

Method	BackBone	Apple			Grape		Plum
		Leaf	SLD	Rust	Leaf	Disease	Leaf	Disease
		IoU			IoU		IoU
DeepLabV3+	Mobilenet	95%	53%	76%	94%	56%	95%	67%
U-Net	Vgg	96%	62%	81%	96%	62%	95%	70%
HRNetV2	W18	95%	52%	77%	95%	50%	95%	61%
SegNeXt	Mscan	96%	64%	82%	97%	61%	97%	69%
DANet	Resnet101	95%	47%	77%	94%	48%	95%	60%
OCRNet	HR-W18	93%	45%	75%	95%	39%	93%	57%
UPerNet	Resnet18	85%	34%	60%	91%	32%	82%	53%
PSPNet	Mobilenet	95%	40%	71%	94%	36%	94%	34%
SegFormer	Mit b_3	96%	58%	82%	96%	59%	96%	73%
Ours	Mit b_0	98%	78%	93%	98%	77%	98%	83%

Table 5. Results of validation experiments on brown spot disease of mango, pomegranate cercospora spot, and pear black star disease.

Method	BackBone	Mango		Pomegranate		Pear
		Leaf	Disease	Leaf	Disease	Leaf	Disease
		IoU		IoU		IoU
DeepLabV3+	Mobilenet	98%	54%	96%	74%	96%	61%
U-Net	Vgg	98%	55%	97%	79%	96%	70%
HRNetV2	W18	97%	34%	95%	67%	96%	60%
SegNeXt	Mscan-T	98%	53%	96%	78%	97%	71%
DANet	Resnet101	97%	42%	95%	72%	96%	55%
OCRNet	HR-W18	97%	39%	95%	73%	95%	31%
UPerNet	Resnet18	96%	35%	92%	71%	82%	30%
PSPNet	Mobilenet	96%	27%	93%	57%	96%	48%
SegFormer	Mit B_3	98%	54%	97%	80%	97%	70%
Ours	Mit B_0	99%	70%	98%	86%	98%	81%

Table 6. Experimental results of different models under three architectures for the dataset of this paper.

Architecture	Method	BackBone	Crop Size	mIoU	Acc	Params (M)	FLOPs (G)	FPS
CNN Based	DeepLabV3+	Mobilenet	$512 \times 512$	78.45%	98.64%	5.82	52.96	72
	U-Net	Vgg	$512 \times 512$	82.61%	98.97%	24.89	452.07	28
	HRNetV2	W32	$512 \times 512$	83.45%	99.28%	29.54	91.11	38
	PSPNet	Mobilenet	$512 \times 512$	68.72%	73.61%	2.37	6.03	102
	DANet	Resnet101	$512 \times 512$	77.02%	98.88%	66.47	289.05	18
Transformer Based	SegNeXt	Mscan-T	$512 \times 512$	82.87%	99.19%	4.23	6.30	88
Transformer Based	SegFormer	Mit b_3	$512 \times 512$	82.53%	99.14%	44.64	42.53	38
CNN Transformer	OCRNet	HR-W18	$512 \times 512$	71.42%	98.34%	12.08	53.42	26
	UPerNet	Swin	$512 \times 512$	64.65%	96.46%	40.81	220.00	35
	Ours	Mit b_0	$512 \times 512$	90.32%	99.52%	4.78	16.25	69

Table 7. Experimental results of different models for three architectures on public datasets.

Method	Method	BackBone	Crop Size	mIoU	Acc	Params (M)	FLOPs (G)
CNN Based	DeepLabV3+	Mobilenet	$256 \times 256$	87.71%	97.49%	5.82	13.23
	U-Net	Vgg	$256 \times 256$	86.54%	97.44%	24.89	112.96
	HRNetV2	W32	$256 \times 256$	89.23%	98.18%	29.54	22.75
	PSPNet	Mobilenet	$256 \times 256$	74.81%	96.18%	2.37	2.15
	DANet	Resnet101	$256 \times 256$	28.8%	70.68%	66.47	72.26
Transformer Based	SegNeXt	Mscan-T	$256 \times 256$	62.35%	97.14%	4.23	1.56
Transformer Based	SegFormer	Mit b_3	$256 \times 256$	76.51%	96.77%	44.64	10.63
CNN Transformer	OCRNet	HR-W18	$256 \times 256$	38.28%	87.11%	12.08	13.38
	UPerNet	Swin	$256 \times 256$	60.45%	93.70%	40.81	55.02
	Ours	Mit b_0	$256 \times 256$	91.19%	98.30%	4.78	4.05

Table 8. Experimental results comparing different attentions.

Method	mIoU	mPA	Acc
SE-Attention	82.57%	87.57%	99.04%
CBAM-Attention	84.52%	88.83%	99.28%
COT-Attention	71.28%	78.89%	96.20%
SK-Attention	72.84%	79.61%	96.98%
Triplet-Attention	69.39%	76.87%	95.85%
Global Context	67.51%	75.46%	94.94%
Base	81.70%	87.90%	99.09%
DM-Attention	85.98%	91.31%	99.11%

Table 9. Results of ablation experiments.

	Base	DM	EFM	Fusion	mIoU	mPA	Acc
Test1	✓	-	-	-	81.70	87.90	99.09
Test2	✓	✓	-	-	85.98	91.31	99.11
Test3	✓	-	✓	-	86.27	90.28	99.28
Test4	✓	✓	✓	-	87.93	91.87	99.44
Test5	✓	-	✓	✓	87.21	91.55	99.38
Test6	✓	✓	✓	✓	90.32	93.65	99.52

Table 10. Experimental results for testing inference speed in edge devices.

Method	Backbone	Inference Speed/ms
DeepLabV3+	Mobilenet	52
U-Net	Vgg	129
HRNetV2	W32	54
DANet	Resnet101	150
OCRNet	HR-W18	70
UPerNet	Swin	108
PSPNet	Mobilenet	22
SegFormer	Mit b_3	105
Ours	Mit b_0	49

Table 11. Results of comparative experiments at different depths.

	mIoU	mPA	Acc	Params	FLOPs
Stage1	90.32%	93.65%	99.52%	4.78 M	16.25 G
Stage2	87.50%	91.12%	99.21%	4.87 M	67.71 G
Stage3	88.16	91.70%	99.31%	5.49 M	73.27 G
Stage4	88.66%	92.10%	99.33%	7.08 M	77.05 G

Table 12. Results of disease severity assessment.

Category	Value	Ratio	Disease Proportion	Level
Background	165,758	63.23%	6.36%	Level 2
Leaf	90,249	34.43%
Disease spots	6137	2.34%
Background	160,851	61.36%	6.19%	Level 2
Leaf	95,017	36.25%
Disease spots	6276	2.39%
Background	176,920	67.49%	7.51%	Level 2
Leaf	78,818	30.07%
Disease spots	6406	2.44%
Background	132,128	50.40%	0.37%	Level 1
Leaf	129,531	49.41%
Disease spots	485	0.19%
Background	145,951	55.68%	0.48%	Level 1
Leaf	115,628	44.11%
Disease spots	565	0.22%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, L.; Jiang, D.; Wang, Y.; Cao, J.; Liu, Z.; Li, J.; Si, X.; Du, W. Disease-Seg: A Lightweight and Real-Time Segmentation Framework for Fruit Leaf Diseases. Agronomy 2026, 16, 311. https://doi.org/10.3390/agronomy16030311

AMA Style

Cao L, Jiang D, Wang Y, Cao J, Liu Z, Li J, Si X, Du W. Disease-Seg: A Lightweight and Real-Time Segmentation Framework for Fruit Leaf Diseases. Agronomy. 2026; 16(3):311. https://doi.org/10.3390/agronomy16030311

Chicago/Turabian Style

Cao, Liying, Donghui Jiang, Yunxi Wang, Jiankun Cao, Zhihan Liu, Jiaru Li, Xiuli Si, and Wen Du. 2026. "Disease-Seg: A Lightweight and Real-Time Segmentation Framework for Fruit Leaf Diseases" Agronomy 16, no. 3: 311. https://doi.org/10.3390/agronomy16030311

APA Style

Cao, L., Jiang, D., Wang, Y., Cao, J., Liu, Z., Li, J., Si, X., & Du, W. (2026). Disease-Seg: A Lightweight and Real-Time Segmentation Framework for Fruit Leaf Diseases. Agronomy, 16(3), 311. https://doi.org/10.3390/agronomy16030311

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Disease-Seg: A Lightweight and Real-Time Segmentation Framework for Fruit Leaf Diseases

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Preprocessing

2.3. Disease-Seg

2.4. Extended Feature Module

2.5. Deep Multi-Scale Attention Mechanism

2.6. Feature-Weighted Fusion Module

2.7. Experimental Platform and Evaluation Metrics

3. Results

3.1. Comparative Experiment

3.2. Generalization Experiment

3.3. Attention Comparison Experiment

3.4. Ablation Experiment

3.5. Deploy Experiment

3.6. Heterogeneous Comparison Experiment

3.7. Experiments for Disease Severity Assessment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI