Multi-Scale Feature Rectification for Crop Leaf Disease Segmentation in Complex Scenarios

Gao, Bingpeng; Nie, Huishan; Du, Tiantian; Cai, Xin

doi:10.3390/horticulturae12050640

Open AccessArticle

Multi-Scale Feature Rectification for Crop Leaf Disease Segmentation in Complex Scenarios

¹

School of Intelligence Science and Technology, Xinjiang University, Urumqi 830017, China

²

School of Electrical Engineering, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2026, 12(5), 640; https://doi.org/10.3390/horticulturae12050640

Submission received: 18 March 2026 / Revised: 15 May 2026 / Accepted: 17 May 2026 / Published: 21 May 2026

(This article belongs to the Special Issue Advances in Digital Technologies for Precision Horticultural Crop Production)

Download

Browse Figures

Versions Notes

Abstract

Crop leaf disease segmentation in complex natural environments remains challenging because lesion regions often exhibit substantial scale variation, blurred boundaries, and severe background interference. To address these issues, this study proposes a Multi-Scale Feature Rectification Network (MFR-Net) for crop leaf disease segmentation. The proposed network adopts an EfficientNetV2-S-based encoder to extract hierarchical features, incorporates a hybrid attention mechanism to enhance lesion-sensitive spatial and channel representations, introduces a Cross-Window Atrous Spatial Pyramid Pooling (CWASPP) module to strengthen multi-scale contextual modeling, and employs a Feature Rectification Module (FRM) in the decoder to alleviate semantic inconsistency during cross-level feature fusion. Experiments on a Kaggle-derived benchmark constructed from the unaugmented data folder of the public Leaf Disease Segmentation Dataset, containing 588 diseased-leaf images and 588 corresponding binary lesion masks, showed that MFR-Net achieved the highest mIoU of 74.27% and the highest Recall of 87.61% among the compared methods, and maintained competitive Dice performance (84.25%) with 25.10 M parameters and 37.55 G FLOPs. Ablation results further confirmed the effectiveness of the proposed design, with CWASPP providing the most notable individual contribution. Additional experiments were conducted on an independent Apple Leaf Dataset comprising 3197 image–mask pairs, collected under mixed controlled and natural field-like imaging conditions. The results showed competitive performance under a different data distribution, and robustness evaluation further verified stable performance under severe noise, blur, darkness, and contrast variation. All experiments were implemented in PyTorch 2.11.0 (CUDA 12.8) on a workstation equipped with an NVIDIA GeForce RTX 4060 Ti GPU (8 GB). These results indicate that MFR-Net provides an effective and robust solution for crop leaf disease segmentation in complex agricultural scenarios.

Keywords:

crop leaf disease segmentation; semantic segmentation; MFR-Net; multi-scale feature rectification; hybrid attention; CWASPP; agricultural image analysis

1. Introduction

Crop diseases are among the major factors limiting agricultural productivity and crop quality worldwide. Early and accurate identification of disease-affected regions on crop leaves is therefore essential for disease monitoring, precision intervention, and intelligent field management [1]. Compared with manual inspection, image-based disease analysis may provide a more efficient and objective solution and has become an important research direction in smart agriculture [2]. However, crop leaf disease segmentation in practical agricultural environments remains highly challenging because lesion regions often exhibit substantial scale variation, irregular morphology, blurred boundaries, and low contrast with surrounding healthy tissues [3]. For example, anthracnose of black pepper (Piper nigrum) caused by Colletotrichum siamense was reported to initially appear as chlorotic circular spots that later coalesced into larger irregular lesions [4]. Similarly, leaf spot of peanut (Arachis hypogaea) caused by Nigrospora oryzae showed early symptoms such as small brown circular or irregular spots that enlarged and were surrounded by chlorotic halos [5]. In addition, complex backgrounds, uneven illumination, leaf overlap, and occlusion further increase the difficulty of accurately extracting diseased regions, especially under natural field conditions [6,7,8,9,10].

Early disease segmentation and analysis methods mainly relied on handcrafted features and conventional image processing techniques, including thresholding, clustering, saliency analysis, and related segmentation strategies [11,12,13,14,15,16]. Although these approaches are computationally efficient and relatively easy to implement, their performance is often sensitive to environmental conditions and image quality. When lesions are small, scattered, or visually similar to the surrounding background, such methods usually fail to capture sufficiently robust discriminative features, resulting in limited segmentation accuracy and weak generalization ability [17,18,19].

With the rapid development of deep learning, convolutional neural networks (CNNs) have demonstrated strong capability in plant disease analysis and have significantly advanced disease identification, localization, and segmentation tasks [20]. Representative architectures such as DeepLabV3+, U-Net, and their variants have achieved promising results by learning hierarchical feature representations directly from data [21,22,23,24,25,26,27,28]. These methods can model semantic information more effectively than traditional approaches and have shown superior performance in many crop disease analysis tasks. Nevertheless, several challenges remain unresolved in complex natural scenarios. First, lesion regions vary greatly in size, from tiny scattered spots to large continuous infected areas, which makes it difficult for a single-scale representation to capture complete disease characteristics. Second, background clutter and interference from veins, shadows, and specular reflections may cause false responses and reduce feature discriminability. Third, although skip connections in encoder–decoder networks help preserve spatial details, low-level and high-level features often exhibit semantic inconsistency, which may lead to inaccurate fusion and suboptimal boundary delineation.

Existing plant disease segmentation methods have explored multi-scale fusion, attention enhancement, and encoder–decoder refinement, yet these strategies are often introduced as relatively independent architectural add-ons rather than being designed to address the coupled failure modes of complex natural scenes. In practical crop disease images, substantial lesion-scale variation weakens single-scale representation, background clutter and illumination disturbance reduce lesion discriminability, and direct fusion of low-level and high-level features often introduces semantic inconsistency near lesion boundaries. Moreover, field-oriented agricultural vision studies have shown that model reliability is further influenced by illumination heterogeneity, occlusion, spectral variability, and image degradation, highlighting the need to evaluate robustness and cross-domain generalization in addition to in-domain accuracy. To address these challenges, this study proposes a Multi-Scale Feature Rectification Network (MFR-Net) for crop leaf disease segmentation in complex scenarios. The proposed framework is built around three coordinated objectives: lesion-sensitive encoding through stage-aware hybrid attention, multi-scale contextual aggregation through the Cross-Window Atrous Spatial Pyramid Pooling (CWASPP) module, and cross-level semantic alignment through Feature Rectification Module (FRM)-guided rectification before feature fusion. Based on an EfficientNetV2-S encoder, the network integrates hybrid attention, CWASPP, and FRM to improve hierarchical feature representation, contextual modeling, and decoder fusion consistency. Experimental results on a Kaggle-derived benchmark and an independent Apple Leaf Dataset demonstrate that the proposed network achieves strong segmentation accuracy, competitive cross-dataset generalization ability, and stable robustness under simulated environmental disturbances.

2. Materials and Methods

2.1. Dataset and Experimental Protocol

2.1.1. Data Sources

Two datasets were used in this study: a Kaggle-derived benchmark for the main experiments and a curated supplementary Apple Leaf Dataset for cross-dataset evaluation. The main benchmark was derived from the public Kaggle dataset, Leaf Disease Segmentation Dataset [29], released by Fakhre Alam and available at the Kaggle dataset page. The released dataset contains two folders, namely data and aug_data. In the present study, only the unaugmented data folder was used. According to the dataset description, this folder contains 588 diseased-leaf images and 588 corresponding masks, and the data collection is based on PlantDoc images [30]. The augmented aug_data folder was not used in this study. The dataset description lists several representative cropdisease examples, such as apple scab leaf, apple rust leaf, bell_pepper leaf spot, corn leaf blight, and potato leaf early blight. However, the public release does not provide a complete class-level annotation file or an exhaustive list of crop and disease labels for all 588 retained image–mask pairs. Therefore, to avoid introducing unsupported category statistics, this study did not use crop or disease class labels for supervised learning, model evaluation, or disease-wise statistical analysis. Instead, the dataset was used strictly as a binary lesion segmentation benchmark, in which diseased regions were treated as foreground and all remaining pixels were treated as background. Based on the retained 588 unaugmented image–mask pairs, a fixed split of 470 training images, 59 validation images, and 59 test images was constructed for all main experiments. The retained benchmark contains diverse lesion appearances, with variations in lesion scale, morphology, boundary clarity, illumination conditions, image resolution, and background complexity, making it suitable for evaluating binary leaf disease segmentation under realistic agricultural imaging conditions. Figure 1 presents the representative original images and corresponding binary masks from the two datasets used in this study.

In addition, a curated supplementary Apple Leaf Dataset was used for cross-dataset evaluation. The images were collected from publicly available sources, including PlantVillage, AppleLeaf9-main, and several additional public apple leaf image sets, whereas all lesion masks were manually annotated in this study. After filename-based image–mask matching and validity checking, 3197 image–mask pairs were retained. The dataset contained apple leaf images from four disease classes: apple scab, apple black rot, cedar-apple rust, and Alternaria leaf spot. The images were collected under mixed controlled and natural field-like imaging conditions. A fixed train–validation–test split of 2557/320/320 image–mask pairs was used for this dataset. Because this dataset was assembled from multiple public sources and unified only for binary lesion segmentation, it was used solely as an external supplementary benchmark for cross-dataset generalization rather than for in-domain model selection or disease-category classification.

2.1.2. Data Partitioning and Experimental Protocols

A unified train–validation–test protocol was adopted for all experiments. For the Kaggle-derived benchmark, only the 588 unaugmented image–mask pairs from the data folder of the public Leaf Disease Segmentation Dataset were used. These image–mask pairs were divided into fixed training, validation, and test subsets at the original-image level. Because the augmented aug_data folder released with the public dataset was not included in this study, augmented duplicates from the public release could not appear across different subsets. This strategy reduced the risk of train–test leakage caused by duplicated visual content and ensured a fair evaluation of model generalization.

For the Kaggle-derived benchmark, the fixed split contained 470 training images, 59 validation images, and 59 test images. The same split was consistently used for the main quantitative comparison, the ablation study, the qualitative visual analysis, the robustness evaluation, and the supplementary repeated-run experiments. This unified protocol ensured that all reported results were directly comparable across different models and experimental settings.

The independent Apple Leaf Dataset was not merged with the Kaggle-derived benchmark. Instead, it was treated as a separate supplementary dataset with a fixed train–validation–test split of 2557/320/320 image–mask pairs. The same split was used consistently for all apple-dataset experiments, and the results were reported independently to examine cross-dataset generalization under a different data distribution and annotation style. This design avoided potential interference between heterogeneous data sources and allowed the supplementary experiments to serve as an external validation of segmentation performance.

Unless otherwise stated, all models were trained and evaluated under the same experimental settings. The input image resolution was fixed at 512 × 512, the same training schedule and early stopping strategy were used, the decision threshold was fixed at 0.5, and no test-time augmentation was applied. For robustness evaluation, all simulated disturbances were introduced only at test time, while the training protocol and model selection strategy remained unchanged. This setting ensured that performance differences could be attributed primarily to the model architectures rather than to inconsistent experimental conditions.

2.1.3. Preprocessing and Data Augmentation

All images and masks were resized to 512 × 512 before training and evaluation. The masks were converted into binary lesion masks, in which diseased regions were treated as the foreground and all remaining pixels were treated as the background. This binary formulation was consistently used for both datasets.

It should be noted that the augmented aug_data folder released with the Kaggle dataset was not used in this study. To improve training diversity and preserve evaluation fairness, only online augmentation was applied to the training subset after the fixed data split had been constructed. The augmentation operations included random horizontal flipping, rotation, scaling, and brightness/contrast perturbation, followed by padding and center cropping when necessary. No augmentation was applied to the validation or test subsets.

Because the augmentation was performed online during data loading rather than by using the pre-generated augmented files from the public Kaggle release, it did not change the nominal number of training samples. Therefore, the Kaggle-derived benchmark still contained 470 training image–mask pairs per epoch. During training, stochastic transformations generated different augmented views of these samples across iterations and epochs for increasing the effective visual diversity of the training data and keeping the validation and test protocols strictly unchanged.

Before model training, all image–mask pairs were checked for filename correspondence and basic validity. Unmatched or invalid pairs were excluded from further analysis. Masks stored in different formats were converted into a unified binary lesion format, in which nonzero pixels were treated as lesion regions. When the spatial size of a mask did not match that of its corresponding image, the mask was first aligned to the image size using nearest-neighbor interpolation to avoid label contamination during subsequent preprocessing.

2.1.4. Roles of the Two Datasets in This Study

The two datasets served different purposes in the experimental design. The Kaggle-derived benchmark, constructed from the 588 unaugmented image–mask pairs in the data folder of the public Leaf Disease Segmentation Dataset, was used as the primary dataset for model development and systematic evaluation, including the main comparison experiments, component ablation analysis, qualitative visual comparison, robustness assessment under simulated environmental disturbances, and supplementary repeated-run experiments. In contrast, the independent Apple Leaf Dataset was used only for supplementary evaluation of cross-dataset generalization.

This dual-dataset design allowed the proposed method to be assessed from both an in-domain perspective and an out-of-domain perspective. The former reflects performance under the main benchmark setting, whereas the latter provides additional evidence regarding the stability and transferability of the proposed model under a different data distribution.

2.2. Overall Architecture of MFR-Net

As shown in Figure 2, the proposed Multi-Scale Feature Rectification Network (MFR-Net) adopts an encoder–decoder architecture for binary crop leaf disease segmentation. The network consists of three main parts: a hierarchical feature extraction encoder, a multi-scale context enhancement bottleneck, and a progressive feature restoration decoder. Given an input crop leaf image of size 512 × 512 × 3, the network outputs a binary lesion segmentation mask. The overall design aims to improve lesion localization, multi-scale representation, and boundary recovery in complex natural scenes while maintaining moderate computational cost.

In the encoding stage, EfficientNetV2-S is employed as the backbone feature extractor. The encoder contains four major stages. The first two stages are based on Fused-MBConv blocks and mainly extract shallow features with relatively high spatial resolution, which are useful for representing local texture, fine lesion appearance, and boundary details. The latter two stages are based on MBConv blocks and are used to capture deeper semantic representations with larger receptive fields. To improve discriminative representation under complex agricultural backgrounds, a Coordinate Attention module is inserted after the third-stage feature extraction to enhance directional sensitivity and spatial localization ability, whereas a Convolutional Block Attention Module (CBAM) is introduced after the fourth-stage feature extraction to further recalibrate high-level semantic features in both channel and spatial dimensions.

At the bottleneck stage, the deepest encoder feature is fed into the proposed Cross-Window Atrous Spatial Pyramid Pooling (CWASPP) module. This module is designed to strengthen multi-scale contextual modeling for lesion regions with substantial size variation. In practical crop disease images, lesions may appear as tiny scattered spots or as large continuous diseased regions. Therefore, the bottleneck representation should simultaneously capture local lesion structures and broader contextual dependencies. CWASPP enhances this capability by aggregating contextual information from multiple receptive fields together with window-based feature enhancement, thereby producing a more discriminative and scale-aware feature representation for subsequent decoding.

In the decoding stage, the network progressively restores spatial resolution through a series of upsampling blocks. Unlike conventional encoder–decoder frameworks that directly concatenate encoder and decoder features, MFR-Net introduces a Feature Rectification Module (FRM) in the skip connection pathway. Before fusion, encoder features are rectified and aligned with the corresponding decoder features to reduce semantic inconsistency between low-level detail features and high-level semantic features. This process helps suppress irrelevant background responses during fusion and improves the recovery of lesion structures and boundaries.

More specifically, the deepest enhanced feature first enters the bottom decoder block and is then progressively upsampled to higher-resolution stages. At each stage, the rectified encoder feature provides complementary structural information for feature recovery. High-resolution encoder features mainly contribute boundary and texture cues, whereas lower-resolution encoder features provide intermediate structural and semantic information. Through FRM-guided fusion, these complementary cues are more effectively integrated with the upsampled semantic features, improving segmentation continuity and reducing boundary ambiguity. After the final decoding stage, the restored feature is projected by a terminal convolution layer to generate a single-channel lesion probability map, from which the final binary segmentation mask is obtained.

Overall, MFR-Net follows a coordinated design strategy of enhanced discriminative encoding, multi-scale bottleneck aggregation, and rectified cross-level decoding. The encoder improves lesion-related feature representation through hierarchical extraction and hybrid attention enhancement, the bottleneck increases sensitivity to lesion scale variation through CWASPP, and the decoder improves spatial recovery through FRM-guided feature rectification. Through the cooperation of these components, the proposed network is expected to achieve more robust lesion segmentation in complex scenes characterized by scale variation, background clutter, and blurred lesion boundaries.

2.3. EfficientNetV2-S Encoder with Hybrid Attention

To simultaneously preserve fine lesion details and enhance high-level semantic discrimination under complex background conditions, EfficientNetV2-S was adopted as the encoder backbone of MFR-Net, together with a hybrid attention mechanism composed of Coordinate Attention and CBAM. This design aims to exploit the efficient hierarchical representation capability of EfficientNetV2-S while further improving lesion localization and background suppression in challenging agricultural scenes.

2.3.1. EfficientNetV2-S Backbone Encoder

In crop leaf disease segmentation, the network is required to identify not only large continuous lesion regions but also fine-grained patterns such as scattered spots, edge erosion, and subtle texture abnormalities. Therefore, the encoder should provide strong hierarchical representation capability while maintaining moderate model complexity. EfficientNetV2-S offers a favorable balance between representation capacity and computational efficiency through its optimized Fused-MBConv and MBConv structures, making it suitable for the proposed framework.

As shown in Figure 3, the encoder backbone in MFR-Net is organized into four major stages. The first two stages are built upon Fused-MBConv blocks and mainly extract relatively high-resolution features, which are beneficial for preserving local texture, color transitions, and lesion boundary details. The latter two stages are based on MBConv blocks and are used to capture deeper semantic representations with larger receptive fields. Through this hierarchical extraction process, the encoder produces multi-level features ranging from shallow detail-rich representations to deep semantic representations, thereby establishing an effective basis for subsequent context enhancement and cross-level feature fusion.

2.3.2. Coordinate Attention

In practical agricultural images, diseased regions are frequently affected by distracting factors such as leaf veins, shadows, specular reflections, and overlapping structures. Conventional channel attention can enhance salient channel responses, but its spatial sensitivity is often limited. To improve spatial localization while preserving directional information, a Coordinate Attention module was introduced after the third encoder stage, where the feature map already contains intermediate semantic abstraction while still retaining substantial spatial structure.

The key idea of Coordinate Attention is to decompose spatial encoding into two one-dimensional aggregation processes along the horizontal and vertical directions. Let the input feature map be denoted by

X \in ℝ^{C \times H \times W}

. Its aggregated descriptors along the two coordinate directions can be written as

z_{c}^{h} (h) = \frac{1}{W} \sum_{i = 1}^{W} X_{c} (h, i),

(1)

z_{c}^{w} (w) = \frac{1}{H} \sum_{j = 1}^{H} X_{c} (j, w),

(2)

where z^h and z^w represent the directional context descriptors along the horizontal and vertical axes, respectively. These descriptors are then jointly transformed to generate direction-aware attention weights:

A^{h}, A^{w} = δ (f ([z^{h}, z^{w}])),

(3)

where f(⋅) denotes the feature transformation operation and δ(⋅) denotes the activation function. The refined output feature can then be expressed as

Y = X ⊙ A^{h} ⊙ A^{w},

(4)

where ⊙ denotes element-wise multiplication.

By enhancing the intermediate encoder representation with directional attention, the model is encouraged to focus more accurately on elongated, scattered, or irregular lesion patterns while suppressing spatially distributed background interference.

2.3.3. Convolutional Block Attention Module

As the encoder becomes deeper, the extracted features become increasingly semantic, but they may still be influenced by background texture, dark regions, and overlapping leaf structures. To further recalibrate the deep semantic representation before context aggregation, a Convolutional Block Attention Module (CBAM) was introduced after the fourth encoder stage, as shown in Figure 3.

Given an input feature map

F \in ℝ^{C \times H \times W}

, CBAM first performs channel attention refinement, which can be expressed as

F^{'} = M_{c} (F) ⊙ F,

(5)

where M_c(F) denotes the channel attention map. Then, spatial attention is applied to the channel-refined feature:

F^{″} = M_{s} (F^{'}) ⊙ F^{'},

(6)

where M_s(F′) denotes the spatial attention map. The final output F″ therefore incorporates both channel-wise and spatially selective enhancement.

In MFR-Net, CBAM is placed after the deepest encoder stage to refine high-level semantic features before they are sent to the CWASPP bottleneck. The channel attention branch strengthens channels that are more relevant to lesion semantics and suppresses irrelevant high-level responses, while the spatial attention branch further guides the model to focus on lesion-dominant regions at the coarse feature resolution. This design enables the network to obtain a cleaner and more discriminative deep representation for subsequent multi-scale context modeling.

2.3.4. Cooperative Role of the Hybrid Attention Mechanism

The proposed attention design does not simply stack multiple attention modules at the same feature level. Instead, different attention mechanisms are assigned to different encoder stages according to their semantic depth and spatial resolution. Coordinate Attention is used to enhance the intermediate feature representation, where preserving positional information is particularly important for lesion localization, whereas CBAM is used to refine the deepest feature representation, where semantic discrimination and background suppression become more critical.

This stage-aware configuration provides two major advantages. First, each attention module operates at a feature level where its characteristics can be more effectively exploited, thereby reducing unnecessary redundancy. Second, the intermediate features enhanced by Coordinate Attention and the deep semantic features refined by CBAM together provide more discriminative inputs for the subsequent CWASPP module and FRM-guided decoder. As a result, the encoder output becomes more informative in terms of lesion relevance, spatial localization, and semantic consistency, which is beneficial for robust segmentation under complex scene conditions.

2.4. Cross-Window Atrous Spatial Pyramid Pooling Module

To improve the adaptability of the network to lesion regions with substantial scale variation, a Cross-Window Atrous Spatial Pyramid Pooling (CWASPP) module was introduced at the bottleneck stage. In practical crop leaf disease images, lesions may appear as tiny scattered spots, elongated infected stripes, or large continuous diseased regions. Under such conditions, relying on a single receptive field is often insufficient to capture both local lesion details and broader contextual dependencies. Therefore, the proposed CWASPP module was designed to enhance multi-scale feature representation while maintaining moderate computational complexity.

As shown in Figure 4, the proposed CWASPP consists of a detail-preserving branch, a multi-dilation atrous context branch, and a cross-window enhancement branch, followed by feature concatenation and channel fusion. This design enables the bottleneck representation to incorporate local structural information, multi-scale contextual cues, and cross-window spatial interactions in a coordinated manner.

2.4.1. Detail-Preserving Branch

The first branch applies a 1 × 1 convolution to the input bottleneck feature map. This branch is intended to retain local structural details while adjusting the channel representation. Because lesion segmentation requires the preservation of boundary-related information and local appearance cues, directly transforming all bottleneck features into dilated representations may weaken fine structural details. The 1 × 1 branch therefore provides a relatively detail-preserving path and serves as a reference representation during subsequent feature fusion.

Given an input feature map F, the output of this branch can be written as

F_{0} = ϕ_{1 \times 1} (F),

(7)

where ϕ_1×1(⋅) denotes the 1 × 1 convolutional transformation.

2.4.2. Multi-Dilation Atrous Context Branch

To model lesion regions of different spatial scales, the second part of CWASPP performs context extraction using multiple atrous convolution branches with different dilation settings. In the current implementation, dilation rates of 1, 3, 6, and 9 are used to generate feature responses with progressively enlarged receptive fields. This design allows the bottleneck representation to capture both fine local lesion structures and broader semantic context.

For an input feature map F, the outputs of the multi-dilation branches can be written as

F_{r} = ϕ_{3 \times 3}^{r} (F), r \in {1, 3, 6, 9},

(8)

where

ϕ_{3 \times 3}^{r} (\cdot)

denotes a 3 × 3 atrous convolution with dilation rate r.

Compared with using a single receptive field, this multi-dilation design enables the network to better adapt to the substantial scale variation in lesion regions, ranging from small scattered spots to large continuous infected areas.

2.4.3. Cross-Window Enhancement Branch

In addition to atrous context extraction, CWASPP further incorporates a cross-window enhancement branch to strengthen spatial interactions within local regions. Specifically, window-based convolutions with different kernel sizes are used to model complementary structural cues at the bottleneck stage. In the current configuration, kernel sizes of 3 and 5 are adopted to capture local lesion continuity and regional structural variation.

Let the corresponding window-based features be denoted by

F_{w}^{(k)} = ϕ_{w}^{(k)} (F), k \in {3, 5},

(9)

where

ϕ_{w}^{(k)} (\cdot)

represents the window-based feature transformation with kernel size k.

This branch is intended to enhance local spatial interaction beyond standard atrous sampling, thereby improving the sensitivity of the bottleneck representation to elongated lesions, fragmented regions, and irregular morphological patterns.

2.4.4. Feature Fusion

After the detail-preserving branch, the multi-dilation atrous branches, and the cross-window enhancement branches generate their respective outputs, all features are concatenated along the channel dimension and projected through a final 1 × 1 convolution for channel fusion. The final output of the CWASPP module can be expressed as

F_{CWASPP} = ϕ_{1 \times 1} ([F_{0}, F_{1}, F_{3}, F_{6}, F_{9}, F_{w}^{(3)}, F_{w}^{(5)}]),

(10)

where [⋅] denotes channel-wise concatenation.

This fusion step integrates local detail information, multi-scale contextual information, and cross-window structural information into a unified bottleneck representation. Compared with directly using a single deep feature map, the fused representation is more suitable for subsequent decoder recovery because it contains richer disease-related cues across multiple spatial scales.

2.4.5. Role of CWASPP in MFR-Net

In the proposed network, CWASPP is placed after the deepest encoder feature refined by CBAM and before the decoder begins progressive feature restoration. In this position, it serves as the core multi-scale context enhancement component of the bottleneck. Its role is not limited to enlarging the receptive field; rather, it provides a richer and more balanced feature representation for the decoder, which is particularly important in crop disease segmentation where lesion scale variability and background interference are both prominent.

By combining a detail-preserving path, multiple atrous branches, and cross-window enhancement, CWASPP improves the ability of MFR-Net to represent lesion regions with complex morphology and diverse spatial scales. As a result, the decoder receives a more discriminative and context-aware feature map, which contributes to improved lesion segmentation accuracy and more stable boundary recovery in challenging agricultural scenes.

2.5. Feature Rectification Module

In encoder–decoder segmentation networks, skip connections are commonly used to fuse low-level and high-level features. However, because these features differ substantially in semantic abstraction, direct fusion may introduce semantic inconsistency and reduce segmentation quality, especially near lesion boundaries. To address this issue, a Feature Rectification Module (FRM) was introduced in the skip pathway of MFR-Net to align encoder features with decoder features before cross-level fusion.

Let F_e denote the encoder feature and F_d denote the decoder feature at a given fusion stage. The rectified feature can be expressed as

F_{r} = R e c t i f y (F_{e}, F_{d}),

(11)

where

R e c t i f y (\cdot)

denotes the feature rectification operation used to reduce semantic inconsistency between encoder and decoder features. The refined output feature is then obtained by adaptive fusion:

F_{out} = ϕ (F_{r}),

(12)

where

ϕ (\cdot)

denotes the subsequent convolution and channel recalibration operations. Through this rectification process, the decoder can make more effective use of complementary information from different feature levels, thereby improving lesion boundary delineation and structural recovery.

2.5.1. Motivation of Feature Rectification

In leaf disease images, lesion regions often exhibit irregular boundaries, fragmented distribution, and low contrast with surrounding healthy tissues. At the same time, shallow encoder features may strongly respond to background textures such as leaf veins and illumination variations, whereas deeper decoder features focus more on lesion semantics but may lose fine-grained structural cues. Therefore, directly merging these two types of features may weaken the overall segmentation quality. The objective of FRM is to reduce this semantic inconsistency and allow the decoder to better exploit complementary information from different levels.

2.5.2. Rectification Formulation

Let F_e denote the encoder feature and F_d denote the decoder feature at a given fusion stage. Since these two features may differ in channel dimension and semantic distribution, they are first projected into a compatible embedding space through lightweight convolutional transformations:

{\hat{F}}_{e} = ϕ_{e} (F_{e}),

(13)

{\hat{F}}_{d} = ϕ_{d} (F_{d}),

(14)

where

ϕ_{e}

and

ϕ_{d}

denote feature projection operations. After alignment, the two features are jointly used to generate a rectification map:

A = σ (ϕ_{r} ([{\hat{F}}_{e}, {\hat{F}}_{d}])),

(15)

where [⋅,⋅] denotes channel-wise concatenation, ϕ_r(⋅) denotes the rectification transformation, and σ(⋅) denotes the sigmoid activation. The encoder feature is then recalibrated by this rectification map:

{\tilde{F}}_{e} = A ⊙ {\hat{F}}_{e},

(16)

where ⊙ denotes element-wise multiplication. Through this process, encoder responses that are less consistent with the decoder semantics can be suppressed, whereas feature components that are more relevant to lesion restoration can be emphasized.

After rectification, the refined encoder feature is fused with the aligned decoder feature to produce the output of the current decoder stage:

F_{out} = ϕ_{f} ([{\tilde{F}}_{e}, {\hat{F}}_{d}]),

(17)

where ϕ_f(⋅) denotes the subsequent fusion operation.

2.5.3. Structural Interpretation of FRM

As illustrated in Figure 5, the encoder feature first enters the skip branch, where it is transformed into a feature space suitable for cross-level interaction. Meanwhile, the decoder feature passes through the gating branch to produce a guidance signal. The two branches then interact through element-wise modulation, allowing the decoder context to control which parts of the encoder feature should be enhanced or suppressed. The rectified skip feature is subsequently fused with the decoder feature and forwarded to the next decoding stage.

This design differs from conventional skip connections in that the encoder feature is not assumed to be fully reliable. Instead, it is selectively filtered according to the current decoder context. As a result, the skip pathway becomes an adaptive correction mechanism rather than a direct transmission channel.

2.5.4. Role of FRM in MFR-Net

The introduction of FRM provides two main advantages. First, it improves semantic consistency between encoder and decoder features, which is particularly important when shallow lesion-like textures resemble the surrounding background. Second, it helps preserve lesion structure and boundary continuity during progressive upsampling, because useful edge-related information can still be retained after rectification while irrelevant background responses are reduced.

In MFR-Net, FRM is applied at multiple skip fusion levels, allowing semantic correction to be performed throughout the decoding process rather than only at a single stage. As a result, the decoder can more effectively integrate multi-level features and progressively recover lesion regions with improved structural integrity and boundary delineation. Therefore, FRM serves as a key component for reducing semantic conflict in cross-level fusion and enhancing the final segmentation quality of the proposed network.

2.6. Decoder and Segmentation Output

After multi-level feature extraction in the encoder, context enhancement in the CWASPP bottleneck, and cross-level feature correction through FRM, the decoder is responsible for progressively restoring the spatial resolution of the feature maps and generating the final lesion segmentation result. In the proposed MFR-Net, the decoder follows a top-down progressive recovery strategy. Starting from the deepest bottleneck feature, the decoder gradually upsamples the feature representation and integrates the rectified encoder features through skip connections. This design allows high-level semantic information and low-level structural information to be combined stage by stage, which is beneficial for recovering lesion shapes, preserving boundary continuity, and reducing the loss of local detail. At each decoding stage, the incoming feature from the previous lower-resolution stage is first upsampled to a higher spatial resolution. The upsampled decoder feature is then fused with the rectified encoder feature delivered by the corresponding FRM. Through this process, the decoder progressively reconstructs lesion regions from coarse semantic localization to fine-grained structural delineation. The low-resolution stages mainly restore the overall lesion distribution and regional consistency, whereas the high-resolution stages are more important for recovering lesion boundaries and local morphological details.

Let

F_{d}^{(l)}

denote the decoder feature at stage l, and let

F_{e}^{(l)}

denote the corresponding rectified encoder feature produced by FRM. The decoding process at stage l can be written as

F_{d}^{(l)} = ψ_{l} ([Up (F_{d}^{(l + 1)}), F_{e}^{(l)}]),

(18)

where Up(⋅) denotes the upsampling operation, [⋅,⋅] denotes channel-wise concatenation, and ψ_l(⋅) denotes the convolutional refinement operation at the current decoding stage. This formulation indicates that each decoder stage receives two complementary inputs: one from the deeper decoder branch containing stronger semantic context, and the other from the encoder branch containing rectified structural detail.

Through repeated application of this process, the decoder generates a high-resolution feature representation that integrates multi-level semantic and structural cues. Compared with direct decoding from a single deep feature map, such progressive recovery is better suited for leaf disease segmentation, because lesion regions often exhibit irregular morphology, weak contrast, and partially blurred edges. The stage-wise fusion mechanism allows the decoder to progressively refine the lesion mask rather than attempting to recover all details in a single step.

After the final decoding stage, the restored feature map is projected into the segmentation space by a terminal 1 × 1 convolution. Let the output feature of the last decoder stage be denoted by

F_{d}^{(0)}

. The lesion probability map P can be expressed as

P = σ (ϕ_{out} (F_{d}^{(0)})),

(19)

where ϕ_out(⋅) denotes the final 1 × 1 convolutional projection and σ(⋅) denotes the sigmoid activation function. The resulting probability map provides the confidence of each pixel belonging to the lesion class.

To obtain the final binary lesion mask, the probability map is thresholded as

S (x, y) = \{\begin{matrix} 1, P (x, y) \geq τ, \\ 0, P (x, y) < τ, \end{matrix}

(20)

where S(x,y) denotes the predicted lesion label at pixel (x,y), and τ denotes the decision threshold. In the present study, a fixed threshold of 0.5 was used in the main experiments to convert the lesion probability map into the final binary segmentation output.

Overall, the decoder in MFR-Net is designed not merely as a spatial upsampling pathway, but as a progressive lesion restoration mechanism guided by rectified skip features. By combining stage-wise upsampling, FRM-guided feature fusion, and final probability projection, the decoder is able to recover lesion regions with improved structural completeness and more accurate boundary delineation. This makes the proposed network more suitable for crop leaf disease segmentation in complex natural environments where both semantic consistency and fine spatial recovery are critical.

2.7. Experimental Settings

All experiments were implemented in PyTorch and conducted on a workstation equipped with an NVIDIA GeForce RTX 4060 Ti GPU (8 GB). All input images and corresponding masks were resized to 512 × 512 before training and evaluation.

The proposed network was optimized using the AdamW optimizer with an initial learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁵. A cosine annealing schedule was adopted, and the minimum learning rate was set to 1 × 10⁻⁶. The batch size was set to 8, and automatic mixed precision (AMP) was enabled during training. The maximum number of training epochs was set to 60, and early stopping was applied with a patience of 10 epochs to reduce the risk of overfitting. Model selection was based on the validation mIoU, and the checkpoint with the best validation performance was used for final evaluation.

The loss function was defined as a weighted combination of Focal Loss and Tversky Loss to address class imbalance and improve lesion region segmentation. In the current implementation, the loss was formulated as

L = λ_{f} L_{Focal} + λ_{t} L_{Tversky},

(21)

where λ_f = 0.3 and λ_t = 0.7. The focal-loss parameters were set to α = 0.25 and γ = 2.0, while the Tversky-loss parameters were set to α = 0.3, β = 0.7, with a smoothing factor of 10⁻⁶.

For the Kaggle-derived benchmark, all compared methods were trained and evaluated under a unified experimental protocol, including the same data split, input resolution, training schedule, early stopping strategy, and fixed threshold. The same settings were also adopted for the ablation study and the robustness experiments. For the independent Apple Leaf Dataset, an independent train–validation–test split was used, and all methods were evaluated under the same basic training and testing settings as those used on the Kaggle benchmark. In the main experiments, the lesion probability map was binarized using a fixed threshold of 0.5.

During manuscript preparation, generative AI tools were employed to optimize English expression and standardize academic writing format. All research content, experimental designs, result analyses, and academic discussions were independently completed, checked, and approved by the authors.

2.8. Evaluation Metrics

To comprehensively evaluate the proposed method, both segmentation accuracy and computational efficiency were considered. The primary segmentation metrics included mean Intersection over Union (mIoU), Dice coefficient, Precision, Recall, and Accuracy. Among them, mIoU and Dice were used to measure the overlap between the predicted masks and the ground-truth annotations, whereas Precision and Recall were used to assess false-positive and false-negative tendencies, respectively. Accuracy was further used to evaluate overall pixel-level classification performance.

For a predicted lesion region P and the corresponding ground-truth region G, mIoU and Dice can be defined as

mIoU = \frac{| P \cap G |}{| P \cup G |},

(22)

Dice = \frac{2 | P \cap G |}{| P | + | G |},

(23)

Precision and Recall are given by

Precision = \frac{T P}{T P + F P},

(24)

Recall = \frac{T P}{T P + F N},

(25)

where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. Accuracy is defined as

Accuracy = \frac{T P + T N}{T P + T N + F P + F N},

(26)

where TN denotes true negatives.

To evaluate model efficiency, the number of parameters (Params) and floating-point operations (FLOPs) were also reported. These measures were used to assess whether the proposed method could achieve favorable segmentation performance while maintaining moderate computational complexity.

In addition, for the robustness experiments under simulated environmental disturbances, mIoU was used as the primary metric to quantify performance degradation under different corruption types and severity levels.

3. Results

To comprehensively evaluate the proposed Multi-Scale Feature Rectification Network (MFR-Net) for crop leaf disease segmentation, a series of quantitative and qualitative experiments were conducted on a Kaggle-derived benchmark and an independent Apple Leaf Dataset. The evaluation focused on segmentation accuracy, model complexity, cross-dataset generalization, and robustness under simulated environmental disturbances. The main results are presented in Section 3.1, Section 3.2, Section 3.3, Section 3.4 and Section 3.5, including quantitative comparison, ablation analysis, external dataset validation, visual comparison, and robustness evaluation.

3.1. Quantitative Comparison on the Kaggle-Derived Benchmark

To evaluate the effectiveness of the proposed method for binary leaf lesion segmentation, comparative experiments were conducted on the Kaggle-derived benchmark constructed from the unaugmented data folder of the public Leaf Disease Segmentation Dataset. This benchmark contained 588 diseased-leaf images and 588 corresponding binary lesion masks. For fairness, all deep learning methods were trained and evaluated under the same protocol, including the same fixed data split, an input resolution of 512 × 512, the same training schedule, a fixed threshold of 0.5, and no test-time augmentation. In addition, three representative traditional segmentation methods, namely Otsu thresholding, K-means clustering, and saliency-based segmentation, were also evaluated on the same test split for reference. The quantitative results are summarized in Table 1.

As shown in Table 1, MFR-Net achieved the highest mIoU of 74.27% among all deep learning models and also obtained the highest Recall of 87.61%, indicating a stronger ability to cover lesion regions more completely. Although HRNet-W18 (FPN) achieved higher Dice and Precision, it required substantially larger computational cost, with 91.57 G FLOPs, whereas MFR-Net maintained competitive segmentation accuracy with only 37.55 G FLOPs. Compared with U-Net++, DeepLabV3+, and U-Net, MFR-Net also provided superior or more balanced performance while requiring lower or moderate model complexity. These results indicate that MFR-Net delivers the most favorable overall trade-off between segmentation accuracy and computational cost on the Kaggle-derived benchmark.

By contrast, the traditional methods performed substantially worse than all deep learning models. Among them, Otsu thresholding achieved the best traditional result, with 17.48% mIoU and 26.74% Dice, whereas K-means clustering and saliency-based segmentation achieved only 14.26%/21.66% and 14.21%/22.60% in mIoU/Dice, respectively. These results indicate that conventional low-level segmentation strategies are insufficient for handling lesion-scale variation, irregular lesion boundaries, and strong background interference in complex agricultural scenes.

To further assess result stability, repeated experiments with three random seeds (42, 52, and 62) were additionally conducted for the representative methods reported in Table 1, and the corresponding mean ± standard deviation results are summarized in Table S1 (Supplementary Materials). These supplementary results were provided to complement the formal single-run comparison reported in Table 1 and to further verify result stability. The repeated-run results show that MFR-Net remained a stable and competitive method across different random seeds, achieving 73.02 ± 0.78% mIoU and 83.19 ± 0.68% Dice. In terms of average mIoU, it outperformed U-Net, U-Net++, DeepLabV3+, and PSPNet, while remaining close to SegFormer and HRNet-W18. These results further support the robustness of the proposed method under repeated training runs.

3.2. Ablation Study on the Kaggle-Derived Benchmark

To assess the contribution of each component in the proposed architecture, an ablation study was conducted on the same Kaggle-derived benchmark constructed from the 588 unaugmented image–mask pairs in the data folder of the public Leaf Disease Segmentation Dataset. Five model variants were considered, including the baseline model, the baseline with the hybrid attention mechanism, the baseline with CWASPP, the baseline with FRM, and the full model. All experiments were performed under the same settings as those used in Table 1, including the identical data split, input size, training schedule, early stopping strategy, fixed threshold, and no test-time augmentation. The results are presented in Table 2.

As shown in Table 2, the full model achieved the best overall performance, with 73.95% mIoU and 83.89% Dice. Compared with the baseline, the full model improved mIoU by 2.15 percentage points and Dice by 1.77 percentage points. Among the individual components, CWASPP provided the largest single gain in mIoU, indicating that multi-scale contextual aggregation plays a particularly important role in lesion segmentation. The hybrid attention mechanism mainly improved Recall, suggesting enhanced sensitivity to lesion regions, whereas FRM further improved feature fusion quality and contributed to more stable decoding performance. Overall, the ablation results confirm that all proposed components contribute positively and that their combination yields the strongest segmentation performance.

To further assess the robustness of the ablation findings, repeated experiments with five random seeds (42, 52, 62, 72, and 82) were additionally conducted, and the corresponding mean ± standard deviation results are summarized in Table S2 (Supplementary Materials). The repeated experiments show that all proposed components generally improve over the baseline. Among the variants, FRM achieved the highest average mIoU (73.65 ± 0.84%), whereas the full model achieved the highest Dice (83.73 ± 0.29%) and Accuracy (95.62 ± 0.15%), while also showing relatively smaller fluctuation across seeds. These supplementary results indicate that the proposed architectural components provide stable performance gains beyond single-run point estimates.

3.3. Quantitative Comparison on the Apple Leaf Dataset

To further examine the generalization ability of the proposed method, supplementary comparative experiments were conducted on the independent Apple Leaf Dataset. A fixed training, validation, and test split was adopted, and all methods were evaluated under the same settings as those used on the Kaggle-derived benchmark, including the same input resolution, training schedule, early stopping strategy, fixed threshold, and no test-time augmentation. The results are summarized in Table 3.

As shown in Table 3, the performance differences among most compared methods on this dataset were relatively small, whereas HRNet-W18 exhibited a marked performance drop under the apple leaf data distribution. MFR-Net achieved competitive performance, with 52.70% mIoU and 61.25% Dice. U-Net obtained the highest mIoU of 52.93%, SegFormer achieved the highest Dice of 61.35% and the highest Recall of 73.57%, and HRNet-W18 achieved the highest Precision of 67.78% and the highest Accuracy of 98.69%. Nevertheless, HRNet-W18 also yielded the lowest mIoU and Dice, together with a low Recall of 45.89%, indicating that its predictions were relatively conservative and tended to under-segment lesion regions. Overall, these results indicate that MFR-Net remains competitive under a different dataset distribution and annotation style, demonstrating good cross-dataset generalization ability rather than dataset-specific overfitting.

3.4. Visual Comparison Under Representative Lesion Morphologies

To further compare the segmentation behavior of different methods, representative challenging samples from the test subset of the Kaggle-derived benchmark were selected for visual analysis. These samples were chosen to reflect typical lesion morphology patterns encountered in practical crop disease segmentation, including elongated lesions, scattered small lesion regions, and lesions with irregular boundaries. Figure 6 presents the error map comparison, and Figure 7 shows the contour overlay results.

As shown in Figure 6, the compared methods exhibit distinct error distributions on challenging samples. In elongated lesion regions, some baseline methods tend to produce local discontinuities or incomplete coverage, whereas MFR-Net preserves lesion continuity more effectively. For scattered small lesions, baseline methods are more prone to missed detections and fragmented predictions, while MFR-Net yields fewer omission errors and suppresses isolated false responses more effectively. In samples with irregular boundaries, MFR-Net generally produces more concentrated error regions, indicating better control of over-segmentation and under-segmentation in structurally complex lesion areas.

Figure 7 further confirms these observations from the contour perspective. Compared with the baseline methods, the contours generated by MFR-Net are generally more consistent with the ground-truth annotations, especially in samples containing fragmented lesion regions and complex boundary shapes. This result suggests that the proposed network can better preserve lesion structure while maintaining more stable boundary delineation under challenging conditions.

Overall, the visual observations are consistent with the quantitative comparisons in Table 1, Table 2 and Table 3 and further demonstrate that MFR-Net achieves accurate lesion localization and stable boundary delineation in complex agricultural scenes.

3.5. Robustness Analysis Under Simulated Environmental Disturbances

To further evaluate model robustness under practical agricultural imaging conditions, additional experiments were conducted under four types of simulated test-time disturbances, including additive Gaussian noise, brightness reduction, Gaussian blur, and contrast reduction. Each disturbance was evaluated at two severity levels, where L1 denotes a moderate setting and L2 denotes a severe setting. Specifically, Noise (L1/L2) was implemented as zero-mean Gaussian noise with standard deviations of 8.0 and 16.0, respectively; Dark (L1/L2) was implemented by multiplying image intensity by factors of 0.80 and 0.60, respectively; Blur (L1/L2) was implemented using Gaussian blur with kernel sizes of 5 × 5 and 9 × 9 and corresponding standard deviations of 2.0 and 2.5, respectively; and Contrast (L1/L2) was implemented by scaling image contrast around the per-image mean with factors of 0.80 and 0.60, respectively. All corruptions were applied only to the test set, while the training protocol, model selection strategy, decision threshold, and all other evaluation settings remained unchanged. The same corruption pipeline and severity definitions were used for all compared methods to ensure a fair robustness comparison.

For conciseness, Table 4 only reports the severe-disturbance results (L2), whereas Figure 8 presents representative qualitative examples under disturbed conditions. Each disturbance type was defined at two severity levels (L1 and L2); however, only the L2 results are reported in the present manuscript in order to focus the comparison on the most challenging cases.

As shown in Table 4, MFR-Net achieved the highest mIoU not only under the clean setting but also under all severe disturbance conditions. Specifically, its mIoU remained 69.11% under Noise (L2), 74.12% under Dark (L2), 66.91% under Blur (L2), and 74.39% under Contrast (L2), consistently exceeding those of the compared methods. The advantage was particularly evident under severe blur, where MFR-Net substantially outperformed U-Net, SegFormer, and DeepLabV3+, indicating stronger resilience to structural degradation and local detail loss. In addition, the average mIoU of MFR-Net across the four severe disturbance settings reached 71.13%, which was the highest among all compared methods. These results demonstrate that the proposed architecture maintains more stable lesion localization under adverse visual conditions and provides the strongest overall robustness in the present robustness evaluation.

Figure 8 further confirms this trend from a qualitative perspective. Under severe noise, darkness, blur, and contrast variation, the baseline methods tend to exhibit more contour deviation, local over-segmentation, or incomplete lesion delineation, whereas MFR-Net preserves lesion shape more consistently and produces fewer spurious responses in disturbed conditions. These observations are consistent with the quantitative results in Table 4 and further demonstrate the suitability of the proposed method for complex real-world agricultural scenes.

3.6. Summary of Results

Overall, the experimental results consistently demonstrate the effectiveness of the proposed MFR-Net for crop leaf disease segmentation. On the Kaggle-derived benchmark, MFR-Net achieved the highest mIoU and the highest Recall among the compared methods while maintaining competitive Dice performance and moderate computational cost. The ablation study further verified that all proposed components contribute positively to performance improvement and that their combination yields the strongest overall result within the proposed architecture.

On the independent Apple Leaf Dataset, MFR-Net remained competitive under a different data distribution and annotation granularity, indicating good cross-dataset generalization ability. The visual comparisons further showed that MFR-Net produces more consistent contours and more concentrated error regions on challenging samples with elongated lesions, scattered small lesions, and irregular boundaries. In addition, the robustness experiments demonstrated that MFR-Net maintained the highest mIoU under all tested severe disturbance conditions, especially under blur degradation. Taken together, these results confirm that the proposed method achieves a favorable balance between segmentation accuracy, generalization capability, computational efficiency, and robustness.

4. Discussion

The experimental results confirm the effectiveness of MFR-Net for crop leaf disease segmentation in complex scenarios. On the Kaggle-derived benchmark, MFR-Net achieved the highest mIoU and the highest Recall among the compared methods, while maintaining competitive Dice performance and moderate computational cost. This advantage is closely related to the coordinated design of the hybrid attention mechanism, the CWASPP module, and the feature rectification module, which together improve lesion-sensitive representation, multi-scale contextual modeling, and cross-level feature fusion.

Another important finding is that MFR-Net achieves a favorable balance between segmentation accuracy and computational cost. Compared with lightweight models, MFR-Net delivers stronger overall segmentation performance, while compared with heavier models such as U-Net++ and DeepLabV3+, it provides competitive or better accuracy without excessive computational overhead. Although HRNet-W18 achieved higher Dice and Precision on the Kaggle-derived benchmark, it required substantially larger FLOPs and showed a marked performance drop on the independent Apple Leaf Dataset, suggesting that higher local overlap on the in-domain benchmark did not necessarily translate into better cross-dataset stability.

The ablation study further clarifies the role of each proposed component. Among the individual modules, CWASPP provided the most notable improvement in mIoU, highlighting the importance of multi-scale contextual aggregation for lesion segmentation. The hybrid attention mechanism mainly improved Recall, indicating enhanced sensitivity to lesion regions. FRM further strengthened the consistency of decoder feature fusion and contributed to more stable structural recovery. The superior performance of the full model indicates that these components are complementary rather than redundant.

The supplementary experiments on the independent Apple Leaf Dataset further show that MFR-Net remains competitive under a different dataset distribution, suggesting that the proposed framework has good cross-dataset adaptability. The visual comparison results also demonstrate that MFR-Net produces more consistent contours and more concentrated error regions on representative difficult samples. In addition, the robustness analysis reveals that MFR-Net remains more stable than the compared methods under simulated environmental disturbances, especially severe blur, which is highly relevant for practical agricultural applications where image degradation is common.

Furthermore, robust cross-domain evaluation under illumination changes, occlusion, and other unstructured field disturbances is important for reliable agricultural visual perception. Rana et al. [31] emphasized that domain variability and field disturbances can substantially affect visual detection reliability in complex agricultural environments. In this study, the simulated disturbance experiments provide additional evidence that the hybrid attention mechanism, CWASPP, and FRM improve the stability of lesion segmentation under degraded imaging conditions, which is relevant for practical agricultural scenarios with uneven illumination, blurred images, and partial leaf occlusion.

Despite these advantages, several limitations should be acknowledged. First, further validation on larger and more diverse multi-source datasets would be beneficial for a more comprehensive assessment of generalization ability. Second, the current framework still relies on supervised pixel-level annotations, which are costly to obtain in large-scale agricultural applications. Third, although representative conventional baselines, including Otsu thresholding, K-means clustering, and saliency-based segmentation, were included in the main benchmark comparison, these methods were used as reference binary segmentation baselines rather than as extensively tuned task-specific pipelines. Therefore, broader benchmarking against additional conventional, hybrid, and disease-oriented specialized methods would still be valuable in future work. Fourth, although the computational cost is moderate, further optimization would still be desirable for deployment on resource-constrained edge devices. Future work may therefore explore broader cross-domain validation, weakly supervised or semi-supervised learning strategies, more comprehensive benchmarking against conventional and specialized segmentation methods, and lightweight deployment-oriented variants of the proposed architecture.

Overall, the present results indicate that MFR-Net is an effective and reliable method for crop leaf disease segmentation in complex agricultural scenes, achieving a favorable balance among accuracy, efficiency, generalization, and robustness.

5. Conclusions

In this study, a Multi-Scale Feature Rectification Network (MFR-Net) was proposed for crop leaf disease segmentation in complex scenarios. The proposed model integrates an EfficientNetV2-S-based encoder, a hybrid attention mechanism, a CWASPP module, and a feature rectification module to improve multi-scale contextual representation and cross-level feature fusion.

Experimental results showed that MFR-Net achieved the highest mIoU (74.27%) and the highest Recall (87.61%) on the Kaggle-derived benchmark, while maintaining competitive Dice performance (84.25%) with a moderate computational cost of 25.10 M parameters and 37.55 G FLOPs. The ablation study confirmed the effectiveness of the proposed components, with CWASPP providing the most notable individual contribution. On the independent Apple Leaf Dataset, MFR-Net remained competitive, demonstrating good cross-dataset generalization ability under a different dataset distribution and annotation style. In addition, MFR-Net consistently achieved the highest mIoU under all severe simulated disturbance conditions, indicating strong robustness to degraded imaging quality.

Overall, these findings demonstrate that MFR-Net is a promising solution for crop leaf disease segmentation in complex agricultural environments. Future work will focus on validating the framework on larger multi-source datasets; reducing annotation dependence; expanding the benchmark to include more conventional, hybrid, and task-specific disease segmentation methods under a unified protocol; and improving deployment efficiency for practical field applications.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/horticulturae12050640/s1, Table S1: Multi-seed stability results (mean ± std over three random seeds) for the representative methods reported in Table 1 on the Kaggle-derived benchmark; Table S2: Multi-seed ablation results (mean ± std over five random seeds) on the Kaggle-derived benchmark.

Author Contributions

Conceptualization, B.G. and H.N.; writing—original draft preparation, T.D.; validation, X.C.; writing—review and editing, B.G. and H.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Tianshan Young Talent—Outstanding Young Talent Project under Grant 2024TSYCCX0011 and in part by the National Natural Science Foundation of China under Grant 62303394.

Data Availability Statement

The main benchmark used in this study was derived from the public Kaggle dataset, Leaf Disease Segmentation Dataset, released by Fakhre Alam and cited in Ref. [29]. The released dataset contains two folders, namely data and aug_data. In the present study, only the unaugmented data folder containing 588 diseased-leaf images and 588 corresponding masks was used to construct the Kaggle-derived benchmark. The augmented aug_data folder was not used. The public dataset page describes the images as including several cropdisease examples, such as apple scab leaf, apple rust leaf, bell_pepper leaf spot, corn leaf blight, and potato leaf early blight, but it does not provide a complete class-level annotation file or an exhaustive list of crop and disease labels for all 588 retained image–mask pairs. Therefore, the dataset was only used in this study for binary lesion segmentation, in which diseased regions were treated as foreground and all remaining pixels were treated as background. The supplementary Apple Leaf Dataset was curated from publicly available apple leaf images collected from PlantVillage, AppleLeaf9-main, and several additional public sources. All lesion masks for this dataset were manually annotated in this study. After filename-based image–mask matching and validity checking, 3197 image–mask pairs were retained and split into fixed train–validation–test subsets of 2557/320/320. In both datasets, masks stored in different formats were converted into a unified binary lesion format, in which nonzero mask pixels were treated as lesion regions. When an image and its corresponding mask had inconsistent spatial sizes, the mask was first aligned to the image size using nearest-neighbor interpolation before subsequent resizing and evaluation. The processed split files, preprocessing scripts, and evaluation outputs generated in this study are available from the corresponding author upon reasonable request.

Acknowledgments

We acknowledge the use of generative AI tools to assist with language polishing and format adjustment during manuscript preparation. The AI tools were only used for linguistic optimization and did not participate in research design, data analysis, result interpretation, or core academic content creation.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhu, R.; Zhang, J.; Huang, J.; Kang, R.; Chen, K.J. Research progress of crop leaf disease detection based on convolutional neural network. Trans. Chin. Soc. Agric. Eng. 2025, 41, 15–28. [Google Scholar]
Bao, W.; Lin, Z.; Hu, G.; Liang, D.; Huang, L.; Yang, X. Severity estimation of wheat leaf diseases based on RSTCNN. Trans. Chin. Soc. Agric. Mach. 2021, 52, 242–252+263. [Google Scholar]
Zhao, W.B.; Hu, L.J.; Wang, Q.; Wu, H.X.; Wang, J.B.; Li, X.; Wu, C.Y. RMP-UNet: An efficient and lightweight model for apple leaf disease segmentation. Agronomy 2025, 15, 770. [Google Scholar] [CrossRef]
Verma, R.; Das, A.; Chakrawarti, N.; Narzary, P.R.; Kaman, P.K.; Sharma, S. First Report of Black Pepper (Piper nigrum) Anthracnose Caused by Colletotrichum siamense in North-East India. Plant Dis. 2023, 107, 2249. [Google Scholar] [CrossRef]
Kim, S.-M.; Kim, S.; Lee, J.; Choi, S.-Y.; Chung, H.; Chun, J.; Seo, B.-Y.; Lim, J.-R.; Choi, N.-J. First Report of Nigrospora oryzae Causing Leaf Spot on Peanut (Arachis hypogaea) in the Republic of Korea. Plant Dis. 2024, 108, 3202. [Google Scholar] [CrossRef]
Zhang, X.; Li, H.; Sun, S.; Zhang, W.; Shi, F.; Zhang, R.; Liu, Q. Classification and Identification of Apple Leaf Diseases and Insect Pests Based on Improved ResNet-50 Model. Horticulturae 2023, 9, 1046. [Google Scholar] [CrossRef]
Li, Z.; Tao, W.; Liu, J.; Zhu, F.; Du, G.; Ji, G. Tomato Leaf Disease Recognition via Optimizing Deep Learning Methods Considering Global Pixel Value Distribution. Horticulturae 2023, 9, 1034. [Google Scholar] [CrossRef]
Chen, Z.L.; Peng, Y.L.; Jiao, J.D.; Wang, A.G.; Wang, L.G.; Lin, W.; Guo, Y. MD-Unet for tobacco leaf disease spot segmentation based on multi-scale residual dilated convolutions. Sci. Rep. 2025, 15, 2759. [Google Scholar] [CrossRef] [PubMed]
Ahmed, W.A.; Abiola, O.A.; Yang, D.; Olatoyinbo, S.F.; Jing, G. Integrating UAVs and Deep Learning for Plant Disease Detection: A Review of Techniques, Datasets, and Field Challenges with Examples from Cassava. Horticulturae 2026, 12, 87. [Google Scholar] [CrossRef]
Fan, Y.; Yu, M.; Shen, L.; Ma, J.; Zeng, Z.; Wang, H. Robust plant disease segmentation in complex field environments: An in-depth analysis and validation with STAR-Net. Front. Plant Sci. 2026, 16, 1706072. [Google Scholar] [CrossRef]
Zi, J.J.; Liu, T.; Zhang, W.; Pan, X.H.; Ji, H.; Zhu, H.H. Quantitatively characterizing sandy soil structure altered by MICP using multi-level thresholding segmentation algorithm. J. Rock Mech. Geotech. Eng. 2024, 16, 4285–4299. [Google Scholar] [CrossRef]
Xie, J.; Kong, W.Y.; Xia, S.Y.; Wang, G.Y.; Gao, X.B. An efficient spectral clustering algorithm based on granular-ball. IEEE Trans. Knowl. Data Eng. 2023, 35, 9743–9753. [Google Scholar] [CrossRef]
Keshavarzi, M.; Mesarich, C.; Bailey, D.; Johnson, M.; Sengupta, G. A review of semantic segmentation methods and their application in apple disease detection. Comput. Electron. Agric. 2025, 237, 110531. [Google Scholar] [CrossRef]
Ding, Z.Y.; Zeng, F.G.; Li, H.F.; Zheng, J.Y.; Chen, J.Z.; Chen, B.; Zhong, W.S.; Li, X.T.; Wang, Z.Y.; Huang, L.F.; et al. Identification of sweetpotato virus disease-infected leaves from field images using deep learning. Front. Plant Sci. 2024, 15, 1456713. [Google Scholar] [CrossRef]
Kumar, D.; Kukreja, V. Image segmentation, classification, and recognition methods for wheat diseases: Two decades’ systematic literature review. Comput. Electron. Agric. 2024, 221, 109005. [Google Scholar] [CrossRef]
Zhao, C.Y.; Li, C.C.; Wang, X.; Wu, X.F.; Du, Y.Q.; Chai, H.B.; Cai, T.Y.; Xiang, H.M.; Jiao, Y.H. Plant disease segmentation networks for fast automatic severity estimation under natural field scenarios. Agriculture 2025, 15, 583. [Google Scholar] [CrossRef]
Nyawose, T.; Maswanganyi, R.C.; Khumalo, P. A review on the detection of plant disease using machine learning and deep learning approaches. J. Imaging. 2025, 11, 326. [Google Scholar] [CrossRef] [PubMed]
Upadhyay, A.; Chandel, N.S.; Singh, K.P.; Chakraborty, S.K.; Nandede, B.M.; Kumar, M.; Subeesh, A.; Upendar, K.; Salem, A.; Elbeltagi, A. Deep learning and computer vision in plant disease detection: A comprehensive review of techniques, models, and trends in precision agriculture. Artif. Intell. Rev. 2025, 58, 92. [Google Scholar] [CrossRef]
Raufer, L.; Wiedey, J.; Mueller, M.; Penava, P.; Buettner, R. A deep learning-based approach for the detection of cucumber diseases. PLoS ONE 2025, 20, e0320764. [Google Scholar] [CrossRef] [PubMed]
Zhao, J.M.; Xu, L.X.; Ma, Z.Z.; Li, J.C.; Wang, X.W.; Liu, Y.C.; Du, X.J. A review of plant leaf disease identification by deep learning algorithms. Front. Plant Sci. 2025, 16, 1637241. [Google Scholar] [CrossRef]
Yuan, H.B.; Zhu, J.J.; Wang, Q.F.; Cheng, M.; Cai, Z.J. An improved DeepLab v3+ Deep learning network applied to the segmentation of grape leaf black rot spots. Front. Plant Sci. 2022, 13, 795410. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.W.; Zhang, C.L. Modified U-Net for plant diseased leaf image segmentation. Comput. Electron. Agric. 2023, 204, 107511. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, P.X.; Tian, S. Tomato leaf disease detection based on attention mechanism and multi-scale feature fusion. Front. Plant Sci. 2024, 15, 1382802. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.W.; Wang, Z.; Wang, Z.L. Method for image segmentation of cucumber disease leaves based on multi-scale fusion convolutional neural networks. Trans. Chin. Soc. Agric. Eng. 2020, 36, 149–157. [Google Scholar]
Ren, S.G.; Jia, F.W.; Gu, X.J.; Yuan, P.S.; Xue, W.; Xu, H.L. Recognition and segmentation model of tomato leaf diseases based on deconvolution-guiding. Trans. Chin. Soc. Agric. Eng. 2020, 36, 186–195. [Google Scholar]
Niu, Z.Y.; Zhong, G.Q.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Li, K.Y.; Zhu, X.Y.; Ma, J.C.; Zhang, L.X. Estimation method of leaf disease severity of cucumber based on mixed dilated convolution and attention mechanism. Trans. Chin. Soc. Agric. Mach. 2023, 54, 231–239. [Google Scholar]
Deng, Y.J.; Wang, X.; Long, C.F.; Liu, J.L.; Zhu, X.H.; Tan, S.Q. Segmenting and grading the blast disease of rice leaves using VCDM-UNet. Trans. Chin. Soc. Agric. Eng. 2024, 40, 190–198. [Google Scholar]
Alam, F. Leaf Disease Segmentation Dataset; Kaggle, 2021. Available online: https://www.kaggle.com/datasets/fakhrealam9537/leaf-disease-segmentation-dataset (accessed on 14 April 2026).
Singh, D.; Jain, N.; Jain, P.; Kayal, P.; Kumawat, S.; Batra, N. PlantDoc: A Dataset for Visual Plant Disease Detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Hyderabad, India, 5–7 January 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 249–253. [Google Scholar]
Rana, S.; Hensel, O.; Nasirahmadi, A. From vineyard to vision: Multi-domain analysis and mitigation of grape cluster detection failures in complex viticultural environments. Results Eng. 2026, 29, 108833. [Google Scholar] [CrossRef]

Figure 1. Representative original images and corresponding binary masks from the two datasets used in this study. The first row shows examples from the Kaggle-derived benchmark, and the second row shows examples from the independent Apple Leaf Dataset.

Figure 2. Overall architecture of the proposed Multi-Scale Feature Rectification Network (MFR-Net).

Figure 3. Structure of the EfficientNetV2-S encoder with Coordinate Attention and a Convolutional Block Attention Module (CBAM).

Figure 4. Structure of the proposed Cross-Window Atrous Spatial Pyramid Pooling (CWASPP) module.

Figure 5. Structure of the Feature Rectification Module (FRM).

Figure 6. Error map comparison of different segmentation models on representative challenging image samples (white: ground-truth contour; red: false positive; blue: false negative).

Figure 7. Contour overlay comparison of different segmentation models on representative challenging image samples (green: ground-truth contour; red: predicted contour).

Figure 8. Qualitative robustness comparison of segmentation models under severe simulated environmental disturbances (L2). Each row corresponds to one disturbance type, namely Noise, Dark, Blur, and Contrast. From left to right, the columns show the disturbed input image, the ground-truth mask, and the prediction error maps of U-Net, DeepLabV3+, SegFormer, and MFR-Net. In the error maps, white denotes the ground-truth contour, red denotes false-positive regions, and blue denotes false-negative regions.

Table 1. Quantitative comparison of deep learning and representative traditional segmentation models on the Kaggle-derived benchmark constructed from the 588 unaugmented image–mask pairs in the data folder of the public Leaf Disease Segmentation Dataset.

Model	Params/M	FLOPs/G	mIoU/%	Dice/%	Precision/%	Recall/%
U-Net++	26.08	147.60	73.35	83.67	83.99	86.60
U-Net	24.44	62.80	72.62	82.97	82.75	85.69
SegFormer	3.71	13.53	72.34	82.81	82.86	86.07
DeepLabV3+	26.68	73.81	71.41	81.79	83.01	85.59
PSPNet	24.30	23.68	62.58	74.73	72.02	83.84
HRNet-W18 (FPN)	14.70	91.57	73.75	84.89	86.63	83.22
MFR-Net (Ours)	25.10	37.55	74.27	84.25	83.89	87.61
Otsu thresholding	-	-	17.48	26.74	17.84	91.67
K-means clustering	-	-	14.26	21.66	17.77	63.49
Saliency-based segmentation	-	-	14.21	22.60	18.03	69.01

Table 2. Ablation study of the proposed model on the Kaggle-derived benchmark.

Hybrid Attention	CWASPP	FRM	Params/M	FLOPs/G	mIoU/%	Dice/%	Precision/%	Recall%
			21.12	32.10	71.80	82.12	80.89	88.39
√			21.12	32.11	72.80	83.06	81.09	89.01
	√		24.28	33.69	73.87	83.59	82.42	88.39
		√	21.94	35.95	73.16	83.41	83.29	87.00
√	√	√	25.10	37.55	73.95	83.89	83.47	87.79

Note: √ indicates that the corresponding module was included in the model variant.

Table 3. Quantitative comparison of different segmentation models on the Apple Leaf Dataset.

Model	Params/M	FLOPs/G	mIoU/%	Dice/%	Precision/%	Recall/%	Accuracy/%
U-Net++	26.08	147.60	52.66	60.86	62.04	67.97	97.07
U-Net	24.44	62.80	52.93	61.20	61.68	70.28	96.44
SegFormer	3.71	13.53	52.53	61.35	60.04	73.57	95.32
DeepLabV3+	26.68	73.81	51.75	60.39	60.89	68.59	96.83
PSPNet	24.30	23.68	51.31	60.59	60.40	68.57	96.93
HRNet-W18 (FPN)	14.70	91.57	37.67	54.73	67.78	45.89	98.69
MFR-Net (Ours)	25.10	37.55	52.70	61.25	61.20	69.78	96.50

Table 4. Robustness comparison of different segmentation models under severe simulated environmental disturbances (L2). Values are reported in mIoU (%). All corruptions were applied only at test time under the same evaluation pipeline. Avg. Corrupted denotes the mean mIoU across the four disturbed L2 settings.

Model	Clean	Noise (L2)	Dark (L2)	Blur (L2)	Contrast (L2)	Avg. Corrupted
U-Net	72.63	56.18	72.11	56.22	72.30	64.20
DeepLabV3+	71.41	67.93	69.19	35.12	68.25	60.12
SegFormer	72.34	64.88	71.74	58.93	69.37	66.23
MFR-Net (Ours)	74.27	69.11	74.12	66.91	74.39	71.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, B.; Nie, H.; Du, T.; Cai, X. Multi-Scale Feature Rectification for Crop Leaf Disease Segmentation in Complex Scenarios. Horticulturae 2026, 12, 640. https://doi.org/10.3390/horticulturae12050640

AMA Style

Gao B, Nie H, Du T, Cai X. Multi-Scale Feature Rectification for Crop Leaf Disease Segmentation in Complex Scenarios. Horticulturae. 2026; 12(5):640. https://doi.org/10.3390/horticulturae12050640

Chicago/Turabian Style

Gao, Bingpeng, Huishan Nie, Tiantian Du, and Xin Cai. 2026. "Multi-Scale Feature Rectification for Crop Leaf Disease Segmentation in Complex Scenarios" Horticulturae 12, no. 5: 640. https://doi.org/10.3390/horticulturae12050640

APA Style

Gao, B., Nie, H., Du, T., & Cai, X. (2026). Multi-Scale Feature Rectification for Crop Leaf Disease Segmentation in Complex Scenarios. Horticulturae, 12(5), 640. https://doi.org/10.3390/horticulturae12050640

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale Feature Rectification for Crop Leaf Disease Segmentation in Complex Scenarios

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Experimental Protocol

2.1.1. Data Sources

2.1.2. Data Partitioning and Experimental Protocols

2.1.3. Preprocessing and Data Augmentation

2.1.4. Roles of the Two Datasets in This Study

2.2. Overall Architecture of MFR-Net

2.3. EfficientNetV2-S Encoder with Hybrid Attention

2.3.1. EfficientNetV2-S Backbone Encoder

2.3.2. Coordinate Attention

2.3.3. Convolutional Block Attention Module

2.3.4. Cooperative Role of the Hybrid Attention Mechanism

2.4. Cross-Window Atrous Spatial Pyramid Pooling Module

2.4.1. Detail-Preserving Branch

2.4.2. Multi-Dilation Atrous Context Branch

2.4.3. Cross-Window Enhancement Branch

2.4.4. Feature Fusion

2.4.5. Role of CWASPP in MFR-Net

2.5. Feature Rectification Module

2.5.1. Motivation of Feature Rectification

2.5.2. Rectification Formulation

2.5.3. Structural Interpretation of FRM

2.5.4. Role of FRM in MFR-Net

2.6. Decoder and Segmentation Output

2.7. Experimental Settings

2.8. Evaluation Metrics

3. Results

3.1. Quantitative Comparison on the Kaggle-Derived Benchmark

3.2. Ablation Study on the Kaggle-Derived Benchmark

3.3. Quantitative Comparison on the Apple Leaf Dataset

3.4. Visual Comparison Under Representative Lesion Morphologies

3.5. Robustness Analysis Under Simulated Environmental Disturbances

3.6. Summary of Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI