1. Introduction
Pears are widely cultivated worldwide due to their rich nutritional value and strong adaptability, making them a popular fruit tree variety. In agricultural economics, pear cultivation provides significant economic benefits to farmers, forming an essential component of their income [
1]. However, pear trees are susceptible to various diseases, particularly under conditions of high humidity, poor ventilation, or high planting density. If these diseases are not controlled in a timely manner, they can severely impact fruit yield and quality, and may even lead to plant death. Although farmers can estimate the extent and severity of diseases based on experience, this approach is labor-intensive and lacks accuracy. With the advancement of computer technology, image segmentation techniques have enabled precise delineation of diseased areas, thereby facilitating quantitative analysis and providing technical support for a more accurate assessment of disease severity.
Image segmentation techniques include traditional feature extraction-based methods and convolutional neural network (CNN)-based deep learning approaches. Traditional feature extraction methods rely on recognizing image features and partitioning regions, typically involving steps such as image preprocessing, feature selection, region segmentation, and edge detection [
2,
3,
4]. These traditional approaches usually utilize manually crafted features such as color histograms, texture descriptors, and shape features, followed by classifiers like SVM or k-NN. While effective in controlled or simple environments, these methods often struggle with complex agricultural scenes where disease appearances vary widely due to lighting conditions, leaf occlusions, and background clutter. Furthermore, the reliance on manual feature design limits their adaptability and robustness across diverse datasets.
In contrast, CNN-based deep learning approaches train on large amounts of labeled data and can automatically extract hierarchical and discriminative disease features without handcrafted preprocessing. This end-to-end learning capability enables the model to better generalize across varying conditions and complex backgrounds. For example, Fu et al. [
5] proposed an improved DeepLabv3+ network using the MobileNetV2 backbone, which has been successfully applied to pear leaf disease segmentation. Similarly, Rai and Pahuja [
6] employed an attention residual U-Net model based on deep learning to achieve effective real-time segmentation and detection of rice diseases. These studies show that deep learning methods have achieved remarkable results in processing complex disease images; however, there remains room for improvement in capturing fine edge details.
Based on these observations, deep learning approaches are preferred for their superior feature learning capacity, adaptability, and performance in complex scenarios, which motivates our adoption of a CNN-based model for precise segmentation of pear leaf diseases.
The U-Net network achieves precise pixel-level segmentation by integrating deep and shallow features through the combination of downsampling and upsampling processes, making it highly effective in disease segmentation tasks. Its symmetrical encoder–decoder structure, along with skip connections, significantly enhances the ability to capture detailed information in diseased regions, which is why it is widely used in plant disease segmentation tasks [
7]. However, the U-Net model also has limitations: its fixed-size convolutional kernels are insufficient for handling irregular disease shapes, and its segmentation accuracy in complex backgrounds is not ideal. Therefore, incorporating attention modules that combine global and local contextual features may enhance U-Net’s ability to recognize diseases in complex backgrounds. Further adding an edge feature extraction path branch can help improve the model’s capability to capture features in boundary regions, and the synergistic effects between different feature paths can further enhance the model’s robustness.
Currently, pear leaf disease segmentation faces two major challenges: first, the diversity in the color, texture, and morphology of diseases makes it difficult for models to capture different disease features; second, variations in lighting and gradual color changes in diseased areas often result in blurred edges, and the resolution reduction in the encoder–decoder structure further exacerbates the loss of edge information [
8].
To address the difficulty of capturing diverse features, Y. Zhao et al. [
9] based on the DoubleU-Net framework, which incorporates multi-level feature fusion and an atrous decoder to effectively capture and understand contextual information. Riu et al. [
10] designed an enhanced Dual Attention module (EDAM) to improve model performance by learning global feature correlations and developed a two-class feature fusion module (2-class FFM) to improve the accuracy of structural edges. G. Liu et al. [
11] designed a Cross Attention and Feature Exploration Network, which combines cross-attention modules with upsampling modules, resulting in a cross-attention decoder that establishes connections between high-level and low-level features. This approach more effectively preserves low-level feature information and enhances the ability to restore details. However, this method still falls short in capturing fine details. Additionally, Yuan et al. [
12], inspired by the original scSE module [
13], designed an advanced attention mechanism called D-scSE. This mechanism introduces dynamic weights and a diversified pooling strategy to provide a more adaptive balance between the importance of spatial and channel information. Nevertheless, the aforementioned methods still exhibit limitations in handling complex disease shapes and balancing features under varying lighting conditions, restricting their practicality and accuracy.
Enhancing edge details is crucial for improving model performance. In terms of edge detail recovery, Patil and Kannan [
14] used deep learning and classical segmentation techniques to predict maize leaf disease by converting images to HSV color space, creating binary masks based on color thresholds to isolate the green regions of leaves, and using edge detection (Canny Edge Detector) to identify the edges of diseased areas, thereby supplementing edge information to some extent. J. Zhou et al. [
15] proposed a cascade structure of dilated branches with truncated gradient flow, which could alleviate low-level confusion and improve the performance of edge detection. To fully capture edge features, Bui et al. [
16] proposed a Multi-scale Edge Guided Attention Network (MEGANet), which integrates the EGA model during the decoding process and uses the Laplacian operator to retain high-frequency information, preserving edge integrity and enhancing weak boundary detection. However, the extracted edge features were not well integrated into the model. Z. Zhu et al. [
17] designed a Sparse Dynamic Volume Network with Multi-level Edge Fusion (SDV-TUNet), proposing a Multi-level Edge Feature Fusion (MEFF) module during the skip connection stage to combine low-level detail features containing spatial edge information with high-level global spatial features rich in semantic information. This approach facilitates the propagation and perception of spatial edge information, thereby enhancing the model’s sensitivity to edges. J. Zhao et al. [
18] proposed a Multi-Scale Edge Fusion Network (MSEF-Net), which incorporates an Edge Feature Fusion Module (EFFM) combining Sobel operator edge detection and an Edge Attention Module (EAM) to fuse multi-scale edge features, and applies threshold processing (set to 0.5) to remove noise and irrelevant edge information, thus enhancing multi-scale edge features in images. Although numerous studies have explored edge handling mechanisms, most methods merely integrate edge processing into the backbone network. While this partially supplements edge information, segmentation accuracy remains suboptimal when lesion edges are blurred or color variations are not significant.
In summary, existing solutions show limitations in effectively capturing and balancing the diverse features of different pear leaf diseases in complex backgrounds. Most current methods either focus primarily on global feature extraction or edge detection, but seldom integrate both in a synergistic manner. Furthermore, their ability to restore fine edge details, especially for diseases with prominent or irregular edges, remains inadequate, which restricts the accuracy of disease segmentation.
To address these shortcomings, we propose the EBMA-Net segmentation model, which differs from previous works by incorporating a dedicated edge feature extraction branch (EFFB) alongside the main network. This branch explicitly enhances the model’s sensitivity to edge information, allowing better preservation and discrimination of lesion boundaries. Additionally, EBMA-Net introduces a Multi-Dimensional Joint Attention Module (MDJA), which simultaneously integrates global and local contextual information and adaptively balances channel features. This design enables more effective feature fusion and disease differentiation than traditional attention mechanisms used in earlier models.
Moreover, unlike many studies relying solely on public datasets or limited disease categories, we built a comprehensive dataset covering multiple pear leaf diseases under varied lighting conditions and occlusions, providing a robust benchmark for model evaluation. The primary contributions of this research are outlined as follows:
We introduced an Edge Feature Extraction Branch (EFFB), which consists of an Edge Feature Extraction module (EFE) and a Bilateral Feature Aggregation module (BFA). This branch is designed to extract edge features and fuse them with the outputs of the main network, effectively addressing the limitations of a single architecture in edge information extraction and enhancing the model’s sensitivity to details in complex environments.
We designed a Multi-Dimensional Joint Attention Module (MDJA), which includes two consecutive modules: the Global Multi-Scale Spatial Attention module (GMSA) and the Adaptive Grouped Channel Attention module (AGC). This module integrates global and local features, deeply explores the intrinsic relationships between channels, and further enhances the model’s ability to capture features of different diseases.
We collected data on three types of diseases, covering images under various lighting conditions and with leaf occlusions in complex environments, to comprehensively evaluate the model’s generalization ability and robustness, ensuring stable performance in diverse real-world application scenarios.
The rest of this paper is structured as follows:
Section 2 describes the construction of the dataset and the introduction of our model.
Section 3 presents the experimental setup, the loss functions and evaluation metrics used by the model, various experimental results, and an analysis of these results. Finally,
Section 4 is provided, along with prospects for future research. The dataset and partial implementation code are available at our GitHub repository:
https://github.com/echo080105/EBMA-Net (accessed on 23 July 2025).
3. Results
3.1. Running Environment
All experiments were conducted based on the PyTorch framework in the same hardware and software environment. The specific experimental environment parameters are shown in
Table 2, including hardware configuration, software versions, and experimental settings, to ensure the reproducibility and consistency of the experimental results.
To ensure the rationality of training, we randomly split the dataset into a 70% training set and a 30% test set. The training set was used for learning model parameters, while the test set was used to evaluate the model’s performance and verify its generalization ability on unseen data.
3.2. Loss Function
In this study, we adopted the Dice loss strategy. Dice loss focuses on the overlap between the predicted results and the true labels, which is particularly effective for the segmentation of small objects and can effectively alleviate class imbalance problems, especially when the background pixels significantly outnumber the disease pixels. The Dice loss function is specifically defined as follows:
where
represents the True Positive for class
i, indicating the number of pixels correctly predicted as belonging to class
i.
represents the False Negative for class
i, indicating the number of pixels that actually belong to class
i but were not correctly predicted as such (missed detection).
represents the False Positive for class
i, indicating the number of pixels incorrectly predicted as belonging to class
i (false detection).
3.3. Model Evaluation Metrics
We used the following evaluation metrics: Mean Intersection over Union (MIoU), Mean Pixel Accuracy (MPA), Dice coefficient, and model parameters (Parameters). These metrics together provide a comprehensive quantitative basis for evaluating the model’s performance in image segmentation tasks.
MIoU calculates the average Intersection over Union (IoU) for each class, where IoU represents the ratio of the intersection area to the union area between the predicted and actual regions. This metric is used to measure the accuracy of the model for each class.
where
n is the total number of segmentation classes, and
,
represent the True Positive and False Negative for class
i, respectively.
MPA calculates the average pixel accuracy of the model across all classes. For each class, pixel accuracy is defined as the ratio of correctly classified pixels to the total number of pixels for that class. This metric effectively reflects the model’s classification performance across different classes, providing an important reference for a comprehensive evaluation of image segmentation quality.
where
N is the total number of segmentation classes, and
,
represent the True Positive and False Negative for class
i, respectively.
The Dice coefficient is a statistical tool used to measure the similarity between two samples. In the field of image segmentation, it is used to evaluate the spatial overlap between the predicted segmentation result and the ground truth.
Parameters refer to the total number of trainable parameters in the network model. This metric is related to spatial complexity and not only determines the complexity of the network but also affects the training speed and computational resource requirements of the model.
3.4. Effectiveness Analysis
3.4.1. Effectiveness Analysis of MDJA Module
To validate the effectiveness of the proposed MDJA module, we used the EBMA-Net network without the EFFB module as the baseline. Keeping the basic architecture unchanged, we conducted replacement experiments on the MDJA module to explore its effectiveness within the network. The specific experimental results are shown in
Table 3.
From the results in the table, it can be seen that the network using the MDJA module improves MIoU, MPA, and Dice coefficient by 1.99%, 0.9%, and 1.23%, respectively, compared to the network using the typical attention module CBAM. These data indicate that the proposed MDJA module effectively extracts multi-scale disease features, enhances feature representation, and captures the unique characteristics of different diseases. In comparison with experiments using only the SE module and the AGC module, the AGC module achieved a 1.19% improvement in the main metric MIoU, mainly due to the designed dual-channel independent feature extraction mechanism, which allows the model to more effectively capture unique feature information in each channel and reduce interference between channels, thereby improving sensitivity to diverse disease features. In the experiment using only the SE module, the addition of the GMSA module resulted in significant improvements across all metrics, especially in the main metric MIoU, with an increase of 2.83%, thanks to the GMSA module’s excellent spatial feature capture capability. Whether combined with the SE module or with our designed AGC module, the GMSA module significantly enhanced the model’s performance. Analyzing the data in the table, it can be concluded that the designed MDJA module outperforms other compared attention mechanisms in all performance metrics, fully demonstrating the effectiveness of the module.
3.4.2. Effectiveness Analysis of EFFB Module
To intuitively demonstrate the effectiveness of our designed edge extraction branch (EFFB), we visualized the attention points of the EBMA-Net model with and without the EFFB branch based on Grad-CAM technology. The visualization heatmap results are shown in
Figure 8.
When the edge feature extraction branch is added, the model effectively focuses its attention on the disease regions, reducing focus on irrelevant areas and thus avoiding performance loss. As shown in the figure above, in the segmentation of rust disease, the model not only focuses on the disease area but also accurately captures the disease boundary. For small spot brown spot disease, the model successfully concentrates attention on the disease area, significantly reducing ineffective attention to the surrounding area, thereby improving overall performance. When the model incorrectly identifies other elements or noise around leaf curl disease as disease, the edge extraction branch effectively controls false detection by supplementing edge detail information, helping the model more precisely determine the disease area and avoid interference from non-leaf region features. The introduction of the EFFB branch significantly enhances the model’s ability to focus on disease regions while improving its ability to perceive edge details, fully demonstrating the effectiveness of the edge feature extraction branch.
3.5. Impact of Different Input Scales
To evaluate the sensitivity of the model to changes in input image scales, we experimented with different input resolutions. These resolutions essentially cover all the settings used during model training, helping to analyze the impact of resolution changes on model performance. In this experiment, we normalized the computational load from 0 to the maximum value to intuitively show the growth trend between input resolution, computational load, and the main evaluation metric (MIoU). The specific experimental results are shown in
Table 4.
The resolution of the input image directly affects the training efficiency and performance of the model. As the input scale increases, the model can capture more detailed features, but it also significantly impacts training speed.
Figure 9 shows the relationship line chart between the main metric (MIoU) and normalized computational load at different input scales. As can be seen from
Figure 9, as the resolution increases, the model’s performance gradually saturates, with limited improvement in evaluation metrics while training costs increase significantly. Therefore, all experiments in our design chose an input resolution of 512 × 512, which effectively balances computational cost and training speed while maintaining high model performance.
3.6. Model Structure Sensitivity Analysis
We also made some structural adjustments to the module, including changing the weighting method of feature interaction in the BFA module, replacing dilated convolutions with regular convolutions, and adjusting the position of the attention mechanism module to study the impact of these changes on model performance. These experiments provided important insights into understanding the internal working mechanisms of the model.
To explore the optimal fusion method in our BFA module, we designed comparative experiments involving different spatial feature weights. In the first experiment, we swapped the weights of the edge feature map and input feature map, using mismatched weights for horizontal and vertical features. In the second experiment, we averaged the two weights before performing weighted fusion. Through these comparative experiments, we aimed to determine which weight fusion strategy could better enhance model performance, thereby optimizing the effectiveness of feature fusion. The experimental results are shown in
Table 5.
The comparative experimental results showed that both of these weighting methods performed worse than our designed weighted fusion strategy. This is because the two feature maps contain rich information in different directions, and the two weights respectively focus on vertical and horizontal features. Averaging the weights for fusion may weaken the representation of directional features, thereby affecting the fusion effect. Only through weighted fusion that is sensitive to the direction of the feature maps can the best effect be achieved, which is the weighting method adopted in our model.
Multi-convolution parallelism is a common strategy in the field of image processing. Compared to using a single large convolution, stacking multiple smaller convolutions usually yields better results. At the same time, without affecting the connectivity of the model, this method significantly reduces the number of parameters and computational complexity. Dilated convolutions allow small convolutions to achieve a larger receptive field, thereby enhancing the ability to recognize and segment small lesions, reaching a similar receptive range as large convolutions [
21]. Therefore, reasonably setting the structure of the convolutions can effectively balance the relationship between model complexity and performance. We designed two comparative experiments to explore this: in the first, we set the dilation rate of all dilated convolutions in the GMSA module of the mixed attention mechanism to 1, that is, using regular convolutions of the same size; in the second, we replaced the dilated convolutions with large convolutions of the same receptive range. By comparing the experimental results of these two configurations, we can gain a deeper understanding of the role of dilated convolutions in capturing lesion details and their specific impact on model performance. The specific results are shown in
Table 6.
From the comparative experimental results, the convolution mechanism designed in our model achieved 2.31%, 1.98%, and 0.0144 higher performance on various metrics compared to using regular convolutions. At the same time, with the parameter count being approximately half that of large convolutions with the same receptive field, better performance was achieved. This fully demonstrates the effectiveness and rationality of our designed convolution structure in balancing model complexity and performance.
Spatial attention mainly focuses on the positional information of key feature points in the image, while channel attention primarily focuses on the importance of different channels in the image [
22]. To study the effectiveness of different attention structures and their impact on model performance, we designed three experiments in the following sequence: applying the AGC module first, followed by the GMSA module, running AGC and GMSA in parallel, and combining the effects of AGC and GMSA. These experiments aimed to explore the synergy between different attention mechanisms in the network and determine the optimal combination to further enhance the model’s performance. The specific results are shown in
Table 7.
The comparative experimental results indicate that for leaf diseases, focusing on the location of the disease first and then extracting features of different diseases is the most effective structure. Specifically, we first use GMSA to capture the disease location, followed by AGC to extract features of different diseases in each channel. This strategy achieved the highest improvements in MIoU, MPA, and Dice metrics, with increases of 0.68%, 0.22%, and 0.042, respectively, outperforming the other two structural combinations. This fully demonstrates the effectiveness of the structure design of focusing on spatial attention first and then channel feature extraction in disease segmentation tasks.
3.7. Ablation Study
To explore the impact of our designed modules on the overall performance of the model and study the interaction between different modules, we conducted ablation experiments. Starting with the baseline network, we gradually added different modules to observe the specific contributions of each module to model performance. By comparing the experimental results under different module configurations (see
Table 8), we systematically evaluated the role of each module in improving model performance as well as the synergistic effects between modules.
From the above table, it can be seen that the initial encoder–decoder network performed the worst across all evaluation metrics. This is primarily due to the lack of edge detail supplementation and insufficient feature extraction capabilities, which made it difficult to balance the feature extraction for different types of diseases, resulting in only coarse segmentation results. In contrast, after adding the GMSA module, the model’s spatial feature extraction ability was significantly improved. Its parallel convolution architecture effectively handled large disease areas, extracted more extensive features, and identified more disease characteristics. The introduction of the AGC module further enhanced the model’s ability to recognize and differentiate between different diseases, especially in cases where leaf curl disease and healthy leaves had similar color features, improving the model’s capacity to distinguish between similar features and thus reducing misclassification.
Comparing experiments with and without the EFFB module, it is evident that the EFFB module consistently enhanced the model’s performance in all scenarios. This improvement is attributed to its excellent edge feature extraction mechanism, which significantly increased the model’s sensitivity to disease edges, allowing it to effectively segment even when the edges were blurred. The complete EBMA-Net improved MIoU, MPA, and Dice metrics by 6.52%, 3.61%, and 0.0425, respectively, compared to the initial network, with a particularly notable improvement in the primary evaluation metric MIoU. Overall, the effective synergy of the EFFB, GMSA, and AGC modules resulted in outstanding performance in disease edge, spatial scale, and detail capture, fully validating the rationality and effectiveness of our designed structure.
In addition,
Figure 10 shows the changes in MIoU as the baseline model gradually adds each module. It can be observed that as the number of training epochs increases, the MIoU gradually converges. Although MIoU still shows a slight upward trend, the required number of training epochs increases significantly. This indicates that our model has been sufficiently trained on the training set, without showing signs of overfitting.
3.8. Performance on Different Disease Categories
To show that data imbalance did not negatively affect our model’s results, we examined how it performed on each disease category separately. The numbers in
Table 9 reveal a fairly balanced accuracy across all classes. This suggests that our strategies—like data augmentation and using Dice loss—helped the model handle uneven sample sizes effectively.
3.9. Comparison Experiments
In the comparison experiments, we evaluated several state-of-the-art segmentation network models to analyze their performance in different scenarios. The comparison networks included classic U-shaped structures and their improved versions (UNet [
7], UNet++ [
23], I2UNet [
24]), transformer-based networks (TransNet [
25], SwinUNet [
26]), encoder–decoder semantic segmentation networks (SegNet [
27], FCN [
28]), deep convolution-based networks (DeepLabv3 [
29], DconnNet [
30], LRASPP [
31]), advanced residual networks (HRNet [
32]), and networks that integrate multi-scale contextual information (PSPNet [
33]). These networks cover the application of classical convolutional neural networks to modern attention mechanisms in semantic segmentation, aiming to comprehensively compare the performance of different methods in disease recognition tasks.
We conducted experimental comparisons between these networks and the proposed EBMA-Net. The results, summarized in
Table 10, show each model’s performance on different evaluation metrics, including MIoU (Mean Intersection over Union), MPA (Mean Pixel Accuracy), Dice coefficient, model parameters (Parameters), and inference speed measured in FPS (Frames Per Second). These results provide an intuitive comparison of the strengths and weaknesses of each model in disease segmentation tasks and validate the effectiveness of our proposed EBMA-Net. The specific results are shown in
Table 10.
We performed disease segmentation predictions on all the compared networks, and the specific segmentation results are shown in
Figure 11. In the figure, the prediction results of leaf curl disease, rust disease, and brown spot disease are marked in yellow, red, and green, respectively. From a visual perspective, it can be seen that our model effectively captured the shape features and color changes of the diseases, refined the edge segmentation effect, and fully processed the disease edges, accurately segmenting the detailed parts of the disease edges. Additionally, the model effectively suppressed false positives and false negatives, even constraining diseases with indistinct boundaries by limiting their edges to a certain range. Overall, our model achieved better results, proving its advancement and superiority.
To demonstrate the effectiveness of our model, we conducted a visual analysis of the attention maps for the top five networks in MIoU rankings using Grad-CAM technology. The specific results are shown in
Figure 12. These visualization results clearly illustrate the feature-capturing capabilities of each model when focusing on disease regions, further validating the advantages of our model in both detail and global feature extraction.
With the integration of the EFFB branch, the model effectively focuses on the edges of various diseases, enhancing the delineation of subtle edges, particularly in rust disease. The prediction results indicate that the EFFB branch aids the network in concentrating on the affected regions, thereby mitigating performance loss during edge processing. The GMSA module exhibits outstanding spatial feature perception, capturing multi-scale features and effectively managing brown spot diseases of varying sizes. For leaf curl disease, characterized by subtle color features, the AGC module enhances the model’s ability to discern morphological differences and local structural damage, excelling in identifying and extracting features of leaf curl. Given the distinct activation patterns of different diseases across various channels, the AGC module captures valuable features more effectively, distinguishing between disease characteristics and achieving superior balanced performance in multi-disease segmentation tasks.
4. Conclusions
The proposed dual-branch EBMA-Net aims to address challenges such as the difficulty of extracting features from multiple diseases in complex environments, the difficulty of distinguishing between different diseases, and the indistinct edges of disease regions. Its goal is to achieve precise segmentation in challenging environments, providing technical support for agricultural disease prevention and control.
The designed multi-dimensional joint attention module integrates attention mechanisms to balance the ability to differentiate between various diseases, integrates both local and global features, and enhances the expression of key features while suppressing irrelevant ones, thereby strengthening the model’s feature extraction capabilities. The edge feature extraction branch employs various types of convolution operations to extract single-channel feature maps rich in edge information, effectively ignoring insignificant features and retaining critical edge details. Subsequently, the deep fusion of dual-branch output features significantly enhances the model’s focus on edge information.
Experimental validation on the self-built dataset demonstrates that EBMA-Net effectively extracts both edge and disease-specific features, achieving accurate disease segmentation. In comparison experiments, EBMA-Net outperformed other state-of-the-art methods, demonstrating its potential for practical agricultural applications. Although we have not yet collected datasets categorized by lighting and occlusion conditions, we recognize the importance of such analyses. Future work will include targeted data collection and classification evaluations to further validate and improve the model’s robustness under real-world complex scenarios.
Looking ahead, we plan to conduct further research into joint segmentation of multiple diseases and optimize the model architecture by incorporating depthwise separable convolutions to reduce the number of parameters and computational complexity. Meanwhile, although the proposed approach partially alleviates the interference caused by shadow occlusion, it still faces challenges under severe shadow conditions. We will explore shadow-invariant feature enhancement and domain adaptation strategies to further improve the model’s robustness.