1. Introduction
As one of the most widely cultivated fruit crops globally, apples play a crucial role in agricultural production due to their significant yield and economic value. However, apple cultivation is frequently threatened by a variety of pests and diseases, which not only affect the yield and quality of the fruit but also result in substantial economic losses for farmers. Common pests and diseases, such as apple aphid infestations, woolly apple aphid attacks, gray mold, and apple scab, are prevalent during the growth process. These issues often lead to fruit rot, leaf damage, and even impair the growth of the entire tree. Research indicates that in non-preventive areas—regions where pest and disease prevention measures are not implemented—the actual yield loss per hectare can reach as high as . These pests and diseases not only reduce the yield and quality of apples but also compel growers to invest considerable time and resources in prevention and control efforts. Therefore, timely and accurate detection and management of pests and diseases affecting apple leaves are of paramount importance.
Current research on pest and disease detection largely relies on manual visual inspection and laboratory chemical analysis. While these methods can provide a certain level of accuracy, they are often time-consuming, labor-intensive, and susceptible to environmental variations. In recent years, the emergence of machine learning has revolutionized object detection. Dubey and Jalal [
1] applied K-Means clustering to segment defects in fruit images and used multi-class support vector machines (SVMs) to classify the images into specific categories. Singh et al. [
2] proposed an improved algorithm for automatic detection and classification of plant leaf diseases through image segmentation using thresholding techniques, which enhanced detection accuracy compared to earlier methods. Gangadevi et al. [
3] addressed the local minima problem by incorporating a hybrid approach based on fruit fly behavior and simulated annealing. After selecting the required features, they classified tomato plant diseases using an SVM classifier. However, these methods are limited by the complexity of feature selection and poor adaptability to background interference, making them difficult to apply in complex field environments.
Hassan et al. [
4] proposed a novel convolutional neural network (CNN) architecture based on Inception and ResNet. By leveraging multiple convolutional layers in the Inception architecture to extract better features and employing residual connections to mitigate the vanishing gradient problem, their approach achieved impressive accuracy on three major datasets: PlantVillage (
), Rice Disease (
), and Cassava (
). Shoaib et al. [
5] used U-Net and its improved variants to separate diseased regions from the background, followed by InceptionNet series models for binary and multi-class classification tasks on the segmented leaf images, effectively supporting the automatic detection of tomato diseases.
Recently, the application of transfer learning [
6] and data augmentation [
7] has further enhanced model generalization. The development of object detection algorithms such as YOLO [
8] and Faster R-CNN [
9] has significantly improved both real-time performance and accuracy, enabling the detection of crop diseases on a larger scale. Additionally, the integration of multi-scale feature fusion and attention mechanisms has notably strengthened the ability of models to identify diseased regions in complex backgrounds. For instance, Tian et al. [
10] optimized the feature layers of the YOLOv3 model by incorporating DenseNet, proposing a YOLOv3-dense-based method for detecting anthracnose on apple surfaces. DenseNet demonstrated excellent performance in enhancing feature utilization, achieving a detection accuracy of
and a maximum detection speed of 31 FPS. Li et al. [
11] modified YOLOv5s for vegetable disease detection, achieving a mAP@50 of
on a dataset containing
million images of five disease categories. Lin et al. [
12] introduced an improved YOLOx-Tiny model for detecting tobacco brown spot disease, incorporating an HMU module to enhance feature interaction and small feature extraction in the neck network, achieving an AP of
.
Faiza Khan [
13] utilized the YOLOv8n model to detect three types of maize diseases, achieving an AP of
. Firozeh Solimani [
14] added an SE attention module after the C2f module in the YOLOv8 model, exploring its impact on small object detection. Results indicated that the SE module improved detection efficiency for small objects, such as tomato plant flowers and fruits. Ma [
15] developed a YOLOv8 model tailored for all growth stages of apples. This model combined ShuffleNetv2, Ghost modules, and SE attention modules, utilizing the CIoU loss function for bounding box regression. It achieved a mAP of
.
With the continuous advancements in the YOLO series algorithms and the release of newer versions such as YOLOv5 and YOLOv8, significant improvements have been achieved in both the accuracy and speed of object detection. The latest version, YOLOv11, introduces a series of innovations in feature extraction, feature fusion, and multi-scale detection, further enhancing its ability to detect small objects. Additionally, the network architecture of YOLOv11 [
16] is specifically optimized for embedded device applications, providing new possibilities for real-time pest and disease detection in agricultural scenarios.
Based on the current state of research domestically and internationally, although YOLOv11 has shown significant performance improvements, several challenges remain when applied to the detection of pests and diseases on apple leaves. First, there is the issue of target size diversity in apple leaf pest and disease detection. Pests and diseases on apple leaves vary greatly in size, with some small targets, such as aphids, being particularly prone to being missed during detection. While YOLOv11 incorporates multi-scale feature extraction, its limitations in detecting extremely small targets are evident, especially when pests or diseases closely resemble the leaf background color, leading to missed detections [
17]. Second, complex background interference poses another challenge. In natural environments, the detection of pests and diseases on apple leaves is often affected by complex backgrounds, such as sunlight, shadows, overlapping leaves, and the presence of non-target objects. These factors can easily compromise the model’s discrimination capabilities. YOLOv11’s performance under complex backgrounds still requires optimization, particularly under conditions of high illumination or low contrast, where detection accuracy is often reduced. Third, the significant differences in pest and disease characteristics present additional difficulties. Apple leaf pests and diseases encompass a wide variety of types, each with distinct visual features. Some diseases exhibit similar visual characteristics, such as spots with similar shapes and colors, leading to misdetections and missed detections. Furthermore, as pests and diseases develop over time, their visual features change, imposing higher requirements on the model’s generalization ability.
To address the above difficulties and challenges, this study proposes an improved YOLOv11-based algorithm, YOLO-PEL, for detecting pests and diseases on apple leaves. The proposed algorithm integrates advanced feature extraction, hierarchical feature fusion, and enhanced spatial awareness capabilities, achieving efficient and accurate detection of pests and diseases on apple leaves. It improves the robustness and generalization of the detection model while maintaining high performance in complex scenarios. The main contributions of this study are as follows:
This study proposes a module named PMFEM, which achieves more efficient feature representation by integrating multi-scale convolution operations with the CSPNet architecture. The module applies multi-scale convolutions to partition and process input features, capturing feature information across different scales. By aggregating these features, it enhances the representational capability of the network, ultimately improving the detection accuracy of apple pests and diseases. Notably, it demonstrates superior performance in complex environments and under varying lighting conditions, showcasing exceptional feature extraction and detection capabilities.
An innovative EHFPN module is designed for feature fusion by constructing a hierarchical feature pyramid network that enables efficient integration of multi-scale features. Through an adaptive weighting mechanism, the module dynamically adjusts the importance of features at different levels, significantly improving the network’s ability to detect pest and disease targets of various sizes. In particular, the EHFPN module excels at detecting tiny spots and early-stage symptoms on apple leaves, delivering outstanding detection performance and leading to a substantial improvement in detection accuracy.
This study also introduces the LKAP module, which integrates a large-kernel attention mechanism to effectively expand the network’s receptive field. While maintaining computational efficiency, the module enhances the acquisition of spatial features through position-sensitive attention computations. This makes it particularly suitable for detecting irregularly shaped disease regions on apple surfaces. In practical applications, the LKAP module significantly improves the model’s precision in locating disease boundaries, thereby boosting detection accuracy.
To address the challenges posed by the diverse visual characteristics of apple leaf pests and diseases and the complex lighting conditions in natural environments, this study adopts a data augmentation strategy. The strategy includes techniques such as image rotation, scaling, and color variation to enhance the diversity of training samples. This data augmentation approach not only improves the model’s adaptability to different scenarios but also significantly enhances its generalization ability, effectively reducing misdetection rates under varying lighting conditions.
To facilitate a clearer understanding of the mathematical formulations and the proposed architecture, the key notations used throughout this study are summarized in
Table 1.
4. Results and Discussion
4.1. Experimental Parameters
To ensure the efficiency and comparability of training and evaluation for the improved YOLOv11 model, the study set the main parameters reasonably, as shown in
Table 2.
4.2. Evaluation Metrics
To evaluate the detection accuracy of the proposed framework, several model evaluation metrics were employed, including the mean Average Precision at IoU 50 (mAP@50), model parameters (Parameters), model size (Model Size), and computational efficiency (GFLOPs). mAP@50 represents the average precision across different categories, with values ranging between zero and one. A value closer to one indicates better performance of the model in multi-class object detection tasks. The precision is determined based on varying threshold values to assess the model’s accuracy. GFLOPs stands for billions of floating-point operations per second, which is a critical metric for measuring the computational resources consumed by models in deep learning applications.
4.3. Comparative Experiment
To verify the practical effectiveness of the key modules in the improved YOLOv11 model for apple pest and disease detection, four sets of comparative experiments were designed. These experiments compared the performance of different neck structures, backbone structures, downsampling modules, and overall models. In all experiments, model performance was primarily evaluated using the mAP@50 metric, with GFLOPs and Parameters being used for supplementary analysis to ensure the comprehensiveness and scientific rigor of the results.
This study systematically validated the performance of the improved YOLOv11 model in apple pest and disease detection through four sets of comparative experiments, analyzing the impact of different YOLO versions, modules, and their combinations on model accuracy, computational complexity, and parameter count.
4.3.1. Backbone Module Comparison
In the comparative experiment of the C3k1 module, the performance of the following modules was tested: C3k2-iRMB [
22], C3k2-RVB-EMA [
23], C3k2-Star-CAA [
24], C3k2-AdditiveBlock [
25], C3k2-IdentityFormer [
26], and CSP-PMFEM. The results are shown in
Table 3. The table demonstrates that the model using the CSP-PMFEM module achieves the best balance between accuracy and computational complexity, with an mAP@50 of 0.708, which is a 4.9% improvement over the second-best C3k2-IdentityFormer.
Although the GFLOPS of CSP-PMFEM is 7.6, slightly higher than C3k2-IdentityFormer and C3k2-RVB-EMA, it is still significantly lower than that of C3k2-Star-CAA. Moreover, its parameter count is 2.62M, only 0.43M more than C3k2-IdentityFormer, which remains at a relatively low level. It is evident that CSP-PMFEM maintains excellent computational efficiency while improving detection accuracy, making it well-suited for practical pest and disease detection tasks.
4.3.2. Neck Module Comparison
In the neck comparison experiment, three different structures, GhostHGNet [
27], Goldyolo [
28], and GDFPN [
29], were tested, along with the performance of different attention mechanisms, namely CA-HSFPN [
30], CAA-HSFPN [
31], and EHFPN. The results are shown in
Table 4. EHFPN achieved the best balance between mAP@50 and computational efficiency, with mAP@50 reaching 71.5%, which is significantly higher than that of the other modules. Specifically, it improved by 6.6% compared to CAA-HSFPN, 7.4% compared to CA-HSFPN, and also outperformed the more computationally complex GDFPN.
In terms of GFLOPS and parameter count, EHFPN has a GFLOPS of 5.7, which is slightly higher than CA-HSFPN but significantly lower than Goldyolo and GDFPN, demonstrating higher computational efficiency. Additionally, its parameter count is 2.51 M, which represents a modest increase of only 0.51 M compared to CAA-HSFPN, while still being much lower than Goldyolo. This demonstrates that EHFPN maintains a lightweight design while significantly improving accuracy.
In contrast, CA-HSFPN and CAA-HSFPN exhibit lower computational overhead in terms of GFLOPS and parameter count but perform weaker in detection accuracy. While GDFPN and Goldyolo offer certain accuracy advantages, their computational complexity and parameter count increase significantly, making them less suitable for lightweight applications. EHFPN, by incorporating an efficient local attention mechanism (ELA), optimizes feature extraction and fusion significantly, achieving a good balance between detection accuracy and computational efficiency, thus demonstrating broad applicability.
4.3.3. Downsampling Module Comparison
In the comparison of down-sampling modules, the performances of AIFIRepBN [
32], FocalModulation [
33], AIFI [
34], and LKAP were evaluated. The results are shown in
Table 5. LKAP achieved an mAP@50 of 72.9%, significantly surpassing other modules. Specifically, it improved by 8.8% compared to the second-best AIFI (67.0%) and by 17.0% compared to FocalModulation (62.3%), demonstrating its remarkable advantages in multi-scale feature extraction.
In terms of computational complexity, LKAP recorded a GFLOPS of 7.3, which is comparable to AIFI and AIFIRepBN (both 7.4) and slightly higher than FocalModulation (7.2). However, it achieved a much greater lead in accuracy. Additionally, LKAP has a parameter count of 2.3 M, significantly lower than AIFI (2.65 M) and AIFIRepBN (2.65 M), highlighting its advantage in lightweight design. By optimizing feature fusion and incorporating a large-kernel attention mechanism, LKAP strikes an optimal balance between accuracy and efficiency, showcasing its strong potential for practical applications.
4.3.4. Overall Model Comparison
In the model comparison experiments, the performances of YOLOv5, YOLOv6, YOLOv8n, YOLOv9t, YOLOv10n, SSD [
35], Faster R-CNN [
36], RetinaNet [
37], YOLOv11, and the YOLO-PEL framework were evaluated. The experimental results are summarized in
Table 6, and the corresponding performance metrics are illustrated in
Figure 11.
The results in
Table 6 and the convergence trends in
Figure 11 demonstrate the superiority of YOLO-PEL. It achieves the highest mAP@50 (72.9%), outperforming all previous YOLO versions, SSD, and RetinaNet, while maintaining a low computational cost (7.3 GFLOPS) and a compact parameter size (2.30 M). Additionally, the learning curve highlights YOLO-PEL’s faster convergence and greater stability across epochs. These findings underscore YOLO-PEL’s ability to balance detection accuracy, efficiency, and scalability, making it highly suitable for real-time and resource-constrained applications.
4.4. Ablation Experiment
To verify the contributions of the proposed modules to model performance, a series of ablation experiments were conducted on YOLOv11. These experiments included the individual introduction of the CSP-PMFEM, EHFPN, and LKAP modules, as well as their various combinations. Performance was evaluated based on the mAP@50 metric, while recording the parameters (Param/M) and computational complexity (GFLOPS). The results are shown in
Table 7. Each module’s independent introduction and their combinations resulted in varying degrees of improvement, validating the effectiveness and rationality of the proposed modules and providing theoretical support for model optimization.
Based on the experimental results, the ablation study was designed to evaluate the impact of each module on the performance of YOLOv11. The original YOLOv11 model (without the proposed modules) achieved an mAP@50 of only 68.6%, serving as the baseline.
When the CSP-PMFEM module was introduced, the mAP@50 significantly increased to 70.8%, indicating that this module enhances key feature representation by introducing partially shareable features and more refined feature fusion structures, while reducing computational complexity.
Further integration of the EHFPN module raised the mAP@50 to 71.5%, demonstrating that the combination of Efficient Local Attention (ELA) and Hierarchical Scale Feature Pyramid Network (HSFPN) not only improved the detection of multi-scale objects but also enhanced the recognition of small objects.
By incorporating the LKAP module, the model also showed considerable improvements in mAP@50-95, illustrating its effectiveness in multi-scale feature fusion and spatial receptive field enhancement.
The final model, combining all three modules, achieved the best performance with an mAP@50 of 72.9%, while maintaining reasonable computational complexity (GFLOPS) and parameter count (Param/M). These results confirm the overall effectiveness of the design, validating the rationality and practical value of the proposed model optimization strategy.
4.5. Visualization and Analysis
In deep learning, the receptive field (RF) refers to the region of the input image that a specific neuron in the network can perceive. A larger receptive field enables the model to capture more contextual information, which is crucial for complex scenes and multi-object detection tasks. In this study, we selected a threshold parameter of to measure the global feature coverage. Under this threshold, the original model exhibited an area ratio of and a rectangle side length of 565, demonstrating the limitations of its receptive field in capturing both local and global information. In contrast, our proposed model achieved an area ratio of and increased the rectangle side length to 599, indicating significant improvements in expanding the receptive field and capturing global contextual information.
To provide a more intuitive understanding of this improvement,
Figure 12 presents a visualization of the receptive field under the same threshold for both models. It is evident that the receptive field in the improved model covers a significantly larger area, with a more widespread distribution of high-contribution regions, leading to richer feature extraction. These results validate the effectiveness of YOLO-PEL in enhancing the receptive field and further demonstrate its advantages in handling complex scenes and multi-scale object detection.
In the visualization of receptive fields, we can intuitively observe the model’s attention to key regions. In the heatmap, the intensity of the color reflects the contribution strength of different regions, with deeper colors indicating areas that contribute more significantly to the model’s decision-making, as shown in
Figure 13.
The heatmap clearly demonstrates the advantages of the YOLO-PEL model after the expansion of its receptive field. Compared to the original model, the improved model captures more high-contribution information over a broader region and exhibits stronger contextual awareness in complex backgrounds and multi-object scenes. The heatmap visually showcases how the model processes both fine details and global features. The expanded receptive field enables the model to recognize and focus on more important areas, further proving the enhancement of object detection performance through receptive field expansion. It is worth noting that in a few cases, YOLOv11 appears to produce more precise localization on certain infected regions. However, the overall performance of YOLO-PEL, as evidenced by its higher mAP and improved feature extraction across a broader range of examples, demonstrates superior generalization and detection capability.
To further illustrate the advantages of the proposed improvements in apple leaf pest and disease detection,
Figure 14 provides a visualization of detection results for different targets, including detection confidence and bounding box accuracy for various pest and disease types. For example,
Figure 14a–d show detection results without the proposed improvements, where some targets exhibit low confidence or inaccurate bounding boxes. In contrast,
Figure 14e–h display results after the improvements, where the bounding boxes are more accurate, and the confidence for small targets significantly increases. This indicates that the proposed model effectively enhances multi-scale feature extraction and target recognition capabilities.
5. Conclusions
This study proposes a YOLO-PEL algorithm for apple pest and disease detection, which is based on an improved YOLOv11 architecture. The goal of YOLO-PEL is to enhance detection accuracy and efficiency, particularly for complex backgrounds and small target detection. YOLO-PEL integrates the PMFEM, EHFPN, and LKAP modules to enhance multi-scale feature extraction, small target detection, and receptive field expansion. Experimental results demonstrate that YOLO-PEL achieves an mAP@50 of on the Turkey_Plant dataset, significantly outperforming existing algorithms such as YOLOv11 and YOLOv8n, with an overall accuracy improvement of . Moreover, it maintains a high advantage in computational efficiency and parameter count, proving its effectiveness in practical applications.
The innovation of this research lies in the combination of multiple deep learning techniques to propose a high-precision, low-complexity apple pest and disease detection solution, suitable for resource-constrained environments. YOLO-PEL not only provides technical support for precision agriculture but also contributes to the broader development of smart agriculture. In the future, the model will be further optimized to improve real-time processing capabilities, particularly in terms of detection accuracy under extreme lighting conditions and large-scale imagery.
Furthermore, future research will explore the application of this model to pest and disease monitoring across different crop types. The YOLO-PEL model holds potential for integration into intelligent agricultural systems, such as precision spraying platforms for targeted pesticide delivery, autonomous UAV-based field scouting systems, and edge computing devices for in-field, real-time disease detection. These practical implementations can facilitate large-scale deployment of automated pest monitoring systems, enhance early warning mechanisms, and support data-driven decision-making in crop management—ultimately accelerating the development of sustainable and intelligent agricultural ecosystems.
In future work, we also plan to evaluate the model’s generalization ability across diverse datasets with varying image resolutions and disease types to further validate its robustness and applicability in broader agricultural scenarios.