Apple Pest and Disease Detection Network with Partial Multi-Scale Feature Extraction and Efficient Hierarchical Feature Fusion

Bao, Weihao; Zhang, Fuquan

doi:10.3390/agronomy15051043

Open AccessArticle

Apple Pest and Disease Detection Network with Partial Multi-Scale Feature Extraction and Efficient Hierarchical Feature Fusion

by

Weihao Bao

and

Fuquan Zhang

^*

College of Information Science and Technology, NanJing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(5), 1043; https://doi.org/10.3390/agronomy15051043

Submission received: 18 March 2025 / Revised: 17 April 2025 / Accepted: 24 April 2025 / Published: 26 April 2025

(This article belongs to the Section Pest and Disease Management)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Apples are a highly valuable economic crop worldwide, but their cultivation often faces challenges from pests and diseases that severely affect yield and quality. To address this issue, this study proposes an improved pest and disease detection algorithm, YOLO-PEL, based on YOLOv11, which integrates multiple innovative modules, including PMFEM, EHFPN, and LKAP, combined with data augmentation strategies, significantly improving detection accuracy and efficiency in complex environments. PMFEM leverages partial multi-scale feature extraction to effectively enhance feature representation, particularly improving the ability to capture pest and disease targets in complex backgrounds. EHFPN employs hierarchical feature fusion and an efficient local attention mechanism to markedly improve the detection accuracy of small targets. LKAP introduces a large kernel attention mechanism, expanding the receptive field and enhancing the localization precision of diseased regions. Experimental results demonstrate that YOLO-PEL achieves a mAP@50 of

72.9 %

in the Turkey_Plant dataset’s apple subset, representing an improvement of approximately

4.3 %

over the baseline YOLOv11. Furthermore, the model exhibits favorable lightweight characteristics in terms of computational complexity and parameter count, underscoring its effectiveness and robustness in practical applications. YOLO-PEL not only provides an efficient solution for agricultural pest and disease detection, but also offers technological support for the advancement of smart agriculture. Future research will focus on optimizing the model’s speed and lightweight design to adapt to broader agricultural application scenarios, driving further development in agricultural intelligence technologies.

Keywords:

deep learning; image processing; apple pest and disease detection; feature extraction; multi-scale detection

1. Introduction

As one of the most widely cultivated fruit crops globally, apples play a crucial role in agricultural production due to their significant yield and economic value. However, apple cultivation is frequently threatened by a variety of pests and diseases, which not only affect the yield and quality of the fruit but also result in substantial economic losses for farmers. Common pests and diseases, such as apple aphid infestations, woolly apple aphid attacks, gray mold, and apple scab, are prevalent during the growth process. These issues often lead to fruit rot, leaf damage, and even impair the growth of the entire tree. Research indicates that in non-preventive areas—regions where pest and disease prevention measures are not implemented—the actual yield loss per hectare can reach as high as

41.26 %

. These pests and diseases not only reduce the yield and quality of apples but also compel growers to invest considerable time and resources in prevention and control efforts. Therefore, timely and accurate detection and management of pests and diseases affecting apple leaves are of paramount importance.

Current research on pest and disease detection largely relies on manual visual inspection and laboratory chemical analysis. While these methods can provide a certain level of accuracy, they are often time-consuming, labor-intensive, and susceptible to environmental variations. In recent years, the emergence of machine learning has revolutionized object detection. Dubey and Jalal [1] applied K-Means clustering to segment defects in fruit images and used multi-class support vector machines (SVMs) to classify the images into specific categories. Singh et al. [2] proposed an improved algorithm for automatic detection and classification of plant leaf diseases through image segmentation using thresholding techniques, which enhanced detection accuracy compared to earlier methods. Gangadevi et al. [3] addressed the local minima problem by incorporating a hybrid approach based on fruit fly behavior and simulated annealing. After selecting the required features, they classified tomato plant diseases using an SVM classifier. However, these methods are limited by the complexity of feature selection and poor adaptability to background interference, making them difficult to apply in complex field environments.

Hassan et al. [4] proposed a novel convolutional neural network (CNN) architecture based on Inception and ResNet. By leveraging multiple convolutional layers in the Inception architecture to extract better features and employing residual connections to mitigate the vanishing gradient problem, their approach achieved impressive accuracy on three major datasets: PlantVillage (

99.39 %

), Rice Disease (

99.66 %

), and Cassava (

76.59 %

). Shoaib et al. [5] used U-Net and its improved variants to separate diseased regions from the background, followed by InceptionNet series models for binary and multi-class classification tasks on the segmented leaf images, effectively supporting the automatic detection of tomato diseases.

Recently, the application of transfer learning [6] and data augmentation [7] has further enhanced model generalization. The development of object detection algorithms such as YOLO [8] and Faster R-CNN [9] has significantly improved both real-time performance and accuracy, enabling the detection of crop diseases on a larger scale. Additionally, the integration of multi-scale feature fusion and attention mechanisms has notably strengthened the ability of models to identify diseased regions in complex backgrounds. For instance, Tian et al. [10] optimized the feature layers of the YOLOv3 model by incorporating DenseNet, proposing a YOLOv3-dense-based method for detecting anthracnose on apple surfaces. DenseNet demonstrated excellent performance in enhancing feature utilization, achieving a detection accuracy of

95.57 %

and a maximum detection speed of 31 FPS. Li et al. [11] modified YOLOv5s for vegetable disease detection, achieving a mAP@50 of

93.1 %

on a dataset containing

17.1

million images of five disease categories. Lin et al. [12] introduced an improved YOLOx-Tiny model for detecting tobacco brown spot disease, incorporating an HMU module to enhance feature interaction and small feature extraction in the neck network, achieving an AP of

80.45 %

.

Faiza Khan [13] utilized the YOLOv8n model to detect three types of maize diseases, achieving an AP of

99.04 %

. Firozeh Solimani [14] added an SE attention module after the C2f module in the YOLOv8 model, exploring its impact on small object detection. Results indicated that the SE module improved detection efficiency for small objects, such as tomato plant flowers and fruits. Ma [15] developed a YOLOv8 model tailored for all growth stages of apples. This model combined ShuffleNetv2, Ghost modules, and SE attention modules, utilizing the CIoU loss function for bounding box regression. It achieved a mAP of

91.4 %

.

With the continuous advancements in the YOLO series algorithms and the release of newer versions such as YOLOv5 and YOLOv8, significant improvements have been achieved in both the accuracy and speed of object detection. The latest version, YOLOv11, introduces a series of innovations in feature extraction, feature fusion, and multi-scale detection, further enhancing its ability to detect small objects. Additionally, the network architecture of YOLOv11 [16] is specifically optimized for embedded device applications, providing new possibilities for real-time pest and disease detection in agricultural scenarios.

Based on the current state of research domestically and internationally, although YOLOv11 has shown significant performance improvements, several challenges remain when applied to the detection of pests and diseases on apple leaves. First, there is the issue of target size diversity in apple leaf pest and disease detection. Pests and diseases on apple leaves vary greatly in size, with some small targets, such as aphids, being particularly prone to being missed during detection. While YOLOv11 incorporates multi-scale feature extraction, its limitations in detecting extremely small targets are evident, especially when pests or diseases closely resemble the leaf background color, leading to missed detections [17]. Second, complex background interference poses another challenge. In natural environments, the detection of pests and diseases on apple leaves is often affected by complex backgrounds, such as sunlight, shadows, overlapping leaves, and the presence of non-target objects. These factors can easily compromise the model’s discrimination capabilities. YOLOv11’s performance under complex backgrounds still requires optimization, particularly under conditions of high illumination or low contrast, where detection accuracy is often reduced. Third, the significant differences in pest and disease characteristics present additional difficulties. Apple leaf pests and diseases encompass a wide variety of types, each with distinct visual features. Some diseases exhibit similar visual characteristics, such as spots with similar shapes and colors, leading to misdetections and missed detections. Furthermore, as pests and diseases develop over time, their visual features change, imposing higher requirements on the model’s generalization ability.

To address the above difficulties and challenges, this study proposes an improved YOLOv11-based algorithm, YOLO-PEL, for detecting pests and diseases on apple leaves. The proposed algorithm integrates advanced feature extraction, hierarchical feature fusion, and enhanced spatial awareness capabilities, achieving efficient and accurate detection of pests and diseases on apple leaves. It improves the robustness and generalization of the detection model while maintaining high performance in complex scenarios. The main contributions of this study are as follows:

This study proposes a module named PMFEM, which achieves more efficient feature representation by integrating multi-scale convolution operations with the CSPNet architecture. The module applies multi-scale convolutions to partition and process input features, capturing feature information across different scales. By aggregating these features, it enhances the representational capability of the network, ultimately improving the detection accuracy of apple pests and diseases. Notably, it demonstrates superior performance in complex environments and under varying lighting conditions, showcasing exceptional feature extraction and detection capabilities.
An innovative EHFPN module is designed for feature fusion by constructing a hierarchical feature pyramid network that enables efficient integration of multi-scale features. Through an adaptive weighting mechanism, the module dynamically adjusts the importance of features at different levels, significantly improving the network’s ability to detect pest and disease targets of various sizes. In particular, the EHFPN module excels at detecting tiny spots and early-stage symptoms on apple leaves, delivering outstanding detection performance and leading to a substantial improvement in detection accuracy.
This study also introduces the LKAP module, which integrates a large-kernel attention mechanism to effectively expand the network’s receptive field. While maintaining computational efficiency, the module enhances the acquisition of spatial features through position-sensitive attention computations. This makes it particularly suitable for detecting irregularly shaped disease regions on apple surfaces. In practical applications, the LKAP module significantly improves the model’s precision in locating disease boundaries, thereby boosting detection accuracy.
To address the challenges posed by the diverse visual characteristics of apple leaf pests and diseases and the complex lighting conditions in natural environments, this study adopts a data augmentation strategy. The strategy includes techniques such as image rotation, scaling, and color variation to enhance the diversity of training samples. This data augmentation approach not only improves the model’s adaptability to different scenarios but also significantly enhances its generalization ability, effectively reducing misdetection rates under varying lighting conditions.

To facilitate a clearer understanding of the mathematical formulations and the proposed architecture, the key notations used throughout this study are summarized in Table 1.

2. Materials and Methods

2.1. Dataset Creation

This study utilizes the apple pest and disease subset from the publicly available Turkey_Plant Dataset [18], which is an open-source dataset designed for plant disease and pest detection research. The dataset contains three-channel (RGB) color images captured using a Nikon 7200d camera, with a resolution of

4000 \times 6000

pixels. Additionally, it includes unconstrained images featuring diverse scenes such as soil, trees, leaves, and sky. It covers multiple common types of pests and diseases encountered during the apple cultivation process, such as apple woolly aphid infestation and apple scab disease.

The original dataset comprises 1416 high-resolution images of apple leaf pests and diseases collected from different regions, under varying lighting conditions, and at different times. However, directly using the raw dataset for model training may present challenges such as insufficient sample size, imbalanced distribution, and data noise, which can adversely affect the model’s learning efficiency and generalization ability. Therefore, in this study, a combination of manual annotation and data augmentation techniques was employed to expand and optimize the original dataset.

2.2. Data Augmentation

In the data annotation process, LabelImg, an open-source graphical image annotation tool based on Python and Qt, was used to manually label pest and disease regions in each image. This tool enables users to draw bounding boxes and assign class labels efficiently, facilitating high-quality annotations for object detection tasks. In this study, LabelImg was employed to ensure high precision and consistency in the annotations. The annotated categories covered multiple types of pests and diseases, with labels strictly defined according to the classification standards provided by the dataset. Moreover, the boundaries of disease spots and the distribution characteristics of pests were accurately marked to ensure the sufficient capture of detailed information for the target detection task.

During the data augmentation phase, the advanced Albumentations library [19] was used to enhance the dataset. Albumentations, with its rich set of augmentation operators and efficient implementation, has become a widely used tool for data augmentation. In this study, several augmentation strategies were employed, including random rotation and flipping, color transformations, random cropping and scaling, blur and noise addition, and geometric transformations. Through these augmentation operations, the dataset was expanded from the original 1416 images to 5664 images, featuring more diverse scenes and target distributions, as shown in Figure 1. The augmented dataset was split into training, validation, and test sets in an 8:1:1 ratio, ensuring proper model training, hyperparameter tuning, and performance evaluation, thus, ensuring the scientific rigor and fairness of the experiments. In the augmented dataset, the sample distribution of each pest category became more balanced and covered various real-world challenges such as lighting conditions, background variations, and deformations, as shown in Figure 2. Compared to the original dataset, the augmented dataset more accurately reflects the complexity encountered in real applications, providing more reliable data support for improving model performance.

3. Model and Training

3.1. The Overall Framework of YOLOv11

YOLOv11 (You Only Look Once v11) is a target detection algorithm based on the YOLO series. After multiple levels of improvement, it is more suitable for real-time object detection tasks. As shown in Figure 3:

Among them, its model architecture includes three main parts: the backbone network (Backbone), the neck module, and the detection head (Head). YOLOv11 optimizes multi-scale feature extraction efficiency in the backbone network, using the C3k2 module with smaller convolution kernels to reduce computation and improve speed, while retaining the SPPF module to enhance global feature representation. It also introduces the C2PSA module to focus on target regions, achieving more precise localization in complex backgrounds.The neck part combines FPN and PANet, enhancing small target detection performance through multi-scale feature fusion, and further applies the C3k2 module to balance accuracy and computational cost. The detection head design uses multi-scale outputs for classifying and predicting bounding boxes for small, medium, and large targets, while employing an improved attention mechanism to strengthen target region perception, ensuring outstanding performance in complex scenes and multi-target detection.

3.2. YOLO-PEL Architecture

This paper proposes an apple pest and disease detection algorithm called YOLO-PEL. The overall framework focuses on optimizing feature extraction, feature fusion, and multi-scale representation design. These modules work collaboratively to effectively improve the model’s detection performance in complex scenes, particularly showing significant advantages when handling small targets and complex backgrounds. The model adopts a modular design approach, emphasizing efficient feature extraction and fusion, while introducing innovative mechanisms to enhance the expression capabilities of information at different scales. The overall architecture of YOLO-PEL is shown in Figure 4.

3.2.1. Part Multi-Scale Feature Extraction Module (PMFEM)

Traditional YOLOv11 backbone networks have significant limitations when applied to apple leaf pest and disease detection tasks. Although the C3k2 module used in the backbone network can extract multi-level features, it performs poorly when dealing with apple leaf pest and disease images that have complex backgrounds and multi-scale features. Specifically, the C3k2 module uses fixed-sized convolution kernels for feature extraction, leading to insufficient adaptability when addressing targets of different scales. Particularly, when dealing with small pest targets, the limited feature extraction ability often results in missed detections.

To address these issues, this paper proposes an algorithm called the Part Multi-Scale Feature Extraction Module (PMFEM), which significantly enhances the model’s ability to detect targets of varying scales by combining multi-scale feature extraction and channel splitting strategies. The module integrates the CSP (Cross Stage Partial) idea [20] innovatively, designing a CSP_PMFEM module to restructure the backbone network. The CSP_PMFEM module mainly consists of two parts: the Cross Stage Partial part, which reduces computation by splitting the feature map and promotes diverse feature expressions, and the Part Multi-Scale Feature Extraction Module part, which uses multi-scale convolution kernels to extract features of different scales, as shown in Figure 5.

Specifically, the input feature map

x_{in}

is split and processed by different branches, resulting in the merged output feature map

x_{out}

. First, the input feature map

x_{in}

is divided into two parts:

x_{in} \to Split (x_{in}) = (x_{in, 1}, x_{in, 2})

(1)

where

x_{in, 1}

and

x_{in, 2}

are the two feature maps obtained after splitting. Then,

x_{in, 1}

is processed through a

1 \times 1

convolution, resulting in

x_{c}

:

x_{c} = {Conv}_{1 \times 1} (x_{in, 1})

(2)

Within the PMFEM module,

x_{c}

is processed using parallel convolutional kernels of sizes

3 \times 3

,

5 \times 5

, and

7 \times 7

, enabling multi-scale perception of target features. This design effectively addresses the adaptability challenges of traditional modules in handling pest and disease targets of different sizes. By employing a progressive channel splitting mechanism, the convolutional operations at different scales focus on specific proportions of feature channels, ensuring both diversity in feature extraction and computational efficiency. Residual connections are introduced to preserve original feature information and facilitate gradient backpropagation, enhancing training stability and feature representation. The formula for the i-th PMFEM module is as follows:

x_{pmfem}^{(i)} = PMFEM (x_{pmfem}^{(i - 1)}) + x_{pmfem}^{(i - 1)}

(3)

where

x_{pmfem}^{(0)} = x_{c}

and

i = 1, 2, \dots, n

. Specifically, the input feature map is processed by applying convolutional kernels of different scales to its distinct parts. This block-wise processing mechanism efficiently captures information at different scales, improving the multi-scale perception capability of the detection model. Given an input feature map x with C channels, it first undergoes a

3 \times 3

convolution, producing

F_{1}

, which extracts features of small targets:

F_{1} = {Conv}_{3 \times 3} (x)

(4)

Then,

F_{1}

is split along the channel dimension, and one part is processed through a

5 \times 5

convolution to capture medium-scale features, yielding

F_{2}

:

F_{2} = {Conv}_{5 \times 5} (F_{1, split})

(5)

Next,

F_{2}

is further divided, and one part is processed through a

7 \times 7

convolution to extract large-scale features, resulting in

F_{3}

:

F_{3} = {Conv}_{7 \times 7} (F_{2, split})

(6)

By leveraging convolutional kernels of different scales, the PMFEM module extracts rich feature information across various scales, laying a solid foundation for subsequent feature fusion. After feature extraction, the three parts

F_{1}

,

F_{2}

, and

F_{3}

are concatenated into a new feature map for further processing. The concatenated feature map

F_{concat}

is passed through a

1 \times 1

convolution to further fuse the multi-scale features and compress the channel dimensions, ensuring computational efficiency. The output is

F_{out}

:

F_{out} = {Conv}_{1 \times 1} (Concat (F_{1}, F_{2}, F_{3}))

(7)

To preserve the original information in the input features, the module adopts a residual connection, where the input feature map

x_{c}

is added to the fused feature map

F_{out}

, resulting in

x_{pmfem}

:

x_{pmfem} = F_{out} + x_{c}

(8)

Finally, the feature map processed by n PMFEM modules is aggregated with

x_{in, 2}

via a residual connection, yielding

x_{out}

:

x_{out} = x_{pmfem}^{(n)} + x_{in, 2}

(9)

The PMFEM module integrates convolutional kernels of different scales into the feature extraction structure of YOLOv11, enhancing the model’s ability to perceive pest and disease features at multiple scales. This design addresses the issues of misdetections and missed detections in traditional YOLO architectures when detecting pests and diseases on apple leaves, which are often caused by variations in feature sizes and background interference. By combining partial multi-scale feature extraction and fusion operations, the PMFEM module significantly improves the robustness of the model in complex environments while maintaining computational efficiency, thereby enhancing the model’s generalization ability.

3.2.2. Efficient Local Attention-Based Hierarchical Scale Fusion Pyramid Network (EHFPN)

The YOLOv11 Feature Pyramid Network (FPN) exhibits certain limitations when handling multi-scale object detection problems, especially when detecting apple leaf pests and diseases. Apple leaf pest and disease targets have distinct small object features, and the disease features are often hidden in complex backgrounds. Traditional FPNs struggle to effectively fuse multi-scale features and highlight the target features in prominent regions. Specifically, high-level semantic features are often interfered with by the noise from low-level features, leading to a decrease in the target feature representation ability. Additionally, redundancy between multi-scale features and background noise further weakens detection accuracy and robustness. The neglect of long-range feature dependencies also causes the model to perform poorly when detecting targets with uneven distribution and complex textures.

To address these issues, we propose the Efficient local attention-based Hierarchical Scale Fusion Pyramid Network (EHFPN). This module enhances the saliency of target regions through the Efficient Local Attention (ELA) mechanism [21] and achieves efficient multi-scale feature fusion based on the Hierarchical Scale Fusion Pyramid Network (HSFPN) concept, significantly improving detection performance for apple leaf pests and diseases. EHFPN effectively solves the problems of large target scale variations, complex backgrounds, and insufficient target saliency in pest and disease detection, providing strong support for improving detection accuracy and model robustness.

EHFPN mainly consists of two key modules: the Feature Selection Module and the Feature Fusion Module, as shown in Figure 6. These modules help the model perform effective selection and fusion between feature maps at different scales, capturing fine-grained feature information.

Specifically, the input feature map

x \in R^{B \times C \times H \times W}

(where B is the batch size, C is the number of channels, H is the height, and W is the width) first enters the Feature Selection Module. The Feature Selection Module is mainly designed based on the Efficient Local Attention (ELA) mechanism, as shown in Figure 7. It performs pooling operations in both horizontal and vertical directions. The horizontal pooling results in

p_{h} \in R^{B \times C \times H}

, and the vertical pooling results in

p_{w} \in R^{B \times C \times W}

:

p_{h} = {Pool}_{h} (x), p_{w} = {Pool}_{w} (x)

(10)

Then, 1D convolutions are applied to extract features and generate corresponding weights:

A_{h} = σ ({Conv}_{1 D} (p_{h})), A_{w} = σ ({Conv}_{1 D} (p_{w}))

(11)

where

σ

denotes the Sigmoid activation function. The horizontal and vertical attention weights are then applied to the original feature map, and element-wise weighted fusion is performed:

x^{'} = x \cdot A_{h} \cdot A_{w}

(12)

Here, · represents the element-wise multiplication operation. The output feature map

x^{'}

possesses prominent target region features while suppressing irrelevant background noise.

The feature fusion module (Feature Fusion Module) further processes the features. This module effectively fuses features from different scales using the Selective Feature Fusion (SFF) mechanism, as shown in Figure 8. The input high-level features

f_{high} \in R^{C \times H^{'} \times W^{'}}

are adjusted to the resolution of the low-level features

f_{low} \in R^{C \times H \times W}

through transposed convolution and bilinear interpolation. The operations are as follows:

f_{up} = TransConv (f_{high}), f_{att} = BilinearInterp (f_{up})

(13)

Here, the adjusted feature map

f_{att}

aligns with the size of

f_{low}

. Next, the high-level features generate channel attention weights

W_{ELA} \in R^{C}

:

W_{ELA} = σ (MLP (GlobalPool (f_{att})))

(14)

where GlobalPool represents the global pooling operation, and MLP refers to the fully connected network. Then, the channel attention weights are applied to the low-level features:

f_{low}^{'} = f_{low} \cdot W_{CA}

(15)

Finally, the adjusted high-level features

f_{att}

are added element-wise to the filtered low-level features

f_{low}^{'}

:

f_{out} = f_{low}^{'} + f_{att}

(16)

Here, the output feature map

f_{out}

combines the semantic information from high-level features and the spatial details from low-level features.

EHFPN significantly enhances the model’s detection capabilities, particularly in the complex task of detecting apple leaf pests and diseases, demonstrating many advantages. By efficiently fusing multi-scale features and strengthening salient features, EHFPN provides an innovative solution to address small targets, complex backgrounds, and long-range dependency issues in apple leaf pest detection. The introduction of this module not only improves the model’s performance but also provides important reference value for the application of deep learning in agricultural pest and disease detection.

3.2.3. Large-Separable Kernel Attention Pyramid Pooling (LKAP)

In the traditional SPPF layer of the YOLOv11 model, multi-scale pooling operations are used to perform spatial pyramid pooling at multiple levels, capturing contextual information from different receptive fields. However, the standard SPPF layer exhibits certain limitations when extracting fine-grained features of apple leaf pests and diseases. Due to its fixed convolution kernel size, the SPPF layer struggles to effectively capture the small, intricate edges and texture features of leaf pests and diseases, which affects the model’s performance in detecting tiny pest and disease instances. Additionally, the traditional SPPF layer lacks adaptability to different feature regions, making it challenging to focus on critical pest and disease features.

To address these issues, we propose a module named Large-separable Kernel Attention Pyramid Pooling (LKAP). The LKA (Large-separable Kernel Attention) mechanism introduces decomposed large convolution kernels, enabling more efficient capture of broad contextual information. Compared to traditional attention mechanisms, LKA employs a decomposed design with horizontal and vertical one-dimensional convolution kernels, which not only retains the advantages of large kernels but also reduces computational cost. This mechanism allows the network to better focus on the pest and disease feature regions of apple leaves when computing the attention map, thereby improving detection accuracy and robustness.

The LKAP module workflow consists of two main components: the SPPF pooling layer and the LKA attention module, as shown in Figure 9.

Specifically, in the SPPF pooling layer, the input feature map x undergoes a series of pooling operations to generate multi-scale feature maps, ensuring that the model captures spatial information at multiple levels. First, the input feature map

x_{in}

is passed through a

1 \times 1

convolutional layer, reducing its channel dimension by half. This process effectively reduces computation while preserving essential feature information:

x_{1} = {Conv}_{1 \times 1} (x_{in})

(17)

Next, the feature map

x_{1}

is sequentially passed through three max-pooling layers. Through layer-by-layer pooling operations, the receptive field is progressively increased to extract global information at different scales, producing outputs

x_{2}

,

x_{3}

, and

x_{4}

:

x_{2} = {MaxPool}_{5 \times 5} (x_{1}), x_{3} = {MaxPool}_{5 \times 5} (x_{2}), x_{4} = {MaxPool}_{5 \times 5} (x_{3})

(18)

These three pooling operations expand the receptive field from local to global scales, thereby capturing key information at various levels. Afterward,

x_{1}

,

x_{2}

,

x_{3}

, and

x_{4}

are concatenated along the channel dimension to generate a multi-scale feature combination

x_{cat}

:

x_{cat} = Concat (x_{1}, x_{2}, x_{3}, x_{4})

(19)

After completing the multi-scale concatenation, the combined feature map

x_{cat}

is fed into the LKA (Large Separable Kernel Attention) module. The LKA module uses a large convolutional kernel (

k = 11

) to generate an attention map, expanding the convolutional receptive field in the spatial domain to capture broad contextual information:

x_{att} = {LKA}_{11} (x_{cat})

(20)

Specifically, the LKA module decomposes the traditional large 2D convolution kernel into 1D convolution kernels along horizontal and vertical directions, forming a cascaded structure. This design includes two processing branches: one utilizes smaller separable convolution kernels (

3 \times 1

and

1 \times 3

) for initial feature extraction, while the other employs larger separable convolution kernels (

5 \times 1

and

1 \times 5

) with dilated convolution for extended feature extraction. The details of the LKA module are shown in Figure 10.

The input feature map

x_{cat}

is processed by two independent depthwise separable convolutions along the horizontal and vertical directions, generating preliminary horizontal and vertical convolution features

{Attn}_{h 0}

and

{Attn}_{v 0}

, respectively:

{Attn}_{h 0} = {Conv}_{h} (x_{cat}), {Attn}_{v 0} = {Conv}_{v} ({Attn}_{h 0})

(21)

Next, the features after the preliminary convolutions are further processed through horizontal and vertical dilated convolutions to capture spatial contextual information over a larger area:

{Attn}_{h} = {DilatedConv}_{h} ({Attn}_{v 0}), {Attn}_{v} = {DilatedConv}_{v} ({Attn}_{h})

(22)

The features after the dilated convolutions are fused through a

1 \times 1

convolution, generating the final attention weights Attn:

Attn = {Conv}_{1 \times 1} ({Attn}_{v})

(23)

The input feature map

x_{cat}

is multiplied element-wise with the attention weights Attn, resulting in the weighted output features

x_{att}

:

x_{att} = x_{cat} \times Attn

(24)

Finally, the weighted feature map is passed through a

1 \times 1

convolution to restore the channel dimension, producing the final output feature map

x_{out}

:

x_{out} = {Conv}_{1 \times 1} (x_{att})

(25)

Applying the LKAP module to apple leaf pest and disease detection effectively addresses several challenges faced by the dataset. Regarding target size diversity, the large-kernel attention design of LKA significantly improves the model’s ability to detect small pest and disease targets, especially for tiny spots and infestations on leaves, with a noticeable increase in detection rates. For complex background interference, the attention mechanism adaptively suppresses feature responses from non-critical regions, enhancing the model’s robustness under conditions of complex lighting and overlapping leaves. Additionally, in addressing the large variance in pest and disease features, LKAP enhances the model’s ability to distinguish between different types of pests and diseases through multi-scale feature extraction and attention enhancement. Specifically, for visually similar disease types, the improved model demonstrates stronger discriminative capabilities, effectively reducing false positive rates. In small object detection, the separable convolution design preserves spatial detail information while maintaining a large receptive field, significantly improving detection precision for tiny pest and disease targets.

4. Results and Discussion

4.1. Experimental Parameters

To ensure the efficiency and comparability of training and evaluation for the improved YOLOv11 model, the study set the main parameters reasonably, as shown in Table 2.

4.2. Evaluation Metrics

To evaluate the detection accuracy of the proposed framework, several model evaluation metrics were employed, including the mean Average Precision at IoU 50 (mAP@50), model parameters (Parameters), model size (Model Size), and computational efficiency (GFLOPs). mAP@50 represents the average precision across different categories, with values ranging between zero and one. A value closer to one indicates better performance of the model in multi-class object detection tasks. The precision is determined based on varying threshold values to assess the model’s accuracy. GFLOPs stands for billions of floating-point operations per second, which is a critical metric for measuring the computational resources consumed by models in deep learning applications.

4.3. Comparative Experiment

To verify the practical effectiveness of the key modules in the improved YOLOv11 model for apple pest and disease detection, four sets of comparative experiments were designed. These experiments compared the performance of different neck structures, backbone structures, downsampling modules, and overall models. In all experiments, model performance was primarily evaluated using the mAP@50 metric, with GFLOPs and Parameters being used for supplementary analysis to ensure the comprehensiveness and scientific rigor of the results.

This study systematically validated the performance of the improved YOLOv11 model in apple pest and disease detection through four sets of comparative experiments, analyzing the impact of different YOLO versions, modules, and their combinations on model accuracy, computational complexity, and parameter count.

4.3.1. Backbone Module Comparison

In the comparative experiment of the C3k1 module, the performance of the following modules was tested: C3k2-iRMB [22], C3k2-RVB-EMA [23], C3k2-Star-CAA [24], C3k2-AdditiveBlock [25], C3k2-IdentityFormer [26], and CSP-PMFEM. The results are shown in Table 3. The table demonstrates that the model using the CSP-PMFEM module achieves the best balance between accuracy and computational complexity, with an mAP@50 of 0.708, which is a 4.9% improvement over the second-best C3k2-IdentityFormer.

Although the GFLOPS of CSP-PMFEM is 7.6, slightly higher than C3k2-IdentityFormer and C3k2-RVB-EMA, it is still significantly lower than that of C3k2-Star-CAA. Moreover, its parameter count is 2.62M, only 0.43M more than C3k2-IdentityFormer, which remains at a relatively low level. It is evident that CSP-PMFEM maintains excellent computational efficiency while improving detection accuracy, making it well-suited for practical pest and disease detection tasks.

4.3.2. Neck Module Comparison

In the neck comparison experiment, three different structures, GhostHGNet [27], Goldyolo [28], and GDFPN [29], were tested, along with the performance of different attention mechanisms, namely CA-HSFPN [30], CAA-HSFPN [31], and EHFPN. The results are shown in Table 4. EHFPN achieved the best balance between mAP@50 and computational efficiency, with mAP@50 reaching 71.5%, which is significantly higher than that of the other modules. Specifically, it improved by 6.6% compared to CAA-HSFPN, 7.4% compared to CA-HSFPN, and also outperformed the more computationally complex GDFPN.

In terms of GFLOPS and parameter count, EHFPN has a GFLOPS of 5.7, which is slightly higher than CA-HSFPN but significantly lower than Goldyolo and GDFPN, demonstrating higher computational efficiency. Additionally, its parameter count is 2.51 M, which represents a modest increase of only 0.51 M compared to CAA-HSFPN, while still being much lower than Goldyolo. This demonstrates that EHFPN maintains a lightweight design while significantly improving accuracy.

In contrast, CA-HSFPN and CAA-HSFPN exhibit lower computational overhead in terms of GFLOPS and parameter count but perform weaker in detection accuracy. While GDFPN and Goldyolo offer certain accuracy advantages, their computational complexity and parameter count increase significantly, making them less suitable for lightweight applications. EHFPN, by incorporating an efficient local attention mechanism (ELA), optimizes feature extraction and fusion significantly, achieving a good balance between detection accuracy and computational efficiency, thus demonstrating broad applicability.

4.3.3. Downsampling Module Comparison

In the comparison of down-sampling modules, the performances of AIFIRepBN [32], FocalModulation [33], AIFI [34], and LKAP were evaluated. The results are shown in Table 5. LKAP achieved an mAP@50 of 72.9%, significantly surpassing other modules. Specifically, it improved by 8.8% compared to the second-best AIFI (67.0%) and by 17.0% compared to FocalModulation (62.3%), demonstrating its remarkable advantages in multi-scale feature extraction.

In terms of computational complexity, LKAP recorded a GFLOPS of 7.3, which is comparable to AIFI and AIFIRepBN (both 7.4) and slightly higher than FocalModulation (7.2). However, it achieved a much greater lead in accuracy. Additionally, LKAP has a parameter count of 2.3 M, significantly lower than AIFI (2.65 M) and AIFIRepBN (2.65 M), highlighting its advantage in lightweight design. By optimizing feature fusion and incorporating a large-kernel attention mechanism, LKAP strikes an optimal balance between accuracy and efficiency, showcasing its strong potential for practical applications.

4.3.4. Overall Model Comparison

In the model comparison experiments, the performances of YOLOv5, YOLOv6, YOLOv8n, YOLOv9t, YOLOv10n, SSD [35], Faster R-CNN [36], RetinaNet [37], YOLOv11, and the YOLO-PEL framework were evaluated. The experimental results are summarized in Table 6, and the corresponding performance metrics are illustrated in Figure 11.

The results in Table 6 and the convergence trends in Figure 11 demonstrate the superiority of YOLO-PEL. It achieves the highest mAP@50 (72.9%), outperforming all previous YOLO versions, SSD, and RetinaNet, while maintaining a low computational cost (7.3 GFLOPS) and a compact parameter size (2.30 M). Additionally, the learning curve highlights YOLO-PEL’s faster convergence and greater stability across epochs. These findings underscore YOLO-PEL’s ability to balance detection accuracy, efficiency, and scalability, making it highly suitable for real-time and resource-constrained applications.

4.4. Ablation Experiment

To verify the contributions of the proposed modules to model performance, a series of ablation experiments were conducted on YOLOv11. These experiments included the individual introduction of the CSP-PMFEM, EHFPN, and LKAP modules, as well as their various combinations. Performance was evaluated based on the mAP@50 metric, while recording the parameters (Param/M) and computational complexity (GFLOPS). The results are shown in Table 7. Each module’s independent introduction and their combinations resulted in varying degrees of improvement, validating the effectiveness and rationality of the proposed modules and providing theoretical support for model optimization.

Based on the experimental results, the ablation study was designed to evaluate the impact of each module on the performance of YOLOv11. The original YOLOv11 model (without the proposed modules) achieved an mAP@50 of only 68.6%, serving as the baseline.

When the CSP-PMFEM module was introduced, the mAP@50 significantly increased to 70.8%, indicating that this module enhances key feature representation by introducing partially shareable features and more refined feature fusion structures, while reducing computational complexity.

Further integration of the EHFPN module raised the mAP@50 to 71.5%, demonstrating that the combination of Efficient Local Attention (ELA) and Hierarchical Scale Feature Pyramid Network (HSFPN) not only improved the detection of multi-scale objects but also enhanced the recognition of small objects.

By incorporating the LKAP module, the model also showed considerable improvements in mAP@50-95, illustrating its effectiveness in multi-scale feature fusion and spatial receptive field enhancement.

The final model, combining all three modules, achieved the best performance with an mAP@50 of 72.9%, while maintaining reasonable computational complexity (GFLOPS) and parameter count (Param/M). These results confirm the overall effectiveness of the design, validating the rationality and practical value of the proposed model optimization strategy.

4.5. Visualization and Analysis

In deep learning, the receptive field (RF) refers to the region of the input image that a specific neuron in the network can perceive. A larger receptive field enables the model to capture more contextual information, which is crucial for complex scenes and multi-object detection tasks. In this study, we selected a threshold parameter of

t h r e s h = 0.99

to measure the global feature coverage. Under this threshold, the original model exhibited an area ratio of

0.779

and a rectangle side length of 565, demonstrating the limitations of its receptive field in capturing both local and global information. In contrast, our proposed model achieved an area ratio of

0.876

and increased the rectangle side length to 599, indicating significant improvements in expanding the receptive field and capturing global contextual information.

To provide a more intuitive understanding of this improvement, Figure 12 presents a visualization of the receptive field under the same threshold for both models. It is evident that the receptive field in the improved model covers a significantly larger area, with a more widespread distribution of high-contribution regions, leading to richer feature extraction. These results validate the effectiveness of YOLO-PEL in enhancing the receptive field and further demonstrate its advantages in handling complex scenes and multi-scale object detection.

In the visualization of receptive fields, we can intuitively observe the model’s attention to key regions. In the heatmap, the intensity of the color reflects the contribution strength of different regions, with deeper colors indicating areas that contribute more significantly to the model’s decision-making, as shown in Figure 13.

The heatmap clearly demonstrates the advantages of the YOLO-PEL model after the expansion of its receptive field. Compared to the original model, the improved model captures more high-contribution information over a broader region and exhibits stronger contextual awareness in complex backgrounds and multi-object scenes. The heatmap visually showcases how the model processes both fine details and global features. The expanded receptive field enables the model to recognize and focus on more important areas, further proving the enhancement of object detection performance through receptive field expansion. It is worth noting that in a few cases, YOLOv11 appears to produce more precise localization on certain infected regions. However, the overall performance of YOLO-PEL, as evidenced by its higher mAP and improved feature extraction across a broader range of examples, demonstrates superior generalization and detection capability.

To further illustrate the advantages of the proposed improvements in apple leaf pest and disease detection, Figure 14 provides a visualization of detection results for different targets, including detection confidence and bounding box accuracy for various pest and disease types. For example, Figure 14a–d show detection results without the proposed improvements, where some targets exhibit low confidence or inaccurate bounding boxes. In contrast, Figure 14e–h display results after the improvements, where the bounding boxes are more accurate, and the confidence for small targets significantly increases. This indicates that the proposed model effectively enhances multi-scale feature extraction and target recognition capabilities.

5. Conclusions

This study proposes a YOLO-PEL algorithm for apple pest and disease detection, which is based on an improved YOLOv11 architecture. The goal of YOLO-PEL is to enhance detection accuracy and efficiency, particularly for complex backgrounds and small target detection. YOLO-PEL integrates the PMFEM, EHFPN, and LKAP modules to enhance multi-scale feature extraction, small target detection, and receptive field expansion. Experimental results demonstrate that YOLO-PEL achieves an mAP@50 of

72.9 %

on the Turkey_Plant dataset, significantly outperforming existing algorithms such as YOLOv11 and YOLOv8n, with an overall accuracy improvement of

4.3 %

. Moreover, it maintains a high advantage in computational efficiency and parameter count, proving its effectiveness in practical applications.

The innovation of this research lies in the combination of multiple deep learning techniques to propose a high-precision, low-complexity apple pest and disease detection solution, suitable for resource-constrained environments. YOLO-PEL not only provides technical support for precision agriculture but also contributes to the broader development of smart agriculture. In the future, the model will be further optimized to improve real-time processing capabilities, particularly in terms of detection accuracy under extreme lighting conditions and large-scale imagery.

Furthermore, future research will explore the application of this model to pest and disease monitoring across different crop types. The YOLO-PEL model holds potential for integration into intelligent agricultural systems, such as precision spraying platforms for targeted pesticide delivery, autonomous UAV-based field scouting systems, and edge computing devices for in-field, real-time disease detection. These practical implementations can facilitate large-scale deployment of automated pest monitoring systems, enhance early warning mechanisms, and support data-driven decision-making in crop management—ultimately accelerating the development of sustainable and intelligent agricultural ecosystems.

In future work, we also plan to evaluate the model’s generalization ability across diverse datasets with varying image resolutions and disease types to further validate its robustness and applicability in broader agricultural scenarios.

Author Contributions

Conceptualization, W.B.; methodology, W.B.; software, W.B.; resources, F.Z.; data curation, W.B.; writing—original draft preparation, W.B.; writing—review and editing, F.Z.; visualization, W.B.; supervision, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because they are part of an ongoing study and are subject to data confidentiality restrictions. Requests to access the datasets should be directed to zfq@njfu.edu.cn.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Dubey, S.R.; Jalal, A.S. Adapted approach for fruit disease identification using images. Int. J. Comput. Vis. Image Process. 2012, 2, 44–58. [Google Scholar] [CrossRef]
Singh, V.; Misra, A.K. Detection of plant leaf diseases using image segmentation and soft computing techniques. Inf. Process. Agric. 2017, 4, 41–49. [Google Scholar] [CrossRef]
Gangadevi, E.; Rani, R.S.; Dhanaraj, R.K.; Nayyar, A. Spot-out fruit fly algorithm with simulated annealing optimized SVM for detecting tomato plant diseases. Neural Comput. Appl. 2024, 36, 4349–4375. [Google Scholar] [CrossRef]
Hassan, S.M.; Maji, A.K. Plant disease identification using a novel convolutional neural network. IEEE Access 2022, 10, 5390–5401. [Google Scholar] [CrossRef]
Shoaib, M.; Hussain, T.; Shah, B.; Ullah, I.; Shah, S.M.; Ali, F.; Park, S.H. Deep learning-based segmentation and classification of leaf images for detection of tomato plant disease. Front. Plant Sci. 2022, 13, 1031748. [Google Scholar] [CrossRef]
Chen, J.; Chen, J.; Zhang, D.; Sun, Y.; Nanehkaran, Y.A. Using deep transfer learning for image-based plant disease identification. Comput. Electron. Agric. 2020, 173, 105393. [Google Scholar] [CrossRef]
Sun, R.; Zhang, M.; Yang, K.; Liu, J. Data enhancement for plant disease classification using generated lesions. Appl. Sci. 2020, 10, 466. [Google Scholar] [CrossRef]
Shoaib, M.; Shah, B.; Ei-Sappagh, S.; Ali, A.; Ullah, A.; Alenezi, F.; Gechev, T.; Hussain, T.; Ali, F. An advanced deep learning models-based plant disease detection: A review of recent research. Front. Plant Sci. 2023, 14, 1158933. [Google Scholar]
Bari, B.S.; Islam, M.N.; Rashid, M.; Hasan, M.J.; Razman, M.A.M.; Musa, R.M.; Ab Nasir, A.F.; Majeed, A.P.A. A real-time approach of diagnosing rice leaf disease using deep learning-based faster R-CNN framework. PeerJ Comput. Sci. 2021, 7, e432. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Li, E.; Liang, Z. Detection of apple lesions in orchards based on deep learning methods of CycleGAN and YOLOV3-dense. J. Sens. 2019, 2019, 7630926. [Google Scholar] [CrossRef]
Li, J.; Qiao, Y.; Liu, S.; Zhang, J.; Yang, Z.; Wang, M. An improved YOLOv5-based vegetable disease detection method. Comput. Electron. Agric. 2022, 202, 107345. [Google Scholar] [CrossRef]
Lin, J.; Yu, D.; Pan, R.; Cai, J.; Liu, J.; Zhang, L.; Wen, X.; Peng, X.; Cernava, T.; Oufensou, S.; et al. Improved YOLOX-Tiny network for detection of tobacco brown spot disease. Front. Plant Sci. 2023, 14, 1135105. [Google Scholar] [CrossRef] [PubMed]
Khan, F.; Zafar, N.; Tahir, M.N.; Aqib, M.; Waheed, H.; Haroon, Z. A mobile-based system for maize plant leaf disease detection and classification using deep learning. Front. Plant Sci. 2023, 14, 1079366. [Google Scholar] [CrossRef] [PubMed]
Solimani, F.; Cardellicchio, A.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Optimizing tomato plant phenotyping detection: Boosting YOLOv8 architecture to tackle data complexity. Comput. Electron. Agric. 2024, 218, 108728. [Google Scholar] [CrossRef]
Ma, B.; Hua, Z.; Wen, Y.; Deng, H.; Zhao, Y.; Pu, L.; Song, H. Using an improved lightweight YOLOv8 model for real-time detection of multi-stage apple fruit in complex orchard environments. Artif. Intell. Agric. 2024, 11, 70–82. [Google Scholar] [CrossRef]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 April 2025).
Lin, H.; Tse, R.; Tang, S.K.; Qiang, Z.P.; Pau, G. Few-shot learning approach with multi-scale feature fusion and attention for plant disease recognition. Front. Plant Sci. 2022, 13, 907916. [Google Scholar] [CrossRef]
Turkoglu, M.; Yanikoğlu, B.; Hanbay, D. PlantDiseaseNet: Convolutional neural network ensemble for plant disease and pest detection. Signal Image Video Process. 2022, 16, 301–309. [Google Scholar] [CrossRef]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June2020; pp. 390–391. [Google Scholar]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar]
Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking Mobile Block for Efficient Attention-based Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 1389–1400. [Google Scholar]
Wu, M.; Yuan, K.; Shui, Y.; Wang, Q.; Zhao, Z. A Lightweight Method for Ripeness Detection and Counting of Chinese Flowering Cabbage in the Natural Environment. Agronomy 2024, 14, 1835. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. arXiv 2024, arXiv:2403.19967. [Google Scholar]
Zhang, T.; Li, L.; Zhou, Y.; Liu, W.; Qian, C.; Ji, X. CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications. arXiv 2024, arXiv:2408.03703. [Google Scholar]
Yu, W.; Si, C.; Zhou, P.; Luo, M.; Zhou, Y.; Feng, J.; Yan, S.; Wang, X. MetaFormer Baselines for Vision. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 896–912. [Google Scholar] [CrossRef] [PubMed]
Dong, P.; Wang, D.; Wang, Y.; Zong, G. Surface Defect Detection of Cigarette Packs Based on Improved YOLOv8. In Proceedings of the 2024 9th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 19–21 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1745–1749. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. 2024, 36, 51094–51112. [Google Scholar]
Jiang, Y.; Tan, Z.; Wang, J.; Sun, X.; Lin, M.; Li, H. GiraffeDet: A heavy-neck paradigm for object detection. arXiv 2022, arXiv:2202.04256. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and PATTERN Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 27706–27716. [Google Scholar]
Guo, J.; Chen, X.; Tang, Y.; Wang, Y. SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Reparameterized Batch Normalization. arXiv 2024, arXiv:2405.11582. [Google Scholar]
Yang, J.; Li, C.; Dai, X.; Gao, J. Focal modulation networks. Adv. Neural Inf. Process. Syst. 2022, 35, 4203–4217. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Lin, T. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]

Figure 1. Image and instance counts after data augmentation.

Figure 2. Data enhancement effect diagram. Note: (a,f) are input images, (b–e) are images of (a) input image after random strategy enhancement, (g–j) are images of (f) input image after random strategy enhancement.

Figure 3. YOLOv11 model diagram.

Figure 4. YOLO-PEL model diagram.

Figure 5. PMFEM structure diagram.

Figure 6. EHFPN structure diagram.

Figure 7. ELA Feature Selection Module.

Figure 8. Feature fusion module SFF mechanism.

Figure 9. LKAP module structure diagram.

Figure 10. Illustration of the LKA (Large Kernel Attention) module. It integrates small (3 × 1, 1 × 3) and large (5 × 1, 1 × 5) separable convolutions with dilation to capture both local and global features for attention generation.

Figure 11. Comparison of different models map@50.

Figure 12. Effective receptive field (ERF) of baseline and YOLO-PEL. The wider the distribution of dark areas, the larger the ERF.

Figure 13. (a,d,g,j) are the images annotated with apple aphid disease, Apple Eriosoma lanigerum, Apple Monillia laxa, and Apple Venturia inaequalis, respectively, using LabelImg. (b,e,h,k) are the corresponding heatmaps generated by the YOLOv11 model for the images in (a,d,g,j), respectively. (c,f,i,l) show the heatmaps generated by the YOLO-PEL model for the images in (a,d,g,j), respectively.

Figure 14. (a–d) are the YOLOv11 experimental results of four diseases, (e–h) are the YOLO-PEL experimental results.

Table 1. Key notations used in equations and architecture design.

Symbol	Definition	Unit/Type
x	Input feature map	Tensor
$x^{'}$	Output feature map after attention	Tensor
$A_{h}$	Horizontal attention weight map	Tensor
$A_{w}$	Vertical attention weight map	Tensor
C	Number of channels	Integer
H	Feature map height	Pixel
W	Feature map width	Pixel
K	Kernel size	Integer
$σ$	Sigmoid activation function	-
·	Element-wise multiplication	Operator

Table 2. Experimental environment and training parameters.

Parameter	Value
CPU	Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz × 7
RAM	30 GB
GPU	RTX 3090E5
Programming Language	Python 3.8.19
Deep Learning Libraries	PyTorch 1.13.1
Software	CUDA 11.7
Input Size	640 × 640
Epochs	400
Batch Size	32
Optimizer	SGD
Patience	50
Learning Rate	0.01

Table 3. Comparison results of different backbone models.

Model	mAP@50 (%)	GFLOPS	Param (M)
C3k2-iRMB	65.1	6.2	2.43
C3k2-RVB-EMA	65.5	5.8	2.28
C3k2-Star-CAA	65.8	8.1	3.03
C3k2-AdditiveBlock	64.9	6.1	2.45
C3k2-IdentityFormer	67.5	5.5	2.19
C3k2-PMFEM	70.8	7.6	2.62

Table 4. Comparison results of different neck models.

Model	mAP@50 (%)	GFLOPS	Parameters (M)
GhostHGNet	66.7	7.7	3.14
Goldyolo	68.5	9.5	5.93
GDFPN	69.5	8.2	3.66
CA-HSFPN	66.6	5.6	1.81
CAA-HSFPN	67.1	6.4	2.00
EHFPN	71.5	5.7	2.51

Table 5. Comparison of different down-sampling modules.

Model	mAP@50 (%)	GFLOPS	Parameters (M)
AIFIRepBN	62.1	7.4	2.65
FocalModulation	62.3	7.2	2.13
AIFI	67.0	7.4	2.65
LKAP	72.9	7.3	2.30

Table 6. Comparison results of different objection models.

Model	mAP@50 (%)	GFLOPS	Parameters (M)
YOLOv5	67.8	7.1	2.50
YOLOv6	63.7	11.8	4.23
YOLOv8n	67.9	8.1	3.01
YOLOv9t	67.1	7.6	1.98
YOLOv10n	67.7	6.5	2.27
SSD	65.4	30.4	23.88
Faster R-CNN	54.9	210.0	36.52
RetinaNet	65.5	210.0	36.52
YOLOv11	68.6	6.3	2.58
YOLO-PEL	72.9	7.3	2.30

Table 7. Ablation study results of YOLO-PEL modules on the apple pest subset.

Baseline	CSP-PMFEM	EHFPN	LKAP	mAP@50 (%)	GFLOPS	Param (M)	P	R	F1
✓				68.6	6.3	2.58	0.73	0.642	0.68
✓	✓			70.8	7.6	2.62	0.742	0.703	0.72
✓		✓		71.5	5.8	1.97	0.683	0.704	0.69
✓			✓	70.5	6.5	2.86	0.725	0.652	0.68
✓	✓	✓		72.3	7.1	2.02	0.757	0.657	0.70
✓	✓		✓	71.3	7.9	2.90	0.641	0.623	0.62
✓		✓	✓	71.5	6.1	2.25	0.635	0.655	0.63
✓	✓	✓	✓	72.9	7.3	2.30	0.747	0.669	0.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bao, W.; Zhang, F. Apple Pest and Disease Detection Network with Partial Multi-Scale Feature Extraction and Efficient Hierarchical Feature Fusion. Agronomy 2025, 15, 1043. https://doi.org/10.3390/agronomy15051043

AMA Style

Bao W, Zhang F. Apple Pest and Disease Detection Network with Partial Multi-Scale Feature Extraction and Efficient Hierarchical Feature Fusion. Agronomy. 2025; 15(5):1043. https://doi.org/10.3390/agronomy15051043

Chicago/Turabian Style

Bao, Weihao, and Fuquan Zhang. 2025. "Apple Pest and Disease Detection Network with Partial Multi-Scale Feature Extraction and Efficient Hierarchical Feature Fusion" Agronomy 15, no. 5: 1043. https://doi.org/10.3390/agronomy15051043

APA Style

Bao, W., & Zhang, F. (2025). Apple Pest and Disease Detection Network with Partial Multi-Scale Feature Extraction and Efficient Hierarchical Feature Fusion. Agronomy, 15(5), 1043. https://doi.org/10.3390/agronomy15051043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Apple Pest and Disease Detection Network with Partial Multi-Scale Feature Extraction and Efficient Hierarchical Feature Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Creation

2.2. Data Augmentation

3. Model and Training

3.1. The Overall Framework of YOLOv11

3.2. YOLO-PEL Architecture

3.2.1. Part Multi-Scale Feature Extraction Module (PMFEM)

3.2.2. Efficient Local Attention-Based Hierarchical Scale Fusion Pyramid Network (EHFPN)

3.2.3. Large-Separable Kernel Attention Pyramid Pooling (LKAP)

4. Results and Discussion

4.1. Experimental Parameters

4.2. Evaluation Metrics

4.3. Comparative Experiment

4.3.1. Backbone Module Comparison

4.3.2. Neck Module Comparison

4.3.3. Downsampling Module Comparison

4.3.4. Overall Model Comparison

4.4. Ablation Experiment

4.5. Visualization and Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI