1. Introduction
Honeysuckle, a crucial traditional Chinese medicinal herb, holds substantial medicinal and economic value. Its active compounds, such as chlorogenic acid and forsythoside, exhibit heat-clearing, detoxifying, anti-inflammatory, and antibacterial effects, and are widely utilized in traditional Chinese medicine. As a significant herb, honeysuckle faces strong market demand both domestically and internationally, with an annual output value exceeding billions of yuan, making it an essential crop for improving farmers’ incomes in many regions [
1].
In honeysuckle agricultural production, pest infestation is a leading cause of reduced yields, with pest damage accounting for yield losses as high as 20–30% [
2]. The invasion of pests not only diminishes yield but also affects the concentration of active compounds [
3], thus lowering both quality and market value. Traditional pest identification methods rely on field devices that capture pest images, which are then transmitted to backend servers for analysis by experts. This approach is inefficient and prone to inaccuracies due to subjective judgment. Although recent efforts have applied deep learning-based object detection networks (such as YOLO [
4,
5,
6,
7,
8,
9,
10], R-CNN [
11,
12,
13]) in agricultural scenarios, the average pest detection accuracy remains between 55–75%, with some pests being detected at rates below 50%. This study acknowledges that pest–flower interactions, occlusion, and color blending are critical real-world factors contributing to the aforementioned challenges. However, this paper posits that these specific factors collectively lead to a more fundamental algorithmic bottleneck—namely, the severe imbalance in the spatial and class distribution of targets in complex field environments. Therefore, this work chooses to focus on this fundamental issue, aiming to provide a general, highly robust solution.
The pursuit of robust solutions in complex visual environments is an active frontier in computer vision. Recent advancements in related domains offer valuable insights. For instance, the work on “A weakly supervised pavement crack segmentation based on adversarial learning and transformers” demonstrates a novel approach to tackling annotation scarcity and complex background interference in pavement inspection, a challenge analogous to the difficulties in agricultural pest detection. By leveraging adversarial learning and Transformer architectures, that study highlights the potential of these techniques in learning robust feature representations from imperfect or limited supervisory signals. While our current work focuses on a fully supervised object detection paradigm to establish a high-performance baseline for honeysuckle pest detection, the success of such weakly supervised methods in other domains strongly motivates our future exploration into reducing the heavy reliance on large-scale, finely annotated datasets. The multi-scale and attention mechanisms developed in our YOLO-MPAM model, which enhance feature discrimination, could serve as a foundational component for future extensions into semi-supervised or weakly supervised learning frameworks for agricultural applications, thereby addressing a broader range of practical constraints.
Spatial distribution imbalance is characterized by pests being small, numerous, and unevenly distributed. Their distribution varies, with pests clustering in the image center and sparsely appearing at the edges. This presents challenges for traditional detection algorithms. For instance, YOLO struggles with scattered scenes due to errors in locating targets that differ in size and position. In dense scenes, target occlusion and overlap exacerbate localization errors, reducing the detection accuracy of small pests.
Category distribution imbalance arises when certain pest categories dominate the dataset, causing YOLO and other detection algorithms to become biased toward more frequent categories. This limits the detection of less common pests.
To address the challenge of spatial distribution imbalance, this paper proposes a novel YOLO-based model termed YOLO-MPAM, which incorporates a Multi-dimensional Pyramid Attention Mechanism. This design is motivated not only by practical detection needs but also by a fundamental computational principle observed in biological vision systems. Recent neuroscience research reveals that even the evolutionarily ancient cortex of turtles achieves robust positional encoding in the face of retinal input variations, through mechanisms that parallel multi-scale feature processing and dynamic attentional selection [
14]. Inspired by this efficient strategy for achieving invariant representation, our YOLO-MPAM explicitly implements a multi-dimensional pyramid to handle scale variation and an attention mechanism to actively suppress irrelevant regions while amplifying critical features. Consequently, the model enhances the integration of channel and spatial features, effectively guiding the network’s focus toward areas with dense insect populations and significantly improving detection accuracy in localized, imbalanced regions.
In object detection, symmetry provides a mathematical framework for designing balanced architectures that can handle the inherent asymmetries in real-world agricultural images, including uneven spatial distributions and scale variations among pests. The YOLO-MPAM model designed in this study, leveraging its core multi-dimensional pyramid attention mechanism, can overcome complex interference in field environments to achieve real-time pest identification and accurate density statistics. This precision control mode based on real-time monitoring is expected to significantly reduce pesticide usage associated with traditional extensive application methods. While effectively controlling pests, it substantially lowers production costs and environmental residue, ultimately ensuring the yield and quality of honeysuckle. The proposed approach is grounded in symmetry principles, which have proven effective in addressing imbalance problems in complex systems.
2. Related Work
2.1. Deep Learning-Based Object Detection Algorithms
The rapid advancements in artificial intelligence, particularly in deep learning, have led to a revolutionary transformation in insect detection technology. Traditional methods, which relied on handcrafted feature extraction and classifiers, have been gradually supplanted by efficient object detection algorithms.
Deep learning-based object detection algorithms are currently categorized into two main types: the two-stage algorithms, exemplified by RCNN and Fast-RCNN, which generate region proposals first and then perform classification and regression for high-precision detection, and the one-stage algorithms, represented by the YOLO series, which divide the image into a grid of equal-sized regions and calculate center points and IOU errors, simplifying the detection task into an end-to-end regression problem. These one-stage methods achieve significant improvements in detection speed while maintaining high accuracy.
However, these algorithms still face significant challenges in honeysuckle pest detection. The pests are often numerous, small, and unevenly distributed, with an average size comprising only 5–8% of the image area. Additionally, the issue of pest occlusion further complicates detection. Existing methods for honeysuckle pest detection yield an average recall rate of just 52.8%, and mAP@0.5 of only 54.1%. This subpar performance can be attributed to the fact that current detection networks treat input images as homogeneous representations composed of equivalent candidate regions or uniformly divided grids. In real-world scenarios, however, different image regions possess varying degrees of importance for detection tasks. This disparity forms the basis for the exploration of attention mechanisms in this study.
2.2. Attention Mechanism-Based Object Detection Algorithms
Attention mechanisms dynamically allocate spatial and channel weights, guiding the model to focus on key information while effectively filtering out background noise, which significantly enhances object detection performance. However, this approach still faces notable limitations in honeysuckle pest detection. A primary issue arises because, to reduce the complexity of the attention mechanism, input feature maps are typically pooled before explicitly modeling channel and spatial dependencies. This pooling process hampers the subsequent training of channel and spatial weights. Several researchers have proposed improvements to feature map compression methods in recent years.
For instance, Jie Hu et al. introduced SENet [
15], which pools input feature maps using Global Average Pooling (GAP) [
16] to reduce data size. The SE module then models channel dependencies, improving the representation of important channels. However, relying solely on GAP for information compression leads to significant information loss due to its coarse granularity.
Woo et al. proposed CBAM [
17], which employs both Global Max Pooling (GMP) [
18,
19,
20] and GAP to pool spatial and channel feature information, improving the model’s ability to focus on crucial regions. CBAM further confirmed that relying exclusively on GAP results in insufficient information. The pooling step in attention mechanisms, which extracts effective information from input feature maps, is critical to the performance of attention modules.
FCANet proposed by Qin et al., incorporated frequency-domain analysis of feature maps to demonstrate that Global Average Pooling (GAP) represents a special case of low-frequency feature decomposition in the frequency domain. By extending GAP to accommodate additional frequency components, their method captured more informative signals, thereby mitigating the information loss typically associated with both GAP and Global Max Pooling (GMP). Although this frequency-domain approach alleviated information degradation within attention mechanisms, a significant limitation persisted: the transformation from the spatial domain to the frequency domain inherently obscured spatial localization. Since frequency representations are global and cannot directly preserve positional relationships, the model’s capability for precise local feature extraction—such as identifying subtle pest textures or edges in honeysuckle images—remained constrained. This limitation is particularly critical for agricultural pest detection, where accurately locating small, densely distributed targets is essential. To address this specific shortcoming, our proposed method builds upon the concept of multi-scale feature analysis but shifts the focus back to the spatial domain through a pyramid pooling strategy. This approach aims to preserve crucial spatial information while achieving multi-scale perception, effectively overcoming the spatial localization ambiguity inherent in pure frequency-domain methods like FCANet.
Building upon the foundational work of SENet and CBAM, more sophisticated attention mechanisms have been developed. A significant recent advancement is Coordinate Attention (CA) [
21], which factorizes channel attention into two parallel 1D feature encoding processes along the horizontal and vertical coordinates. This decomposition allows CA to capture long-range dependencies with precise positional information, effectively preserving spatial locality—a critical aspect that is lost in channel-only attention mechanisms like SENet and to some extent in frequency-domain approaches like FCANet.
While advanced attention mechanisms like Coordinate Attention (CA) have improved positional awareness through directional pooling, they remain limited in handling extreme scale variations and complex multi-scale contexts typical of agricultural pest detection. CA’s one-dimensional encoding captures long-range dependencies but may compromise fine spatial details essential for recognizing small, locally distinctive pests.
In contrast, YOLO-MPAM introduces a multi-dimensional pyramid pooling strategy applied directly in the spatial domain. This design simultaneously preserves fine-grained details for small pest identification and incorporates coarse-grained semantic contexts for background distinction, all while maintaining precise spatial localization. Our contribution thus lies in a symmetrical pyramid-based framework that holistically balances multi-scale feature fusion and spatial-channel interdependencies.
2.3. Vision Transformers and Pre-Trained Models in Agricultural Vision
The success of Transformers in natural language processing has rapidly permeated computer vision, leading to the emergence of Vision Transformers (ViTs) [
22]. Unlike CNNs that process images via local convolutions, ViTs treat an image as a sequence of patches, enabling global contextual modeling from the very first layer. This capability is particularly advantageous for capturing long-range dependencies in agricultural scenes, such as the spatial relationship between a pest and its host plant across a complex background [
23]. Recent studies have begun exploring ViTs for agricultural tasks. For instance, Wang et al. [
24] demonstrated that a hybrid architecture combining CNN backbones with Transformer encoders could achieve superior performance in crop disease recognition by effectively integrating local and global features.
Parallel to this architectural shift is the paradigm of leveraging large-scale pre-trained models [
25]. Pre-training on massive datasets like ImageNet provides models with robust and generalized feature representations, which can be effectively fine-tuned for specific downstream tasks with limited data—a common scenario in agricultural applications. This approach mitigates overfitting and often leads to faster convergence and improved accuracy compared to training from scratch [
26]. While pre-trained CNNs are widely used, the application of pre-trained ViTs in pest detection is an emerging and promising trend that warrants attention. However, the computational complexity of standard ViTs can be prohibitive for real-time field applications, prompting research into more efficient architectures [
27].
2.4. Comparative Analysis of Research Gaps and Motivations
Existing object detection algorithms and their attention-based enhancements were evaluated. A significant performance gap remains when these general-purpose models are applied to honeysuckle pest detection, primarily due to their failure to effectively address the spatial distribution imbalance challenge presented in this study.
To systematically delineate these limitations and substantiate the research necessity of our work, a comparative summary of existing object detection algorithms is presented in
Table 1. This table provides not only an evaluation of these methods but also a critical analysis of their inherent capabilities in addressing the two core challenges identified therein.
3. A Multi-Scale Pyramid Pooling Attention-Based Method for Honeysuckle Pest Detection
3.1. Definition and Terminology Explanation
To ensure the clarity and consistency of terminology used in this paper and to facilitate understanding for all readers, this section provides centralized definitions of key terms and abbreviations that appear herein. These definitions form the foundation for understanding the subsequent model architecture and algorithmic details. The definitions of key terms and abbreviations are summarized in
Table 2.
3.2. Overall Research Framework
To offer a comprehensive visualization of our research pipeline and elucidate the synergistic relationships among its constituents, we present an overall research framework in
Figure 1. This framework delineates the systematic progression from dataset construction and preprocessing, through the core model design featuring our novel attention modules, to final performance evaluation, specifically tailored to address the challenges of spatial imbalance and scale variation in honeysuckle pest detection.
As illustrated in
Figure 1, this study begins with the collection, cleaning, and annotation of field images of honeysuckle, leading to the construction of a dedicated pest detection dataset.
Utilizing this dataset, we conducted object detection tasks with existing baseline models [
28]. The results indicated that the detection accuracy remained unsatisfactory in complex field scenarios. Consequently, we introduced and evaluated several mainstream attention-based models (e.g., SENet, CBAM, Vision Transformer, etc.) [
29] for comparative analysis. Statistical analysis of the experimental outcomes revealed that the performance metrics of these existing attention models were still suboptimal on samples containing rare categories or exhibiting spatially imbalanced distributions. This investigation thereby identifies and articulates the two core challenges addressed in this paper: spatial distribution imbalance and category distribution imbalance [
30].
To address the challenges mentioned above, this research draws inspiration from the way advanced biological visual systems process multi-source information to refine the attention mechanism. By integrating attention weights from shallow layers, which capture detailed perceptions, with those from deep layers, which convey semantic understanding, we achieve cross-level and cross-domain fusion of attention features. This approach enhances the model’s perceptual capability concerning spatially heterogeneous regions and effectively mitigates the performance degradation in detection caused by category distribution imbalance.
3.3. Symmetry-Aware Design Principles in YOLO-MPAM
The overall architecture of YOLO-MPAM is conceived under a symmetry-aware design paradigm [
31], which seeks to maintain structural and functional balance across feature hierarchies. This symmetry-aware approach addresses the fundamental challenges in pest detection through three key principles:
Scale Symmetry in Feature Processing: Traditional object detectors often exhibit asymmetry in handling objects of different sizes, favoring larger targets over smaller ones. YOLO-MPAM introduces pyramid symmetry through its multi-scale processing architecture, where features at different scales (from fine-grained details to coarse semantic information) are treated with equal importance. This symmetric scaling ensures consistent detection performance regardless of pest size variations in the image.
Channel–Spatial Symmetry: Most attention mechanisms focus predominantly on either channel or spatial dimensions, creating an inherent asymmetry. Our MPAM module achieves bidirectional symmetry by simultaneously optimizing both channel weights and spatial attention maps. This symmetric treatment allows the model to focus on important feature channels while also emphasizing relevant spatial regions, creating a balanced attention mechanism.
Architectural Symmetry in Module Design: The proposed PPSAM and PPCAM modules exhibit mirror-symmetric structures in their processing pipelines. While PPSAM processes spatial information through pyramid pooling along channel dimensions, PPCAM employs a symmetrically opposite approach by processing channel information through spatial pyramid pooling. This complementary design ensures comprehensive feature enhancement from multiple symmetric perspectives.
3.4. YOLO-MPAM Model Architecture
Based on the YOLO model architecture, the overall design of the YOLO-MPAM model architecture was optimized in three key aspects: the backbone network, neck network, and head network, in alignment with the practical requirements for pest detection in honeysuckle.
The existing YOLO backbone networks (such as DarkNet and CSPDarkNet) primarily rely on stacked convolutional and pooling operations. While these can extract global features, they fail to adequately capture the local details of small targets, such as pests (e.g., antennae and texture). To address the shortcomings of local feature extraction and spatial information utilization, a Pyramid Pooling Spatial Attention Module (PPSAM) is embedded within the backbone network. This enhancement is designed to improve the backbone’s ability to extract detailed local features.
The current YOLO neck networks rely on simple upsampling, addition, or concatenation to fuse multi-scale feature maps. However, these methods do not account for the varying contributions of different channels. To optimize the channel feature fusion capability of the neck network, a Pyramid Pooling Channel Attention Module (PPCAM) is introduced. This module is aimed at enhancing the fusion of features across different channels.
The YOLO head networks typically rely on convolutional or fully connected layers to process feature maps, lacking explicit attention modeling for both spatial and channel dimensions. A Multi-scale Pyramid Attention Module (MPAM) is embedded within the head network, combining the feature information from both the Pyramid Pooling Channel Attention Module (PPCAM) and the Pyramid Pooling Spatial Attention Module (PPSAM). This integration aims to improve the accuracy of detection head localization and classification.
Figure 2 the model architecture primarily consists of three components: Backbone, Neck, and Head. The CBS module is composed of Conv, BatchNorm [
32], and SiLu [
33] modules. The C3 module consists of multiple Convolutional layers and Bottleneck structures. The SPPF is a multi-scale pooling module, enabling the network to process input images of arbitrary sizes while simultaneously extracting multi-scale features.
The following section will provide a detailed explanation of the design principles behind the MPAM, PPCAM, and PPSAM modules.
3.5. Pyramid Pooling Channel Attention Module
Traditional channel attention mechanisms like SENet and even advanced ones like Coordinate Attention (CA) [
34] primarily focus on modeling interdependencies between channels. However, they often rely on global spatial compression (e.g., GAP in SENet) or 1D encoding (in CA), which can oversimplify the rich spatial information contained within each channel. This is particularly detrimental when the critical features for distinguishing pests are localized in small image regions. To overcome this, our Pyramid Pooling Channel Attention Module (PPCAM) is designed to incorporate multi-scale spatial context directly into the channel weighting process. The PPCAM builds upon the traditional channel attention mechanism by integrating multiple multi-scale 2D GAP and GMP.
SENet utilizes GAP to spatially compress feature maps, preserving significant spatial features while reducing the feature scale. It combines this with a channel attention mechanism to dynamically adjust channel weights, thereby enhancing the network’s sensitivity to important features.
However, in the context of honeysuckle pest detection, the spatial features embedded in images are complex. Relying solely on GAP to compress the input feature maps may overlook rich details in the spatial information, weakening the effectiveness of the channel attention mechanism. To address this issue, the PPCAM builds upon the traditional channel attention mechanism by integrating multiple multi-scale 2D GAP and GMP. This aims to effectively incorporate local spatial information within the channel attention mechanism.
Additionally, the module trains the channel weights using feature maps of various scales, outputted by the pyramid pooling. The global channel weights from the upper layers are then fused with the local channel features from the lower layers. This fusion allows the network to compute channel weights while considering local spatial information, significantly improving the channel-space collaborative perception capability of the channel attention mechanism. The principle of PPCAM is illustrated in
Figure 3.
Colors are used to distinguish different functional components: light blue corresponds to the Multilayer Perceptron (MLP), light purple corresponds to the Convolutional Neural Network (CNN), and beige corresponds to the intermediate feature maps. The variations in lightness and granularity of the beige shades further reflect the differences in channel weights generated after spatial feature fusion, enabling a more refined representation of the spatial hierarchical information of the feature maps.
As shown in
Figure 3, the PPCAM performs spatial pyramid pooling on the input feature map F through an adaptive pooling method. This approach effectively preserves more local spatial information. The pooled multi-scale information is then fed into the corresponding MLP network and CNN to calculate the intermediate weights
and
for each layer. The final channel attention weights
CAW{
C × 1 × 1} are then computed. These channel weights are used to perform a weighted fusion with the feature map F, generating the final weighted feature map U. The specific implementation steps of the PPCAM are outlined below.
In
Figure 3, the input feature map
F has dimensions
H ×
W ×
C, where
H and
W represent the height and width of the feature map, and
C denotes the number of channels. For the input feature map
F, multi-scale GAP and GMP are applied along the spatial dimensions. The pooling kernel size is adaptively adjusted based on the divided regions, as defined by Equation (1). Through these operations, three distinct scales of 2D pooling regions are obtained, corresponding to {
H W,
H/2
W/2,
H/4
W/4}.
Q represents the number of divisions. The value of Q can be 1, 2, or 4.
Each scale of pooling is independently applied to the input feature map
F, performing both GAP and GMP along the spatial dimensions. This results in three distinct pooling outputs at different scales.
Here, represents the channel feature map, F denotes the input feature map, and Q ∈ {1, 2, 4}. The symbol ⊕ indicates concatenation along the channel dimension.
Multiscale Channel Feature Map Processing: The multiscale channel feature maps
are first input into an MLP network for the calculation of
with the feature map
. The resulting channel weights
are then multiplied element-wise with the channel feature map
which is obtained through pyramid pooling. This operation effectively merges local channel information with global channel weights, ensuring a balanced consideration of both local and global channel features. The product is then passed through CNN and MLP networks for the calculation of
{
C 1
1}. Finally, the channel weights
are fused again with the local channel feature map
, resulting in the final channel attention weights
CAW.
Channel Attention Weighting: The channel attention weights
CAW [
] are multiplied element-wise with the input feature map
F = [
], resulting in the weighted output feature map.
where
.
PPCAM, through multiscale pooling, integrates global and local channel feature information into the channel weights, enhancing the model’s ability to focus on local features. Compared to traditional channel attention mechanisms, the CAW channel weights not only indicate which channels are important but also emphasize channel significance using local spatial information, achieving a channel-space collaborative perception effect. In detection scenarios involving small objects, the model can automatically enhance the corresponding channel weights of the local region where the target is located, thereby improving detection performance. This approach is particularly effective in handling scenarios with dense pest distributions in images.
3.6. Pyramid Pooling Spatial Attention Module
Spatial attention mechanisms aim to highlight important regions. However, methods that pool across channels to generate spatial weights (e.g., the spatial part of CBAM) or those that derive positional information indirectly (like CA) might not fully leverage the channel-specific spatial patterns that are key for identifying similar-looking pests. Our Pyramid Pooling Spatial Attention Module (PPSAM) addresses this by performing multi-scale pooling along the channel dimension, preserving the channel-wise spatial information at different granularities. The proposed PPSAM was designed to leverage adaptive pooling for channel pyramid pooling on the input feature map F, thereby retaining essential local information within the channels.
In object detection neural networks, as the input feature map undergoes multiple convolutions, deeper spatial information (such as edges and textures) is merged with channel information. At this stage, the channel contains rich feature information, and using a single GAP to process these features fails to accurately reflect the importance of local information within the channel, leading to the loss of critical local features. The proposed PPSAM addresses this issue by incorporating multiscale pyramid pooling to prevent the loss of local key features in GAP operations.
By using one-dimensional multiscale GAP and GMP pyramid pooling, feature maps of different scales are generated. These maps are then input into the convolution layers to compute spatial weights. By integrating the spatial weights from higher layers with the channel information from lower layers, the model can focus on spatial details while also considering the critical information in important channels. This significantly enhances the channel-space collaborative perception capability of the spatial attention mechanism. The detailed implementation process and principles of PPSAM are shown in
Figure 4.
Colors are used to distinguish the functional modules and data flows in the figure: yellow represents the pooling results generated by Global Average Pooling (GAP), green represents the pooling results generated by Global Max Pooling (GMP), and blue corresponds to the final output Spatial Attention Weights (SAW). The variations in the depth of the blue color reflect the fine-grained perceptual information after channel feature fusion, indicating differences in importance across spatial locations. After multiplying SAW with the input feature map F, the resulting feature map integrates both spatial and channel attention. In the figure, darker areas represent key positions that the model focuses on, while lighter areas correspond to suppressed interference features.
As shown in
Figure 4, the PPSAM was designed to leverage adaptive pooling for channel pyramid pooling on the input feature map F, thereby retaining essential local information within the channels. The resulting feature maps, which are generated at different scales after one-dimensional pyramid pooling, are subsequently fed into a convolutional neural network for the computation of spatial attention weights
{1
H W} and
{1
H W}. By integrating local channel information from the lowest pyramid level
, the final spatial attention weights SAW{1
H W} are computed. The detailed implementation of the PPSAM is described in the following section.
3.7. Multi-Scale Channel Pyramid Pooling
The pyramid pooling-based spatial attention mechanism first applies multi-scale GMP and GAP operations along the channel dimension of the input feature map F, and concatenates them to generate a multi-channel two-dimensional spatial feature. Pooling along the channel axis has been shown to be effective in highlighting informative regions. The kernel size for one-dimensional pooling is adaptively adjusted based on the number of channel segments and the number of input feature channels, specifically as follows.
In this setting, C represents the number of channels in F, and P corresponds to the number of channel partitions, with P being set to 1, 2, or 4.
Three pooling kernels with sizes of
C,
C/2 and
C/4 are obtained through the aforementioned operations. Each kernel is independently employed to perform GMP and GAP on the input
F along the channel dimension. The detailed operations are defined as follows.
In this context, denotes the spatial feature map, F denotes the input feature map, and P denotes the number of segments in the multi-scale channel pyramid pooling, where P ∈ {1, 2, 4}. The symbol ⊕ denotes concatenation along the channel dimension.
Multi-scale Spatial Feature Map Processing: The multi-scale spatial feature maps (P ∈ {1, 2, 4}) are individually fed into a convolutional neural network for training. Initially, is input, and the intermediate spatial weight is obtained after training. Next, is element-wise multiplied with in the spatial domain, fusing spatial and channel information. The spatial weight is computed through the convolutional network, and at a finer scale, is fused with the channel information to compute the spatial weight SAW, which integrates critical channel information.
The multi-scale spatial feature maps
are sequentially fed into the convolutional neural network for training. Initially,
is input, and the intermediate spatial weight
is derived after training. Subsequently,
is element-wise multiplied with
in the spatial domain, effectively fusing spatial and channel information. The spatial weight
is then computed by the network, and at a finer granularity, it is fused with the channel information
to produce the spatial weight SAW, incorporating crucial channel-specific information.
Spatial Attention Weighting: The spatial attention-weighted output is obtained by performing element-wise multiplication between the spatial attention weights SAW = {
}
and the input feature map
F = {
}
.
In this operation, is obtained by performing element-wise multiplication between and .
PPSAM incorporates a multi-scale pooling pyramid structure to compress channel features, effectively preserving more relevant channel information in complex conditions. Additionally, multi-scale channel feature maps are processed, integrating spatial weights and channel information at each layer to enhance spatial attention performance.
3.8. Multi-Scale Pyramid Attention Module
MPAM is designed to optimize feature maps simultaneously across spatial and channel dimensions [
35,
36], whereas PPSAM and PPCAM, when used individually, are limited to a single dimension. Specifically, PPSAM focuses exclusively on spatial positions within the feature maps, while PPCAM attends solely to channel-wise information. By incorporating MPAM into the head network, the final feature representations are directly optimized, resulting in comprehensive improvements in both classification accuracy and localization precision, thereby leading to more effective enhancements in overall detection performance.
As illustrated in
Figure 5, MPAM sequentially connects PPCAM and PPSAM to generate the
CAW and SAW for the input feature map F. Element-wise multiplication is then performed among
CAW, SAW, and F, resulting in a refined feature map. This refinement strengthens the network’s capability to extract fine-grained features, thereby improving both the accuracy and the recall rate in object detection.
To intuitively demonstrate the ultimate effectiveness of the multidimensional feature fusion mechanism at the core of the YOLO-MPAM model, we first analyze the model’s learning behavior through visualization techniques. As shown in
Figure 6, we employed XGradCAM [
37] to generate attention distribution heatmaps, providing a qualitative comparison of the differences in feature fusion and focus localization among different attention mechanism models. Compared to SE-Net and CBAM, the heatmaps generated by the YOLO-MPAM model exhibit two key advantages: Firstly, the activated regions show a high degree of overlap with the actual pest target contours in the images, indicating that the features fused through the MPAM module can more accurately localize the spatial positions of the targets. Secondly, the heatmaps are more concentrated within the target regions, while the activation responses in the background or non-critical areas are significantly weaker. This fully demonstrates that our pyramid attention module, by symmetrically fusing multi-scale spatial and channel information, successfully enables the model to learn to suppress irrelevant interferences and enhance critical features, ultimately achieving high-quality feature representation.
As shown in
Figure 6, the heatmap visualizations of the integrated MPAM, CBAM, and SE network models are compared. The true labels for each input image are displayed at the bottom. P (Precision) represents the accuracy score of the model for correctly identifying the true class, while R (Recall) indicates the proportion of positive samples correctly predicted as positive by the model. From the comparison of the PR-Score, it can be observed that YOLO-MPAM outperforms the other two models in terms of both detection accuracy and recall rate. Additionally, the heatmap comparison reveals that the heat distribution in YOLO-MPAM is more concentrated, effectively focusing on the target areas to be detected.
4. Experiment
4.1. Dataset Preparation
Extensive pest trapping activities were conducted across multiple honeysuckle cultivation regions, during which pest species and occurrence times were systematically recorded.
Table 3 summarizes the key attributes collected throughout the dataset construction process, including insect scientific names, taxonomic classifications, collection dates, and sampling locations.
The Honeysuckle Pest and Insect Dataset (HPID) was constructed from imagery systematically collected during the 2023 growing season across multiple pear orchards in northern China’s primary production region, specifically within Hebei Province. Julu County served as the core sampling area, with Dicun Township as a key collection site. A standardized protocol utilizing industrial-grade cameras paired with uniform insect trap boards was implemented for image acquisition. The spatial distribution of captured samples across townships, including Dicun, is detailed in
Table 3. All images were captured in situ under unconstrained field conditions, covering a wide spectrum of natural variations in illumination, weather, and crop phenology. This approach ensures the dataset’s high representativeness and methodological consistency, enhancing its utility for developing and validating robust agricultural monitoring applications.
Table 3 summarizes the types, quantities, and distributions of pests monitored across multiple honeysuckle cultivation areas in Julu County, Hebei Province, on 28 June 2022. The data reveal that noctuid moths (28 individuals) and scarab beetles (19 individuals) constituted the dominant pest populations, whereas crambid moths and geometrid moths were observed in significantly lower numbers, with only four individuals each. These findings indicate that substantial class imbalance was inherently present among the samples during the data collection process.
To provide a comprehensive overview of the final experimental dataset,
Table 4 summarizes the detailed statistics, including the number of images, bounding box instances, and resolution distribution for each pest category. The dataset was randomly split into training, validation, and test sets with a ratio of 7:2:1.
4.2. Data Preprocessing
To ensure dataset quality for model training, a rigorous data cleaning process was applied. Images unsuitable for training and validation were manually removed. The primary criteria for removal included: (1) blank images containing no pests, which provide no positive training signal; (2) images with excessively dense clusters of insects (“insect piles”), where severe occlusion prevents accurate individual labeling and could mislead the model; and (3) low-quality images corrupted by equipment malfunctions (e.g., motion blur, overexposure), which degrade feature extraction. Retaining such images would adversely affect training convergence and final detection performance.
To improve model training performance, the original image dataset was cleaned by removing images unsuitable for training and validation, followed by manual annotation of image categories. The main goal of data cleaning was to eliminate low-quality images through manual inspection. As shown in
Table 5. Before cleaning, the dataset contained 1431 images at 4000 × 3000 pixels and 1738 images at 5472 × 3648 pixels. After cleaning, 1354 images at 4000 × 3000 pixels and 1682 images at 5472 × 3648 pixels were retained.
To quantitatively and intuitively present the core challenges in the HPID dataset,
Figure 7 provides detailed statistical analyses. The figure systematically reveals three major difficulties: severe class imbalance, significant scale variation, and spatial distribution imbalance, all of which are key obstacles to accurate pest detection.
Figure 7a illustrates the category distribution, highlighting the pronounced imbalance among the four pest classes. The number of instances for Cockchafer and Cotton Worm is significantly higher than that for Geometridae and SnouthMoth, with Cockchafer occurring approximately 4.5 times more frequently than SnouthMoth. Moreover, Noctuid appears nearly 10 times more often than SnouthMoth. Such imbalance may lead the model to bias toward majority classes during training.
Figure 7b employs a heatmap to visualize the spatial distribution of bounding box centers across all images. The intense blue concentration in the central region indicates that pests predominantly appear near the image center, a phenomenon attributed to camera framing during data collection. In contrast, the sparse peripheral areas make it difficult to detect pests near the edges.
Figure 7c shows a scatter plot of normalized pest sizes (width and height relative to image dimensions). The wide distribution of points, ranging from near zero to approximately 0.4, confirms significant scale variation. Notably, dense clusters are observed in the small-scale region (width and height < 0.1), underscoring the prevalence of small-target pests and the necessity of multi-scale feature learning.
These statistical characteristics demand a robust detection model capable of handling class imbalance, adapting to varying scales, and maintaining sensitivity across the entire image plane, which directly informs the design of our YOLO-MPAM architecture.
4.3. Classification and Labeling Process
Under the guidance of agricultural experts, the insects in the images were divided into four main categories based on the pest characteristics of honeysuckle crops, as shown in
Figure 8. The four categories of pests in this study were defined according to strict taxonomic standards: the grub beetle belongs to the family Scarabaeidae within the order Coleoptera, the geometer moth belongs to the family Geometridae within the order Lepidoptera, the noctuid moth belongs to the family Noctuidae within the order Lepidoptera, and the pyralid moth belongs to the family Pyralidae within the order Lepidoptera. This classification system not only complies with entomological taxonomy norms but also better reflects the essential differences in morphological characteristics and damage mechanisms among different pests [
38]. All annotated images ultimately formed the Agricultural Pest Image Dataset (HPID) suitable for honeysuckle production.
As illustrated in
Figure 8, the dataset used for honeysuckle pest detection was categorized into four groups: Cockchafer, Geometridae, Noctuid, and Snout Moth.
4.4. Experimental Setup
The single-stage object detection algorithm YOLOv5, based on the PyTorch framework, was employed during the experiments. Hardware configurations included an Intel i7 processor and two Nvidia GTX 3080 GPUs with a combined 32 GB of VRAM to accelerate the training process. The experiments were conducted in a CentOS 7.8 environment, with Python 3.9, PyTorch 2.2.2, CUDA 11.4, and cuDNN 7.5.5.
The model training employed the Adam optimizer, with an initial learning rate set to 0.001, and utilized a cosine annealing scheduling strategy. The batch size was set to 16, and the training was conducted for 100 epochs. To ensure comparability of the results, all comparative experiments were performed under the same training-validation-test split, and the random seed was fixed at 42 to guarantee the reproducibility of the experiments.
4.5. Analysis of the YOLO-MPAM Model
Parameter Configuration and Network Model Parameters: During the experimental phase, the YOLO-MPAM network model was adopted for training purposes. The detailed architecture and corresponding parameter settings of the model are summarized in
Table 6. In this table, “Current Layer” denotes the index of each network module, serving as its unique identifier. “Input Source” indicates the origin of the input data for each module, where a value of “−1” signifies that the input is derived from the immediately preceding layer. “Module Type” specifies the nature of the network module, while “Parameters” list the configuration details associated with each module. For the final Detect layer, “nc” represents the number of target classes to be predicted, and “anchor” provides the configuration details of the pre-defined anchors.
4.6. Model Performance Evaluation
To assess the effectiveness of the YOLO-MPAM model in the honeysuckle pest detection task, a comparative analysis was conducted against several state-of-the-art attention mechanisms, including SE-Net, SA-Net [
39], CBAM, and ECA-Net [
40]. The evaluation was performed across multiple dimensions using comprehensive performance metrics.
Initially, detection performance was compared based on mean Average Precision for rare categories (mAPc) and small object categories (mAPs). A fine-grained classification analysis was then conducted using confusion matrices, with particular attention to the comparison of True Positive Rates (TPR) and False Positive Rates (FPR) across various categories. Subsequently, the F1-score was employed to evaluate the model’s balance between precision and recall. Finally, the stability of detection performance under varying confidence thresholds was systematically assessed through Precision-Recall (P-R) curves.
Experimental results indicate that the YOLO-MPAM model consistently outperformed competing methods across all three evaluation dimensions.
Comparative experiments were conducted with several state-of-the-art attention mechanisms, namely SE-Net, SA-Net, CBAM, and ECA-Net, to validate the effectiveness of the YOLO-MPAM model in the honeysuckle pest detection task.
The experiments were carried out using the HPID dataset, and the corresponding results are summarized in
Table 7.
As presented in
Table 7, mAPc denotes the mean Average Precision for common categories, while mAPs refers to the mean Average Precision for similar categories. The experimental results indicate that the YOLO-MPAM model achieved substantial performance gains over the baseline model, with improvements of 16.0% in mAPc@[0.5–0.95] and 12.3% in mAPs@[0.5–0.95]. Furthermore, compared to other attention-based models, YOLO-MPAM exhibited average increases of 10.15% in mAPc@[0.5–0.95] and 5.1% in mAPs@[0.5–0.95].
The results indicate that the enhanced detection accuracy of YOLO-MPAM comes with increased computational cost. With 32.86 M parameters and 10.67 GFLOPs, our model is more complex than the baseline YOLOv5 (25.63 M parameters, 7.98 GFLOPs) and other attention-based variants. Consequently, the inference speed of YOLO-MPAM is measured at 82 FPS on our experimental hardware (Nvidia GTX 3080), which is lower than YOLOv5’s 112 FPS. This trade-off between accuracy and efficiency is a common challenge when introducing advanced modules like MPAM.
Although the comparison with variants of attention mechanisms validates the effectiveness of the module, to more comprehensively evaluate the competitiveness of YOLO-MPAM, this paper further compares it with recognized state-of-the-art baseline models in the current object detection domain, including YOLOv8 [
41], YOLOv9 [
9], and the Transformer-based RT-DETR [
42]. This comparison aims to verify the advanced capabilities and practical value of the proposed method in addressing specific challenges of agricultural pest detection from a broader perspective.
As shown in
Table 8, the results of the comparative experiments with current mainstream advanced detection models demonstrate that the YOLO-MPAM model proposed in this study exhibits significant advantages in detection accuracy, computational efficiency, and practical applicability. In terms of the core detection accuracy metrics, YOLO-MPAM achieves comprehensive superiority. Our model attains an mAP@0.5 of 0.835 and an mAP@[0.5:0.95] of 0.573, both of which significantly outperform all the compared models. This outcome fully validates the effectiveness of the network architecture design of YOLO-MPAM, particularly its multi-dimensional pyramid attention mechanism, in detecting multi-scale and imbalanced distributed pests in complex agricultural scenarios. While the number of parameters in YOLO-MPAM (32.86 M) is comparable to that of RT-DETR-L (32.0 M) and YOLOv9c (25.5 M), its computational complexity (GFLOPs) exhibits an order-of-magnitude advantage. The GFLOPs of YOLO-MPAM is only 10.67, far lower than the 102.01 of YOLOv9c and 97.13 of RT-DETR-L, amounting to less than one-tenth of the latter two. This indicates that the architectural design of YOLO-MPAM is highly efficient, enabling top-tier detection performance with minimal computational cost, avoiding unnecessary computational redundancy, and making it particularly suitable for deployment on edge devices with limited computing power.
4.7. Ablation Study of YOLO-MPAM
To validate the independent contributions of each module in the YOLO-MPAM model, we designed systematic ablation experiments. The experiments used YOLOv5 as the baseline and progressively integrated the PPSAM, the PPCAM, and the MPAM. The performance of these models was evaluated on the HPID dataset.
As shown in
Table 9. Experimental results show that the progressive integration of each module leads to significant performance improvements:
Contribution of the PPSAM module: Introducing PPSAM into the backbone network (Model A) increases mAP@0.5 from 0.628 to 0.725 (a relative improvement of 15.4%), demonstrating that multi-scale spatial attention effectively enhances the model’s ability to perceive small targets with imbalanced spatial distribution.
Synergistic effect of the PPCAM module: Adding the PPCAM on top of Model A (Model B) further increases mAP@0.5 to 0.785 (a relative improvement of 8.3%), demonstrating the complementary nature of channel attention and spatial attention, which enhances the discriminability between similar categories.
Complete optimization with the MPAM module: The full model integrating MPAM achieves a further increase in mAP@0.5 to 0.835 (a relative improvement of 6.4%), validating the collaborative optimization of classification and localization tasks at the detection head through multi-dimensional attention mechanisms.
Parameter efficiency analysis reveals that compared to the baseline model, YOLO-MPAM incurs a 28.2% increase in the number of parameters but delivers a 32.7% gain in accuracy. The marginal benefit (a 1.16% improvement in accuracy for each 1% increase in parameters) surpasses that of comparative models (such as YOLOv5-CBAM, which achieves a marginal benefit of 0.89%), highlighting the computational efficiency of the module design.
4.8. Confusion Matrix Comparison Experiment
In this study, a normalized confusion matrix is employed to visualize model performance [
43], primarily motivated by the inherent class imbalance of the dataset (see
Table 4). Given the substantial variation in sample sizes across categories (e.g., 5688 instances for Cockchafer compared to 1254 for SnoutMoth), the normalization procedure scales each row to 100%, thereby mitigating visual biases introduced by differing class frequencies. This approach enables an equitable assessment of the model’s recognition capability—specifically, the recall rate—for each individual class, thereby reinforcing the core thesis regarding the model’s robustness under imbalanced class distributions.
Furthermore, the normalized matrix aligns conceptually with the principal evaluation metrics adopted in this work—namely, mean Average Precision (mAP), Precision, and Recall. Its diagonal elements correspond directly to per-class recall, while off-diagonal entries reflect the proportion of misclassifications. In contrast to the raw confusion matrix, where absolute counts may disproportionately reflect the influence of majority classes, the normalized representation offers clearer insight into the model’s performance on minority categories. Consequently, it provides a more targeted and interpretable analytical tool for evaluating classification performance in imbalanced data settings.
According to the confusion matrix, YOLO-MPAM demonstrated significant improvements in both Precision and Recall compared to the baseline model, with an overall mAP increase of 5%.
For difficult-to-distinguish targets such as Noctuid moths, the FPR decreased by 12.8%.
The TPR improved by 8.7% over the original model, and for Geometridae moths, which previously exhibited low TPR values, a substantial increase of 40% in TPR was observed.
As illustrated in
Figure 9 the
Y-axis (“Predicted”) denotes the model’s predicted labels, whereas the
X-axis (“True”) indicates the ground truth labels. In the confusion matrix of YOLO-MPAM, when analyzing the Geometridae class horizontally, the predicted label distribution is observed as {0, 0.79, 0, 0.01, 0.04}, with a correct prediction rate of 0.79 and a cumulative incorrect prediction rate of 0.05. Consequently, the FPR for Geometridae is calculated as 0.05, representing an average reduction of 6.4% compared to other models.
Conversely, when examining the Geometridae class vertically, the true label distribution appears as {0.01, 0.03, 0.17, 0.79, 0}, wherein a proportion of 0.79 of the true Geometridae samples are correctly predicted, yielding a TPR of 0.79. This reflects an average improvement of 12.7% over other models.
These results demonstrate a substantial enhancement in the model’s recall and a notable reduction in its error rate.
4.9. F1-Score Comparison Experiments
The F1-score serves as a comprehensive metric that balances precision and recall. Across varying confidence thresholds, a model achieving an F1 score closer to 1 demonstrates superior predictive performance, indicating a more optimal trade-off between precision and recall.
Figure 10 presents a comparison of the F1-Score curves between YOLO-MPAM and other detection models. It can be observed that the baseline model, which does not incorporate an attention mechanism, exhibits relatively low F1-Scores. In contrast, the YOLO-MPAM model demonstrates a significant improvement in F1-Score. Notably, within the confidence interval of [0.2–0.8], the F1-Score of YOLO-MPAM increases from an average of 0.6 to an average of 0.8 compared to other models. Additionally, for all classes, YOLO-MPAM achieves an accuracy of 86% while maintaining a confidence level of 61.5%, marking a 24% improvement over the baseline model’s optimal accuracy of 62%. These results indicate that the YOLO-MPAM model not only enhances prediction accuracy in object detection tasks but also strengthens the recall ability for target samples, thereby demonstrating a clear advantage in overall performance.
From the F1-Score curve in
Figure 10, it can be observed that YOLO-MPAM exhibits a relatively gentle decline within the confidence threshold range of [0.2, 0.5], which contrasts sharply with the baseline model and other attention-based models. This phenomenon arises from the more concentrated and reliable distribution of confidence scores generated by the MPAM module, reducing the production of low-quality prediction boxes.
It is noteworthy that in the region with confidence thresholds above 0.8, the F1-Scores of all models show a significant decline, which aligns with the general pattern of object detection tasks—excessively high thresholds filter out a large number of true positive detection results. The relative advantage of YOLO-MPAM in this region further demonstrates the high reliability of its prediction quality.
4.10. PR Comparison Experiments
The Precision-Recall (P-R) curve provides an intuitive visualization of a model’s performance by illustrating the relationship between Precision and Recall at varying threshold levels. As shown in the P-R curve of the improved YOLO-MPAM model, it can be observed that the model significantly enhances Precision while maintaining a high level of Recall. Consequently, the overall performance of YOLO-MPAM surpasses that of the baseline model.
As illustrated in
Figure 11, the YOLO-MPAM model achieves a Precision of 84.3% on the P-R curve while maintaining a Recall of 80%. Under the same conditions, the baseline model attains only 28.7% Precision, and other attention-based models achieve an average Precision of 68.9%. Additionally, for all classes, the mAP@0.5 of YOLO-MPAM is improved by 19.1% compared to the baseline model and by an average of 11.3% compared to other attention-based models. These results collectively demonstrate that YOLO-MPAM achieves significant improvements in both Precision and Recall.
In the P-R curve shown in
Figure 11, the curve of YOLO-MPAM is positioned overall in the upper right, indicating that it maintains high precision across various recall levels. The slight fluctuation observed in the recall range of 0.7–0.8 is related to the distribution characteristics of the Geometridae samples in the dataset, as the morphological variations of this pest category are relatively large, leading to a small number of false detections at higher recall rates. Nonetheless, the performance of YOLO-MPAM in this challenging scenario remains significantly superior to that of the comparison models.
5. Conclusions
This study effectively addresses the critical challenges of spatial and category distribution imbalances in honeysuckle pest detection by proposing the YOLO-MPAM model, which incorporates a novel multi-dimensional pyramid attention mechanism. To intuitively demonstrate the progressive performance optimization achieved through module integration, a visual comparison of detection results across different model variants is presented in
Figure 12. The figure clearly shows how the detection capability is significantly enhanced at each stage: the baseline model exhibits substantial misdetections and misclassifications; the intermediate model incorporating PPSAM and PPCAM shows improved recall for small and spatially imbalanced pests; finally, our full YOLO-MPAM model achieves the most comprehensive and accurate detection.
This study effectively addresses the critical challenges of spatial and category distribution imbalances in honeysuckle pest detection by proposing the YOLO-MPAM model, which incorporates a novel multi-dimensional pyramid attention mechanism. Through rigorous experimentation on the dedicated HPID dataset, our approach demonstrates significant performance enhancements over the YOLOv5 baseline. The key findings reveal that the progressive integration of the Pyramid Pooling Spatial Attention Module (PPSAM), Pyramid Pooling Channel Attention Module (PPCAM), and Multi-scale Pyramid Attention Module (MPAM) contributed to a substantial cumulative improvement, elevating the mAP@0.5 from 0.628 to 0.835—a relative gain of 32.7%. This improvement is attributed to the model’s enhanced capability in perceiving small targets with imbalanced distributions and distinguishing between similar pest categories. Statistical validation further confirms the model’s robustness; it achieved a notable reduction in the False Positive Rate (FPR) by 12.8% for challenging categories like Noctuid moths and increased the True Positive Rate (TPR) for Geometridae by 40%. Moreover, despite a 28.2% increase in parameters, the model’s marginal benefit of a 1.16% accuracy gain per 1% parameter increase surpasses that of comparable models like YOLOv5-CBAM (0.89%), highlighting its superior computational efficiency. While the current design may present challenges for real-time applications due to increased complexity, this work provides a solid foundation for accurate pest detection in precision agriculture. Future efforts will focus on model lightweighting to facilitate practical deployment, thereby offering a reliable technological solution to ensure honeysuckle yield and quality while supporting farmers’ income.
While the YOLO-MPAM model achieves state-of-the-art detection accuracy, its practical deployment in precision agriculture requires a balanced consideration of performance and computational cost. The current model, with an inference speed of 82 FPS on a high-end GPU (NVIDIA GTX 3080), is readily deployable on centralized processing servers or high-performance edge computing devices (e.g., NVIDIA Jetson AGX Orin) for real-time monitoring tasks. For large-scale field applications involving resource-constrained edge devices (e.g., NVIDIA Jetson Nano), the model’s complexity may present a challenge for strict real-time processing. However, it is important to note that many agricultural monitoring scenarios, such as automated pest counting in static image traps, do not require high frame rates (e.g., 1–5 FPS is often sufficient), making the current model highly viable for a wide range of practical use cases where accuracy is paramount.
To further enhance deployment flexibility and broaden the model’s applicability, future work will focus on two main directions. The first is model lightweighting, where promising optimization strategies such as knowledge distillation, structured pruning, and post-training quantization will be employed. These techniques are expected to produce a streamlined variant that retains the core detection capabilities while meeting the efficiency demands of cost-sensitive, large-scale deployment. The second direction involves enhancing the model’s robustness for field deployment. Recognizing that intense sunlight, variable illumination, and complex backgrounds in real-world environments can challenge the model’s performance, we will prioritize the development of advanced data augmentation techniques and explore domain adaptation methods. This will specifically aim to improve the model’s invariance to lighting variations and occlusions, ensuring reliable real-time pest monitoring under actual field conditions. These efforts collectively will facilitate the translation of research outcomes into practical tools, ultimately providing a reliable technological solution to safeguard honeysuckle yield and quality and support sustainable agricultural livelihoods.