MA-YOLO: A Pest Target Detection Algorithm with Multi-Scale Fusion and Attention Mechanism

Lu, Yongzong; Liu, Pengfei; Tan, Chong

doi:10.3390/agronomy15071549

Open AccessArticle

MA-YOLO: A Pest Target Detection Algorithm with Multi-Scale Fusion and Attention Mechanism

by

Yongzong Lu

¹

,

Pengfei Liu

^2,*

and

Chong Tan

³

¹

School of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

²

School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

³

Chengdu Institute of Metrology Verification and Testing, Chengdu 610000, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(7), 1549; https://doi.org/10.3390/agronomy15071549

Submission received: 5 June 2025 / Revised: 20 June 2025 / Accepted: 24 June 2025 / Published: 25 June 2025

(This article belongs to the Collection Advances of Agricultural Robotics in Sustainable Agriculture 4.0)

Download

Browse Figures

Versions Notes

Abstract

Agricultural pest detection is critical for crop protection and food security, yet existing methods suffer from low computational efficiency and poor generalization due to imbalanced data distribution, minimal inter-class variations among pest categories, and significant intra-class differences. To address the high computational complexity and inadequate feature representation in traditional convolutional networks, this study proposes MA-YOLO, an agricultural pest detection model based on multi-scale fusion and attention mechanisms. The SDConv module reduces computational costs through depthwise separable convolution and dynamic group convolution while enhancing local feature extraction. The LDSPF module captures multi-scale information via parallel dilated convolutions with spatial attention mechanisms and dual residual connections. The ASCC module improves feature discriminability by establishing an adaptive triple-weight system for global, channel, and spatial semantic responses. The MDF module balances efficiency and multi-scale feature extraction using multi-branch depthwise separable convolution and soft attention-based dynamic weighting. Experimental results demonstrate detection accuracies of 65.4% and 73.9% on the IP102 and Pest24 datasets, respectively, representing improvements of 2% and 1.8% over the original YOLOv11s network. These results establish MA-YOLO as an effective solution for automated agricultural pest monitoring with applications in precision agriculture and crop protection systems.

Keywords:

pest detection; YOLOv11s; SDConv; LDSPF; ASCC; MDF

1. Introduction

China is a traditional agricultural country, and agriculture, as the fundamental industry in China, occupies a dominant position in the national economy [1]. In 2022, the total grain output reached 1373.06 billion catties, with a year-on-year growth rate of 0.5% [2]. The sown area of grain crops was 1.775 billion mu, with a year-on-year increase of 0.6%. The grain crop yield per mu was 386.8 kg, with a year-on-year increase of 0.5% [3]. The level of agricultural production and grain yield directly affects the daily lives of the population [4]. Pest infestation is a significant factor that causes harm to crops and impacts crop productivity on a global scale [5]. This, in turn, affects the stability of regional agricultural economies and food security [6]. Pest and weed infestations exert a substantial impact on global agricultural production, affecting approximately twenty to thirty percent of the total output. Consequently, this predicament leads to a staggering economic loss of an estimated seventy billion US dollars [7]. In recent years, the Ministry of Agriculture and Rural Affairs has successively deployed major pest control work for autumn grain crops, making every effort to fight against pests and achieve “preventing pests from eating crops” to ensure a good harvest. In response to new situations and characteristics of pests and diseases in autumn grain crops, monitoring and investigation have been strengthened, and the system of reporting and alerting on pest and disease disasters has been strictly implemented. Timely release of early warning information and guidance helps farmers promptly control pests and diseases. Accurate pest identification not only helps reduce the blind use of pesticides and lowers production costs and environmental pollution, but also improves control effectiveness and crop quality [8,9,10,11]. In order to effectively combat and manage pests and diseases in farmland, the collection and analysis of information pertaining to these issues is imperative [12]. Traditional monitoring techniques, such as manual observation and statistical analysis, fall short in meeting the demands of modern large-scale agricultural production due to the vast diversity of pests and the complexity of available information sources [13]. Consequently, the application of deep learning methods in agricultural pest detection has emerged as a prominent area of research [14].

There are two main types of object detection algorithms based on deep learning [15,16,17]: one-stage target detection algorithms [18] that rely on regression, and two-stage target detection algorithms [19] that generate candidate regions. The initial step in two-stage target detection algorithms involves the extraction of potential regions from the input image. These extracted regions are subsequently utilized for target classification and position estimation. Notable algorithms in this domain include R-CNN [20], Fast R-CNN [21], Faster R-CNN [22,23], and Mask R-CNN [24]. One-stage target detection algorithms eliminate the step of extracting candidate regions and directly use convolutional neural networks for target classification and position estimation from the input image. YOLO (you only look once) series algorithms [25,26,27,28] and the SSD (single shot multibox detector) [29] algorithm are typical one-stage target detection algorithms. These algorithms offer several advantages when compared to two-stage target detection algorithms. Firstly, they significantly reduce the computational cost of the model by eliminating the need to generate candidate regions before target classification and position estimation. Additionally, they demonstrate high real-time performance, making them suitable for rapid detection of agricultural pests. The target detection algorithms mentioned above have demonstrated impressive outcomes on traditional datasets. However, it is important to note that these deep learning models commonly employ sample resizing operations. Consequently, when these models are utilized for pest detection purposes, there is a risk of losing crucial sample features of the pests. Furthermore, the effectiveness of deep learning models heavily relies on the quantity of samples in the dataset. Consequently, for agricultural pest target detection models, the challenges posed by limited sample numbers and datasets containing images that are either too small or too large are undeniably significant.

Currently, the existing pest detection methods have the following limitations. The pest dataset exhibits extremely imbalanced class distribution, with some classes contributing to the majority of training samples while others are severely underrepresented in the dataset. The detection model in question experiences a decline in accuracy due to the overlapping presence of multiple pest targets, resulting in the occurrence of numerous redundant detection boxes. This, in turn, poses a challenge when attempting to detect and differentiate between similar and finely detailed pest species. To address these issues, this paper proposed a novel agricultural pest detection model incorporating multi-scale feature fusion and attention mechanisms. Specifically, the proposed model integrated various custom modules into the YOLOv11s network, including the symmetric dual-path convolution (SDConv) module, lightweight dynamic spatial-aware pyramid fusion (LDSPF) module, adaptive spatial-channel concat (ASCC) module, and multiscale dynamic fusion (MDF) module. This amalgamation of components led to the development of an agricultural pest object detection algorithm that enhanced detection accuracy while reducing model parameters and computational costs.

2. Related Work

2.1. Agricultural Pest Target Detection Algorithm

Traditional pest detection algorithms first rely on manually set feature information, such as using scale-invariant feature transform (SIFT) [30,31], local binary patterns (LBP) [32], oriented fast and rotated BRIEF (ORB) [33], and speeded up robust feature (SURF) [34] algorithms to represent the target. Machine learning-based object detection algorithms, such as support vector machine (SVM) [35], K-nearest neighbor (KNN) [36], and random forest [37], mainly include two steps: feature extraction and model training. Feature extraction mainly extracts relevant features of pest images for target representation. HOG features were first used for pedestrian detection [38]. Due to its good performance, Shen et al. [39] used it for insect detection. Li et al. [40] applied KNN for insect identification. However, these feature-based methods have limitations. They heavily rely on the features provided by the feature operators, making them less robust against factors such as lighting, occlusion, complex environments, interference from similar targets, and lack of generalization ability. Therefore, these detection algorithms cannot adapt to scene migration.

With the improvement of the computational power of graphics processors, object detection algorithms that utilize deep learning [41,42,43] have demonstrated strong resilience in extracting features, with the good real-time detection capability. Compared to traditional object detection algorithms, both the accuracy and speed have been significantly improved. Deep learning utilizes convolutional neural networks to extract features from data and train models end-to-end. Its lightweight models and powerful generalization capabilities have proven effective in object tracking and image recognition tasks. Therefore, the use of deep learning in pest detection has become a trending research topic. Deep learning-based object detection algorithms can be categorized into one-stage and two-stage frameworks based on their network architecture. In the two-stage framework, the first-level network is used for candidate region extraction, while the second-level network performs classification and precise coordinate regression on the extracted candidate regions. The candidate region extraction step is eliminated in the one-stage framework, which directly utilizes the first-level network for classification and regression tasks. In response to the low accuracy and efficiency of traditional agricultural pest detection methods, Jiao et al. [44] developed an anchor-free region convolutional neural network (AF-RCNN). They first created a feature fusion module to extract valuable information about small target pests. Then, they introduced an anchor-free region proposal network (AFRPN) to generate high-quality proposal regions for potential detection positions, based on the fused feature maps. Finally, AFRPN and Fast R-CNN networks were combined into a single network, known as AF-RCNN, for end-to-end detection of twenty-four classes of pests. Xu et al. [45] devised a novel framework for object detection using Cascade RCNN. Their main focus was to address the challenges of limited and imbalanced samples through data augmentation techniques. To improve the accuracy and level of detail in the sample features, they introduced a sliding window cropping method that expanded the perception field without altering the original image size. By incorporating the attention mechanism with the feature pyramid network, they were able to enhance the importance of sample features in both spatial and channel dimensions, thereby improving the performance of the model. This model successfully overcame the difficulties faced by neural networks when working with various types of pests, large sample images, and the identification of small pests. Li et al. [46] integrated the YOLO and Faster R-CNN network [22] into a disease and pest detection framework. They also developed an application embedded with this model on a mobile platform, achieving good results in practical agricultural applications.

In order to address the complex background of tomato disease and pest images under natural conditions, Wang et al. [47] replaced the convolutional layers in the backbone network of the YOLOv3 model with dilated convolutional layers to preserve high resolution and receptive field. To tackle the detection problem of mutually ambiguous objects related to disease and pests, in the detection network, the fuzzy objects of tomato disease and pests are retained based on the intersection-over-union of candidate boxes predicted by multiple networks, with linearly attenuated confidence scores. Furthermore, a lightweight network construction is achieved through convolutional factor decomposition, reducing the number of model parameters. Finally, a balance factor was introduced to optimize the loss function and the weight of small targets. Building upon the YOLOv3 network, Lv et al. [48] combined image enhancement techniques to construct a target detection model that performs better in detecting crop pests in real agricultural environments. After image enhancement based on image cropping, a maize pest dataset was constructed. Then, a linear transformation optimization-based K-means clustering algorithm was proposed to generate detection boxes. This approach significantly enhanced the accuracy of matching between the predefined detection boxes and the ground truth boxes. To capture more information about the position of small targets at a lower level, two residual units were added to the second residual block of the original network. Zhang et al. [49], based on YOLOX [50], introduced several improvements to enhance the ability to extract image features and address the issue of sample imbalance. These enhancements included the efficient channel attention mechanism, the hard-Swish activation function, and the focal loss function. As a result, the detection speed and accuracy of the model were greatly improved. Furthermore, they developed cotton disease and pest detection software that can be deployed on smartphones based on the improved model, achieving real-time detection. Aladhadh et al. [51] modified the YOLOv5s model [52] through various methods to design a pest detection method which can accurately locate pests and classifiy them based on their required class labels.

2.2. Convolutional Operation

The purpose of convolutional operation is to extract effective feature information from the input feature map. In the process of image processing, multiple different filters can be selected for convolutional operation. Each filter plays a vital role in extracting various aspects of features from the input feature map, such as horizontal, vertical, and diagonal edge features. In convolutional neural networks, these filters were equipped with automatically learned weights during the training process to extract different features effectively.

Conventional convolution [53] can be divided into the single-channel 2D convolution, the multi-channel 2D convolution, 3D convolution, and 1 × 1 convolution. The multi-channel 2D convolution is usually used for processing color images. It performs three-channel convolutions on the R, G, and B layers, respectively, which moves only in the height and width directions of the image, and then the convolutional results of the three channels are merged to output the final feature map. Three-dimensional convolution is an extension of 2D convolution. The filter used in 3D convolution typically has a smaller depth than the input feature map, allowing the convolution to move along the height, width, and channel dimensions of the image. A 1 × 1 convolution, using a kernel size of 1 × 1, can effectively reduce dimensionality and reduce the complexity of calculation, and it has found wide application in the GoogLeNet network structure [54].

In neural networks, it is common to use up-sampling operations [55] to enhance the resolution of low-resolution images and convert low-dimensional feature maps to high-dimensional space. This is particularly important in tasks such as autoencoders [56] or semantic segmentation networks [57]. Transposed convolution [58] is a method that achieves optimal up-sampling by learning convolutional parameters. Although transposed convolution can achieve optimal up-sampling through learned parameters, the transposed convolution uses the same convolutional kernel across the entire feature map without considering local variations, limiting its ability to respond to specific details. Furthermore, the large number of parameters involved in transposed deconvolution operations can decrease the detection speed. Therefore, in practical applications, there is often a preference for using linear interpolation for up-sampling to achieve higher efficiency.

Atrous convolution [59], also known as dilated convolution, is a technique that expands the kernel by inserting gaps between its elements to increase the receptive field. This expansion rate, which refers to the spacing between the elements of the convolutional kernel, allows for a larger receptive field without increasing the computational cost. Unlike regular convolutions, atrous convolutions can avoid information loss caused by pooling operations while increasing the receptive field. However, the gaps in the convolutional kernel mean that not all input information in the feature map is taken into account during the computation. This can result in discontinuity in the feature map, especially when multiple convolutional layers with the same dilation rate are stacked together. To address this issue, a common approach is to use a mixture of atrous convolutions with different dilation rates, enabling effective utilization of feature map information.

Separable convolution consists of two convolution processes: spatially separable convolution [53] and depthwise separable convolution [60]. Spatially separable convolution is a technique where a convolution kernel is divided into smaller kernels, such as dividing a 3 × 3 kernel into a 3 × 1 and 1 × 3 kernel. This allows for the reduction in the number of multiplications required for convolution from 9 to 6, thereby decreasing the computational load and speeding up the network. However, not all convolution kernels can be divided in this way. If spatially separable convolution is used to replace all conventional convolutions, it will limit the search for all possible convolution kernels during the training process, resulting in a suboptimal network. Depthwise separable convolution consists of two convolution processes: channel-wise convolution and point-wise convolution. In channel-wise convolution, individual convolution kernels are applied to each channel of the input feature map. This produces an output feature map with the same depth as the input. Point-wise convolution, on the other hand, is the 1 × 1 convolution in conventional convolution, where the feature map obtained from the channel-wise convolution is weighted and added together in the depth direction, resulting in the final output feature map. The advantage of depthwise separable convolution is that it can effectively reduce the number of parameters and computations in the model. The more attributes the network needs to extract, the more parameters and computations can be saved, making the inference of neural networks deployed on embedded devices faster. However, for small models, replacing regular convolution with depthwise separable convolution may significantly degrade the performance of the model.

Group convolution [61] is the convolution operation that was first used in the AlexNet network [62]. It divides the convolution into groups and executes them in parallel on two GPUs to solve the issue of insufficient GPU memory. In group convolution, the convolutional filters are divided into different groups, with each group responsible for convolving the corresponding input layer. Finally, the results are merged. Group convolution is now commonly used in lightweight networks such as ShuffleNet [63] to reduce network parameters and improve computational speed when deploying neural networks on embedded systems such as ARM.

3. Materials and Methods

To improve the detection accuracy of agricultural pest detection models and address the issues of high computational complexity and insufficient feature extraction in existing object detection models, this study proposed a novel agricultural pest detection algorithm (MA-YOLO). Based on the YOLOv11s network, MA-YOLO enhanced detection performance in complex agricultural pest scenarios through four lightweight and efficient modules. First, the SDConv module was designed by combining depthwise separable and group convolution to replace the traditional convolution. Second, the LDSPF module employed parallel dilated convolutions to capture multi-scale information, integrated dynamic weighting with a spatial attention mechanism, and introduced dual residual connections to achieve synergistic optimization of multi-scale perception and spatial enhancement. Third, the ASCC module utilized a ternary weighting system (global, channel, spatial) to adaptively concatenate multi-source feature maps, reinforcing semantic responses in target regions and significantly improving feature discriminability and robustness. Fourth, the MDF module extracted multi-scale features using multi-branch depthwise separable convolution, combined dynamic weighting via soft attention with cross-channel interaction through 1 × 1 convolution, balancing lightweight design and feature enhancement. The architecture of the proposed MA-YOLO agricultural pest detection algorithm is illustrated in Figure 1.

3.1. YOLOv11s Model

Considering the real-time requirements and detection accuracy for agricultural pest detection, this study selected the YOLOv11s model [64] as the base model for improvement. YOLOv11, released by Ultralytics in 2024 as part of the YOLO series, underwent refined architectural adjustments compared to YOLOv5 and YOLOv8, adopting a more efficient feature extraction network to enhance small-object detection capability. Additionally, YOLOv11 introduced an adaptive anchor mechanism and advanced data augmentation techniques, further improving model robustness and generalization.

YOLOv11 incorporated two core modules: the C3K2 and the C2PSA module. The C3K2 module, an improved version of the C3 module, adjusted its behavior through parameter tuning in C3K, enabling flexible feature extraction tailored to different scenarios. The C2PSA module integrated stacked PSA blocks, leveraging attention mechanisms to enhance feature representation while efficiently processing feature information, significantly boosting detection performance. In the detection head, YOLOv11 added two depthwise separable convolutions for classification tasks. Compared to standard convolutions, depthwise separable convolutions reduced computational cost and parameter count in large-scale data processing and real-time detection tasks. Through pointwise convolution, they maintained performance close to standard convolutions with lower computational overhead.

3.2. SDConv Module

During convolution operations on input feature maps, conventional convolutions typically employ fixed-size kernels for feature extraction. However, fixed-size kernels struggled to simultaneously capture both local details and global contextual information. Small kernels (e.g., 3 × 3) effectively extracted local features but were limited by their restricted receptive fields in capturing large-scale global information. Conversely, large kernels (e.g., 7 × 7) expanded the receptive field but performed poorly in extracting fine-grained local details while significantly increasing parameter counts and computational costs. Moreover, as network depth increased, conventional convolutions exhibited a substantial rise in both parameter size and computational complexity. This not only prolonged model training and inference time but also led to excessive GPU memory consumption, restricting practical deployment, particularly on edge devices with limited computational resources. Reducing model parameters and computational overhead thus became a critical challenge for real-world applications.

Depthwise separable convolution reduced the number of parameters and computational cost through depthwise convolution and pointwise convolution, while group convolution divided the convolutional kernels into different groups, with each group performing convolution operations on corresponding input channels before merging the results. This effectively decreased network parameters to improve model inference speed. Based on the combination of depthwise separable convolution and group convolution, this study constructed the SDConv module, whose network architecture is illustrated in Figure 2.

The workflow of the SDConv module was as follows:

Feature Grouping: Given an input feature map

X \in R^{C \times H \times W}

, this step evenly split it into two groups along the channel dimension,

X = [X_{1}, X_{2}], R^{C / 2 \times H \times W}

.

Depthwise Separable Convolution: The main branch employed depthwise separable convolution to achieve spatial–channel decoupled feature extraction, reducing computational complexity while preserving spatial perception.

Group Convolution: The auxiliary branch enhanced local feature representation through dynamic group convolution.

Aggregation: After performing depthwise separable convolution and group convolution, the feature maps from both branches were concatenated. To facilitate information fusion between different sub-feature maps, a 1 × 1 convolution was applied to the aggregated feature map to generate the final output.

The input module of the YOLOv11s network primarily handled preprocessing tasks such as image scaling rather than feature extraction from images. On the other hand, the output module utilized detection heads to predict objects of varying sizes based on the extracted multi-scale feature maps. In this study, the constructed SDConv module was embedded into the backbone network, the neck network, and both the backbone and neck networks. The corresponding network variants were named YOLOv11s-SDConv-A, YOLOv11s-SDConv-B, and YOLOv11s-SDConv-C, respectively. Their network architecture is illustrated in Figure 3.

3.3. LDSPF Module

Although the SPPF module enhanced the multi-scale object recognition capability of the target detection network through its pyramid pooling structure, the three cascaded 5 × 5 pooling operations significantly increased the module’s parameter count and computational cost. Additionally, the SPPF module employed channel concatenation and convolution operations for multi-scale feature fusion, failing to adaptively account for the varying importance of features across different scenarios. Moreover, its fixed structure and parameters lacked the ability to dynamically adjust feature extraction strategies, leading to insufficient feature representation in complex scenes, particularly at object edges or texture regions, which adversely affected detection accuracy. Finally, the SPPF module did not perform feature channel dimensionality reduction, resulting in redundant high-dimensional features that further increased computational overhead during fusion and hindered the model’s deployment efficiency on edge computing devices.

This study integrated the concepts of residual modules, depthwise separable convolution, and dynamic weighted multi-scale fusion to design the LDSPF module, whose architecture is illustrated in Figure 4. Figure 4a shows the overall structure of the LDSPF module, Figure 4b depicts the spatial-aware attention module, Figure 4c presents the feature fusion module, and Figure 4d illustrates the dynamic weight generation module. The LDSPF module employed depthwise separable convolution for feature extraction from the input feature maps, effectively reducing model parameters and computational costs while preserving the receptive field. Subsequently, the feature maps were fed into three parallel dilated convolutions with dilation rates of 1, 3, and 5, respectively. This design was inspired by the multi-resolution perception mechanism of biological vision systems, enabling the network to capture contextual information at different scales. The dynamic weight generation module extracted channel statistics via global average pooling and generated fusion weights for each branch through two convolutional layers and a sigmoid function. This allowed the network to adaptively adjust the importance weights of features at different scales based on the input. The weighted multi-scale features were fused using a 1 × 1 convolutional layer and combined with the original input features through residual connections. This design, integrating multi-scale feature fusion and residual connections, preserved the initial feature information while capturing multi-scale contextual details.

Based on the aforementioned module, a spatial attention enhancement mechanism was incorporated to develop the spatial-aware attention module. The spatial-aware module applied both max pooling and average pooling operations along the channel dimension of the input feature maps to capture high-response features (such as object edges) and global distribution characteristics. The resulting feature maps were concatenated and processed through a 7 × 7 large-kernel convolution to generate a spatial attention map. This attention map was then element-wise multiplied with the input feature maps to enhance the network’s focus on target regions. Finally, a secondary residual connection was established with the original input features to balance feature intensity levels. Through the integration of multi-scale perception, dynamic fusion, and spatial enhancement strategies, the LDSPF module enabled the network to simultaneously capture long-range dependencies and concentrate on local detail features, while maintaining robust discriminative capability even under complex background interference.

3.4. ASCC Module

The concat module fused low-level positional information with high-level semantic features through channel-wise concatenation, preserving the original distribution characteristics of different branches. This operation required no additional learnable parameters, demonstrating significant advantages in lightweight network design. In multi-scale object detection tasks, integrating semantic features from different levels effectively enhanced the model’s adaptability to complex scenes. However, the concat module simply concatenated feature maps from different levels along the channel dimension without dynamic perception of feature importance. Consequently, both noisy and effective features were assigned equal weights, leading to semantic confusion in challenging scenarios such as occlusion and illumination variations. During gradient propagation, the distribution discrepancy among hierarchical features caused imbalanced gradient magnitudes, potentially hindering the consistency of network optimization and affecting model convergence stability.

The BiFPN module improved upon the concat module by introducing learnable global weighting parameters, which dynamically adjusted fusion weights based on the semantic importance of input features while preserving the advantages of multi-scale feature concatenation. This approach mitigated channel redundancy and gradient imbalance caused by direct channel-wise concatenation, with weight normalization strategies enhancing the stability of weight allocation. However, the BiFPN module relied solely on scalar weights for global feature modulation without modeling the channel characteristics and spatial distribution variations across different-scale feature maps. Consequently, when detecting targets with significant channel sensitivity or spatial bias, this coarse-grained weighting mechanism struggled to adequately capture feature responses in critical regions. To address this limitation, this study integrated an adaptive weight allocation mechanism with channel-wise concatenation, enhancing fusion robustness through dynamic feature selection.

The ASCC module achieved cross-modal information fusion through a dual-path feature joint perception mechanism. By constructing a differentiable ternary weighting system, it first retained the global scale weights of the BiFPN module as the basic fusion coefficients, while introducing channel sensitivity weight and spatial attention weight to capture the semantic importance of input feature maps in the channel dimension and the regional response intensity in the spatial dimension, respectively. The channel weight compressed spatial information through global average pooling to obtain channel-wise statistical features, while the spatial weights generated spatial response heatmaps through channel squeezing. Finally, the channel and spatial weights were nonlinearly coupled with trainable base weights, forming a feature enhancement mechanism with dual-domain adaptability in both spatial and channel dimensions. Through parameterization, the ASCC module realized progressive feature fusion with multi-dimensional perception, organically integrating channel sensitivity analysis and spatial attention modeling into the feature concatenation process. This enabled the multi-scale information fusion to focus more precisely on the feature regions of detection targets.

3.5. MDF Module

The C3K2 module in the YOLOv11s network was an improved structure based on the Cross Stage Partial (CSP) architecture, which inherited the C2f structure and incorporated an optional C3K module for feature fusion. The C3K2 module utilized either bottleneck or C3K modules to extract high-level features and fused shallow and deep features through concatenation operations. However, the C3K2 module employed convolution kernels of the same size, lacking adaptive processing for multi-scale features, which limited its ability to capture objects of varying sizes. Particularly in complex scenes, the fixed receptive field struggled to adaptively match target features of different scales, leading to the loss of fine-grained details or insufficient contextual associations. Additionally, the C3K2 module relied on simple channel-wise concatenation for feature fusion without a dynamic weight allocation mechanism, making it difficult to emphasize critical information in the feature maps.

Similarly to the design concept of the LDSPF module, this study also integrated depthwise separable convolution and dynamic weighted multi-scale fusion to construct the MDF module, whose network architecture is illustrated in Figure 5. Figure 5a shows the architecture of the multiscale dynamic fusion module, while Figure 5b depicts the soft attention module. The MDF module first extracted multi-scale features using multiple parallel depthwise separable convolutions with different kernel sizes. The soft attention mechanism learned spatial attention through global average pooling and 1 × 1 convolutions, automatically computing the weights of each feature extraction branch and normalizing them via the softmax function to achieve dynamic fusion. Finally, a 1 × 1 convolution was applied to the fused feature maps for information interaction.

The MDF module expanded the receptive field coverage through multi-scale convolutional kernels, enhancing both local detail perception and large-scale contextual modeling capabilities. Secondly, the soft attention mechanism adaptively optimized multi-scale features in a learnable manner, enabling the network to dynamically adjust weights based on multi-scale feature information. Meanwhile, the MDF module employed depthwise separable convolutions to reduce parameter count and computational overhead, achieving dual benefits of lightweight design and high accuracy through efficient feature reuse.

In this study, the constructed MDF module was embedded into the backbone network, the neck network, and both the backbone and neck networks. The corresponding networks were named YOLOv11s-MDF-A, YOLOv11s-MDF-B, and YOLOv11s-MDF-C, with their architectures illustrated in Figure 6.

3.6. Experimental Dataset

The dataset used in this study is the IP102 dataset [65] and Pest24 dataset [66]. Both datasets are specifically designed for pest detection in precision agriculture and can effectively support deep learning-based pest identification and detection research.

The IP102 dataset is a large-scale benchmark dataset for insect pest recognition released in June 2019, specifically designed for pest identification tasks in agricultural production. This dataset contains 75,222 high-quality images covering 102 different pest categories, including important pest species that affect major crops such as rice, maize, wheat, and soybeans. Examples include rice pests such as rice leaf roller, rice leaf caterpillar, white-backed planthoppe, and grain spreader thrips, as well as maize pests such as corn borer and armyworm. The dataset exhibits a natural long-tailed distribution, accurately reflecting the realistic occurrence patterns of pests in agricultural production, where some pest species have abundant samples while rare pest species have relatively fewer samples. The dataset contains 18,981 images with precise bounding box annotations supporting object detection tasks. Notably, small inter-class differences (similar features) and large intra-class differences (multiple stages in the lifecycle of pests) exist among pest species, fully reflecting the technical challenges faced in pest identification for practical agricultural applications.

The Pest24 dataset is a large-scale agricultural pest object detection dataset released in August 2020, specifically targeting key monitored pests in Chinese agricultural production. This dataset is automatically collected through specialized pest trapping and imaging devices deployed in fields, ensuring data authenticity and scene diversity. The dataset selects 24 categories of key pests designated for monitoring by China’s Ministry of Agriculture and Rural Affairs as annotation targets, all of which are important species that pose serious threats to China’s major crops. Specific examples include rice pests such as rice planthopper and rice leaf folder, and vegetable pests such as Plutella xylostella and spodoptera cabbage. The Pest24 dataset contains 25,378 professionally annotated pest images, divided into training and testing sets with a 7:3 ratio. A distinctive feature of this dataset is that pest targets are generally small with high similarity, fully conforming to the technical challenges of pest identification in actual field monitoring. Additionally, non-target pest samples are mixed in the images, realistically simulating complex situations in actual agricultural environments. Since automated pest capture equipment is used for collection, the data collection process is highly standardized, ensuring the authenticity of field environments and reliability of data quality.

The selection of IP102 and Pest24 datasets fully considers the practical needs of agricultural pest monitoring, covering the complete application scenario from pest identification to field detection. The IP102 dataset provides rich pest species information, helping to build comprehensive pest identification models, while the Pest24 dataset focuses on key pest monitoring in agricultural production, providing more targeted training data for practical agricultural applications. The combined use of both datasets ensures both model generalization capability and detection accuracy for key monitored pests, providing reliable technical support for pest control in precision agriculture.

3.7. Experimental Setup

The experimental framework comprised two distinct computational environments: a training server and an embedded testing platform. The training server was equipped with an Intel(R) Xeon(R) Platinum 8358P CPU, Ubuntu 18.04 operating system, and an NVIDIA RTX 3080Ti graphic card. The deep learning environment utilized CUDA framework version 11.1 and PyTorch 1.9.0. Model initialization employed pretrained weights derived from the COCO dataset, and training was conducted over 300 epochs with a batch size of 32 and input image dimensions of 640 × 640 pixels. In this study, all models employed the official default parameters, with an initial learning rate of 0.01, SGD (Stochastic Gradient Descent) optimizer, momentum parameter of 0.937, and weight decay coefficient of 0.0005. Additionally, a fixed random seed was established to further enhance the reproducibility of experiments and the reliability of results.

For embedded system evaluation, a Jetson Xavier AGX development platform was employed (NVIDIA Corporation, Santa Clara, CA, USA), configured with Ubuntu 16.04, Jetpack 4.5.1, CUDA 10.2, and cuDNN 8.0. The embedded testing environment utilized PyTorch 1.10.0 and Torchvision 0.11.1 to ensure compatibility and optimal performance on the edge computing platform. This dual-platform approach enabled comprehensive evaluation of both training efficiency and deployment feasibility across different computational architectures.

4. Results and Discussion

The experimental results analysis in this study were divided into eight parts: the SDConv module effectiveness experiment, the LDSPF module effectiveness experiment, the ASCC module effectiveness experiment, the MDF module effectiveness experiment, ablation experiments, comparison experiment, visualization experiment, and embedded device deployment experiment.

4.1. SDConv Module Effectiveness Experiment

To explore the optimal embedding model for the SDConv module, this study investigated how different structures of the YOLOv11s network affected detection performance when embedding the SDConv module. The performance of three proposed detection models with embedded SDConv modules (YOLOv11s-SDConv-A, YOLOv11s-SDConv-B, and YOLOv11s-SDConv-C) was evaluated on the IP102 test dataset, with experimental results presented in Table 1.

As shown in Table 1, the constructed SDConv module reduced the model’s parameter count and computational complexity through depthwise separable convolution. Comparing the mAP@0.5 metric, YOLOv11s-SDConv-B achieved a 0.2% improvement in detection accuracy over the original YOLOv11s model, while YOLOv11s-SDConv-A and YOLOv11s-SDConv-C decreased by 4% and 2.4%, respectively. The SDConv module employed channel splitting for dual-path feature processing: the main branch used depthwise separable convolution to extract spatial features, while the auxiliary branch captured channel interactions via dynamic grouped convolution. In the original network, the neck primarily performed feature fusion across different hierarchical levels, making the SDConv module’s dual-path mechanism better at preserving feature diversity. Conversely, the backbone mainly extracted low-level spatial features, where standard convolution’s dense connections and channel interactions proved more effective for low-level feature extraction. The SDConv module’s channel splitting and depthwise separable convolution disrupted the strong spatial correlation required during feature extraction. Therefore, embedding the SDConv module in the neck yielded superior results while reducing model parameters and computations. Consequently, YOLOv11s-SDConv-B was selected as the final SDConv-embedded model.

4.2. LDSPF Module Effectiveness Experiment

Experiments were conducted on the LDSPF and SPPF modules using the IP102 dataset to investigate the impact of the LDSPF module on model performance. The result was presented in Table 2.

As shown in Table 2, the LDSPF module significantly reduced both parameter count and computational complexity compared to the SPPF module, while improving detection accuracy by 1.1% over the original YOLOv11s model. The SPPF module employed fixed-size max-pooling operations to capture multi-scale contextual information, but its rigid pooling windows lacked adaptability to varying object scales. Additionally, max-pooling discarded non-maximal response features, and its single receptive field limited the representation capability for specialized objects. In contrast, the proposed LDSPF module utilized parallel dilated convolutions to extract multi-scale contextual information. A dynamic weight allocation mechanism adaptively adjusted feature weights across branches, while a spatial attention mechanism further enhanced the fused features. By integrating multi-scale perception, dynamic fusion, and spatial refinement, the LDSPF module effectively captured long-range dependencies while preserving local details, maintaining robust feature discrimination under complex background interference.

4.3. ASCC Module Effectiveness Experiment

Experiments were conducted on the IP102 dataset to evaluate three channel concatenation modules: the proposed ASCC module, BiFPN, and the conventional concat operation. The result is presented in Table 3.

As shown in Table 3, compared to the conventional concat operation, both BiFPN and ASCC modules introduced negligible increases in model parameters and computational cost. The YOLOv11s-ASCC model achieved detection accuracy improvements of 0.4% and 0.7% over YOLOv11s-BiFPN and the original YOLOv11s model, respectively. The original concat operation simply concatenated feature maps without considering the importance of differences across hierarchical features, making it difficult for the model to filter key information. The BiFPN module incorporated learnable global weights to achieve adaptive feature fusion through normalized weighting, enabling the network to distinguish feature importance. However, BiFPN overlooked spatial feature variations. The proposed ASCC module built upon BiFPN by introducing a spatial-channel joint perception mechanism. It captured global channel-wise contextual relationships through channel weights while preserving local structural information via spatial weights. This joint perception mechanism allowed ASCC to more accurately identify effective regions in feature maps.

4.4. MDF Module Effectiveness Experiment

To explore the optimal embedding model of the MDF module, the influence of different YOLOv11s network structures on detection performance was investigated when embedding the MDF module. Three proposed detection models with embedded MDF modules (YOLOv11s-MDF-A, YOLOv11s-MDF-B, and YOLOv11s-MDF-C) were evaluated on the IP102 test dataset, and the experimental result was presented in Table 4.

As shown in Table 4, the MDF module also effectively reduced the model’s parameter count and computational cost due to the adoption of depthwise separable convolution. Furthermore, by comparing the mAP@0.5 metric, it was observed that YOLOv11s-MDF-B achieved a 0.5% improvement in detection accuracy over the original YOLOv11s model, while YOLOv11s-MDF-A and YOLOv11s-MDF-C exhibited reductions of 4.8% and 7.5%, respectively. This discrepancy primarily stemmed from the functional differences between the backbone and neck components in the network. The backbone focused on hierarchical construction from low-level to high-level semantic features, whereas the neck emphasized cross-level feature fusion. The randomness of dynamic weights in the HDG module disrupted the consistency learning of low-level features. In contrast, within the neck, the hybrid dynamic convolution of the HDG module adaptively adjusted fusion weights for features at different levels.

4.5. Ablation Experiment

Ablation studies were conducted on the IP102 and Pest24 datasets to investigate the effects of the ASCC module, LDSPF module, SDConv module, and MDF module on model accuracy. The experimental results were presented in Table 5 and Table 6, respectively.

As shown in Table 5 and Table 6, integrating the ASCC, LDSPF, SDConv, and MDF modules individually into the YOLOv11s algorithm improved the detection accuracy of the network. Specifically, on the IP102 dataset, the mAP@0.5 metric increased by 0.7%, 1.1%, 0.2%, and 0.5%, respectively. On the Pest24 dataset, the mAP@0.5 metric increased by 0.3%, 0.5%, 0.6%, and 0.8%, respectively. Furthermore, it was concluded that combining multiple enhancement modules yielded better results compared to using a single improvement module, and the introduced enhancement modules effectively improved the model’s detection performance. Comparing the proposed MA-YOLO model with the YOLOv11s model, it was found that on the IP102 and Pest24 datasets, the MA-YOLO model reduced the parameter count and computational cost by 2.57 M and 4.2 G, respectively, while improving detection accuracy by 2% and 1.8%, respectively.

4.6. Comparison Experiment

To verify the effectiveness of the proposed MA-YOLO algorithm in agricultural pest detection and recognition tasks, comparisons were made with current mainstream object detection algorithms. First, the results on the IP102 dataset were compared. Specifically, two-stage algorithms such as Faster R-CNN [22], FPN [67], Dynamic R-CNN [68], and Spare R-CNN [69], as well as one-stage algorithms such as SSD 300 [29], RefineDet [70], YOLOv3 [71], PAA [72], TOOD [73], YOLOX [74], improved YOLOX [75], Yolov5s, Yolov8s, Yolov10s [76], Yolov11s [77], PestLite [78], GLU-YOLOv8 [79], and the literature [80] were included. According to the results in Table 7, the MA-YOLO algorithm constructed in this paper achieved the best results.

The results on the Pest24 dataset were also compared. Specifically, two-stage algorithms such as Faster R-CNN [22], FPN [67], as well as one-stage algorithms such as SSD 300 [29], RefineDet [70], YOLOv3 [71], YOLOX [74], DETR [81], TOOD [73], Pest-YOLO [82], Yolov5s, Yolov8s, Yolov10s [76], Yolov11s [77], and MACNet [83] were included. According to the results in Table 8, the MA-YOLO algorithm constructed in this paper also achieved the best results.

4.7. Visualization of Test Results

In order to visually demonstrate the effectiveness of the proposed algorithm in this paper, the detection results of some images from the IP102 test set were shown in Figure 7.

As shown in Figure 7, it can be observed from the detection results in the first row that missed detections occurred with the original YOLOv11s algorithm. The second and fourth rows show that false detections were found in the original YOLOv11s algorithm. Third, fifth, and sixth rows demonstrate that both over-detections and false detections were present in the original YOLOv11s algorithm. Therefore, the detection results from these seven rows of images further confirm that the proposed MA-YOLO network outperforms the original YOLOv11s network in detection performance.

4.8. Embedded Device Deployment Experiment

In agricultural pest detection applications, it is essential to comprehensively consider the computational complexity, parameter count, and detection speed of models [84]. To validate the deployment feasibility of the MA-YOLO model on embedded devices, this study deployed YOLOv5s, YOLOv8s, YOLOv10s, YOLOv11s, and MA-YOLO on the Jetson Xavier AGX platform. Test sets from both the IP102 dataset and Pest24 dataset were selected for evaluation. The comparative results of model deployment detection are presented in Table 9.

As demonstrated in Table 9, on the IP102 dataset, the proposed MA-YOLO model achieved parameter and computational reductions of 28.5% and 19.5%, respectively, compared to YOLOv11s. On the Pest24 dataset, the MA-YOLO model demonstrates parameter and computational reductions of 28.6% and 19.6%, respectively, compared to YOLOv11s. The real-time detection rates on the Jetson Xavier AGX embedded device are 26.8 and 27.2 f/s, respectively, representing 1.37 and 1.22 improvements over YOLOv11s. Therefore, the MA-YOLO model achieved lower computational cost and memory consumption, enabling practical engineering applications on resource-constrained embedded devices.

5. Conclusions

Currently, the existing pest detection methods suffer from imbalanced class distribution in datasets, small inter-class variations, and large intra-class variations, resulting in poor detection performance of current object detection algorithms on pest datasets and insufficient generalization in real-world scenarios. To address the issues of high computational complexity and inadequate feature representation in traditional convolutional networks, a novel agricultural pest detection model (MA-YOLO) based on multi-scale fusion and attention mechanisms was proposed in this study. The SDConv module reduces computational costs through depthwise separable convolution and dynamic group convolution while enhancing local feature extraction. The LDSPF module captures multi-scale information via parallel dilated convolutions with spatial attention mechanisms and dual residual connections. The ASCC module improves feature discriminability by establishing an adaptive triple-weight system for global, channel, and spatial semantic responses. The MDF module balances efficiency and multi-scale feature extraction using multi-branch depthwise separable convolution and soft attention-based dynamic weighting. Experimental results demonstrated that the proposed modules, through dynamic fusion and lightweight design, significantly enhance multi-scale modeling and spatial optimization capabilities while reducing computational costs, achieving a synergistic improvement in high-precision detection and efficient computation in complex scenarios. The superior performance on public datasets further validated the effectiveness of the proposed method.

Although the MA-YOLO model constructed in this study demonstrates exceptional performance in pest detection tasks, the model’s performance is dependent on the training dataset and can effectively identify known pest species but cannot recognize new or unknown species not represented in the training data. Future research should focus on two directions. First, unknown-species detection should be addressed through continual learning or open-set recognition methods. Second, extensive field trials should be conducted to evaluate the model’s robustness under different agricultural conditions, including varying lighting conditions, weather conditions, and crop growth stages. Furthermore, the pest detection model should be integrated with automated intervention mechanisms, such as real-time warning systems and precision spraying systems, to ultimately establish a comprehensive smart agriculture ecosystem that not only accurately identifies pest threats but also provides automated responses tailored to specific pest categories to effectively reduce crop damage.

Author Contributions

Y.L. and P.L. conceived and designed the experiments. Y.L., P.L., and C.T. performed the experiments and analyzed the data. P.L. wrote the manuscript. Y.L. and C.T. gave some significant comments to improve the quality and language of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD 2023-87).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

We would like to express our gratitude to all reviewers for their patience and help.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, J.; Tong, M. Shanxi Province Food Security Evaluation Research. Sci. Rep. 2025, 15, 897. [Google Scholar] [CrossRef]
Qiu, B.; Jian, Z.; Yang, P.; Tang, Z.; Zhu, X.; Duan, M.; Yu, Q.; Chen, X.; Zhang, M.; Tu, P.; et al. Unveiling Grain Production Patterns in China (2005–2020) towards Targeted Sustainable Intensification. Agric. Syst. 2024, 216, 103878. [Google Scholar] [CrossRef]
Zhang, H.; Xu, S. Issues and Governance of China’s International Grain Trade from the Perspective of Major Country Food Security. J. Shenzhen Univ. Humanit. Soc. Sci. 2023, 40, 5–14. [Google Scholar]
Tilman, D.; Balzer, C.; Hill, J.; Befort, B.L. Global Food Demand and the Sustainable Intensification of Agriculture. Proc. Natl. Acad. Sci. USA 2011, 108, 20260–20264. [Google Scholar] [CrossRef]
Subedi, B.; Poudel, A.; Aryal, S. The Impact of Climate Change on Insect Pest Biology and Ecology: Implications for Pest Management Strategies, Crop Production, and Food Security. J. Agric. Food Res. 2023, 14, 100733. [Google Scholar] [CrossRef]
Ahmad, I.; Yang, Y.; Yue, Y.; Ye, C.; Hassan, M.; Cheng, X.; Wu, Y.; Zhang, Y. Deep Learning Based Detector YOLOv5 for Identifying Insect Pests. Appl. Sci. 2022, 12, 10167. [Google Scholar] [CrossRef]
Hu, Z.; Xu, L.; Cao, L.; Liu, S.; Luo, Z.; Wang, J.; Li, X.; Wang, L. Application of Non-Orthogonal Multiple Access in Wireless Sensor Networks for Smart Agriculture. IEEE Access 2019, 7, 87582–87592. [Google Scholar] [CrossRef]
Qin, W.-C.; Qiu, B.-J.; Xue, X.-Y.; Chen, C.; Xu, Z.-F.; Zhou, Q.-Q. Droplet Deposition and Control Effect of Insecticides Sprayed with an Unmanned Aerial Vehicle against Plant Hoppers. Crop Prot. 2016, 85, 79–88. [Google Scholar] [CrossRef]
Ma, J.; Liu, K.; Dong, X.; Huang, X.; Ahmad, F.; Qiu, B. Force and Motion Behaviour of Crop Leaves during Spraying. Biosyst. Eng. 2023, 235, 83–99. [Google Scholar] [CrossRef]
Wang, G.; Li, Z.; Jia, W.; Ou, M.; Dong, X.; Zhang, Z. A Review on the Evolution of Air-Assisted Spraying in Orchards and the Associated Leaf Motion During Spraying. Agriculture 2025, 15, 964. [Google Scholar] [CrossRef]
He, X.; Yang, F.; Qiu, B. Agricultural Environment and Intelligent Plant Protection Equipment. Agronomy 2024, 14, 937. [Google Scholar] [CrossRef]
Yang, G.; Bao, Y.; Liu, Z. Localization and Recognition of Pests in Tea Plantation Based on Image Saliency Analysis and Convolutional Neural Network. Trans. Chin. Soc. Agric. Eng. 2017, 33, 156–162. [Google Scholar] [CrossRef]
Cheng, X.; Zhang, Y.; Chen, Y.; Wu, Y.; Yue, Y. Pest Identification via Deep Residual Learning in Complex Background. Comput. Electron. Agric. 2017, 141, 351–356. [Google Scholar] [CrossRef]
Sabanci, K.; Aslan, M.F.; Ropelewska, E.; Unlersen, M.F.; Durdu, A. A Novel Convolutional-Recurrent Hybrid Network for Sunn Pest–Damaged Wheat Grain Detection. Food Anal. Methods 2022, 15, 1748–1760. [Google Scholar] [CrossRef]
Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent Advances in Convolutional Neural Networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
Pan, Y.; Jin, H.; Gao, J.; Rauf, H. Identification of Buffalo Breeds Using Self-Activated-Based Improved Convolutional Neural Networks. Agriculture 2022, 12, 1386. [Google Scholar] [CrossRef]
Huang, Y.; Pan, Y.; Liu, C.; Zhou, L.; Tang, L.; Wei, H.; Fan, K.; Wang, A.; Tang, Y. Rapid and Non-Destructive Geographical Origin Identification of Chuanxiong Slices Using Near-Infrared Spectroscopy and Convolutional Neural Networks. Agriculture 2024, 14, 1281. [Google Scholar] [CrossRef]
Wu, P.; Liu, A.; Fu, J.; Ye, X.; Zhao, Y. Autonomous Surface Crack Identification of Concrete Structures Based on an Improved One-Stage Object Detection Algorithm. Eng. Struct. 2022, 272, 114962. [Google Scholar] [CrossRef]
Zhao, Y.; Zhao, L.; Xiong, B.; Kuang, G. Attention Receptive Pyramid Network for Ship Detection in SAR Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 2738–2756. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Sun, J.; He, X.; Ge, X.; Wu, X.; Shen, J.; Song, Y. Detection of Key Organs in Tomato Based on Deep Migration Learning in a Complex Background. Agriculture 2018, 8, 196. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. [Google Scholar]
Zhang, Q.; Chen, Q.; Xu, W.; Xu, L.; Lu, E. Prediction of Feed Quantity for Wheat Combine Harvester Based on Improved YOLOv5s and Weight of Single Wheat Plant without Stubble. Agriculture 2024, 14, 1251. [Google Scholar] [CrossRef]
Ma, J.; Zhao, Y.; Fan, W.; Liu, J. An Improved YOLOv8 Model for Lotus Seedpod Instance Segmentation in the Lotus Pond Environment. Agronomy 2024, 14, 1325. [Google Scholar] [CrossRef]
Pei, H.; Sun, Y.; Huang, H.; Zhang, W.; Sheng, J.; Zhang, Z. Weed Detection in Maize Fields by UAV Images Based on Crop Row Preprocessing and Improved YOLOv4. Agriculture 2022, 12, 975. [Google Scholar] [CrossRef]
Xie, H.; Zhang, Z.; Zhang, K.; Yang, L.; Zhang, D.; Yu, Y. Research on the Visual Location Method for Strawberry Picking Points under Complex Conditions Based on Composite Models. J. Sci. Food Agric. 2024, 104, 8566–8579. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; Volume 9905, pp. 21–37. ISBN 978-3-319-46447-3. [Google Scholar]
Dellinger, F.; Delon, J.; Gousseau, Y.; Michel, J.; Tupin, F. SAR-SIFT: A SIFT-Like Algorithm for SAR Images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 453–466. [Google Scholar] [CrossRef]
Ma, W.; Wen, Z.; Wu, Y.; Jiao, L.; Gong, M.; Zheng, Y.; Liu, L. Remote Sensing Image Registration With Modified SIFT and Enhanced Feature Matching. IEEE Geosci. Remote Sens. Lett. 2017, 14, 3–7. [Google Scholar] [CrossRef]
Li, W.; Chen, C.; Su, H.; Du, Q. Local Binary Patterns and Extreme Learning Machine for Hyperspectral Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2015, 53, 3681–3693. [Google Scholar] [CrossRef]
Bansal, M.; Kumar, M.; Kumar, M. 2D Object Recognition: A Comparative Analysis of SIFT, SURF and ORB Feature Descriptors. Multimed. Tools Appl. 2021, 80, 18839–18857. [Google Scholar] [CrossRef]
Ye, Y.; Bruzzone, L.; Shan, J.; Bovolo, F.; Zhu, Q. Fast and Robust Matching for Multimodal Remote Sensing Image Registration. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9059–9070. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Zhang, M.-L.; Zhou, Z.-H. ML-KNN: A Lazy Learning Approach to Multi-Label Learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Zheng, C.-H.; Pei, W.-J.; Yan, Q.; Chong, Y.-W. Pedestrian Detection Based on Gradient and Texture Feature Integration. Neurocomputing 2017, 228, 71–78. [Google Scholar] [CrossRef]
Shen, Y.; Zhou, H.; Li, J.; Jian, F.; Jayas, D.S. Detection of Stored-Grain Insects Using Deep Learning. Comput. Electron. Agric. 2018, 145, 319–325. [Google Scholar] [CrossRef]
Li, X.-L.; Huang, S.-G.; Zhou, M.-Q.; Geng, G.-H. KNN-Spectral Regression LDA for Insect Recognition. In Proceedings of the 2009 First International Conference on Information Science and Engineering, Nanjing, China, 26–28 December 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1315–1318. [Google Scholar]
Chen, C.; Zhu, W.; Steibel, J.; Siegford, J.; Han, J.; Norton, T. Classification of Drinking and Drinker-Playing in Pigs by a Video-Based Deep Learning Method. Biosyst. Eng. 2020, 196, 1–14. [Google Scholar] [CrossRef]
Zhou, X.; Sun, J.; Tian, Y.; Lu, B.; Hang, Y.; Chen, Q. Hyperspectral Technique Combined with Deep Learning Algorithm for Detection of Compound Heavy Metals in Lettuce. Food Chem. 2020, 321, 126503. [Google Scholar] [CrossRef]
Liu, J.; Abbas, I.; Noor, R.S. Development of Deep Learning-Based Variable Rate Agrochemical Spraying System for Targeted Weeds Control in Strawberry Crop. Agronomy 2021, 11, 1480. [Google Scholar] [CrossRef]
Jiao, L.; Dong, S.; Zhang, S.; Xie, C.; Wang, H. AF-RCNN: An Anchor-Free Convolutional Neural Network for Multi-Categories Agricultural Pest Detection. Comput. Electron. Agric. 2020, 174, 105522. [Google Scholar] [CrossRef]
Xu, W.; Sun, L.; Zhen, C.; Liu, B.; Yang, Z.; Yang, W. Deep Learning-Based Image Recognition of Agricultural Pests. Appl. Sci. 2022, 12, 12896. [Google Scholar] [CrossRef]
Li, M.; Cheng, S.; Cui, J.; Li, C.; Li, Z.; Zhou, C.; Lv, C. High-Performance Plant Pest and Disease Detection Based on Model Ensemble with Inception Module and Cluster Algorithm. Plants 2023, 12, 200. [Google Scholar] [CrossRef]
Wang, X.; Liu, J.; Zhu, X. Early Real-Time Detection Algorithm of Tomato Diseases and Pests in the Natural Environment. Plant Methods 2021, 17, 43. [Google Scholar] [CrossRef]
Lv, J.; Li, W.; Fan, M.; Zheng, T.; Yang, Z.; Chen, Y.; He, G.; Yang, X.; Liu, S.; Sun, C. Detecting Pests From Light-Trapping Images Based on Improved YOLOv3 Model and Instance Augmentation. Front. Plant Sci. 2022, 13, 939498. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, B.; Hu, Y.; Li, C.; Li, Y. Accurate Cotton Diseases and Pests Detection in Complex Background Based on an Improved YOLOX Model. Comput. Electron. Agric. 2022, 203, 107484. [Google Scholar] [CrossRef]
Ji, W.; Pan, Y.; Xu, B.; Wang, J. A Real-Time Apple Targets Detection Method for Picking Robot Based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Aladhadh, S.; Habib, S.; Islam, M.; Aloraini, M.; Aladhadh, M.; Al-Rawashdeh, H.S. An Efficient Pest Detection Framework with a Medium-Scale Benchmark to Increase the Agricultural Productivity. Sensors 2022, 22, 9749. [Google Scholar] [CrossRef]
Xu, B.; Cui, X.; Ji, W.; Yuan, H.; Wang, J. Apple Grading Method Design and Implementation for Automatic Grader Based on Improved YOLOv5. Agriculture 2023, 13, 124. [Google Scholar] [CrossRef]
Yun, J.; Jiang, D.; Liu, Y.; Sun, Y.; Tao, B.; Kong, J.; Tian, J.; Tong, X.; Xu, M.; Fang, Z. Real-Time Target Detection Method Based on Lightweight Convolutional Neural Network. Front. Bioeng. Biotechnol. 2022, 10, 861286. [Google Scholar] [CrossRef]
Lakhani, P.; Sundaram, B. Deep Learning at Chest Radiography: Automated Classification of Pulmonary Tuberculosis by Using Convolutional Neural Networks. Radiology 2017, 284, 574–582. [Google Scholar] [CrossRef]
Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure Transformer Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
Su, J.; Zhang, B.; Xiong, D.; Liu, Y.; Zhang, M. Alignment-Consistent Recursive Neural Networks for Bilingual Phrase Embeddings. Knowl.-Based Syst. 2018, 156, 1–11. [Google Scholar] [CrossRef]
Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y. Review the State-of-the-Art Technologies of Semantic Segmentation Based on Deep Learning. Neurocomputing 2022, 493, 626–646. [Google Scholar] [CrossRef]
Huyan, J.; Li, W.; Tighe, S.; Xu, Z.; Zhai, J. CrackU-net: A Novel Deep Convolutional Neural Network for Pixelwise Pavement Crack Detection. Struct. Control Health Monit. 2020, 27, e2551. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Le, D.-N.; Parvathy, V.S.; Gupta, D.; Khanna, A.; Rodrigues, J.J.P.C.; Shankar, K. IoT Enabled Depthwise Separable Convolution Neural Network with Deep Support Vector Machine for COVID-19 Diagnosis and Classification. Int. J. Mach. Learn. Cybern. 2021, 12, 3235–3248. [Google Scholar] [CrossRef]
Lee, Y.; Park, J.; Lee, C.-O. Two-Level Group Convolution. Neural Netw. 2022, 154, 323–332. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of Deep Learning: Concepts, CNN Architectures, Challenges, Applications, Future Directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Zhao, P.; Li, C.; Rahaman, M.M.; Xu, H.; Yang, H.; Sun, H.; Jiang, T.; Grzegorzek, M. A Comparative Study of Deep Learning Classification Methods on a Small Environmental Microorganism Image Dataset (EMDS-6): From Convolutional Neural Networks to Visual Transformers. Front. Microbiol. 2022, 13, 792166. [Google Scholar] [CrossRef]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus Detection Algorithm Based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Wu, X.; Zhan, C.; Lai, Y.-K.; Cheng, M.-M.; Yang, J. IP102: A Large-Scale Benchmark Dataset for Insect Pest Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8779–8788. [Google Scholar]
Wang, Q.-J.; Zhang, S.-Y.; Dong, S.-F.; Zhang, G.-C.; Yang, J.; Li, R.; Wang, H.-Q. Pest24: A Large-Scale Very Small Object Data Set of Agricultural Pests for Multi-Target Detection. Comput. Electron. Agric. 2020, 175, 105585. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 936–944. [Google Scholar]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; Volume 12360, pp. 260–275. ISBN 978-3-030-58554-9. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 14449–14458. [Google Scholar]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-Shot Refinement Neural Network for Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 4203–4212. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Kim, K.; Lee, H.S. Probabilistic Anchor Assignment with IoU Prediction for Object Detection. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; Volume 12370, pp. 355–371. ISBN 978-3-030-58594-5. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-Aligned One-Stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3490–3499. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Huang, J.; Huang, Y.; Huang, H.; Zhu, W.; Zhang, J.; Zhou, X. An Improved YOLOX Algorithm for Forest Insect Pest Detection. Comput. Intell. Neurosci. 2022, 2022, 5787554. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Dong, Q.; Sun, L.; Han, T.; Cai, M.; Gao, C. PestLite: A Novel YOLO-Based Deep Learning Technique for Crop Pest Detection. Agriculture 2024, 14, 228. [Google Scholar] [CrossRef]
Yue, G.; Liu, Y.; Niu, T.; Liu, L.; An, L.; Wang, Z.; Duan, M. GLU-YOLOv8: An Improved Pest and Disease Target Detection Algorithm Based on YOLOv8. Forests 2024, 15, 1486. [Google Scholar] [CrossRef]
Kang, C.; Jiao, L.; Liu, K.; Liu, Z.; Wang, R. Precise Crop Pest Detection Based on Co-Ordinate-Attention-Based Feature Pyramid Module. Insects 2025, 16, 103. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Computer Vision—ECCV 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; Volume 12346, pp. 213–229. ISBN 978-3-030-58451-1. [Google Scholar]
Wen, C.; Chen, H.; Ma, Z.; Zhang, T.; Yang, C.; Su, H.; Chen, H. Pest-YOLO: A Model for Large-Scale Multi-Class Dense and Tiny Pest Detection and Counting. Front. Plant Sci. 2022, 13, 973985. [Google Scholar] [CrossRef]
Hu, Y.; Wang, Q.; Wang, C.; Qian, Y.; Xue, Y.; Wang, H. MACNet: A More Accurate and Convenient Pest Detection Network. Electronics 2024, 13, 1068. [Google Scholar] [CrossRef]
Hu, Y.; Chen, N.; Hou, Y.; Lin, X.; Jing, B.; Liu, P. Lightweight Deep Learning for Real-Time Road Distress Detection on Mobile Devices. Nat. Commun. 2025, 16, 4212. [Google Scholar] [CrossRef]

Figure 1. Network Structure of Agricultural Pest Target Detection Model.

Figure 2. Symmetric dual-path convolution network architecture diagram.

Figure 3. YOLOV11s network structure integrating SDConv module.

Figure 4. Lightweight dynamic spatial-aware pyramid fusion network architecture diagram.

Figure 5. Multiscale dynamic fusion network architecture diagram.

Figure 6. YOLOV11s network structure integrating MDF module.

Figure 7. Visualization results of IP102 test set.

Table 1. Performance evaluation of SDConv module-embedded model.

Model	Param.	GFLOPs	Precision	Recall	mAP@0.5	mAP@0.75	mAP@0.5:0.95
YOLOv11s	9.01	21.5	63.4	63.5	63.4	47.8	41.4
YOLOv11s-SDConv-A	7.67	17.3	62	55.9	59.4	42.8	38.1
YOLOv11s-SDConv-B	8.46	20.8	57.3	67.0	63.6	45.7	41.2
YOLOv11s-SDConv-C	7.11	16.5	57	61.1	61.0	44.5	39.4

Table 2. Performance evaluation of LDSPF module-embedded model.

Model	Param.	GFLOPs	Precision	Recall	mAP@0.5	mAP@0.75	mAP@0.5:0.95
YOLOv11s	9.01	21.5	63.4	63.5	63.4	47.8	41.4
YOLOv11s-LDSPF	8.76	21.3	63.2	62.6	64.5	47.8	41.9

Table 3. Performance evaluation of ASCC module-embedded model.

Model	Param.	GFLOPs	Precision	Recall	mAP@0.5	mAP@0.75	mAP@0.5:0.95
YOLOv11s	9.01	21.5	63.4	63.5	63.4	47.8	41.4
YOLOv11s-BiFPN	9.01	21.5	62.9	63	63.7	47.7	41.4
YOLOv11s-ASCC	9.01	21.5	61.5	63.3	64.1	46.8	41.7

Table 4. Performance evaluation of MDF module-embedded model.

Model	Param.	GFLOPs	Precision	Recall	mAP@0.5	mAP@0.75	mAP@0.5:0.95
YOLOv11s	9.01	21.5	63.4	63.5	63.4	47.8	41.4
YOLOv11s-MDF-A	7.59	18.8	57.5	56.8	58.6	41.4	37.4
YOLOv11s-MDF-B	7.26	18.3	63.2	63.3	63.9	47.2	41.2
YOLOv11s-MDF-C	5.85	15.7	54.6	55.3	55.9	40.4	36.0

Table 5. IP102 dataset ablation experiment.

YOLOv11s	ASCC	LDSPF	SDConv	MDF	Param.	GFLOPs	mAP@0.5	mAP@0.75	mAP@0.5:0.95
√					9.01	21.5	63.4	47.8	41.4
√	√				9.01	21.5	64.1	46.8	41.7
√		√			8.76	21.3	64.5	47.8	41.9
√			√		8.46	20.8	63.6	45.7	41.2
√				√	7.26	18.3	63.9	47.2	41.2
√	√	√			8.77	21.3	65.0	49.5	43.2
√	√	√	√		8.20	20.6	65.1	49.0	43.2
√	√	√	√	√	6.44	17.3	65.4	49.8	43.5

The √ in the table represents the corresponding module used in the model.

Table 6. Pest24 dataset ablation experiment.

YOLOv11s	ASCC	LDSPF	SDConv	MDF	Param.	GFLOPs	mAP@0.5	mAP@0.75	mAP@0.5:0.95
√					8.99	21.4	72.1	52.7	45.8
√	√				8.99	21.4	72.4	52.6	45.8
√		√			8.73	21.1	72.6	52.5	46.0
√			√		8.43	20.6	72.7	53.1	46.1
√				√	7.24	18.1	72.9	52.9	46.1
√	√	√			8.73	21.1	73.2	54.7	46.8
√	√	√	√		8.17	20.4	73.7	54.7	47.2
√	√	√	√	√	6.42	17.2	73.9	54.9	47.4

The √ in the table represents the corresponding module used in the model.

Table 7. The detection accuracy of the most advanced object detection methods on the IP102 dataset.

Model	AP	AP₅₀	AP₇₅	AP_small	AP_medium	AP_large
Faster R-CNN [22]	28.4	48.0	30.2	17.8	29.0	29.4
FRN [67]	28.10	54.93	23.30	—	—	—
Dynamic R-CNN [68]	29.4	50.7	30.3	14.6	25.9	30.4
Spare R-CNN [69]	21.1	33.2	23.8	10.2	24.3	22.0
SSD300 [29]	21.49	47.21	16.57	—	—	—
RefineDet [70]	22.84	49.01	16.82	—	—	—
YOLOv3 [71]	25.67	50.64	21.79	—	—	—
PAA [72]	25.2	42.7	26.1	18.6	27.1	26.1
TOOD [73]	26.5	43.9	28.7	19.0	28.3	27.4
YOLOX [74]	31.1	52.1	32.3	23.2	32.4	32.0
improved YOLOX [75]	32.4	53.6	33.4	24.8	33.5	32.9
Yolov5s	36.0	58.0	39.6	16.9	34.3	36.3
Yolov8s	40.8	63.0	46.5	15.9	37.1	41.1
Yolov10s [76]	40.8	62.6	45.5	16.6	37.9	40.7
Yolov11s [77]	41.4	63.4	47.8	16.3	40.3	41.3
PestLite [78]	—	—	—	—	—	—
GLU-YOLOv8 [79]	37.9	58.7	—	—	—	—
Literature [80]	29.8	49.7	29.2	8.9	27.9	29.3
MA-YOLO	43.5	65.4	49.8	16.7	39.5	44.1

The boldface values represent the maximum value in the column.

Table 8. The detection accuracy of the most advanced object detection methods on the Pest24 dataset.

Model	AP	AP50	AP75	APsmall	APmedium	APlarge
Faster R-CNN [22]	—	42.67	—	—	—	—
SSD300 [29]	—	25.06	—	—	—	—
RefineDet [70]	—	26.11	—	—	—	—
YOLOv3 [71]	—	60.69	—	—	—	—
YOLOX [74]	—	68.88	—	—	—	—
DETR [81]	—	37.84	—	—	—	—
TOOD [73]	—	68.35	—	—	—	—
Pest-YOLO [82]	—	69.59	—	—	—	—
Yolov5s	43.6	71.1	49.1	28.6	45.8	48.3
Yolov8s	45.4	71.9	51.6	29.8	47.5	46.7
Yolov10s [76]	46.7	73.0	54.1	29.6	48.2	50.0
Yolov11s [77]	45.8	72.1	52.7	32.3	47.4	53.3
MACNet [83]	43.1	71.0	48.1	29.3	48.0	32.1
MA-YOLO	47.4	73.9	54.9	34.1	49.1	53.4

The boldface values represent the maximum value in the column.

Table 9. Model deployment detection frame rate.

Model	IP102				Pest24
Model	Param.	GFLOPs	mAP@0.5	FPS	Param.	GFLOPs	mAP@0.5	FPS
YOLOv5s	6.96	16.8	58.0	25.2	6.76	16.1	71.1	26
YOLOv8s	10.66	28.9	63.0	21.6	10.63	28.7	71.9	21.1
YOLOv10s	7.77	25.2	62.6	23.1	7.71	24.9	73.0	19.6
YOLOv11s	9.01	21.5	63.4	19.6	8.99	21.4	72.1	22.3
MA-YOLO	6.44	17.3	65.4	26.8	6.42	17.2	73.9	27.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Y.; Liu, P.; Tan, C. MA-YOLO: A Pest Target Detection Algorithm with Multi-Scale Fusion and Attention Mechanism. Agronomy 2025, 15, 1549. https://doi.org/10.3390/agronomy15071549

AMA Style

Lu Y, Liu P, Tan C. MA-YOLO: A Pest Target Detection Algorithm with Multi-Scale Fusion and Attention Mechanism. Agronomy. 2025; 15(7):1549. https://doi.org/10.3390/agronomy15071549

Chicago/Turabian Style

Lu, Yongzong, Pengfei Liu, and Chong Tan. 2025. "MA-YOLO: A Pest Target Detection Algorithm with Multi-Scale Fusion and Attention Mechanism" Agronomy 15, no. 7: 1549. https://doi.org/10.3390/agronomy15071549

APA Style

Lu, Y., Liu, P., & Tan, C. (2025). MA-YOLO: A Pest Target Detection Algorithm with Multi-Scale Fusion and Attention Mechanism. Agronomy, 15(7), 1549. https://doi.org/10.3390/agronomy15071549

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MA-YOLO: A Pest Target Detection Algorithm with Multi-Scale Fusion and Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. Agricultural Pest Target Detection Algorithm

2.2. Convolutional Operation

3. Materials and Methods

3.1. YOLOv11s Model

3.2. SDConv Module

3.3. LDSPF Module

3.4. ASCC Module

3.5. MDF Module

3.6. Experimental Dataset

3.7. Experimental Setup

4. Results and Discussion

4.1. SDConv Module Effectiveness Experiment

4.2. LDSPF Module Effectiveness Experiment

4.3. ASCC Module Effectiveness Experiment

4.4. MDF Module Effectiveness Experiment

4.5. Ablation Experiment

4.6. Comparison Experiment

4.7. Visualization of Test Results

4.8. Embedded Device Deployment Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI