1. Introduction
Agricultural production faces a lot of problems, and one of them is pest infestation, which threatens food security. Crop losses caused by pest damage affect farmers’ incomes and regional food supplies. The timely and accurate monitoring of agricultural pests is a critical step of Integrated Pest Management (IPM) strategies [
1]. Therefore, efficient pest-monitoring methods have become an important research direction in modern agriculture.
Traditional monitoring methods mainly rely on professional technicians conducting field inspections or observing crop damage. These approaches work to some extent. But they have problems such as high labor consumption and difficulties in the timely detection of initial infestations. This makes it hard to control pests quickly. To address these limitations, automated pest monitoring has become a solution through deploying pest detection devices in agricultural fields. These devices can automatically capture and regularly monitor pest species and quantities through various trapping methods [
2,
3,
4], including attractant lures, light traps, or sticky boards. The captured pests are photographed at regular intervals, and automated image analysis is used to identify pest species. The identification process can be implemented in two ways: images can be transmitted via network to centralized servers for processing, or detection can be performed locally on the device with results uploaded to cloud servers. Local detection offers advantages such as reduced server computational load and lower network transmission requirements.
The core challenge of automated pest monitoring lies in accurate image analysis for pest identification. In the past, people tried to use machine learning methods for pest classification. For example, Ebrahimi et al. [
5] used a region index and color index features with an SVM classifier to detect thrips on crop canopy images. Such traditional approaches, especially those that relied on manual feature engineering, had trouble with the variety of pests, which made generalization difficult. Modern CNN-based object detection models are more suitable for pest monitoring, as they can automatically extract features without relying on manual feature engineering and provide both the location and category of pests in images.
Early applications of CNN in pest detection mainly adopted two-stage detection architectures. These techniques include R-CNN [
6] and its subsequent developments such as Faster R-CNN [
7], Cascade R-CNN [
8], and Mask R-CNN [
9]. They work by generating region proposals and then classifying the proposed regions. Jiao et al. [
10] proposed AF-RCNN in multi-category pest detection. This model integrates Fast R-CNN with an Anchor-Free Region Proposal Network to eliminate anchor dependencies. Wang et al. [
11] addressed the foreground–background imbalance challenges in small crop pest detection by proposing S-RPN (Sampling-balanced Region Proposal Network). This approach utilizes sparse sampling strategies and centerness-based sample selection mechanisms to improve detection performance. Teng et al. [
12] proposed MSR-RCNN by incorporating the multi-scale super-resolution feature enhancement module. This model achieved good performance on LLPD-26. Gao et al. [
4] proposed an enhanced Cascade R-CNN model to detect small pests on yellow sticky board images. Although two-stage detection approaches generally provide high accuracy, they have excessively high model complexity. This makes them unfavorable for deployment on resource-constrained devices, such as field monitoring equipment.
Single-stage detection frameworks, such as the Single Shot Multibox Detector (SSD) [
13] and the YOLO series [
14,
15,
16,
17,
18,
19,
20,
21], are widely used due to their computational efficiency and accuracy. These characteristics have made them popular choices for pest detection applications. However, existing single-stage pest detection methods face two primary limitations: limited pest coverage and high model complexity.
Many current single-stage model studies suffer from limited pest coverage. Li et al. [
22] proposed YOLO-TP as a lightweight model for Lasioderma serricorne counting, employing GIoU loss function, GSConv and the PC2f structure. But it only focuses on a single species. Li et al. [
23] proposed YOLO-JD for jute disease and pest detection, integrating SCFEM, DSCFEM, and SPPM for effective feature extraction. Although this model covers eight types of jute diseases, it only includes two pest species. Addressing grain pest detection challenges, Lyu et al. [
24] proposed a feature fusion SSD algorithm. They implemented a top–down strategy to combine multi-layer features and removed the block unfavorable to small object detection, but it only targets five types of grain pests. Wang et al. [
25] proposed Insect-YOLO, augmented with CBAM for crop insect detection, but it only covers seven agricultural pest species. Wang and Wang [
26] proposed GAS-YOLOv8 as a lightweight YOLOv8 variant for mango pest and disease detection, incorporating GhostHGNetv2 backbone, AsDDet detection head, and C2f-SE module, targeting 10 mango-related pests. Zhao et al. [
27] proposed AC-YOLO as a multi-class detection model in stored grain pest applications, integrating ECIoU loss function, CBAM and ACmix, covering 12 types of grain pests.
Many studies focus primarily on improving accuracy without fully considering the lightweight requirements for practical deployment, making even single-stage models quite complex. Zhang et al. [
28] proposed AgriPest-YOLO. This model employed the coordination and local attention mechanism and the grouping spatial pyramid pooling fast module. Although this model achieved 71.3% mAP50 on the Pest24 dataset, it has 16.2 GFLOPs, making it still unsuitable for resource-constrained deployment environments. Tian et al. [
29] proposed MD-YOLO. This model deployed the DenseNet block and the adaptive attention module to enhance feature utilization. Although it achieved 86.2% mAP50, it has 126.8 M parameters and only targets three species of lepidopteran pests, with limited coverage scope. Dai et al. [
30] proposed an enhanced YOLOv5m-based pest detection method. This method integrated the C3TR and SWinTR modules to extract global features. Additionally, this method incorporated ResSPP and WConcat components for improved feature extraction and fusion. Although it achieved 96.4% mAP50, they used a dataset that was not focused on small targets, potentially reducing detection difficulty, and the model has 38.1 MB model size.
This study proposes a lightweight YOLOv8 model for agricultural pest detection. Our research uses the Pest24 dataset for evaluation, which covers 24 categories of pests that need to be detected according to Ministry of Agriculture of China requirements, making our model applicable to broader pest detection applications. Furthermore, our research is dedicated to seeking a balance between accuracy and efficiency, making the model size more suitable for deployment on resource-constrained devices, such as field monitoring equipment. The primary technical contributions include the following:
(1) We propose a Lightweight Complementary Residual (LCR) module. This module has two branches that respectively focus on extracting different types of features, forming feature complementarity. Feature extraction is implemented through depthwise convolution, thus maintaining lightweight design. Furthermore, we integrate the LCR into the C2f module to form C2f-LCR.
(2) We propose Efficient Partial Convolution (EPConv) based on PConv [
31], a downsampling operator that adopts an asymmetric channel splitting strategy to efficiently utilize features. Additionally, EPConv incorporates the shortcut in ResNet-D [
32] and the SE module [
33].
(3) We introduce the Ghost module [
34] to the detection head, which generates more feature maps through cheap operations while reducing computational cost.
(4) We adopt WIoUv3 [
35] as the localization loss function, which improves detection performance through the dynamic non-monotonic gradient allocation mechanism.
These integrated modifications aim to produce an efficient and lightweight pest detection model suitable for agricultural applications.
2. Materials and Methods
2.1. Dataset
This study uses the Pest24 dataset [
36], which is a benchmark dataset for agricultural pest detection applications. The dataset encompasses 25,378 images across 24 distinct crop pest species. These 24 pest species are selected from the 38 field crop pest categories designated for detection by the Ministry of Agriculture of China. The remaining 14 categories are excluded from this dataset due to insufficient instances for reliable training and evaluation.
Table 1 presents details regarding the 24 pest categories within the Pest24 dataset, encompassing identifiers and category names.
Figure 1 illustrates the instance count distribution across the 24 pest categories. The dataset exhibits significant class imbalance, with Anomala corpulenta (ID: 10) having the highest number of instances at 53,347, while Holotrichia oblita (ID: 18) has the fewest instances with only 108 samples.
The Pest24 dataset exhibits the following characteristics:
(1) The dataset is characterized by extremely small target scales, with pest relative scales primarily distributed between 0 and 0.01 [
36], much smaller than traditional object detection datasets. As shown in
Figure 2a,b, individual pests occupy relatively few pixels in the images, making feature extraction particularly challenging.
(2) The dataset contains some imprecise annotations. As illustrated in
Figure 2a, several bounding boxes (marked with red circles) are relatively loose and do not tightly fit the pest boundaries. Since evaluating bounding box annotation accuracy involves subjective factors, we present here only a few examples that we consider to have inaccurate annotations.
(3) The dataset employs an incomplete annotation strategy, labeling only the 24 target pest categories while leaving other pest species in the images unlabeled. These unlabeled pests are referred to as non-target pests, creating additional detection challenges.
(4) Many different pest classes exhibit high visual similarity, which manifests in two forms. Similarity among target pests is demonstrated in
Figure 2b, where the blue circles highlight two visually similar but different pest species—Bollworm (ID: 0) and Armyworm (ID: 8). Additionally, similarity between non-target and target pests is shown in
Figure 2a, where Agriotes fuscicollis Miwa (ID: 4) appears similar to many unlabeled black insects in the background.
(5) The dataset contains numerous instances of pest adhesion and overlap, as evidenced in
Figure 2b, where multiple pests are clustered together, making individual detection difficult.
(6) The dataset includes various environmental disturbances, such as illumination reflections. As shown in
Figure 2b, green circles mark areas where pests are partially obscured by lighting reflections.
2.2. YOLO-LCE Network Architecture
The standard YOLOv8 model can process RGB images and is composed of three main parts: the backbone, neck, and head. The backbone is primarily responsible for feature extraction. It adopts CBS modules for downsampling and employs multiple C2f modules to capture rich features. At the end of the backbone, a spatial pyramid pooling fast (SPPF) module is used to expand the receptive field and fuse features from different receptive fields. The neck adopts the Path Aggregation Network (PANet) structure. This design fuses low-level spatial details with high-level semantic features from the backbone, enhancing the model’s ability to detect objects at various scales. Finally, the detection head is decoupled, which allows two separate branches to focus on the classification and regression tasks respectively, improving detection accuracy. YOLOv8 includes multiple variants (n, s, m, l) with varying complexity levels.
YOLO-LCE is designed based on the YOLOv8n architecture for multi-class agricultural pest detection, capable of simultaneously localizing pest objects and assigning category labels. As shown in
Figure 3, the model introduces targeted improvements while maintaining the core architecture of YOLOv8n.
To enhance feature representation capability through complementary features and reduce model complexity, YOLO-LCE employs the C2f-LCR modules to replace some of the original C2f modules in the network. Specifically, the last C2f module in the backbone network is replaced with C2f-LCR for high-level semantic feature extraction; meanwhile, all C2f modules preceding the three detection heads in the neck network are also replaced with C2f-LCR.
Additionally, the last convolutional layer in the backbone network and all convolutional layers in the neck network employ EPConv to achieve efficient downsampling with low computational complexity.
The detection head incorporates the Ghost module [
34], which generates more feature maps through cheap operations.
In terms of loss function design, YOLO-LCE adopts WIoUv3 [
35] to improve bounding box regression for pest targets.
Through these components, YOLO-LCE constructs a lightweight and efficient pest detection network that reduces computational resource requirements.
2.3. Lightweight Complementary Residual Module
In pest detection tasks, unlabeled non-target pests in the dataset have similar appearances to target pests, interfering with target pest detection. The C2f module in YOLOv8 employs the Bottleneck module for feature extraction. However, this module has high computational overhead, and the single feature extraction path lacks feature diversity.
To address these issues, this study proposes a Lightweight Complementary Residual (LCR) module, which designs a complementary dual-branch structure. This structure has two branches that focus on extracting different types of complementary features respectively. The first branch focuses on extracting stable features of pests. The second branch focuses on extracting discriminative features of pests. This design enhances the discriminative capability between target and non-target pests through complementary feature extraction while reducing model complexity.
As shown in
Figure 4, given an input feature map
X with dimensions
, the LCR module first splits it equally along the channel dimension into two branches:
where
has dimensions
.
The two branches employ different pooling strategies for feature abstraction. Pooling is beneficial for detecting dense and even occluded pests, as it expands the receptive field through aggregation, providing more context for the next layer. The first branch uses average pooling:
The second branch uses max pooling:
Both pooling operations use a kernel size of , stride of 1, and padding of 1. This ensures that the output feature maps maintain the same spatial dimensions as the input.
This dual-branch pooling design provides complementary feature foundations. Average pooling preserves smooth local responses. Max pooling preserves maximum local responses, highlighting potential discriminative information of similar pests.
Both branches then employ depthwise convolution for lightweight feature learning. Depthwise convolution conducts feature extraction independently on each channel. Compared to standard convolution, this approach reduces parameter complexity.
The depthwise convolution processing in the LCR module is as follows:
where
represents depthwise convolution operation with kernel size
.
Through these depthwise convolutions, the first branch extracts stable features based on smooth responses. The second branch extracts discriminative features based on maximum responses. This helps the network better distinguish target pests from non-target pests.
These two complementary features are then concatenated along the channel dimension:
To enhance feature learning capability and alleviate gradient vanishing problems, LCR adopts residual connections:
Since depthwise convolution lacks channel interaction capability, a final
convolution is applied for channel mixing [
37]:
To leverage the advantages of the LCR module, this study integrates it into the YOLOv8 C2f structure to form the C2f-LCR module. As shown in
Figure 5, C2f-LCR replaces the original Bottleneck structure in C2f with the LCR module.
2.4. Efficient Partial Convolution
FasterNet [
31] proposed Partial Convolution (PConv) to satisfy lightweight needs. Only a small part of the feature map is processed with convolution in this method. Most of the feature map is passed directly to the output without processing to reduce computational complexity. The large number of unprocessed features leads to insufficient feature utilization, limiting the feature representation capability of the network.
Addressing the above problems, this study proposes Efficient Partial Convolution (EPConv) as a downsampling operator based on PConv. As shown in
Figure 6, EPConv implements an asymmetric channel splitting strategy. In this strategy,
of the input features are processed through the main convolution branch1 for extracting primary pest features. The remaining
are processed through the parameter-efficient group convolution branch2 for extracting auxiliary features. This design ensures complete feature utilization while reducing computational cost and model parameters.
Given an input feature map
X with dimensions
, EPConv splits it along the channel dimension into two parts:
where
has dimensions
, representing features processed by branch1 main convolution, and
has dimensions
, representing features processed by branch2 group convolution.
When used for downsampling with stride
s (typically 2), the processing of both branches is as follows:
where
represents standard convolution with kernel size
for extracting main features.
applies group convolution with kernel size
for extracting auxiliary features. Both operations use stride
s for spatial dimension reduction and are followed by batch normalization (BN) [
38]. The application of group convolution reduces computational complexity, achieving the lightweight design goal.
The outputs are then concatenated:
To match the target output dimensions,
convolution followed by BN is applied for channel projection:
To enable residual learning during downsampling, EPConv adopts the shortcut in ResNet-D [
32]:
where
represents the average pooling operation with kernel size
and stride
s. Subsequently,
convolution adjusts channel dimensions followed by BN. This method conducts spatial dimension reduction through average pooling. It preserves more information during spatial downsampling, which is beneficial for small target detection.
The features are then combined through this residual connection:
To focus on important feature channels, EPConv introduces the SE (Squeeze-and-Excitation) module [
33]. The SE module processes the features as follows:
The final output of the EPConv is
2.5. Ghost Module
The detection head usually contains multiple standard convolutional layers. Because it needs to process features with high dimensions, it typically becomes a concentrated area of computational complexity.
To reduce computational overhead, this study introduces the Ghost module [
34] in YOLOv8’s detection head. As shown in
Figure 7, the standard convolutions are replaced with Ghost modules.
The idea of the Ghost module is to generate ghost features through cheap operations, reducing computational cost.
According to
Figure 8, for an input feature map
X with dimensions
where
denotes the input channel number, the Ghost module processing procedure comprises the following stages:
First, half of the output features are generated through
standard convolution:
where
has dimensions
.
is the output channel number, and
indicates the output spatial size.
Subsequently, ghost features are generated via
depthwise convolution as the cheap operation:
where
has dimensions
.
Finally, these two feature parts are concatenated along the channel dimension to form the complete output:
To quantify the computational efficiency advantages of the Ghost module, this study analyzes the theoretical computational complexity. The analysis compares the standard convolution and the Ghost module. Note that the following computational complexity analysis mainly considers multiplication operations in convolution, ignoring the bias term computational overhead. The
standard convolution operation can be formulated as
where
Y has dimensions
representing the output feature map.
W has dimensions
as convolution kernel parameters, and
b is the bias term. The computational complexity of standard convolution is
The computational cost of the Ghost module consists of two parts. The first is the computational cost of the main path
convolution:
The second is the computational cost of the ghost path
depthwise convolution:
The total computational cost of the Ghost module is
The computational efficiency ratio
can be expressed as
In the deep layers where detection heads are located, the input channel number
is typically large. For theoretical analysis, when
approaches infinity, the efficiency ratio approaches the limit value:
These theoretical analyses show that as the number of channels increases, the computational efficiency advantage of the Ghost module becomes greater.
2.6. WIoUv3
Box loss function plays an important role in training the object detection model. YOLOv8 adopts CIoU [
39] as the bounding box regression loss. The CIoU loss function considers the overlap area, central point distance, and consistency of aspect ratios, but it still has some limitations. The pest dataset used in this study inevitably has some low-quality samples. CIoU uses the same calculation method for anchor boxes of different qualities. This cannot dynamically optimize bounding box regression.
According to
Figure 9, CIoU loss is formulated as
where IoU denotes the overlap ratio between the predicted box and the ground truth box.
represents the squared distance between box centers.
c is the diagonal length of the smallest enclosing rectangle.
v quantifies the consistency of aspect ratios.
is a trade-off coefficient.
To reduce the impact of low-quality samples on training, this study introduces the WIoUv3 [
35] loss function. Since WIoUv3 is improved based on WIoUv1, the details of WIoUv1 are first presented:
where
and
represent the width and height of the smallest enclosing box. The superscript ∗ indicates that this term does not participate in gradient calculation during backpropagation.
WIoUv3 is built on WIoUv1 and adds a dynamic non-monotonic gradient allocation mechanism:
where
represents the sample anomaly degree.
and
are hyperparameters.
This non-monotonic design allows WIoUv3 to allocate small gradient gain to high-quality anchor boxes with low anomaly degrees. It also allocates small gradient gain to low-quality anchor boxes with high anomaly degrees. This makes the model focus on optimizing ordinary-quality anchor boxes instead of extreme cases, weakening the harmful gradients generated by low-quality samples. Therefore, the model can achieve better performance.
3. Results
3.1. Experimental Setup
For this study, the Pest24 dataset was randomly partitioned into training, validation, and testing subsets using a 6:2:2 distribution ratio based on image count. All experimental work was conducted on a computational server. The server is equipped with Intel Xeon Gold 6430 CPU and NVIDIA GeForce RTX 4090 GPU, operating under Ubuntu 22.04.3 with CUDA version 12.1.
The Faster R-CNN model is implemented with MMDetection. Model training employed the hyperparameter settings detailed in
Table 2. The “Other YOLO Series” column represents the settings for YOLOv4-tiny, YOLOv5n, YOLOv7-tiny, YOLOv8n, YOLOv10n, YOLOv11n, and YOLO-LCE.
For data augmentation, all models are trained using their respective default settings. The proposed YOLO-LCE follows the default data augmentation strategy [
40] of YOLOv8, with the specific enabled techniques detailed in
Table 3. Additionally, the WIOUv3 loss function employed in YOLO-LCE uses the following hyperparameters:
= 1.7 and
= 2.7.
3.2. Evaluation Metrics
This study conducts a quantitative assessment of the enhanced model across two aspects: detection performance and model efficiency.
The detection performance metrics [
41,
42] include precision, recall, AP50, mAP50, and mAP50-95. Precision and recall are calculated based on the counts of true positives (TPs), false positives (FPs), and false negatives (FNs). First, only predictions with a confidence score above a specific threshold are considered for evaluation. Then, these filtered predictions are categorized as follows. A prediction is a TP if it correctly identifies a pest’s species and its bounding box achieves an Intersection over Union (IoU) of at least the predefined IoU threshold. A prediction is an FP if it is incorrect, either due to a class mismatch or an IoU below the IoU threshold. An FN represents a pest that was present but not detected by the model with sufficient confidence.
Based on these counts, precision and recall are defined as
For the
i-th pest category, its average precision is defined as
where
represents the precision at recall
r for that category. AP50 is calculated when the IoU threshold is set to 0.5.
The mean Average Precision (mAP) represents the overall detection performance across all categories. mAP is calculated as
where
is the total number of pest categories. This study uses two primary mAP metrics. The first, mAP50, is calculated at a single IoU threshold of 0.5. The second, mAP50-95, is the average mAP across ten IoU thresholds, ranging from 0.5 to 0.95 in increments of 0.05.
The model efficiency metrics include three measures. Params (parameters) [
43] reflect model complexity, Model Size (MB) indicates storage requirements, and GFLOPs (Giga Floating Point Operations) [
43] represent computational complexity.
3.3. Comparison Experiments with Other Models
To evaluate the comprehensive performance of the improved algorithm, this study compares the proposed method with multiple advanced object detection models on the same dataset. The comparison includes the two-stage detector (Faster R-CNN) [
7] and multiple lightweight versions of the YOLO series [
17,
18,
19,
21]. The comparison outcomes are shown in
Table 4.
The experimental results show clear performance differences across detection architectures. Faster R-CNN, as a representative of traditional two-stage detectors, achieves only 43.2% mAP50 while requiring 41.47 M parameters and 91.0 GFLOPs. This indicates that the two-stage architecture not only fails to achieve high detection accuracy but also introduces substantial model parameters and computational cost.
Among early YOLO models, YOLOv4-tiny achieves 55.1% mAP50 with 5.93 M parameters and 16.3 GFLOPs. YOLOv5n outperforms it, achieving 57.8% mAP50 with only 1.79 M parameters and 4.2 GFLOPs. YOLOv6n exhibits slightly lower performance compared to YOLOv5n, achieving 57.5% mAP50 using 4.63 M parameters and 11.4 GFLOPs. YOLOv7-tiny shows further improvements over YOLOv5n, reaching 59.2% mAP50 but requiring 6.07 M parameters and 13.2 GFLOPs.
Recent YOLO models present a better balance between performance and efficiency. YOLOv8n achieves 62.2% mAP50 with 3.01 M parameters and 8.1 GFLOPs. YOLOv11n attains the same 62.2% mAP50 with fewer parameters of 2.59 M and 6.3 GFLOPs. YOLOv10n achieves 61.8% mAP50 with 2.70 M parameters and 8.3 GFLOPs.
YOLO-LCE achieves the highest mAP50 of 63.9% among all evaluated models. It uses only 1.69 M parameters, 3.69 MB model size, and 5.4 GFLOPs. Compared to YOLOv8n, YOLO-LCE improves mAP50 by 1.7 percentage points. It also reduces parameters by 43.9%, model size by 41.1%, and GFLOPs by 33.3%. When compared to YOLOv11n, YOLO-LCE achieves the same 1.7 percentage point improvement in mAP50. It also reduces parameters by 34.7%, model size by 32.7%, and GFLOPs by 14.3%.
On the stricter mAP50-95 metric, YOLO-LCE achieves the highest performance of 39.1% among all compared models, indicating that YOLO-LCE has better localization precision. Regarding precision and recall analysis, compared to YOLOv7-tiny, YOLO-LCE has lower precision by 4.0 percentage points, achieving 69.3% compared to YOLOv7-tiny’s 73.3%, but demonstrates higher recall by 4.2 percentage points with 60.3% compared to YOLOv7-tiny’s 56.1%. YOLOv5n demonstrates the highest precision at 76.2%, but its recall of 52.7% is 7.6 percentage points lower than YOLO-LCE’s 60.3%. YOLOv4-tiny achieves the highest recall at 65.5%, but its precision of 35.3% is significantly lower than YOLO-LCE’s 69.3% by 34.0 percentage points. YOLO-LCE surpasses Faster R-CNN, YOLOv10n, and YOLOv11n in precision and recall. Additionally, YOLO-LCE achieves the same recall as YOLOv8n while improving precision by 0.5 percentage points.
3.4. Ablation Experiments
To validate the effectiveness of each component, this study conducted ablation experiments by progressively integrating each component into the baseline YOLOv8n according to the YOLO-LCE design. The evaluation outcomes are shown in
Table 5.
The introduction of the C2f-LCR module reduces parameters and GFLOPs by 19.9% and 11.1%, respectively. The reduction in parameters and GFLOPs is primarily due to the use of depthwise convolutions. These convolutions perform independent computations on each channel. Compared to standard convolutions, they require fewer parameters and lower computational overhead. Although recall decreases from 60.3% to 59.1%, the module significantly improves precision from 68.8% to 75.6%, an increase of 6.8 percentage points. More importantly, mAP50, as a comprehensive metric that considers both precision and recall, improves by 0.8 percentage points from 62.2% to 63.0%. This demonstrates that the comprehensive detection performance is enhanced. This improvement can be attributed to complementary features that enhance feature representation, thereby improving comprehensive detection capability.
The EPConv continues to reduce parameters to 2.05 M and GFLOPs to 6.8. This indicates that the asymmetric channel splitting strategy reduces both parameters and computational complexity while ensuring complete feature utilization. Additionally, the shortcut based on average pooling [
32] preserves more pest details. SE attention [
33] focuses on important channels. These strategies enable the model to maintain mAP50 at 63.1% and mAP50-95 at 38.8%. Although EPConv causes precision to decrease from 75.6% to 74.1%, recall improves from 59.1% to 60.1%.
The introduction of the Ghost module [
34] generates more ghost features through cheap operations. This reduces model complexity with parameters reduced to 1.69 M and GFLOPs to 5.4. Although mAP50 slightly decreases to 62.8%, mAP50-95 remains at 38.8%. Moreover, the 0.3% mAP50 loss is exchanged for a 20.6% reduction in GFLOPs. Additionally, although recall drops to 58.4%, this module enables the model to achieve the highest precision of 76.3%.
The adoption of the WIoUv3 loss function [
35] enables recall to recover to the same level as the baseline at 60.3%. Although precision decreases compared to the previous stage, it still improves by 0.5 percentage points compared to the baseline, reaching 69.3%. In terms of mAP metrics that comprehensively consider precision and recall, it enables the final model to achieve optimal mAP50 and mAP50-95 performance. mAP50 and mAP50-95 improve by 1.1 percentage points and 0.3 percentage points respectively compared to the previous stage, reaching 63.9% and 39.1%. This indicates that WIoUv3 can optimize overall detection performance through its dynamic non-monotonic gradient allocation mechanism.
The ablation experiments verify the effectiveness of the integrated components. Each component works synergistically to achieve the goals of detection performance improvement and lightweight design. Through integrating these components into YOLOv8n, YOLO-LCE is constructed. YOLO-LCE achieves 63.9% mAP50 and 39.1% mAP50-95 while reducing parameters by 43.9% and computational cost by 33.3% in GFLOPs compared to baseline YOLOv8n.
3.5. Per-Class AP50 Comparison
This study conducted per-class AP50 analysis comparing YOLOv8n and YOLO-LCE.
Table 6 presents the AP50 values for each pest category, ranked by improvement magnitude in descending order.
The results show notable performance variations across different pest categories, with YOLO-LCE achieving the highest AP50 for Gryllotalpa orientalis (97.9%) and the lowest for Rice planthopper (1.49%). Holotrichia oblita achieves the largest improvement of 13.1 percentage points, followed by Nematode trench with 8.4 percentage points. However, both categories have limited test instances (29 and 30 respectively), which affects the reliability of these improvements.
Categories with moderate sample sizes demonstrate more reliable improvements. Rice Leaf Roller (243 instances) and Stem borer (384 instances) show consistent gains of 5.2% and 4.7% respectively. These improvements are more statistically meaningful due to adequate sample representation.
For pest categories with large test populations, the results show stable performance patterns. Anomala corpulenta (10,533 instances), Athetis lepigone (6000 instances), and Bollworm (5496 instances) maintain strong detection rates with modest improvements. These categories provide the most reliable evidence of model performance due to sufficient statistical power.
YOLO-LCE improves or maintains performance for 20 out of 24 categories. The exceptions include Eight-character tiger, which decreases by 8.51 percentage points, but this category has only 30 test instances, making this decline potentially attributable to limited sample rather than genuine model weakness. Categories with substantial test instances show improvements, demonstrating that our model achieves enhanced detection accuracy while maintaining a lightweight design.
3.6. Comparison of EPConv Channel Splitting Ratios
To determine the optimal channel splitting ratio for EPConv, we conducted experiments with different splitting strategies. As shown in
Table 7, this study evaluated several ratios under the same experimental framework.
As demonstrated in
Table 7, the 1:7 channel splitting ratio achieves the highest mAP50 of 63.1% and mAP50-95 of 38.8% while maintaining the lowest parameter count of 2.05 M. It also attains the highest precision and recall, indicating that the 1:7 ratio delivers the best overall performance among the three experimental ratios. It should be noted that this study did not experiment with ratios smaller than 1:7. This is because excessively small ratios would cause the standard convolution component to be overwhelmed by group convolution, which violates the asymmetric design concept.
3.7. Detection Results Visualization Analysis
To validate the capability of the LCR module in enhancing discrimination between target pests and non-target pests, this study conducted visualization comparison analysis. Representative test images were selected for this analysis.
Figure 10 shows the detection result comparisons between YOLOv8n and YOLOv8n integrated with C2f-LCR. Each row displays the same test image processed by different methods.
The visualization results demonstrate clear improvements after integrating the C2f-LCR module. In the first comparison group, the enhanced model reduces three false detections. In the second group, the enhanced model shows two fewer false detections. It also eliminates one missed detection and one classification error. In the third group, the enhanced model reduces three false detections and one missed detection. In the fourth group, the enhanced model produces one false detection that YOLOv8n does not have, but it still achieves a net reduction of two false detections and one missed detection compared to YOLOv8n. In the fifth group, the enhanced model reduces four false detections.
Overall, YOLOv8n integrated with the C2f-LCR module reduces false detections across different test scenarios. This performance is better than YOLOv8n. These improvements demonstrate that the introduction of the LCR module enhances the model’s capability to discriminate between target pests and non-target pests.
3.8. Heatmap Analysis
To intuitively observe the attention distribution of YOLO-LCE on pest targets, heatmap visualization analysis was conducted. The analysis was conducted on both YOLOv8n and YOLO-LCE models.
As shown in
Figure 11, performance differences are revealed across three representative scenarios. Each row displays the same test image processed by different models. In the first comparison group, which represents a clustered pest detection scenario, YOLOv8n fails to focus on some target pests. In contrast, YOLO-LCE focuses on more target pests. The second comparison group shows that YOLO-LCE generates more intense attention regions compared to YOLOv8n. In the third comparison group, YOLOv8n incorrectly focuses on non-target pest regions. In contrast, YOLO-LCE partially avoids this erroneous focus. Overall, the heatmap visualization demonstrates that YOLO-LCE can better focus on pest targets.