1. Introduction
As one of the three major staple food crops worldwide, rice serves as a fundamental pillar of global food security [
1]. During its growth cycle, the infestation of pests such as rice flies seriously threatens rice production, which not only causes significant yield losses [
2,
3], but also leads to quality deterioration problems such as increased chalkiness and decreased protein content of rice, which directly affects the national strategic food security and farmers’ economic returns [
4,
5]. According to the monitoring data of the Agricultural Technology Extension Service Center in 2023, the cumulative area of rice pests in China reached 5.33 × 108 hm
2, resulting in economic losses accounting for more than 55% of the total crop pest losses [
6]. The State of Food and Agriculture 2024 report of the Food and Agriculture Organization of the United Nations emphasizes that rice cultivation is an important component of the agrifood system [
7]. Therefore, improving the efficiency and accuracy of crop pest and disease identification is of great significance in reducing agricultural production losses and promoting high-quality agricultural development [
8].
The current rice pest monitoring technology system has significant dual dilemmas: traditional manual surveys are limited by low sampling density, high subjective bias rate, and response delay [
9], making it difficult to meet the demand for accurate plant protection [
4]. In the past two decades, many scholars have conducted research on automatic image-based identification of crop pests and diseases [
10,
11,
12]. Traditional machine learning methods classify crop pests and diseases by manually designing feature extraction and classification strategies. Although these methods have achieved certain success in specific scenarios, such as in Thenmozhi et al. employed digital image processing techniques for preprocessing, segmentation, and geometric shape extraction to identify insect species in sugarcane crops, achieving high accuracy across nine shape categories [
13]. Wang et al. designed an automatic insect identification system based on support vector machines (SVM), attaining a recognition accuracy of 93% [
14]. Larios et al. proposed a classification method combining Haar random forest (RF) features, which demonstrated improved recognition performance for aquatic stonefly species [
15]. Li et al. utilized spectral regression linear discriminant analysis (SR-LDA) for dimensionality reduction, followed by K-nearest neighbor (KNN) classification, achieving 90% accuracy in recognizing unclassified insect images [
16]. However, these approaches rely heavily on expert-driven feature engineering, resulting in poor generalization performance in complex field environments [
17].
With the continuous improvement of computing power and the continuous development of deep learning technology, the research on intelligent detection of agricultural pests based on deep learning has shown explosive growth [
18,
19]. Current object detection algorithms are mainly categorized into two types of architectures: Two-stage models, such as the R-CNN series [
20,
21], generate candidate frames through a region suggestion network, which has high detection accuracy but suffers from defects such as slow inference speed and large memory occupation, making it difficult to be deployed in resource-constrained field equipment [
22,
23]. Single-stage models, such as YOLO [
24] and SSD [
25], use an end-to-end detection strategy, in which the YOLO series of algorithms achieves a balance between accuracy and speed through a grid-based prediction mechanism, and has become the preferred solution for agricultural scenarios [
26]. For example, the PestLite crop pest detection model proposed by Dong et al. compresses the number of YOLOv5 parameters to 1.2 M through multilevel spatial pyramid pooling, reducing the computational cost by 32% while maintaining 85.7% mAP [
27]. The rice pest detection model proposed by Zhou et al. uses GhostNet to reconstruct the YOLOv4 backbone network, which reduces the model size by 41%, and the inference speed is increased to 67 FPS [
28]. Liao Juan introduces a lightweight GsConv module with dilated convolution in YOLOv7 to enhance the feature extraction capability of small target spots and reduce the leakage rate to 6.3% [
29]. Li et al. efficiently suppressed the complex background interference by integrating the channel–space dual-attention mechanism and the EfficientIoU loss function to make the YOLOv5 improve the recognition accuracy by 11.2 percentage points in the pest occlusion scene [
30]. Di et al. proposed a lightweight attention-based network called TP-YOLO. It introduces a context converter and a full-dimensional dynamic convolution module for enhanced feature extraction [
31]. Sun et al. implement three core improvements based on the YOLOv8l architecture: adopting an asymptotic feature pyramid network to optimize multi-scale feature fusion, reconfiguring the C2f module to achieve 55.26% parametric compression, and integrating the attention mechanism to enhance feature discrimination. Experiments show that this scheme improves mAP by 1% while maintaining detection efficiency [
32]. Hu et al. introduce a global contextual attention module to enhance feature characterization, and optimize cross-layer feature fusion by combining a bidirectional feature pyramid network, which significantly improves mAP by 5.4% compared to the YOLOv5 baseline [
33].
Deep learning offers an effective solution for intelligent pest detection, significantly enhancing both accuracy and processing speed in modern agriculture. Aiming at the problems of time-consuming and laborious traditional manual detection methods and low accuracy of existing machine learning models in complex farmland scenarios, this study proposes a high-precision rice pest detection model MTD-YOLO based on the YOLOv8 framework, with the following core improvements and contributions:
The original YOLOv8 backbone network (5.08 M parameters) is replaced by the lightweight MobileNetV3 (2.97 M parameters), which achieves about 2.11 M parameter reduction (41.5% reduction) through deeply separable convolution, and the model size is significantly reduced from 21.5 MB to 11.1 MB (48.4% reduction). The strong representation of fine-grained features of the pest was enhanced while compressing the structure.
Fusing the C2f module with the Triplet Attention module to construct the C2f-T structure, which effectively solves the problem of confusing leaf texture and pest region features by capturing spatial location relationship, channel dependency, and cross-scale contextual information in parallel.
Dynamic Head is introduced to replace the original detection head, utilizing its Scale-aware, Spatial-aware and Task-aware triple-attention mechanism to dynamically enhance the semantic clarity and spatial focusing ability of the pest target.
A diversified data augmentation strategy was adopted, specifically including geometric transformations (horizontal/vertical flipping, random rotation); lighting adjustments (dynamic adjustment of brightness and exposure); noise interference (Gaussian noise); and weather simulation (raindrop degradation effect). This approach systematically covers the main types of interference that may be encountered during field testing. The experimental dataset covers 12 types of typical agricultural pests and is characterized by significant biodiversity and scene complexity.
The synergistic effect of module combinations was demonstrated by ablation experiments. When using MobileNetV3 alone, the mAP@0.5 improved from 85.8% to 88.3%, indicating its excellent performance in lightweight feature extraction. Using C2f-T alone increased recall by 2.5% but reduced mAP@0.5 by 0.8%, indicating that without feature compression support, background noise is easily amplified. Combining MobileNetV3 with C2f-T improved mAP@0.5 to 89.9%, demonstrating the structural synergistic advantages of the two. The optimal performance is achieved when all three modules are used together, further validating the complementary and synergistic nature of the overall architecture.
The model performance was quantitatively and qualitatively analyzed through comparative experiments and visualization of test results.
The structure of the paper is organized as follows:
Section 2 describes the rice pest dataset and the data enhancement strategy;
Section 3 analyzes the original architecture of YOLOv8 and details the three improvements of MTD-YOLO: MobileNetV3 backbone, C2f-T cross-dimensional feature fusion, and Dynamic Head;
Section 4 presents a detailed evaluation of the model through comparative experiments, ablation studies, and visualization of test results;
Section 5 summarizes the research results and looks into the future direction.
2. Datasets
In order to improve the generalization performance and recognition accuracy of the pest detection model, this study adopts a two-stage dataset validation strategy and conducts experiments based on two rice pest image datasets, Rice Pest1 and Rice Pest2, obtained from the Roboflow platform (available online:
https://roboflow.com, accessed on 9 February 2025). The distribution of the number of pest objects for each category in the different sets is shown in
Figure 1. Both datasets maintain balanced category distributions internally. These images are organized into training, validation, and test sets in the form of 7:2:1, and each image has a resolution of 640 × 640 pixels and contains one to four target objects. Among them, Rice Pest1, as the core training set, focuses on two types of pests, stem borer and brown planthopper, which are prominent in rice production areas, and contains a total of 2639 high-quality labeled images, and some of the training samples are shown in
Figure 2. Rice Pest2 contains 5564 multi-category samples covering 10 species of rice pests, specifically including lepidopteran pests (stem borer, stickleback, rice leaf borer), hemipteran pests (brown planthopper, rice leafhopper, white-backed fly), coleopteran pests (bean scabbard fly, rice water weevil), arachnid pests (red spider mite), and nematode species (wheat root-knot nematode), and its diversity features can effectively validate the model’s adaptability in cross-species recognition tasks, and some of the training samples are shown in
Figure 3.
Aiming at the characteristics of variable insect postures and complex lighting conditions in agricultural scenes, the dataset adopts multiple data augmentation techniques: horizontal and vertical flips are imposed with 50% probability; random rotational transformations are performed in the ±15° interval; the brightness adjustment amplitude is set to be ±66%, and the exposure adjustment range to be ±25%. Since Rice Pest2 covers more variety and is suitable for evaluating model robustness in complex environments, a richer set of data enhancement strategies is introduced. Rice Pest1 serves as the benchmark test set, and the base enhancements are retained in order to ensure the consistency of the variable control and to facilitate the comparison of the effects of the model improvements. The new data enhancement methods added to the Rice Pest2 dataset include adding 4.22% noise for each image, adding random Gaussian blur at 25 px, and artificially generating raindrops to mimic the effect of bad weather, interfering with the image to cope with detection under adverse conditions. Raindrop generation parameters are designed in accordance with the China Meteorological Administration Precipitation Intensity Rating Standard (GB/T 28592-2012) [
34]. Specifically, the intensity factor is uniformly sampled from the interval [0.3,0.8], corresponding to moderate to torrential rainfall levels. The density of raindrops ranges from 900 to 1900 drops/m
2 (with 1000 ± 100/m
2 as a reference for moderate rain), and raindrop lengths are randomly generated between 19 and 34 pixels, which are optically equivalent to a physical diameter of 1.9 to 3.4 mm. This parameterized perturbation strategy is intended to enhance the model’s robustness to environmental variability and improve generalization performance in complex real-world scenarios.
4. Experimental Design and Analysis of Results
4.1. Experimental Environment and Parameters
Hardware Configuration: The processor is an Intel(R) Core(TM) i7-14650HX (Intel Corporation, Santa Clara, CA, USA) with 2.20 GHz, RAM is 32 GB, and the graphics card is NVIDIA RTX4060 (NVIDIA Corporation, Santa Clara, CA, USA) with 8 GB of video memory. Software Configuration: The operating system is win11, the deep learning framework is PyTorch-3.10.1, the conda version is 12.4, and the programming language is Python-3.8.10.
Parameter settings: The image size is 640, epoch is 100, batch is 8, and the learning rate is 0.01.
4.2. Evaluation Metrics
The evaluation is based on four widely used metrics: Precision (P), Recall (R), Average Precision (AP), and mean Average Precision (mAP). Precision measures the ratio of true positive predictions to the total number of predicted positives, while Recall represents the ratio of true positives to the total number of actual positive samples. Average Precision is defined as the area under the Precision–Recall curve for a single class, reflecting the model’s performance in that specific category. Meanwhile, mean Average Precision is the mean value of AP across all categories, providing an overall performance metric for multi-class detection tasks. The mAP@0.5 metric refers to the mean AP computed using an Intersection over Union (IoU) threshold of 0.5. This metric is widely used in object detection tasks and serves as a reliable indicator of a model’s ability to accurately locate targets. In addition to mAP@0.5, this study also reports mAP@[0.5:0.95], which represents the average mAP calculated at multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. This stricter and more comprehensive evaluation metric allows for a more rigorous assessment of the model’s performance across varying levels of localization precision. The definitions of the formulas used are as follows:
denotes True Positives, the number of samples correctly predicted as positive samples by the model, denotes False Positives, the number of negative samples incorrectly predicted as positive samples by the model, denotes False Negatives, the number of positive samples incorrectly predicted as negative samples by the model, and denotes the number of categories.
4.3. Experimental Demonstration
All model training and evaluation were conducted on an NVIDIA RTX4060 GPU under consistent hyperparameter settings. As shown in
Table 3, on the Rice Pest1 dataset, MTD-YOLO achieves significant improvements in all performance metrics compared to the baseline model YOLOv8. Precision was improved from 90.9% to 92.5%, Recall from 87.2% to 90.1%, mAP@0.5 from 85.8% to 90.0%, and mAP@[0.5:0.95] from 66.5% to 67.8%, which are 1.6%, 2.9%, 4.2%, and 1.3%, respectively. In addition,
Table 4 shows the comparison of the average precision metrics for the two pest categories, where the AP of Penggerek Batang padi kuning improved from 95.2% to 96.8% and that of Wereng Coklat improved from 76.4% to 83.2%. This indicates that the model’s ability to detect the more difficult-to-identify categories has been significantly enhanced.
On the Rice Pest2 dataset, MTD-YOLO also shows strong performance advantages. As shown in
Table 5, the precision of the model improves from 93.1% to 95.6%, the recall improves from 90.6% to 92.8%, mAP@0.5 improves from 94.2% to 96.6%, and mAP@[0.5:0.95] improves from 80.7% to 82.5%, which are 2.5%, 2.2%, 2.4%, and 1.8%, respectively. Moreover,
Table 6 presents the average precision results for ten individual pest categories, all of which exhibited consistent performance improvements. Among them, the original high APs of red spider and rice gall midge were further improved to 99.5% after the improvement, while the relatively weak original performances of yellow rice borer and rice leaf roller were also steadily improved after the improvement. This fully verifies the robustness and generalization ability of this method under multi-category and complex background conditions.
To assess the stability of the model’s performance, we conducted five independent training runs on the Rice Pest1 dataset, as shown in
Figure 10. The mean, standard deviation, and 95% confidence interval of the key metrics were calculated, as shown in
Table 7. Specifically, the model achieved a precision of 92.46% ± 0.34 (95% CI: [92.04%, 92.88%]) and a recall of 90.12% ± 0.29 (95% CI: [89.76%, 90.48%]). The mAP@0.5 was 90.02% ± 0.36 (95% CI: [89.58%, 90.46%]), while mAP@[0.5:0.95] reached 67.80% ± 0.32 (95% CI: [67.42%, 68.18%]). These results demonstrate that the proposed MTD-YOLO model exhibits strong consistency and robustness across repeated experiments.
4.4. Ablation Experiments
To assess the individual contributions of the MobileNetV3, DyHead, and C2f-T modules within the YOLOv8 framework, ablation experiments were conducted on the Rice Pest1 dataset, as shown in
Table 8. Replacing the original backbone with MobileNetV3 alone significantly reduces the parameter count, while achieving an mAP@0.5 of 88.3%. This demonstrates the effectiveness of its inverted residual structure in compressing features without sacrificing accuracy. Introducing the DyHead module increases the recall by 2.0%, although it leads to a considerable increase in parameter count. This suggests that the enhanced multiscale detection capability comes at the cost of greater computational complexity. Isolated application of the C2f-T module results in an unexpected 0.8% drop in mAP@0.5, indicating that its cross-stage fusion design is most effective when supported by the multi-level features extracted by MobileNetV3. When all three modules are integrated, the model achieves peak performance with 92.5% precision, 90.1% recall, and 90.0% mAP@0.5. The experiments show that the model performance improvement comes from the complementary design of the three components, MobileNetV3 achieves efficient feature compression, DyHead enhances the multi-scale target response, and C2f-T optimizes the cross-level semantic fusion.
To visually validate the effectiveness of the key modules introduced in this study on improving detection performance,
Figure 11 shows the changes in detection confidence for a specific target (Penggerek Batang padi kunin) as each core improvement module is progressively added to the model. (a) Base Model: The confidence score is 89%, indicating that the unmodified YOLOv8 model exhibits a baseline level of detection capability. (b) +MobileNetV3: Confidence increases significantly to 92%, demonstrating that the lightweight backbone improves feature extraction efficiency for the target object. (c) +C2f-T: The addition of the C2f-T module further increases confidence to 93%. The integrated Triplet Attention mechanism rotates tensor dimensions and applies Z-Pool to model spatial-channel dependencies. This enhances the model’s focus on discriminative features while suppressing background noise, thereby improving confidence. (d) +Dynamic Head: With the addition of the Dynamic Head module, the confidence score further increases to 95%. This module facilitates adaptive multiscale feature learning via scale-aware, spatial-aware, and task-aware attention, resulting in more confident and precise detection.
This visualization result strongly validates the lightweight feature extraction capability of the MobileNetV3 backbone network, the effectiveness of Triplet Attention in enhancing feature discriminability, and the role of Dynamic Head in optimizing prediction accuracy. Working in tandem, these three components collectively form an efficient and precise object detection framework, significantly enhancing the model’s detection confidence and reliability in complex backgrounds.
4.5. Model Comparison Experiments
In order to demonstrate the effectiveness of the improved algorithms in this experiment, comparisons are made with mainstream algorithms in object detection, including Faster R-CNN, SSD, YOLOv7 [
50], YOLOv5 [
51], YOLOv3 [
52], YOLOv6 [
53], YOLOv9, YOLOv10, and YOLOv11 while the experimental environment remains unchanged. As shown in
Table 9, MTD-YOLO leads all comparison models with 92.1% precision, 90.1% recall—significantly better than YOLOv5 to YOLOv11—and mAP@0.5 is improved by 27.6 percentage points compared to the two-stage model Faster R-CNN, and 1.7 percentage points compared to the next best model YOLOv3-tiny, proving that the MTD-YOLO model has obvious advantages in the rice pest task.
4.6. Pest Detection
In this study, the superiority of the improved MTD-YOLO model compared to the benchmark YOLOv8 model in the rice pest detection task is verified by four sets of comparative experimental images. As shown in
Figure 12, the MTD-YOLO model demonstrates a 1–5% confidence improvement in complex scenarios where there are multiple angle variations of the target and differences in the number of groups. Furthermore, in densely grouped target scenarios, the improved model exhibits stronger discriminative capability. These results highlight the potential of MTD-YOLO for practical deployment in real-world agricultural environments.
To further explore differences in attention regions between models during pest identification, this study employed Grad-CAM to visualize and compare the focus areas of the original YOLOv8 and the proposed MTD-YOLO model. The visualization settings were configured as follows: the detection layer was set to [
10] and the confidence threshold was set to 0.2. Regions highlighted in red indicate higher levels of model attention or activation.
As shown in
Figure 13, the MTD-YOLO model produced more localized and focused activation regions compared to the baseline YOLOv8, accurately attending to pest targets while minimizing activation over background noise. This indicates improved attention precision and enhanced model robustness in complex visual scenes.
5. Conclusions
This study addresses the inefficiency of traditional manual inspection and the low accuracy of existing detection methods in complex farmland environments by proposing a novel rice pest detection model, MTD-YOLO. The model incorporates MobileNetV3 as the backbone network, integrates the Triplet Attention mechanism into the C2f module, and replaces the original detection head with Dynamic Head. These improvements collectively construct an efficient and accurate detection framework. Two high-quality datasets covering 12 major rice pests were used, and multiple enhancement strategies (including Gaussian blurring, noise simulation, raindrop simulation, and light adjustment) were employed to improve adaptability to complex farmland scenes. Experimental results demonstrate that MTD-YOLO significantly improves detection accuracy under complex agricultural conditions, effectively overcoming the limitations of existing approaches. Experiments on the Rice Pest1 and Rice Pest2 datasets further validate the model’s effectiveness, with mAP@0.5 improvements of 4.2% and 2.4%, respectively, over the baseline model.
Future work will focus on enhancing hardware-aware algorithm co-design, expanding the range of pest categories to improve model generalization, and exploring multimodal data fusion to strengthen feature representation. These efforts aim to develop a more robust and deployable pest detection system, thereby supporting intelligent agricultural management.