1. Introduction
Rice serves as the staple food for nearly half of the global population, making its stable production essential for ensuring food security [
1]. However, agricultural pests remain one of the most persistent and severe threats to rice growth [
2], and large-scale outbreaks can cause substantial yield losses and even jeopardize regional food stability [
3]. Efficient monitoring and accurate identification of field pests have therefore become critical components of modern rice production. Traditional pest monitoring relies on manual field inspection and specimen comparison, which is labor-intensive, prone to subjective errors, and unable to meet the requirements of real-time, large-scale, and precise monitoring in modern agriculture [
4]. With the rise of smart agriculture and machine vision, image-based automated recognition has emerged as a promising solution. Nevertheless, the effectiveness of such approaches fundamentally depends on the availability of high-quality, large-scale, and multi-scene image datasets.
To support model training and evaluation, several datasets such as IP102 [
5], RP11 [
6], and Pest24 [
7] have been released. Despite their contributions, these datasets still exhibit notable limitations: although IP102 is large, its rice-related subset includes only 14 categories and 1248 images; RP11 focuses mainly on adult pests and contains only 11 categories with limited diversity; Pest24 provides fine-grained annotations but suffers from circular field-of-view imaging, resulting in large black margins that hinder feature extraction. Overall, existing datasets often suffer from uneven class distribution, limited sample size, and restricted imaging scenarios, making them insufficient for multi-scale pest detection in complex field environments, and thus constraining the performance of advanced detection models.
Image-based pest monitoring has therefore become an important research direction [
8]. Early approaches relied on handcrafted features and traditional Machine-Learning classifiers such as SVM and KNN [
9,
10], but these methods lacked robustness under varying illumination and background conditions [
11]. With advancements in imaging hardware and artificial intelligence, deep learning-based detection methods have rapidly gained attention in crop pest monitoring [
12,
13,
14]. Models such as Faster R-CNN, R-FCN, and SSD significantly improved accuracy through end-to-end feature learning [
15,
16]. In particular, the YOLO family has demonstrated outstanding performance in balancing speed and accuracy, with improved models (e.g., YOLOv5, YOLOv7) achieving over 90% average precision in rice [
17,
18], maize [
19], and tea plant [
20] pest detection.
In recent years, advances in deep learning and computer vision have spurred a new wave of research in rice pest detection. For example, RP-DETR (2025) introduced Transformer architectures into rice pest detection for the first time, enabling end-to-end inference, multi-scale feature fusion, and lightweight model design, thereby offering a new paradigm for pest detection under complex field conditions [
21]. Meanwhile, Rice-YOLO—an improved lightweight variant of YOLOv8 incorporating attention mechanisms—achieved strong mAP across multiple pest species while substantially reducing computational cost, making on-device or edge deployment more feasible [
22]. In addition, several studies have implemented pest and disease detection systems on mobile phones, demonstrating the practical viability of bringing such models directly to farmers [
23].
Despite these advances, several key challenges remain unresolved. Most existing studies rely solely on RGB images and seldom consider visual degradation caused by low illumination, shadow occlusion, or nighttime light-trap settings [
15,
24]. Although some efforts explore multisource data (e.g., UAV imagery, light-trap images, visible–NIR fusion), these datasets are typically limited in scale and scenario diversity, failing to capture the wide range of conditions present in real agricultural environments [
25,
26].
With respect to data resources, publicly available datasets remain limited in scale, species coverage, and annotation consistency. Most rely on single-modality RGB images and lack scenarios involving low light, occlusion, dense distributions, small objects, or multisource imaging (light-trap images, laboratory close-ups, UAV data, web-collected samples) [
27,
28]. Existing datasets often suffer from small category sets, inconsistent annotation formats, and homogeneous scenes, thereby constraining model robustness and generalization [
29,
30]. Building a large-scale dataset that spans multiple scenes, modalities, and pest developmental stages has therefore become a critical bottleneck for advancing the field.
Taken together, current research has yet to meet the combined requirements of “multisource/multimodal data, diverse scenes, multi-scale targets, deployable lightweight models, and high generalization stability.” Addressing this gap, we constructed RicePest-30—a new rice pest image dataset comprising field ultraviolet light-trap images, laboratory close-up images, and curated web-sourced samples. Covering 30 common rice pest categories, it better reflects the diversity and complexity of real-world scenarios. Based on this dataset, we adopt YOLOv11 as the core detection framework and incorporate transfer learning to leverage its strong capability in high-precision object detection. The goal is to significantly enhance multi-scale pest recognition performance under complex backgrounds. Ultimately, this work aims to develop a comprehensive rice pest image database and a robust YOLOv11-based detection model, providing reliable data resources and technical support for intelligent pest monitoring and precision crop protection.
2. Materials and Methods
2.1. Image Acquisition
Field images were primarily collected using ultraviolet (UV) light trap systems (RNCB-III, Hunan RNXN Tech, Changsha, China) deployed across major rice-growing regions in Hunan Province, including Suining, Taoyuan, and Wangcheng. Each trap was equipped with an LED light source operating at a central wavelength of 360–400 nm (
Figure 1). The systems were automatically activated at night and captured images at two-hour intervals, which were transmitted to a central server in real time.
All traps were installed at a height of approximately 1.5 m, with an inter-site spacing of 300–500 m to ensure adequate spatial representativeness. Image acquisition was performed using high-definition surveillance and digital cameras with a minimum resolution of 3200 × 2700 pixels. Data collection covered dusk, nighttime, and early morning periods, corresponding to peak pest activity.
To supplement fine-grained morphological information, individual insect images were captured under controlled laboratory conditions. These images focused on detailed features such as body shape and color patterns, enhancing the model’s capacity for individual-level feature learning. In addition, white-background images were prepared by removing complex environmental backgrounds and isolating the insect body. This process emphasized key structural characteristics, including wing venation, antennae, and body segments, thereby supporting more precise feature extraction.
To address class imbalance, additional images were selectively obtained from authorized online sources. All such images were manually inspected, and only those consistent with natural field conditions were retained.
In total, the dataset comprises 8848 images covering 30 major rice pest species. This includes 6452 field images, 1597 single-insect laboratory images, and 799 online-supplemented images, with a total of 62,227 annotated instances. The resulting dataset provides a diverse and well-structured foundation for model training and performance evaluation.
2.2. Data Annotation and Quality Control
All images were annotated in the COCO format using CVAT v2.45.0. To ensure consistency and accuracy, a standardized annotation protocol was established. Each recognizable insect instance was enclosed within a bounding box, even under conditions of mild blur or partial occlusion, while severely occluded or unidentifiable individuals were excluded.
A dual-annotation and cross-validation strategy was employed, with 10% of the images randomly selected for triple independent reviews to minimize human bias and guarantee annotation reliability. The final annotation files contain essential metadata, including image information, class identifiers, and bounding box coordinates. A sample of the dataset is shown in
Figure 2, example images for the 10 pest categories are provided in
Figure 3, and a complete set of examples for all pest categories can be found in
Table S1.
In total, 62,227 high-quality annotations were produced, covering 30 major rice pest species. The class distribution of images and annotated instances is summarized in
Table 1, providing a statistical overview of dataset composition. The resulting multi-class rice pest detection dataset, RicePest-30, will be publicly available for research purposes at:
https://github.com/kkb20334-lang/RicePest-30 (URL accessed on 21 December 2025).
2.3. Dataset Partitioning for Model Training
Following precise manual annotation, the dataset was divided into training and testing subsets with an approximate ratio of 9:1, ensuring proportional representation across all pest categories. The pest categories in the dataset are shown in
Table 2.
A total of 953 images were randomly selected as the test set for model performance evaluation, while the remaining samples were allocated to the training set for model learning and parameter optimization.
This partitioning strategy maintains balanced class distributions and effectively prevents data leakage, thereby ensuring the objectivity and reliability of model evaluation.
2.4. Model Architecture and Training Configuration
In recent years, object detection techniques have advanced rapidly within the field of computer vision [
31]. Among these approaches, the You Only Look Once (YOLO) series has become one of the most representative single-stage detection frameworks, owing to its end-to-end architecture and strong balance between accuracy and real-time performance [
32].
This study adopts YOLOv11 [
33], one of the latest developments in the YOLO family. The model consists of three core components: a Backbone, a Neck, and a Detection Head.
In the Backbone, YOLOv11 integrates enhanced Cross Stage Partial (CSP) modules together with dynamic convolution units, improving feature extraction efficiency and multi-scale representation capability.
The Neck employs an optimized Path Aggregation Network–Feature Pyramid Network (PAN-FPN) to enable effective interaction between low-level spatial features and high-level semantic information, thereby enhancing robustness to complex pest morphologies.
In the Detection Head, YOLOv11 combines anchor-based and anchor-free paradigms and incorporates dynamic label assignment and adaptive loss re-weighting strategies, which jointly improve classification confidence and bounding-box localization accuracy.
During training, Mosaic and MixUp data augmentation techniques were applied to improve model generalization under complex field conditions. During inference, an improved Non-Maximum Suppression (NMS) strategy was used to suppress redundant detections and enhance prediction stability. Overall, YOLOv11 demonstrates a favorable trade-off between detection accuracy and computational efficiency, making it well suited for multi-class rice pest detection tasks [
34].
To further optimize performance, the learning rate of YOLOv11 was determined through a grid-search strategy in the range of 0.001 to 0.01 with a step size of 0.001, and the optimal value was identified as 0.002. The optimizer was configured in automatic mode to select an appropriate optimization strategy during training. Model training was conducted for a maximum of 300 epochs, with an early-stopping mechanism that terminated training if no performance improvement was observed over 100 consecutive epochs.
For comparative evaluation, YOLOv5s, YOLOv8s, Faster R-CNN, and RetinaNet were selected as benchmark models. All comparison models were trained for 300 epochs using the Adam optimizer with a fixed learning rate of 0.001 and their default data augmentation and architectural settings as provided in the official implementations. Although a unified hyperparameter search across all models was not conducted due to computational constraints, identical training durations and evaluation protocols were adopted to ensure a consistent and transparent performance comparison.
2.5. Evaluation Metrics
To comprehensively evaluate the performance of the proposed model on rice pest detection, this study adopts widely used metrics from the COCO and Pascal VOC evaluation protocols, including Precision, Recall, Average Precision (AP), and mean Average Precision (mAP). Precision and Recall assess the model’s performance under a single confidence threshold, whereas AP and mAP provide an integrated measurement across multiple confidence levels, offering a more holistic view of detection capability.
Precision and Recall are defined as follows:
where
TP,
FP,
TN, and
FN denote the numbers of True Positives, False Positives, True Negatives, and False Negatives.
Model performance is further evaluated using mAP@0.5 and mAP@0.5:0.95, strictly following the standard evaluation protocols adopted in object detection benchmarks such as COCO and Pascal VOC. In this study, average precision (AP) is defined as the area under the precision–recall (PR) curve, rather than pointwise precision computed at a single confidence threshold. The PR curve is obtained by varying the detection confidence threshold and computing precision and recall across all recall levels.
Specifically, mAP@0.5 denotes the mean of per-class AP values calculated at a fixed Intersection over Union (IoU) threshold of 0.5, reflecting the detection performance under a relatively lenient localization criterion. In contrast, mAP@0.5:0.95 represents the average AP across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, thereby providing a more stringent and comprehensive assessment of localization accuracy and bounding-box regression robustness under complex field conditions and varying object scales.
The overall mAP is computed as:
where
denotes the average precision of the
i-th pest class and
n is the total number of classes. Collectively, these metrics enable a rigorous and unambiguous evaluation of multi-class pest detection performance in real agricultural environments, while explicitly distinguishing integrated AP-based indicators from pointwise precision and recall.
3. Results
3.1. Experimental Environment and Parameter Settings
To achieve high-precision recognition of multiple rice pest species, this study employed YOLOv11 as the core detection framework and optimized its hyperparameters through a systematic experimental process. The optimal configuration was obtained when the learning rate was set to 0.002, the optimizer was configured in auto mode to automatically determine the most suitable strategy.
All experiments were conducted on an Ubuntu server equipped with an NVIDIA GeForce RTX 3090 GPU, with the software environment comprising Python 3.10.18, PyTorch 2.6.0, Ultralytics 8.3.181, Scikit-learn 1.7.2, and CUDA 12.4. This setup provided a stable computational foundation, ensuring reproducibility and reliability in model training and evaluation.
3.2. Prediction Performance Evaluation
To ensure a fair and interpretable comparison, YOLOv5s, YOLOv8s, Faster R-CNN [
35], and RetinaNet [
36] were selected as baseline detectors. All models were trained using their official implementations, and the training duration was kept consistent with YOLOv11. Performance comparisons are presented in
Table 3 using standard object detection metrics.
YOLOv11 achieves the most consistent performance across evaluation criteria. Its pointwise precision (0.7969) and recall (0.7071) indicate a balanced trade-off between detection accuracy and sensitivity. Meanwhile, the integrated metrics mAP@0.5 (0.7550) and mAP@0.5:0.95 (0.5513) demonstrate reliable localization performance and robustness across multiple IoU thresholds.
YOLOv5s attains slightly higher recall but lower precision, reflecting a tendency to detect more targets at the cost of increased false positives. YOLOv8s exhibits a noticeable decline in both mAP@0.5 and mAP@0.5:0.95, suggesting insufficient robustness when applied to complex, multi-class field scenarios. Faster R-CNN and RetinaNet achieve relatively high mAP@0.5 values for selected pest species; however, their low precision indicates a substantial false-positive burden, limiting their practical usability.
Taken together, the results confirm that YOLOv11 offers the most balanced and stable detection performance when both pointwise and integrated metrics are jointly considered.
Figure 4 illustrates the detection performance of the YOLOv11 model across nine representative rice pest species that exert significant impact in Hunan Province, evaluated using Precision, Recall, AP@0.5, and AP@0.5:0.95 metrics. Overall, YOLOv11 achieved consistently high accuracy across most pest categories, demonstrating strong stability and generalization ability.
Although YOLOv11 did not attain the highest Recall for certain species such as Ostrinia furnacalis, it maintained the most stable performance and the best trade-off among all models in terms of Precision and overall balance. These results indicate that YOLOv11 can accurately detect multiple pest species under complex field conditions, highlighting its robustness and practical value for reliable automated pest monitoring in agricultural environments.
A comparative analysis reveals that RetinaNet achieves strong performance in Recall and AP@0.5:0.95 across most categories; however, its markedly low Precision indicates a substantial tendency toward false positives. Although the model detects a large number of potential instances, this imbalance undermines its overall reliability. The abnormally high AP@0.5:0.95 observed in several categories further suggests an over-sensitivity or systematic misclassification, limiting the model’s suitability for field-scale pest monitoring, where both accuracy and stability are essential.
YOLOv8 exhibits notable declines in AP@0.5 and AP@0.5:0.95 for several pest categories. Given the distributional characteristics of the dataset, these deficiencies likely arise from insufficient sample size or imbalance across categories, constraining the model’s ability to learn robust multi-scale and cross-scenario representations. Notably, YOLOv11 also shows reduced AP@0.5:0.95 for a small subset of morphologically similar pests, indicating that its localization precision and bounding-box regression remain improvable under stricter IoU thresholds.
To evaluate the counting performance of different models under complex field conditions, 50 test images containing diverse, dense, and partially overlapping pest individuals were selected. Manual inspection indicated that the number of pest instances per image ranged from 25 to 98, resulting in a total of 1237 annotated targets.
Among the evaluated models, YOLOv11 produced count estimates that were closest to the manual annotations, with an absolute error of 17 instances. In comparison, YOLOv5 and Faster R-CNN exhibited larger counting deviations, with absolute errors of 79 and 86 instances, respectively. These results suggest that YOLOv11 offers more stable counting behavior in dense and cluttered scenes, although other models may outperform it on individual detection metrics. Overall, the counting experiment highlights a trade-off between detection sensitivity and counting stability, with YOLOv11 demonstrating comparatively robust performance in scenarios involving high object density and overlap (
Figure 5).
To further assess the class-specific recognition performance of YOLOv11, a confusion matrix was constructed at an IoU threshold of 0.5 (
Figure 6), and detailed analyses were performed for each pest category. The results indicate that the model exhibits strong discriminative ability for most classes, though some misclassifications remain. For example,
Spodoptera spp. and
Sesamia inferens were occasionally confused, some
Geometridae samples were misidentified as
Spodoptera frugiperda or
Spodoptera litura, and
Agrotis ipsilon individuals were occasionally classified as
Spodoptera litura. These confusions are primarily attributed to morphological similarity and limited sample numbers, which increase classification difficulty. Further examination of the evaluation metrics in
Table S1 shows that
Scotogramma trifolii,
Tryporyza incertulas, and
Agrotis ipsilon exhibit the lowest AP@50 values. Among them,
Scotogramma trifolii records an AP@50 of only 0.4335, with an identification accuracy of merely 0.3524. These results are fully consistent with the confusion patterns observed in the earlier confusion matrix, further confirming the model’s difficulty in accurately recognizing these specific categories.
To visually demonstrate the model’s field detection performance, selected test samples were annotated with ground-truth and predicted bounding boxes (
Figure 7). The results show that YOLOv11 accurately detects the majority of pest instances, with predicted boxes closely overlapping ground-truth annotations. Target localization is precise and bounding boxes align well. In images with clear pest features and minimal background interference, the model achieved detection results nearly identical to manual annotations, reflecting strong feature extraction and spatial localization capabilities.
In more challenging scenarios with overlapping pests or uneven illumination, YOLOv11 effectively distinguished adjacent individuals and produced reasonable predictions, demonstrating robustness in dense-object environments. Minor false positives and missed detections mainly occurred in regions where pest coloration closely matched the background or individuals were partially occluded, reflecting inherent challenges related to sample distribution and subtle small-object features. Overall, YOLOv11 exhibited high detection accuracy and stability across multiple pest categories, with predicted results closely aligned with ground-truth annotations, supporting its practical application in automated field pest identification and population monitoring.
4. Discussion
In this study, we constructed the RicePest-30 dataset based on field insect images collected using UV light traps, supplemented by a limited amount of web-sourced data. The dataset covers 30 major rice pest species and contains 8848 images with approximately 62,227 annotated bounding boxes. Its design and construction exhibit notable characteristics and research value.
Compared with earlier datasets such as Pest24, which adopted circular framing, RicePest-30 uniformly uses square compositions. This avoids the black-edge regions caused by circular fields of view and enhances spatial completeness and effective image utilization. Moreover, the square layout captures richer spatial context, such as pest distribution patterns and environmental background, enabling models to extract more informative features. These properties significantly improve dataset robustness and representativeness under real-world field conditions with variable illumination, complex backgrounds, and overlapping insects. In addition, the dataset implements refined bounding-box annotations and strict verification procedures to ensure label accuracy and consistency. It also preserves natural class distribution and diversity in capture conditions (e.g., illumination, angle, distance), providing a realistic, reproducible, and extensible benchmark platform for pest detection research.
The YOLOv11 model trained on RicePest-30 achieved strong performance, with a
mAP@50 of 0.7550. The model performs particularly well on categories with clear insect structures but exhibits certain false positives and false negatives on small targets, heavily occluded insects, or images with uneven illumination—issues closely related to the morphological complexity of pest species and class imbalance within the dataset [
37].
Comparisons with previous versions (YOLOv5 and YOLOv8) demonstrate that YOLOv11 achieves substantial improvements in both accuracy and inference speed, increasing
mAP@50 by approximately 1.8% and 16.36%, respectively, and improving inference speed by about 12% [
38]. These gains largely stem from its enhanced feature extraction and fusion modules (e.g., C2f and GELAN) and improved multi-scale detection strategies [
39]. In contrast, two-stage models such as Faster R-CNN retain advantages when detecting larger or sparsely distributed pests but suffer from slow inference and reduced localization accuracy in dense small-object scenarios. Although the RetinaNet model achieves very high precision for certain pest categories, its performance is poor across several other categories, accompanied by a substantial risk of false positives, resulting in insufficient overall detection balance.
Visualization results further confirm that YOLOv11 produces bounding boxes highly consistent with ground truth, demonstrating accurate spatial localization and strong discrimination ability. Even under challenging conditions such as illumination variation and cluttered backgrounds, the model can effectively differentiate adjacent individuals, reflecting robust generalization performance. The remaining errors are mainly associated with insects blending into the background or partially occluded, which may be mitigated through targeted sample enrichment and small-object optimization strategies [
40]. However, when the IoU threshold is raised to 0.5:0.95, the AP values for several pest categories drop markedly, indicating that the model’s boundary regression and localization precision still have room for improvement under stricter criteria. Overall, while YOLOv11 demonstrates clear advantages in overall performance, stability, and practical applicability, its localization robustness for small targets, morphologically similar categories, and high-IoU conditions remains somewhat limited.
YOLOv11 demonstrates stable and consistent performance in counting tasks under dense field conditions. The counting experiment was conducted on 50 test images selected to represent challenging scenarios, including high insect density, multi-species coexistence, and partial target overlap. These images cover a wide range of infestation levels, with pest counts per image spanning from low-density to highly crowded conditions, thereby enabling a focused evaluation of counting robustness beyond standard detection metrics.
Based on regression analysis, YOLOv11 achieved a total counting error of 17 instances across the selected images. The corresponding regression slope (1.01), intercept (−0.67), coefficient of determination (R2 = 0.96), and RMSE (3.68) indicate a strong linear agreement between predicted and ground-truth counts. In comparison, YOLOv5 and Faster R-CNN exhibited larger counting deviations, while YOLOv8 and RetinaNet showed substantially broader error distributions, with long-tailed absolute errors exceeding 30–40 insects in multiple samples. In particular, RetinaNet displayed pronounced over- and under-counting in high-density scenes, suggesting limited stability when insect overlap becomes severe.
It should be noted that the counting experiment focuses primarily on dense and complex scenes, where detection ambiguity and object overlap are most pronounced. While this setting highlights the relative robustness of YOLOv11 in high-density scenarios, model behavior under sparse or moderately populated conditions may differ. Therefore, the reported results should be interpreted as evidence of improved counting stability under challenging field conditions, rather than as a universal advantage across all insect density levels.
Analysis of the confusion matrix alongside AP@50 metrics reveals that while YOLOv11 performs well across most pest categories, it faces substantial challenges with morphologically similar or poorly delineated classes. For instance, pests such as Geometridae, Sesamia inferens, Spodoptera spp. show elevated misclassification rates, indicating that even at an IoU threshold of 0.5, the model’s discriminative capacity remains constrained. These pests exhibit minimal differences in fine-grained features such as wing venation, texture patterns, and body coloration, and their appearances in field images are often complicated by variable postures, occlusions, and lighting conditions, resulting in highly overlapping feature spaces. Consequently, AP@50 values for species like Geometridae, Sesamia inferens, Agrotis ipsilon are notably low, consistent with high misclassification rates observed in the confusion matrix. Collectively, these “confusing classes” highlight inherent limitations of current deep detection models in handling morphologically similar, small-sample, and boundary-ambiguous targets, and emphasize the need for future work to improve performance via multi-scale feature enhancement, fine-grained representation learning, and strategies addressing class imbalance.
Overall, YOLOv11 offers a strong combination of high accuracy, robustness, and inference efficiency across pest detection and counting tasks, making it suitable for automated field pest monitoring. Nonetheless, its performance is still constrained by class imbalance, small-object detection challenges, and feature drift under varying illumination. It should also be noted that YOLOv11 benefited from task-specific hyperparameter tuning, whereas the benchmark models were trained using default configurations from their official implementations; although this reflects common practice in applied studies, it may introduce a degree of bias in direct performance comparison. Future work may therefore explore more homogeneous optimization strategies across models to further strengthen comparative fairness. Future work may address these issues by: (1) integrating structures with improved global feature modeling capability [
41]; (2) expanding sample diversity via adaptive data augmentation or generative adversarial networks (GANs); and (3) incorporating multimodal information, such as meteorological factors and time-series pest data, to enable spatiotemporal pest prediction [
42].