1. Introduction
China is one of the cradles of rice cultivation in the world and also the largest rice producer globally [
1,
2]. Yunnan is an important rice cultivation and breeding base in the plateau region of China, with a rice cultivation history of over 4000 years and abundant rice germplasm resources [
3]. The rice industry is not only related to national food security but also plays an important role in the cultivation of disease-resistant strains and the research and development of special rice varieties. However, the terrain of Yunnan is mainly mountainous and plateau, with mountainous areas accounting for 88.64% of the province’s total area [
4]. This unique terrain not only results in a lower level of mechanization and automation in rice cultivation compared to plain areas but also poses greater challenges to the precise monitoring of breeding experimental fields and the disease prevention and control in terrace areas [
5,
6,
7]. Moreover, the diversity of altitude, climate, and biodiversity makes it difficult to predict the occurrence, development, and outbreak of rice diseases, directly affecting the screening efficiency of disease-resistant strains during the breeding process and the stability of field yields.
Accurate disease monitoring is crucial for scenarios such as disease resistance identification in rice breeding experimental fields and fixed-point monitoring of mountain terraced fields. During the breeding stage, precise identification of tiny disease spots is required to screen excellent disease-resistant strains for early disease prevention and control. In field disease control, it is necessary to quickly distinguish visually similar diseases to formulate targeted strategies [
8,
9]. However, traditional manual detection methods rely on expert experience, which are not only inefficient but also have high omission rates for small-target diseases and prominent misjudgments of similar diseases, hindering the progress of disease-resistant breeding and delaying field disease control [
10]. Therefore, developing a rice disease detection technology that enables fixed-point detection with high accuracy and the ability to detect small targets is of great significance for ensuring stable rice yields and improving breeding efficiency.
The rapid development of artificial intelligence technology has promoted the transformation of rice disease detection from traditional expert experience-based judgment to automated and intelligent monitoring. Early studies mainly combined traditional computer vision with machine learning methods. Through image processing techniques such as threshold segmentation and edge detection, and combined with manually designed features, disease spots were separated. Then, machine learning algorithms such as SVM and KNN were used to classify these features [
11,
12,
13,
14]. Although these methods achieved good results in the early stage, they were limited by manually designed features and were difficult to adapt to scenarios with changing lighting, complex backgrounds, or inconspicuous disease characteristics. Therefore, they had limitations in terms of robustness and accuracy [
15]. With the rise of deep learning, especially the great success of convolutional neural networks and Vision Transformer in the field of general image recognition, it has promoted the transformation of rice disease detection research [
16,
17]. By improving CNN-based architectures such as VGG16 [
18] and ResNet [
19], as well as Transformer-based architectures such as ViT [
20] and Swin Transformer [
21], they can better learn deep and high-dimensional disease features, thereby improving the recognition accuracy. However, the limitation of classification models is that they can only determine whether there is a disease in the image but cannot provide the specific location and range of the disease spots. It is difficult to meet the application scenarios that require precise localization of disease spots for disease resistance screening or assessment of disease severity.
To overcome the aforementioned difficulties, researchers have gradually turned to object detection technology, which can simultaneously provide category and location information. Existing studies are mainly based on mainstream detection frameworks such as YOLO and Faster R-CNN [
22], and conduct adaptability research on these models for rice disease detection tasks. For example, in the YOLO framework, pre-trained models from large-scale general datasets like COCO and ImageNet 1K are used for transfer learning [
23,
24] to accelerate the model’s convergence speed on specific disease datasets and improve its basic performance. Meanwhile, to optimize the model’s ability to extract features of specific diseases, researchers have improved the baseline model from multiple perspectives: First, attention mechanisms such as GAM [
25], Triplet Attention [
26], and CBAM [
27] are introduced to enable the model to adaptively focus on lesion areas and suppress the interference of complex backgrounds. Second, the convolution module is modified to enhance the perception ability of lesions of different scales, especially tiny lesions [
28,
29]. Third, loss functions such as Wise IOU [
30,
31] and DIoU Loss [
32] are introduced to address the class imbalance problem where background samples far outnumber lesion samples in field scenarios or to improve the accuracy of target disease localization. These improvements have achieved good results on specific datasets, promoting the transformation of disease monitoring from category judgment to precise positioning. However, despite the significant progress made in existing object detection technology, when applied to complex real-world field environments, problems such as missed detections, false detections, and low accuracy still commonly occur for early tiny lesions or diseases with similar visual features [
33,
34]. These problems restrict the practical application of automation technology in key processes such as disease-resistant breeding screening and precise field management.
To further break through the detection bottleneck of a single model in complex scenarios and enhance the detection stability, this study explores the applicability of ensemble learning in the task of rice disease detection. Ensemble learning is a technique widely used in traditional machine learning. Its core idea is to make joint decisions by constructing and combining multiple diverse base learners [
35,
36]. By effectively combining the advantages of different models, it can achieve better generalization ability and robustness than any single model. However, there are relatively few studies on applying the ensemble learning strategy to deep learning object detection tasks. For the rice disease detection task in this study, integrating multiple high-performance object detectors helps to overcome the interference of complex backgrounds, solve the misjudgment of similar diseases, and ultimately improve the upper limit of detection accuracy and stability [
37].
However, designing an effective integration strategy and fusing prediction results according to the specific requirements of rice disease detection is one of the main challenges in current research [
38]. To address this issue, this paper proposes an ensemble learning method based on post-processing integration to tackle the problem of missed detections in rice disease detection. The detection boxes are integrated at the post-processing stage of multiple detection models, and the results are optimized and integrated using the Weighted Boxes Fusion method [
39]. Ultimately, high-precision detection of rice diseases under complex backgrounds, especially tiny and similar lesions, is achieved. The innovations and contributions of this paper are as follows:
To address the significant difference between general datasets and the characteristics of agricultural diseases, domain adaptation pre-training was carried out using the PlantDoc plant disease-specific dataset. This provided a more targeted feature foundation for the subsequent disease detection model and effectively enhanced the model’s sensitivity to disease characteristics.
Regarding the detection challenges of missed detection of tiny lesions and difficult-to-distinguish samples commonly found in rice diseases, in this study, based on YOLOv8s-transfer, a P2 detection head was introduced to improve the detection performance of small targets. Additionally, the EMA mechanism and the Focal loss function were combined to strengthen key features and focus on difficult samples.
To break through the detection bottlenecks of single models in complex backgrounds and visually similar diseases, an ensemble detection framework based on weighted box fusion was designed and implemented. By integrating three high-performance single detectors and utilizing the WBF post-processing technology, the upper limit of the final accuracy and the robustness of the detection model were improved.
In this study, a high-precision and high-robustness ensemble detection method was constructed through multiple stages, which is beneficial for the accurate identification of disease resistance in breeding experimental fields and the fixed-point monitoring of diseases in mountain terraces. This has important practical significance for improving the efficiency of rice breeding, ensuring the stable rice yield in plateau areas, and even promoting the application of smart agriculture in complex scenarios.
2. Materials and Methods
The overall technical roadmap of this study is shown in
Figure 1, which is divided into three parts: data collection, data processing, and model construction. First, data on five rice diseases, namely Bacteria blight, Blast, Brown spot, Entyloma, and Tungro, and one pest, Rice planthopper, were obtained through two channels: self-collection and public datasets. Then, fuzzy and duplicate images were deleted. LabelImg was used to annotate the disease images, and the dataset was expanded through five data augmentation methods. Subsequently, five different architectures, YOLOv8, YOLOv9, Faster-RCNN, RT-DETR, and EfficientDet, were adopted to construct disease detection models. After selecting the YOLO framework for subsequent experiments, the YOLOv8 and YOLOv9 disease detection models were first constructed using the PlantDoc dataset. After obtaining the pre-trained weights, the pre-trained weights were used to train on the dataset constructed in this study, resulting in two models, YOLOv8-transfer and YOLOv9-transfer. Subsequently, based on the YOLOv8 model, the YOLOv8s-transfer model was improved by adding a P2 small-target detection head, EMA, and changing the loss function to Focal Loss, aiming to improve its detection ability and prediction accuracy for small-target diseases. Then, to further improve the model performance, through ensemble learning, the prediction results of the three base models with the highest mAP_0.5 were integrated in the post-processing stage. By comparing the results of four post-processing methods, NMS, SNMS, NMW, and WBF, the Ensemble-WBF model was finally constructed. After that, the advantages and disadvantages of the single models and the Ensemble-WBF model were compared, and the research results were analyzed and discussed.
2.1. Data Acquisition
In this study, data on rice diseases and pests were collected in two ways. First, in May 2024, images of rice diseases and pests were collected at the Rice Research Institute of Yunnan Agricultural University. This included two parts: rice planted in the greenhouse and that planted outside the greenhouse. Using mobile devices, images of four diseases (Bacteria blight, Blast, Brown spot, Entyloma) and one pest (Rice Planthopper) of rice were taken at a distance of 15–30 cm from the diseased leaves. Each image had a resolution of 3120 × 4160 pixels, with a total of 3688 images. To ensure data quality, blurred and duplicate images were removed through manual screening, and finally 2757 high-quality images of diseases and pests were obtained. To improve the robustness of the model and its adaptability in different natural environments, a rice disease dataset created by Sethy et al. [
40] was further obtained from Mendeley Data. This dataset is mainly used for classification tasks and includes four diseases (Bacteria Blight, Blast, Brown spot, Tungro) in real field scenarios. To make it suitable for the rice disease detection in this study, 2634 images with complex backgrounds were selected from it to construct the dataset. In total, 5391 original images of rice diseases and pests were obtained.
Figure 2 shows the collected images of five diseases and one pest.
2.2. Image Processing and Dataset Construction
To construct a rice disease detection model with high accuracy and strong generalization ability, this study simulates rice disease images at different angles and under different lighting conditions through data augmentation and simultaneously addresses the issue of limited data volume. Initially, 2634 images from the Mendeley dataset were augmented twice using vertical flipping and rotation, resulting in 7902 disease images. Additionally, to improve the model’s robustness and adaptability to different environments and reduce the impact of the long-tail effect on model performance, random data augmentation was performed on the images of Bacteria blight, Blast, Brown spot, and Entyloma, which are in relatively small quantities in the manually collected data, using five methods: vertical flipping, random adjustment of brightness and contrast, horizontal flipping, random adjustment of gamma value, and random adjustment of hue. Taking Entyloma as an example, the effect of data augmentation is shown in
Figure 3. The probability of each data augmentation method was uniformly set to 0.5, and 3618 images were obtained after two rounds of random augmentation. Finally, the dataset was randomly divided into a training set and a validation set at a ratio of 8:2. The positions and categories of all pests and diseases were manually annotated using the LabelImg software (1.8.1) and saved in the YOLO format. The quantities of pest and disease images and labels in the dataset are detailed in
Table 1.
2.3. Model Construction
2.3.1. YOLOv8s Disease Detection Model Based on Transfer Learning of PlantDoc
Figure 4 illustrates the feature extraction and prediction workflow of YOLOv8s. The input image is initially subjected to feature extraction by the backbone part. After passing through multiple convolutional layers, feature maps of multiple scales are generated. Subsequently, these feature maps undergo feature fusion in the neck module to capture context information at different levels, thereby supporting the final detection task. Finally, the head module outputs prediction results based on feature maps of different sizes [
41].
The Backbone part consists of CBS, C2f, and SPPF modules. The backbone networks of both YOLOv5 and YOLOv8 are based on the CSPDarkNet53 architecture [
42]. Different from YOLOv5, YOLOv8 replaces the C3 module with the C2f module to maintain light weight while providing more abundant gradient flow information, thus improving the convergence speed and performance.
The Head part of YOLOv8 differs from those of YOLOv3 and YOLOv5. YOLOv3 and YOLOv5 adopt a strategy based on coupled heads and anchors, while YOLOv8 uses the same decoupled head and anchor-free strategy as YOLOX. The Decoupled Head eliminates object prediction and is divided into two branches for bounding box (bbox) prediction and category prediction, respectively, extracting position features and classification features separately. Then, each branch completes the positioning and classification tasks through convolutional layers, thereby improving the detection accuracy and accelerating the model convergence. For these two tasks, YOLOv8 employs different loss functions: Binary Cross-Entropy Loss (BCE Loss) for the classification task, and Distribution Focal Loss (DFL) and CIoU for the bounding box regression prediction task.
In addition, transfer learning can leverage the knowledge gained from one task and apply it to another related task, thereby improving performance [
43]. When data is limited, it enables the model to transfer knowledge from a pre-trained model rather than training from scratch [
44]. In terms of disease detection, taking advantage of transfer learning, we first trained the YOLO model on the publicly available PlantDoc dataset [
45], which is specifically tailored for plant disease detection. Then, we used its pre-trained weights for the rice disease detection dataset in this experiment, thus achieving more efficient feature learning on the limited rice disease sample set and improving the detection accuracy.
2.3.2. EMA Mechanism
The efficient multi-scale attention (EMA) module [
46] aims to enhance the feature representation ability of convolutional neural networks (CNNs) in computer vision tasks such as image classification and object detection. Its architecture focuses on retaining information in each channel while reducing computational overhead. The EMA module is implemented through three steps: feature grouping, parallel sub-networks, and cross-spatial learning. By combining features of different scales with global context information, parallel processing and feature grouping improve computational efficiency. The Matmul function is used for the weighted fusion of cross-spatial features. Through matrix operations of global and local features, it enhances the representation ability of key disease regions. The structure diagram of the EMA mechanism is shown in
Figure 5.
2.3.3. Focal Loss
Object detection methods typically utilize prior boxes to enhance prediction performance. An image may generate thousands of prior boxes, but only a small fraction of them can match the targets (positive samples), while most prior boxes cannot match any targets [
47]. This situation leads to an imbalance between positive and negative samples in one-stage object detection methods. Focal Loss [
48] dynamically adjusts the cross-entropy loss based on confidence, addressing the sample imbalance problem from another perspective. As the confidence of correct predictions increases, the weight coefficient of the loss gradually decreases to zero. This ensures that the model training loss is more concentrated on challenging cases, while the loss from a large number of simple instances remains at a low level.
2.3.4. Rice Pest and Disease Detection Model Based on Improved YOLOv8s
To address the issue of the limited number of original images in the rice dataset used in this study, although the sample size was increased through local data collection, public datasets, and data augmentation, the generalization ability of the model may still be limited. Therefore, in this study, the backbone, neck, and loss function of the YOLOv8s model were improved, respectively. The structure of the improved model is shown in
Figure 6. By introducing the EMA mechanism after the feature fusion module SPPF (Spatial Pyramid Pooling-Fast), the ability of the model to express important features was enhanced. This not only improved the stability of the model but also reduced the instability during the training process, alleviated the overfitting problem, and enhanced the robustness of the model. Subsequently, a P2 detection head was added as a new output layer. The introduction of the P2 detection head can enhance the model’s ability to capture features and details of small targets with a relatively small increase in model parameters, thereby improving the accuracy of detecting small-target pests and diseases. In addition, Focal loss can better adapt to the target distribution in the dataset, thus improving the detection performance for difficult samples.
2.3.5. Rice Disease Detection Model Based on Ensemble Learning
In response to the limitations of transfer learning and improved models in enhancing the generalization ability and prediction stability of models, this study utilizes the concept of ensemble learning to further optimize the comprehensive performance of rice disease detection by integrating multiple object detection models in the post-processing stage. The integration process is shown in
Figure 7.
As shown in
Figure 7, the proposed integration method evaluates multiple trained object detection models by aggregating a large number of candidate bounding boxes generated by multiple single models during the detection process. Subsequently, a suitable post-processing algorithm is employed to select the most appropriate bounding boxes. Currently, the application of ensemble learning methods in the agricultural field is relatively limited. In this study, based on the Bagging method, multiple models with the best performance among the previously constructed pest and disease detection models are predicted in parallel. During the prediction process, individual prediction boxes along with their corresponding categories and confidence scores are generated. Subsequently, all prediction boxes are aggregated, and model fusion is achieved through four commonly used post-processing techniques: Non-Maximum Suppression (NMS) [
49], Soft Non-Maximum Suppression (SNMS) [
50], Non-Maximum Weighting (NMW) [
51], and Weighted Box Fusion (WBF) [
39]. These methods facilitate the complementary advantages between different models, thereby enhancing the overall performance of the model. Notably, Non-Maximum Suppression is one of the most commonly used post-processing techniques in object detection tasks. It retains the prediction box with the highest confidence while suppressing other prediction boxes with lower confidence or those highly overlapping with the high-confidence prediction box, thus selecting the best prediction box from multiple overlapping prediction boxes. Soft Non-Maximum Suppression is an improved version of NMS. The difference lies in that it does not directly suppress other prediction boxes but linearly reduces the confidence scores of overlapping prediction boxes, thereby retaining useful information. The Non-Maximum Weighting method is similar to NMS, but it assigns weights to each candidate prediction box, aiming to improve object detection performance in cases of high overlap. Weighted Box Fusion balances the prediction results of different models by combining candidate boxes from different models and assigning weights, thereby enhancing the detection ability.
2.4. Experimental Setup
The hardware configuration of this experiment includes Intel (R) Core (TM) i9-13900K Central Processing Unit (CPU), 32 GB of memory, and NVIDIA RTX 2080 Ti 11 G Graphics Processing Unit (GPU). The software runs on the Windows 10 operating system. All programs are executed under Python 3.11 and the deep-learning framework PyTorch 2.0.1, and the NVIDIA CUDA 11.8 parallel computing driver is utilized for accelerated training. The models are trained using the YOLOv8 and YOLOv9 algorithms defined in Ultralytics version 8.2.36. The parameters for each model are set as follows: 300 epochs, a batch size of 16, an initial learning rate of 0.1, a learning rate scaling factor of 0.001, the AdamW optimizer with a weight decay coefficient of 5 × 10−4; the thresholds for different post-processing methods are all set to 0.6; data pre-processing includes pixel value normalization (/255) and mosaic augmentation (performed under the configuration enabled during training), and the cosine annealing strategy is adopted for learning rate adjustment.
2.5. Evaluation Metrics
The YOLO series models include different versions such as n(t), s, m, l, and x. Among them, the n(t) and s versions perform excellently in terms of real-time performance, which can not only maintain high detection accuracy but also ensure real-time responsiveness. Therefore, this study mainly focuses on discussing the accuracy and efficiency of the models in the disease detection task. Correspondingly, for EfficientDet, the EfficientDet–D0 with the fewest parameters is selected for the experiment. Since there are only two sizes (L and X) of RT-DETR in the Ultralytics library, RT-DETR-L is selected for the experiment in this study. Three indicators, Precision, Recall, and mAP_0.5, are selected to evaluate the model performance.
- (1)
Precision
Precision characterizes the proportion of samples that are actually positive among those predicted to be positive. The computational formula for this metric is presented in Equation (2).
where TP (True Positives) denotes the instances where the model correctly classifies positive samples as positive. FP (False Positives) indicates the cases where the model erroneously predicts negative samples as positive.
- (2)
Recall
The recall rate characterizes the proportion of samples that are correctly classified as positive among all samples that are actually positive. Its computational formula is presented in Equation (3).
where FN (False Negatives) denotes instances where the model erroneously classifies positive samples as negative. Additionally, TN (True Negatives) indicates cases where the model accurately predicts negative samples as such.
- (3)
Mean Average Precision (mAP)
The mean average precision (mAP) is derived from the average precision (AP). The calculation formulas for AP and mAP are presented in Equations (4) and (5) [
16]. Average precision describes the accuracy of predictions for each class, calculated as the area enclosed by the precision-recall curve and the coordinate axes for each class. Mean average precision, on the other hand, is the average of the average precisions across all classes [
16]. Generally, a higher mAP indicates better model performance.
where R
j denotes the recall rate at the j-th recall threshold point, P
inter(j) represents the interpolated precision at the j-th recall threshold. N represents the total number of categories, and AP
i is the Average Precision (AP) value for the i-th category.
The Intersection over Union (IoU) is a widely used evaluation metric in object detection tasks. It measures the correlation, or the degree of overlap, between the predicted bounding boxes and the actual bounding boxes. Its value ranges from 0 to 1, with a higher value indicating a greater overlap between the predicted and actual bounding boxes, and thus a higher accuracy in object detection. IoU is calculated by dividing the overlapping area of the predicted and actual bounding boxes by the combined total area of both regions. A predetermined threshold is then compared with the calculation result to determine whether to retain the predicted bounding box. The calculation formula is as follows:
where
denotes the true bounding box, while
represents the predicted bounding box.
signifies the intersection area between the true bounding box
and the predicted bounding box
;
denotes the union area between the true bounding box
and the predicted bounding box
.
3. Results and Discussion
3.1. Analysis of Experimental Results of Different Object Detection Models
To screen the object detection models suitable for improvement and integration experiments, this study selects five models with different architectures, namely Faster-RCNN, EfficientDet, YOLOv8, YOLOv9, and RT-DETR, and compares their three accuracy indicators, mAP_0.5, Precision, and Recall, as well as two efficiency indicators, Param (M) and GFLOPs, on a self-built rice pest and disease dataset. All models are tested under the same parameter configuration, and the results are shown in
Table 2. Among them, Faster-RCNN and EfficientDet-D0 are trained through the MMDetection framework, while YOLO and RT-DETR-L are trained via Ultralytics.
As shown in
Table 2, there are significant differences in the accuracy and efficiency of models with different architectures. RT-DETR achieves the highest detection accuracy, with mAP_0.5 and Recall of 0.869 and 0.848, respectively, which are absolutely improved by 0.9% and 4.3% compared with YOLOv9s. This may be because the Transformer architecture has a stronger ability to capture global features, especially performing better on small-scale rice diseases. However, it should be noted that its Precision is 0.854, slightly lower than that of YOLOv9s, and the number of parameters and GFLOPs are 32.00 M and 103.5, respectively, which are 4.5 times and 3.9 times those of YOLOv9s. Its excessively high computational cost makes it difficult to adapt to the scenario of batch detection. In addition, the traditional two-stage model Faster-RCNN has the worst performance, with an mAP_0.5 of only 0.791. Moreover, the two-stage candidate box generation process has poor compatibility with the subsequent post-processing integration methods such as WBF and NMS in this study, resulting in a relatively high integration difficulty. Although the real-time-oriented EfficientDet has the lowest GFLOPs of 3.61, its mAP_0.5 is only 0.804, indicating that its lightweight structure has difficulty in capturing the subtle features of rice diseases and lacks detection stability.
In contrast, the YOLO series strikes a better balance between accuracy and efficiency. The mAP_0.5 of YOLOv9s reaches 0.86, and the Precision reaches 0.858, showing absolute improvements of 1.6% and 3.7%, respectively, compared to YOLOv8s. Moreover, the Recall values of the two models are close. Meanwhile, its number of parameters and GFLOPs are 7.17 M and 26.7, respectively, representing decreases of 35.6% and 6.0% compared to YOLOv8s, indicating that it achieves better performance after architecture optimization. Additionally, even for the lightweight versions YOLOv8n and YOLOv9t, their streamlined mAP_0.5 values are 0.818 and 0.813, respectively, which are lower than those of YOLOv8s and YOLOv9s. However, their numbers of parameters and GFLOPs are significantly reduced, and the overall performance is still superior to that of Faster-RCNN and EfficientDet, further validating the adaptability of the YOLO series.
In summary, the YOLO series has achieved advantages in terms of accuracy, efficiency, and integration convenience. Moreover, it is developed based on the Ultralytics library, and subsequent updates to newer versions such as YOLOv10 and YOLOv11 can be directly made without the need to reconstruct data pre-processing and integration interfaces, which guarantees the scalability of this study. Therefore, this study selects the YOLO series for subsequent improvement and integration experiments to develop a high-precision disease detection framework suitable for special agricultural scenarios.
3.2. Performance Analysis of YOLO Pre-Trained Model Based on PlantDoc Dataset
3.2.1. PlantDoc Pre-Trained Model
The default pre-trained models of the YOLO series are built based on the COCO dataset. The general targets such as vehicles and pedestrians included in it have significant differences from the characteristics of rice diseases, which easily leads to the problem of insufficient feature adaptation of the model in agricultural scenarios. To enhance the model’s learning ability for plant disease characteristics, this study selects the PlantDoc dataset dedicated to plant diseases and pests for pre-training, laying a foundation for subsequent transfer learning. The training results are shown in
Table 3.
As can be seen from
Table 3, the YOLOv9s model achieved the optimal performance on the PlantDoc dataset, with mAP_0.5, Precision, and Recall of 0.667, 0.625, and 0.619, respectively. This might be because it improved the feature extraction and fusion mechanism of YOLOv8, thereby effectively mining the details of diseases. It is worth noting that the mAP_0.5 and Precision of YOLOv8n are higher than those of YOLOv8s, but its Recall is significantly lower than that of YOLOv8s. This could be because the streamlined architecture of YOLOv8n reduces the redundant features brought by complex networks, thus reducing the probability of misjudgment in non-disease areas and showing higher prediction accuracy. However, due to its limited ability to extract shallow features, it is difficult to fully cover the scattered small disease areas in the image, resulting in missed detections of real disease targets and ultimately reducing the Recall. In contrast, YOLOv9t improved the missed detection phenomenon through structural improvements. However, the optimized structure has difficulty extracting disease features under the lightweight structure, leading to a decrease in accuracy.
Table 4 presents the detection results of the top five categories of YOLOv9s in the PlantDoc dataset. The mAP_0.5 values for Apple leaf and Apple rust reach 0.902 and 0.912, respectively, indicating that the model has strong feature learning ability for single-category targets. The results validate the effectiveness of domain-related pre-training, suggesting that ensuring the close alignment between source-domain crop diseases and target-domain rice diseases is an effective strategy for improving model performance [
56,
57]. By introducing the plant disease features from the PlantDoc dataset, the feature differences between general targets in the COCO dataset and rice diseases are avoided, laying a foundation for subsequent transfer learning. However, the current pre-training only covers various crop diseases. In the future, based on the dataset of this study, pre-training samples of different types of rice diseases can be further supplemented to enhance the feature adaptation ability.
3.2.2. Validation of the Effectiveness of Transfer Learning
To verify the impact of pre-trained weights (transfer) on the PlantDoc dataset on the model accuracy, this study conducted a comparative experiment on transfer learning using the rice disease dataset. The experimental results are shown in
Table 5.
According to the results in
Table 5, after all models were pre-trained on PlantDoc, all indicators were improved, which verified the effectiveness of transfer learning. Without using the pre-trained weights from PlantDoc, YOLOv9s showed the best performance, with mAP_0.5, Precision, and Recall values of 0.86, 0.858, and 0.805, respectively. Compared with YOLOv8s, these values had absolute improvements of 1.6%, 3.7%, and 0.3%, respectively. After all models used the pre-trained weights from PlantDoc, the mAP_0.5, Precision, and Recall all increased. Among them, the mAP_0.5, Precision, and Recall of YOLOv9s had absolute improvements of 2.3%, 1.1%, and 4.0%, respectively, compared to the original model. YOLOv8s showed more significant improvements after transfer learning, with absolute improvements of 3.2%, 4.7%, and 4.1% in the three indicators, respectively, indicating the optimization potential of transfer learning for YOLOv8s.
3.3. Improved YOLOv8s for Rice Disease Detection
Based on the experimental results of transfer learning, this study selects YOLOv8s-transfer as the basic model for improvement. The improved model, Improved YOLOv8s-transfer, is constructed by introducing the small target detection head P2, the EMA mechanism, and the Focal Loss function. To verify the effectiveness of each component, this study conducts an ablation experiment using the control variable method by gradually adding components and evaluating the performance changes. The results of the ablation experiment are shown in
Table 6.
The results of the ablation experiments in
Table 6 clearly demonstrate the contributions of each improvement. After adding the P2 detection head to the baseline model, the mAP_0.5 increased from 0.876 to 0.888, with an absolute improvement of 1.2%. However, the GFLOPs increased from 28.4 to 37.0, resulting in an approximately 30.3% increase in computational overhead. This indicates that the P2 small-object detection head is crucial for capturing small disease spots such as Entyloma and Brown Spot, and it compensates for the feature loss of small objects by enhancing the utilization ability of shallow features. After adding the EMA module, the mAP_0.5 further increased to 0.892, with an absolute improvement of 0.4%, and both the number of parameters and GFLOPs slightly decreased. This shows that through the efficient design of the parallel sub-networks in the EMA, while enhancing the focus on key disease features, the computational efficiency is optimized. After introducing the Focal Loss, the mAP_0.5 of the model reached 0.899, with a further absolute improvement of 0.7%. This indicates that by reducing the weights of easily classifiable samples and focusing on difficult-to-distinguish samples such as Blast and Brown Spot, it effectively reduces the false detection rate of similar diseases. The mAP_0.5 of the final Improved YOLOv8s-transfer model is 0.899, with an absolute improvement of 2.3% compared to the baseline model, which verifies the rationality of the improvement strategy. It shows that the Improved YOLOv8s-transfer proposed in this study can effectively alleviate the problem of feature loss in small-object detection tasks.
To further analyze the optimization effect,
Table 7 compares the performance differences between the improved model and the baseline model for each disease category. Among them, the Improved YOLOv8s-transfer shows the highest improvement in the detection ability for small-target diseases. The mAP_0.5 of Entyloma is absolutely increased by 6.7% compared with YOLOv8s-transfer, and that of Brown Spot is absolutely increased by 3.3%, directly verifying the synergistic effect of the P2 detection head and the EMA mechanism. In terms of the recall rate, the recall rates of the improved model for Bacterial blight, Brown spot, Tungro, Entyloma, and Rice planthopper are absolutely increased by 0.9%, 1.2%, 1.3%, 8.0%, and 0.7%, respectively, compared with YOLOv8s-transfer, indicating that it can effectively solve the problem of missed detection. However, for Blast, its mAP_0.5 decreases from 0.852 to 0.837, with an absolute reduction of 1.5%. This may be because although the improved model optimizes small targets and difficult samples, for Blast, the difficult sample focusing of Focal Loss leads to insufficient learning weights for easily classified samples of Blast, resulting in a decrease in prediction accuracy.
Furthermore, it can be seen from
Figure 8 that for the dense small-target scenario of Entyloma diseases, the detection effects of YOLOv8s-transfer and YOLOv9s-transfer are similar. When the diseases are small, YOLOv9s has the highest missed detection rate. In contrast, Improved YOLOv8s-transfer can identify disease spots of smaller sizes and detect more small-target diseases, further confirming the enhanced ability of the improvement strategy to capture subtle features.
Although the performance of the improved model has been significantly enhanced, there are still certain limitations. It has insufficient ability to distinguish visually similar diseases such as Blast and Brown Spot in complex backgrounds, and there are still cases of missed detection of tiny targets in extreme backlight environments. The current model cannot fully adapt to the complex field environment, and the problem of missed detection of tiny targets still exists. This trade-off reflects the challenges of fine-grained classification in visual tasks, that is, like humans, the model may have difficulty distinguishing visually similar categories [
58,
59]. In the future, fine-grained feature modules such as lesion texture and color depth can be introduced to further reduce the false detection rate of similar diseases.
3.4. Rice Disease Detection Based on Ensemble Learning
To further enhance the detection robustness, this study selected three models with the best performance, namely Improved YOLOv8s-transfer, YOLOv9s-transfer, and YOLOv8s-transfer (original), to construct an ensemble framework, and compared the fusion effects of four post-processing techniques: NMS, SNMS, NMW, and WBF. The experiment recorded the output in the JSON format consistent with the COCO dataset. Since the inference time of the ensemble model was long, for efficient evaluation, we randomly selected 600 images (100 per class) from the validation set to construct a representative subset for testing, including 100 images for each of the 6 diseases and pests. The Pycocotools library was used to calculate the AP_0.5 and AR_0.5:0.95 metrics. The experimental conditions were set as an IoU threshold of 0.60 and post-processing weights of 1. The experimental results are shown in
Table 8.
As shown in
Table 8, among the four integration strategies, Ensemble-WBF performs the best, with an AP_0.5 of 0.922 and an AR_0.5:0.95 of 0.648, significantly outperforming the single model. Compared with the Improved YOLOv8s-transfer, it has an absolute increase of 2.2% and 3.2%, respectively. WBF generates new bounding boxes by performing a weighted average on highly overlapping predicted boxes, rather than simply suppressing redundant boxes like NMS. This allows it to fully integrate the consensus information from different models and improve the positioning accuracy. In contrast, the performance of Ensemble-SNMS is lower than that of the single model, which may be due to its excessive suppression of bounding boxes, leading to the loss of effective information.
However, the performance improvement of the integrated model results in significant performance overhead. The inference times of Ensemble-WBF and Improved YOLOv8s-transfer are 86.021 ms/image and 30.028 ms/image, respectively, with a 2.9-fold increase, which makes it difficult to meet the requirements of low-latency scenarios such as real-time monitoring by drones. Nevertheless, in scenarios where high precision and high recall are prioritized, such as rice breeding greenhouses, the ensemble learning strategy proposed in this study still has certain practical value. For instance, by regularly collecting images with fixed cameras and uploading them to the cloud, the early disease warning ability can be significantly enhanced at an average inference speed of 86 ms/image.
In addition,
Table 9 further compares the performance differences between Ensemble-WBF and Improved YOLOv8s transfer for each disease category. Among them, Ensemble-WBF shows the most significant improvement in easily confused diseases: the AP and AR of Bacterial blight are absolutely increased by 4.2% and 7.0%, respectively, and those of Blast are absolutely increased by 3.8% and 3.1%, respectively, indicating that the ensemble strategy can effectively integrate the advantages of different models and reduce the misjudgments of a single model. For Brown Spot, both the AP and AR are absolutely increased by 3.3%, further optimizing the detection performance. However, the accuracy improvement of the integrated model for the small-target disease Entyloma is relatively small, with an absolute increase of 0.2%, and the average recall rate remains the same, indicating that the optimization of Improved YOLOv8s-transfer for small targets has approached the limit under the current strategy, while the ensemble strategy mainly enhances the ability to distinguish disease samples of similar categories.
Figure 9 intuitively demonstrates the advantages of Ensemble-WBF in complex and dense scenarios. It not only reduces missed detections but also enhances the recognition stability of small and dense targets, verifying the application potential of ensemble learning in agricultural disease detection.
Although Ensemble-WBF demonstrates excellent performance, the high computational complexity resulting from multi-model fusion restricts its deployment on portable devices [
60]. In the future, knowledge distillation [
61] or pruning techniques [
62] can be employed to transfer the detection capabilities of the ensemble model to a single lightweight model, thereby reducing the deployment difficulty while maintaining high accuracy.
3.5. Limitations and Future Prospects
This study has made certain progress in the construction and optimization of the rice disease detection model in complex environments, but there are still some limitations. First, the proposed model needs to improve its ability to distinguish visually similar diseases. For example, under complex backgrounds, there are still repeated false detections between Blast and Brown spot, indicating that the model has deficiencies in capturing fine-grained features such as lesion texture and color depth. Second, the deployment of the integrated model is challenging. The fusion of multiple models increases the computational overhead, making it challenging to deploy on resource-constrained portable devices. Although model deployment can be well achieved through cloud computing and network-based image capture, it is difficult to adapt to portable detection devices. In addition, limited by the rice growth cycle, the off-greenhouse cultivation in this study was an artificially controlled environment, which differs from the real field scenarios with natural weeds and complex lighting. In the follow-up, data will be supplemented in combination with the next season’s rice cultivation to improve the model’s robustness in field scenarios such as weed backgrounds and backlighting.
Future research will focus on three aspects. First, expand the multi-source dataset by further acquiring public rice datasets from different regions to cover disease phenotypes under different climate and soil conditions, thereby improving the generalization ability of the model. Second, introduce a disease feature extraction module according to different disease types to enhance the recognition ability of similar diseases. Finally, explore lightweight ensemble strategies. Through techniques such as knowledge distillation and model pruning, transfer the detection ability of the Ensemble-WBF model to a single lightweight model, thus lowering the threshold for model deployment. This will promote the large-scale application of high-precision rice disease detection technology on field portable devices (such as handheld detectors and small drones [
63]) and provide efficient and implementable technical support for the early prevention and control of diseases in modern agriculture.
4. Conclusions
To address the issues of missed detection of minor diseases and misjudgment of similar diseases in rice disease detection under complex field and greenhouse backgrounds, this study collected images of five common rice diseases and one common pest and proposed a phased model optimization method based on transfer learning, model improvement, and ensemble learning. The research results show that, aiming at the limitations of a single model, on the basis of transfer learning, improving the model by adding a small target detection head P2, the EMA mechanism, and the Focal loss function is an effective way to reduce the missed detection of small target rice diseases and improve the prediction accuracy. On this basis, the ensemble learning method can further solve the misjudgment problem of similar diseases and achieve more accurate disease detection.
The findings of this study contribute to solving the problem of accurate disease detection in core scenarios such as breeding experimental fields in the Yunnan Plateau region and mountainous terraced fields. By constructing an integrated detection framework, it holds significant practical implications for enhancing rice breeding efficiency, ensuring food security in the plateau region, and even promoting the application of smart agriculture in complex scenarios. Future research can combine methods such as knowledge distillation and pruning to build multiple single lightweight models, so as to further balance the accuracy and efficiency of the integrated learning framework.