DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background

Huang, Dongchen; Chen, Zhipeng; Zhuang, Jiajun; Song, Ge; Huang, Huasheng; Li, Feilong; Huang, Guogang; Liu, Changyu

doi:10.3390/agriculture16020234

Open AccessArticle

DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background

by

Dongchen Huang

¹,

Zhipeng Chen

¹,

Jiajun Zhuang

^2,*

,

Ge Song

¹,

Huasheng Huang

³,

Feilong Li

³,

Guogang Huang

⁴

and

Changyu Liu

^1,*

¹

College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China

²

Academy of Contemporary Agriculture Engineering Innovations, Zhongkai University of Agriculture and Engineering, Guangzhou 510225, China

³

Academy of Interdisciplinary Studies, Guangdong Polytechnic Normal University, Guangzhou 510665, China

⁴

School of Marine Science and Technology, Shanwei Institute of Technology, Shanwei 516600, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2026, 16(2), 234; https://doi.org/10.3390/agriculture16020234

Submission received: 11 December 2025 / Revised: 10 January 2026 / Accepted: 14 January 2026 / Published: 16 January 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

In the field of precision agriculture, accurately detecting rice panicles is crucial for monitoring rice growth and managing rice production. To address the challenges posed by complex field backgrounds, including variety differences, variations across growth stages, background interference, and occlusion due to dense distribution, this study develops an improved YOLO11-based rice panicle detection model, termed DRPU-YOLO11. The model incorporates a task-oriented CSP-PGMA module in the backbone to enhance multi-scale feature extraction and provide richer representations for downstream detection. In the neck network, DySample and CGDown are adopted to strengthen global contextual feature aggregation and suppress background interference for small targets. Furthermore, fine-grained P2 level information is integrated with higher-level features through a cross-scale fusion module (CSP-ONMK) to improve detection robustness in dense and occluded scenes. In addition, the PowerTAL strategy adapts quality-aware label assignment to emphasize high-quality predictions during training. The experimental results based on a self-constructed dataset demonstrate that DRPU-YOLO11 significantly outperforms baseline models in rice panicle detection under complex field environments, achieving an accuracy of 82.5%. Compared with the baseline model YOLO11 and RT-DETR, the mAP₅₀ increases by 2.4% and 5.0%, respectively. These results indicate that the proposed task-driven design provides a practical and high-precision solution for rice panicle detection, with potential applications in rice growth monitoring and yield estimation.

Keywords:

rice panicle; UAV; deep learning; precision agriculture; object detection

1. Introduction

As a major staple crop worldwide, rice provides the principal food supply for large populations across Asia and many other regions [1]. The production benefits of rice are not only reflected in yield but also in its quality and sustainability [2]. During rice production, the rice panicle, as the final harvest component, plays a decisive role in determining both the yield and quality of rice. Therefore, effectively acquiring information on the growth stage and quantity of panicles is critical for enhancing rice production and monitoring levels. Traditional panicle detection typically relies on manual field investigation, which is costly in labor and time and can introduce subjective errors, reducing accuracy [3]. Traditional image processing methods struggle to handle challenges such as irrelevant background interference, occlusion between crops, and variations in rice varieties and growth stages, which results in low detection accuracy. Currently, rice panicle detection methods mainly rely on computer vision technologies and deep learning models [4], but these methods is often constrained by dealing with complex field environments and large-scale data processing. Therefore, developing a high-precision rice panicle detection method that can adapt to complex field environments is of great significance for rice yield prediction and growth monitoring [5].

Automatic detection of rice panicles is a significant challenge, as the complex field environment presents numerous difficulties. Factors such as irrelevant background interference, occlusion between crops, and variations in rice varieties and growth stages can all affect panicle detection accuracy to varying degrees. Recently, significant progress in deep learning and image processing has promoted their extensive use in the agricultural sector, especially in crop yield prediction, disease detection and crop growth monitoring [6]. In the context of rice growth and yield estimation, detecting rice panicles is of crucial importance, and deep learning offers new solutions for this task. The yield of rice is determined by three key parameters: the panicle number per unit area, grains per panicle, and thousand-grain weight [7]. In particular, the number of panicles per unit area can be obtained from images using deep learning methods.

At present, mainstream research on rice panicles primarily relies on object detection. Object detection algorithms have been widely adopted in agricultural applications and have demonstrated strong performance across various detection tasks [8]. In complex open-field environments, object detection models can localize targets and generate bounding boxes, allowing them to effectively handle dense, overlapping, and occluded panicles. These models also offer high efficiency and robustness for tasks such as panicle counting and growth monitoring. In the field of object detection, methods can broadly be categorized into Transformer-based approaches [9] and CNN-based methods [10]. Transformer-based object detectors have achieved significant progress in recent years, especially in capturing long-range dependencies and modeling complex scenes, with representative models such as DETR [11] and RT-DETR [12]. Unlike traditional CNNs, Transformers employ self-attention mechanisms to extract global features more effectively. However, Transformer-based models generally require high computational and memory resources, large amounts of training data, and often perform suboptimally in local feature extraction compared to CNNs. As a result, their application in real-time and resource-constrained crop detection scenarios remains limited. Gao et al. [13] adopted a modified PCERT-DETR model for identifying rice seedlings as well as missing plants, and reported an mAP of 81.2%, although the network includes 21.4 M parameters and requires 66.6 GFLOPs. Fang et al. [14] developed a computationally efficient wheat-head detection algorithm, CML-RTDETR, derived from RT-DETR, which achieved an mAP of 90.5%. In addition, representative CNN-based object detectors include the YOLO series [15] and RetinaNet [16], which directly regress target positions and categories and thus offer fast detection speeds. Zhang et al. [17] proposed an improved Faster R-CNN model for detecting rice panicles of the Jinnongsimiao variety and achieved an mAP of 80.3%; however, the method was developed only for potted rice and cannot be directly applied to field conditions. Xu et al. [18] introduced a multi-scale hybrid window–based detection method, MHW-PD, for rice panicles at maturity. This method demonstrated strong robustness in high-density panicle counting and achieved an average counting accuracy of 87.2%, although its performance deteriorates under occlusion. Wang et al. [19] proposed a de-duplication strategy based on YOLOv5 for multi-variety field rice panicle detection, enabling the preservation of small panicles without extensive image cropping or adjustment. This method achieved an accuracy of 92.77%, but its robustness has been validated only under a single planting density. Tan et al. [20] developed RiceRes2Net based on an improved Cascade R-CNN, achieving mean accuracies of 96.8%, 93.7%, and 82.4% for the jointing, heading, and filling stages, respectively, although the model has not yet been evaluated on UAV images. Teng et al. [21] introduced Panicle-Cloud, an open-access, AI-enabled cloud platform for rice panicle detection and counting from drone imagery, and released a diverse open-source dataset, DRPD. The platform incorporates Panicle-AI, an improved YOLOv5-based model that achieves robust performance on rice-field images captured at three different flight altitudes.

However, despite the progress achieved in the detection of rice panicles and other crop ears, detecting rice panicles in complex field environments still faces multiple challenges, and many existing approaches show degraded robustness under dense distribution, frequent occlusion, and background-complex UAV scenes. With the continuous advancement of detection algorithms, the YOLO architecture has undergone progressive optimization. The recently proposed YOLO11 [22] has achieved a new balance between speed and accuracy. Nevertheless, YOLO11 and other detectors still exhibits insufficient accuracy when applied to rice panicle detection under highly complex field conditions. First, field backgrounds are complicated by factors such as water surface reflections, strong illumination, and overcast conditions, which can easily lead to false positives and missed detections. Second, rice panicles exhibit substantial multi-scale variations across different growth stages and cultivars. In such scenarios, traditional detectors with limited cross-scale feature interaction often fail to simultaneously represent large, medium, and small targets effectively. Therefore, a key challenge in rice panicle detection is to effectively enhance multi-scale feature extraction and fusion while maintaining a reasonable detection speed and model size.

To address these issues, this study proposes an improved YOLO11-based rice panicle detection model, termed DRPU-YOLO11. By optimizing the network architecture and strengthening the feature-fusion module, the proposed model achieves superior detection performance. Rather than introducing an entirely new detection paradigm, this work focuses on task-driven architectural adaptation of YOLO11 to improve robustness for rice panicle detection in complex in-field UAV imagery. The main contributions are summarized as follows:

Enhanced multi-scale feature extraction and occlusion suppression: A task-oriented multi-scale feature extraction module, CSP-PGMA, is introduced to address scale variation and partial occlusion commonly observed in UAV-based rice panicle imagery. Through progressive extraction under different receptive fields, this module strengthens the model’s ability to represent multi-scale targets and substantially improves the detection of multi-scale and partially occluded panicles.
Suppression of background interference and enhanced small object detection: The Small Object and Environment Context Feature Pyramid Network (SOCFPN) redesigns the neck network of the original architecture by integrating dynamic upsampling (DySample), information-guided downsampling (CGDown), and cross-scale feature fusion module CSP-ONMK. This task-driven design promotes precise detection of small objects through cross-scale feature interaction, avoids computational redundancy caused by additional detection layers, and maximizes the retention of small object details in the P2 layer while reducing background interference.
Optimization of prediction box quality and loss weighting: The PowerTAL strategy is employed to adapt quality-aware label assignment for rice panicle detection by differentiating the contribution of predictions with varying localization quality. Through power-based transformation, higher-quality prediction boxes are assigned greater importance during training, enabling the model to focus more effectively on reliable predictions in occluded and cluttered field environments.

2. Materials and Methods

2.1. Dataset

2.1.1. Data Acquiring

Data for the experiments were gathered from the teaching and research base of South China Agricultural University in Zengcheng District, Guangzhou (P1), as shown in Figure 1. The region is characterized by a subtropical monsoon climate with abundant rainfall, an average annual precipitation of approximately 1800 mm, and around 150 days of rainfall per year. The rice research area within the base includes 115 rice varieties, with each variety planted in rows and replicated three times in separate plots. The rice was seeded in late March. A DJI M300 UAV (DJI, Shenzhen, China) equipped with a 100-megapixel ultra-high-resolution camera SHARE A10 (SHARE, Shenzhen, China) was used for image capture. To achieve optimal image quality, other camera settings, such as ISO and aperture, were set to automatic mode. Various flight heights were tested, including 7 m, 10 m, 12 m, and 15 m. Since flying at 15 m allowed for coverage of a larger area and did not interfere with rice fields, we selected this height for image capture. The flight activities were conducted between 10:00 AM and 3:00 PM on 18 June 2025, coinciding with the heading stage of the rice. The weather that day was mostly cloudy with intermittent sunlight, providing varied lighting conditions. The temperature ranged from 26 °C to 32 °C. Images were stored in JPG format with a resolution of 11,648 × 8736 pixels, with a total of 157 images collected.

2.1.2. Data Processing

The rice images obtained from the UAV covered large areas of field plots, with a resolution of 11,648 × 8736 pixels. The high resolution increased the data processing time for training deep learning models. The original UAV images were first cropped into sub-images of 1024 × 1024 pixels to reduce the time required for annotation and preserve fine-grained details during annotation. During training, these sub-images were resized to 640 × 640 pixels to match the network input resolution, which is a standard practice in YOLO-based detectors. These cropped sub-images were stored as JPG files, with each patch containing 10 to 60 rice panicles. In total, more than 2000 cropped images were obtained. Subsequently, the rice panicle sub-images were manually annotated with LabelMe (version 5.4.1). Specifically, the targets were labeled by drawing the smallest bounding box around each rice panicle, and the resulting labels were exported to corresponding text files in the YOLO detection format. The data processing workflow for this dataset is shown in Figure 2.

High-quality images were selected from the original images for random cropping. Subsequently, data cleaning was performed on the cropped sub-images. Sub-images containing a large number of rice panicles were retained, whereas images containing no rice panicles and dominated by weeds, as well as background-dominated images, were discarded. To prevent model overfitting, enhance model robustness, and improve rice panicle recognition capability, data augmentation was performed online during training using the Albumentations library. The applied transformations included random brightness and contrast adjustment with a maximum variation of ±15% (probability 0.7), horizontal flipping (probability 0.5), and random affine translation with a translation ratio ranging from 1% to 3% while preserving the aspect ratio (probability 0.5). In addition, either Gaussian noise (variance range: 0.5–1.5) or Gaussian blur (kernel size range: 1–3) was randomly applied with a combined probability of 0.3. These augmentation strategies were designed to simulate variations in illumination, viewpoint displacement, and sensor noise commonly encountered in UAV-based field imaging. A total of 3010 rice panicle images were compiled as the dataset for this experiment. The images and annotation files were partitioned into training, validation, and test sets at a ratio of 8:1:1. The training subset contains 1205 images with 83,639 rice panicle targets, while the validation subset comprises 150 images with 6469 targets. The test subset includes 300 images containing 11,824 rice panicle targets. Figure 3 shows the selected samples from the dataset.

Figure 3 illustrates the diversity of rice panicle appearances. The images depict rice panicles with varying sizes, shapes, and growth stages, providing challenges for our research. Additionally, they illustrate challenging environmental factors, such as background interference, local occlusion, and multi-scale target distribution, all of which can significantly affect the accuracy of rice panicle detection.

2.2. DRPU-YOLO11 Model

2.2.1. Architecture

To address multi-scale variation, local occlusion, and background interference in UAV field scenes, we propose DRPU-YOLO11, an improved detector built upon YOLO11. The architecture of DRPU-YOLO11 is shown in Figure 4.

The proposed method introduces three key designs: (i) CSP-PGMA for multi-scale feature extraction in the backbone, (ii) SOCFPN for enhanced cross-scale feature fusion in the neck, and (iii) PowerTAL as a quality-aware label assignment strategy. Together, these designs improve detection robustness in complex field environments while keeping the model lightweight and efficient. The CSP-PGMA module significantly enhances the model’s ability to perceive multi-scale targets and local occlusions in complex environments through multi-scale feature extraction, partial channel depth modeling and residual connections. In addition, SOCFPN integrates dynamic upsampling (DySample) [23], context-guided downsampling (CGDown) [24], and a multi-scale hybrid convolution feature extraction module (CSP-ONMK). By maintaining overall inference efficiency, it effectively enhances the network’s ability to represent small-scale targets and suppresses background interference through multi-path fusion and dynamic adjustment mechanisms. The PowerTAL strategy optimizes the label assignment strategy by introducing adaptive IoU adjustment.

2.2.2. CSP-PGMA

To improve the detection accuracy of rice panicles in complex field environments, we designed a new feature extraction module, CSP-PGMA. This module integrates multi-scale feature extraction, partial channel depth modeling, residual connections, and an adaptive attention mechanism, effectively enhancing the model’s ability to model local fine-grained targets without significantly increasing computational cost. The aim is to solve the challenges of multi-scale target detection and local occlusion.

Figure 4D illustrates the structure of CSP-PGMA. The input feature map is processed through three parallel convolution branches with kernel sizes of 3

\times

3, 5

\times

5, and 7

\times

7 to capture features at different receptive fields. In the 3

\times

3 and 5

\times

5 branches, a partial channel convolution strategy is adopted: only a subset of channels is further convolved, while the remaining channels are preserved via residual connections. This design reduces redundant computation and helps retain features from non-occluded regions in complex backgrounds. The outputs of the three branches are then fused through residual connections to ensure information continuity, followed by a CBAM attention mechanism [25] to emphasize target regions and suppress background interference.

Specifically, CSP-PGMA achieves multi-scale representation by combining convolution kernels of different sizes, which is beneficial for handling rice panicles with varying scales across growth stages and cultivars. Partial channel depth modeling allows features from non-occluded regions to be preserved when local occlusion occurs, thereby reducing missed detections. Residual connections facilitate stable feature propagation and improve training convergence, while the adaptive attention mechanism dynamically adjusts channel-wise and spatial responses to focus on panicle regions. Assuming the input feature map is

X = [x_{0}, x_{1}, x_{2}, \dots, x_{C - 1}]

, where each

x_{i} \in R^{H \times W}

and the feature map consists of C channels, each channel being a 2D feature map of size

H \times W

. The formulas for the three convolution branches in the CSP-PGMA module are as follows:

X_{part 1}, X_{part 2} = split (Conv (X, K_{1}))

(1)

X_{part 3}, X_{part 4} = split (Conv (X_{part 1}, K_{2}))

(2)

X_{concat} = concat (Conv (X_{part 3}, K_{3}), X_{part 2}, X_{part 4})

(3)

To apply partial channel depth modeling in the 3

\times

3 and 5

\times

5 convolution branches, we need to split the feature map X along the channel dimension. Let the feature map’s channels be divided into two parts, C1 and C2, such that C1 + C2 = C. Here,

X_{part 1}

and

X_{part 2}

are the two parts of the feature map after the 3

\times

3 convolution, while

X_{part 3}

and

X_{part 4}

come from the 5

\times

5 convolution. K₁, K₂ and K₃ represent the kernel sizes of the first, second, and third layers of convolution in the three branches, set to 3, 5 and 7, respectively. The output of the three convolution branches is denoted as

X_{concat}

.

Through these designs, the CSP-PGMA module demonstrates exceptional performance in detecting multi-scale targets and addressing severe occlusion issues. Multi-scale convolution and partial channel depth modeling effectively improve its capability to perceive targets of varying sizes and local occlusion areas, while the introduction of residual connections along with adaptive attention mechanisms further improves robustness in complex environments.

2.2.3. SOCFPN

In deep neural networks, shallow features preserve fine spatial details but lack strong semantics, whereas deep features provide richer semantics at the cost of spatial resolution. Although feature pyramid networks partially alleviate this trade-off through multi-scale fusion, their effectiveness is still limited in complex field environments, particularly for small object detection under background interference. Small objects are often inadequately represented at the P3, P4 and P5 detection layers. While introducing an additional P2 detection layer can improve small-object sensitivity, it also increases computational overhead and post-processing complexity.

To address the aforementioned issues, this study proposes the Small Object and Environment Context Feature Pyramid Network (SOCFPN). SOCFPN adopts a new feature fusion strategy. Compared to the traditional approach of adding a P2 detection layer, it uses the P2 feature layer, processed through the Context-GuidedDown (CGDown) module, to obtain rich small object information, which is then fused with P3. This approach, combined with Cross Stage Partial (CSP) and OmniKernel [26], results in the CSP-ONMK module for feature integration. This strategy preserves fine-grained P2 details while avoiding the computational redundancy introduced by additional detection layers, thereby enhancing small object detection and background suppression.

Figure 4B shows the overall structure of the proposed SOCFPN, consisting of three core modules: DySample, CGDown, and CSP-ONMK. DySample is introduced at each upsampling stage to mitigate semantic loss during feature scaling. As a lightweight dynamic upsampling module, it adaptively adjusts sampling locations, enhancing global context representation and reducing spatial distortion compared to fixed interpolation. CGDown focuses on extracting small object related information from high-resolution P2 features by retaining key contextual cues while suppressing background noise.

The structure of the CSP-ONMK module is shown in Figure 4E, combines CSP with the OmniKernel multi-scale convolution unit. After an initial 1 × 1 convolution, the feature map is split following the CSP paradigm. One branch undergoes multi-scale processing via OmniKernel, while the remaining channels are directly propagated to preserve original information. OmniKernel employs parallel depthwise convolutions with large and strip-shaped kernels to capture both global and directional contextual information. The processed features are then fused with the bypassed channels and projected through a final 1 × 1 convolution.

Overall, SOCFPN performs feature fusion by upsampling deep features with DySample, extracting small-object semantics from P2 via CGDown, and integrating them with P3 features through CSP-ONMK. This design injects fine-grained small object information into the detection pipeline while maintaining a compact three-layer architecture, leading to improved robustness in dense and background complex field scenarios.

2.2.4. PowerTAL

In complex field environments, rice panicle detection is often affected by background interference and occlusion, which can produce a large number of low-quality bounding boxes and hinder stable training. These low-quality predictions reduce detection accuracy, especially under dense distribution and complex backgrounds. Therefore, improving the contribution of high-quality predictions while suppressing the influence of low-quality ones is critical for robust rice panicle detection.

To address the above issues, we propose PowerTAL, a quality-aware label assignment strategy based on power transformation. PowerTAL applies a power transform to the intersection over union (IoU) between predicted and ground-truth boxes, increasing the relative weight of high-quality predictions while compressing the contribution of low-quality ones. Built upon the original Task-Aligned Learning (TAL) framework, PowerTAL enhances the discriminability of the overlap metric and improves alignment quality during training. Specifically, given an IoU value, PowerTAL applies a piecewise power transformation, defined as the following formula:

{I o U}^{'} = {\begin{matrix} {I o U}^{γ} & if I o U < β \\ {I o U}^{1 / γ} & if I o U \geq β \end{matrix}

(4)

where

γ

controls the degree of emphasis between low and high quality predictions (set to 2 by default), and

β

is the threshold for

I o U

. This transformation suppresses low-IoU predictions while amplifying high-IoU ones.

In the TAL method, the alignment metric is used to quantify the degree of match between the predicted and target boxes. In the PowerTAL strategy, the overlap metric is power-transformed and serves as the basis for optimizing the alignment metric. The alignment metric is essentially a combination of IoU and scores to measure the match between the predicted and target boxes. Through this approach, PowerTAL enhances the alignment metric of high-quality boxes during training, ensuring that these boxes are prioritized for assignment and optimization, while adjusting the relative importance of overlap and score in the alignment metric. In PowerTAL, we replace the original IoU with the power-transformed IoU′ to focus the loss calculation more on high-quality boxes.

Overall, PowerTAL improves the stability and effectiveness of label assignment by adaptively reweighting prediction quality, thereby enhancing detection robustness in complex and occluded field environments.

2.3. Experimental Settings

2.3.1. Training Settings

All experiments were conducted on an Ubuntu 20.04.4 LTS operating system, which was selected to ensure compatibility with the CUDA and PyTorch versions used in this study and to maintain consistency with the experimental environment, with an Intel i5-10400 processor (Intel, Santa Clara, CA, USA), an NVIDIA GeForce RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA), and 24GB of GPU memory. The experiments were conducted using a software environment comprising Python 3.9.20, PyTorch 1.13.1, and CUDA 11.7. A batch size of 8 was adopted, and the model was trained for 200 epochs. The learning rate was set to 0.001, and the optimizer was set to SGD which was selected due to its stable convergence behavior and strong generalization performance in large-scale object detection tasks, particularly for YOLO-based architectures. The input image size for the model was 640 × 640. Detailed hyperparameter configurations for the model training phase are shown in Table 1.

2.3.2. Evaluation Metrics

To objectively evaluate the model’s detection performance on the rice panicle dataset, the evaluation metrics used in this study include precision (P), recall (R), F1-Score, and mean Average Precision (mAP).

The specific formulas are as follows:

Precision (P) represents the proportion of true positive samples among all samples predicted as positive.

P = \frac{TP}{TP + FP}

(5)

Recall (R) represents the proportion of true positive samples among all true positive samples.

R = \frac{TP}{TP + FN}

(6)

where TP (True Positive) refers to instances where the model predicts rice panicles and the true class is rice panicle. FP (False Positive) refers to instances where the model predicts rice panicles, but the true class is background elements.

F1-Score is the harmonic mean of precision (P) and recall (R).

F 1 - Score = 2 \times \frac{P \cdot R}{P + R}

(7)

mAP calculates the average AP value across all classes, providing a comprehensive evaluation of the overall performance of the model.

mAP = \frac{\sum_{i = 1}^{n - 1} {AP}_{i}}{K}

(8)

where K indicates the total classes, and the denominator equals the aggregate AP across all classes.

In this study, detection performance is evaluated using mAP at an IoU threshold of 0.5 (mAP₅₀) and mAP averaged over IoU thresholds from 0.5 to 0.95 (mAP_50–95). mAP₅₀ mainly reflects whether rice panicles are successfully detected under a standard overlap criterion and is therefore closely related to detection completeness, which is critical for downstream tasks such as panicle counting and yield analysis. By contrast, mAP_50–95 aggregates performance under progressively stricter IoU thresholds and provides a more stringent evaluation of localization robustness. It should be noted that mAP_50–95 is an averaged metric and thus reflects overall performance under increasing localization strictness rather than behavior at any single IoU threshold. For field rice panicle detection, where targets are slender, densely distributed, and frequently occluded, achieving precise high IoU alignment is inherently challenging. Accordingly, this study places greater emphasis on mAP₅₀ when assessing detection effectiveness, while mAP_50–95 is reported as a complementary indicator of localization robustness.

In addition, this study adopts the number of model parameters (Params) as an indicator of structural complexity, since it is closely related to the model’s storage needs and the potential risk of overfitting. The computational complexity is quantified using giga floating-point operations (GFLOPs), which describe the demand for computational resources. Model complexity metrics (including Params and GFLOPs) are reported to facilitate an explicit comparison between detection performance and computational cost under practical deployment constraints.

3. Experimental Results and Analysis

3.1. Hyperparameter Sensitivity Analysis of PowerTAL

To validate the effectiveness of the proposed PowerTAL strategy in optimizing the model’s detection accuracy, a comparative experiment was conducted on a model with the CSP-PGMA module and SOCFPN structure, using different values for the threshold hyperparameter

β

.

Generally, a small IoU value indicates that the predicted box has a small overlap with the ground truth box, which is considered a low-quality prediction box and can negatively affect the model’s detection performance. However, due to the influence of complex field environments, rice panicles are often subject to occlusion and small target situations, leading to relatively small overlaps between predicted and ground truth boxes. These are often viewed as low-quality boxes, resulting in missed detections. To reduce this occurrence and suppress the negative impact of low-quality boxes, the model needs to select an appropriate hyperparameter

β

. Typically, detection models treat predicted boxes with an IoU below 0.5 as negative samples, and some studies lower the IoU threshold to 0.3 to handle dense and occluded target detection. Therefore, we consider the range of

β

values from 0.2 to 0.6 as the appropriate research range for this sensitivity analysis. The results are shown in Table 2.

The changes in IoU values after adaptive adjustment using the Power_transform function are shown in Figure 5. The intersection point of the gray solid line in Figure 5 is when the input value is 0.5 when the unimproved TAL strategy is used, the output value is also 0.5.

Table 2 indicates that varying the hyperparameter

β

notably influences model accuracy. When

β

is set to 0.3, precision reaches the highest value of 82.5%. Although the recall is 0.2% lower than when

β

is set to 0.4, the overall F1-Score achieves the highest value of 80.4%. In general, the detection accuracy is optimal when

β

is set to 0.3. Additionally, as seen in the table, when

β

is 0.3, mAP₅₀ is 85.7%, which is 0.4% higher than the second highest value of 85.3%. Therefore, the best performance of the PowerTAL strategy occurs when

β

is set to 0.3. This choice enhances the model’s rice-panicle detection capability without increasing the parameter count.

3.2. Ablation Study

To comprehensively evaluate the effectiveness of the proposed improvements and methods in this study, ablation experiments were conducted based on YOLO11 as the baseline model. The following modifications were made step by step: replacing the C3k2 block in the feature extraction network with CSP-PGMA, using SOCFPN to improve the neck network, and applying the PowerTAL method to optimize TAL. Experiments were conducted incrementally by integrating these modules, and the results are summarized in Table 3.

The experimental results demonstrate that each proposed module is effective in the DRPU-YOLO11 model for multi-scale rice panicle detection in complex field backgrounds and UAV viewpoints. The CSP-PGMA module significantly enhanced the backbone network’s ability to capture multi-scale target features and local occlusion target features, from local details to global context, thereby improving the representation of rice panicle structures in complex field environments. Compared to the baseline model, the method integrating CSP-PGMA (Method 1) improved accuracy by 0.6%, F1-Score by 0.5%, mAP₅₀ by 1.3%, and recall by 0.3%, indicating a reduction in missed detections for small and partially occluded panicles.

The SOCFPN neck network module further optimized multi-scale feature fusion. By using DySample, CGDown, and CSP-ONMK, it achieved an adaptive path that enhanced the representation of small targets and suppressed environmental interference without sacrificing inference efficiency. Moreover, DySample, CGDown, and CSP-ONMK are not independent add-on modules; they are structural elements of SOCFPN, where CSP-ONMK is indispensable for cross-scale feature fusion and for preserving fine-grained P2 layer information for small and densely distributed panicles. The model integrating SOCFPN (Method 2) achieves improvements of 0.8% in accuracy and 0.8% in recall over the baseline, demonstrating its effectiveness in reducing missed detections in dense-field scenarios. In contrast, the absence of SOCFPN in Method 1 weakens the integration of low-level details and high-level semantics, leading to a noticeable decline in recall and F1-Score due to increased omission of small targets.

To avoid ambiguity, we clarify that PowerTAL is not an architectural component but a quality-aware label assignment strategy that improves the matching quality between predictions and ground-truth boxes by assigning higher importance to high-quality pairs during optimization. As shown in Method 3, adding PowerTAL to the baseline model improved accuracy by 0.4%, recall by 0.8%, F1-Score by 0.7%, and mAP₅₀ by 0.5%. This adaptive mechanism enhanced the sensitivity of the model to the rice panicle texture, resists the interference background, reduced the positioning error, and reduce the missed detection and error detection.

The enhanced modules are incorporated into the base framework, resulting in a model that is both more accurate and more efficient. This design better accommodates multi-scale features and helps reduce false positives as well as missed detections. The results indicate that each module contributes to performance gains relative to the base framework. The overall performance improved by 2.4%, with the F1-Score increasing by 2%, without a significant increase in parameter count.

3.3. Comparison of Different Backbone Networks

To assess the advantages of the improved backbone in our model, we performed comparative experiments against the baseline YOLO11 backbone and several advanced backbone networks, including EfficientViT [27], FasterNet [28], RepViT [29], and ConvNeXtV2 [30]. EfficientViT reduces computational complexity and improves model efficiency by optimizing the Transformer architecture and introducing efficient computational modules, making it particularly suitable for devices with limited computational resources and visual tasks with high real-time requirements. FasterNet employs a hierarchical design composed of PConv and MLP blocks to enable efficient feature extraction, thereby lowering computational cost and enhancing hardware-level parallelism, which makes it well suited for real-time applications. RepViT enhances the model’s representational power by using structure re-parameterization to separate channel mixers, improving performance without increasing the computational burden, especially excelling in high-accuracy and low-latency applications on mobile devices. ConvNeXtV2 incorporates GRN normalization and depthwise separable convolutions to strengthen inter-channel feature interactions, thereby streamlining convolution operations. Taken together, these four backbones reflect four representative design paradigms: Transformer architecture, lightweight convolution, re-parameterization, and pure convolution optimization. The experimental conditions were kept consistent, and the results are summarized in Table 4.

In the baseline YOLO11 backbone, multiple C3k2 blocks are used for feature extraction. We replace all C3k2 blocks with the proposed CSP-PGMA module. The novelty of the CSP-PGMA module lies in its use of convolution blocks with different kernel sizes combined with the CSP approach. This replacement improves multi-scale feature extraction and robustness to occlusion and complex field interference while introducing only a slight increase in model size, thereby enhancing both detection accuracy and localization quality. As shown in Table 4, compared with YOLO11, our method improves both mAP₅₀ and mAP_50–95 by 1.3%, and increases recall by 0.3%, indicating better sensitivity to small and partially occluded targets. Relative to other mainstream backbones, our recall is 0.4% lower than FasterNet. However, FasterNet yields precision and mAP₅₀ values that are 1.7% and 0.9% lower than ours, respectively. RepViT achieves comparable precision and recall, but its mAP₅₀ is 1.1% lower. Although EfficientViT achieves the second-highest mAP₅₀ (83.9%) among the compared backbones, slightly below our backbone (84.6%), but its precision is substantially lower than ours. ConvNeXtV2 performs worse than our method across all reported metrics.

Additionally, to subjectively observe the feature extraction capabilities of each backbone, we employed Grad-CAM [31] for visualization. The results are shown in Figure 6. The heatmaps generated by Grad-CAM help us understand the regions the network focuses on, where different colors represent the degree of influence on the detection results. The darker the color, the more the network focuses on that region, and the greater its impact on the detection outcome. Considering the challenges frequently faced in rice panicle detection in field scenes, such as multi-scale targets, background interference, and local occlusion, Figure 6 shows a comparison of heatmaps from the YOLO11 model using different backbones on UAV images. Since rice panicles of different varieties and growth stages appear in the same UAV image, this results in a multi-scale distribution, making the model more susceptible to scale changes. Consequently, feature extraction for rice panicles becomes restricted. In the first column of Figure 6, YOLO11 and FasterNet focus on the entire image, but the actual rice panicles in the original image are not as widespread. EfficientViT focuses more on the leaves rather than the rice panicles. RepViT and ConvNeXtV2 do not pay enough attention to small-scale rice panicles, whereas our model performs better. Background interference, which is common in field environments, can significantly impact the model’s feature extraction capabilities. In the second column of Figure 6, FasterNet and ConvNeXtV2 are distracted by the ground background, focusing more on the ground. Although Baseline, EfficientViT, and RepViT reasonably focus on the rice panicle areas, they are also, to varying degrees, disturbed by the soil. In contrast, our model is minimally affected by background interference and correctly focuses on the rice panicles. When rice panicles are densely distributed, occlusion often occurs. Occlusion can lead to biased feature extraction by the model, resulting in missed detections. In the third column of Figure 6, compared to other models, our model better focuses on large areas of overlapping rice panicles and exhibits better feature extraction capabilities under occlusion.

Overall, with the proposed multi-scale feature-fusion strategy, the model delivers the best detection accuracy with a low computational cost, and shows particularly strong recall and localization performance on multi-scale targets. These results demonstrate its effectiveness for multi-scale detection tasks.

3.4. Comparison of Different Detection Models

To comprehensively evaluate the rice panicle detection capability of the DRPU-YOLO11 model, it was compared with seven leading object detection algorithms, as shown in Table 5. The models compared include the transformer-based detector RT-DETR, the rice panicle detection network Panicle-AI, the CNN-based detectors YOLOv8, YOLOv10 and the baseline model YOLO11. For all competing detectors, we followed the official implementations and recommended the hyperparameter settings reported in the original papers. The input resolution is 640

\times

640 to ensure a fair comparison. The validation results for all models are shown in Figure 7. From the Precision-epoch curves in Figure 7A, rapid convergence is observed for the majority of models. Precision exceeds 80% for YOLOv8, YOLO11, Panicle-AI, and our method. After convergence, DRPU-YOLO11 and Panicle-AI maintain stronger performance and exhibit higher stability than the other models. Among them, DRPU-YOLO11 achieved the highest precision at 82.5%, while Panicle-AI also achieved competitive results with 81.6% precision. Notably, the DRPU-YOLO11 model size is 3.21 M, significantly smaller than the 8.54 M of Panicle-AI. Overall, DRPU-YOLO11 shows superior performance compared with other object detection models. Furthermore, from the Precision-Recall curves (Figure 7B), it can be observed that within the recall range of 0.0–1.0, DRPU-YOLO11′s precision is substantially higher relative to YOLO11, YOLOv10, YOLOv8, and RT-DETR. Additionally, in the high recall region (≥0.6), it still outperforms Panicle-AI, reflecting its advantage in identifying positive instances accurately. At the same time, the Precision-Recall curve for DRPU-YOLO11 is smooth without significant fluctuations, demonstrating superior stability compared to other models and avoiding the sharp drop in precision caused by changes in recall.

As shown in Table 5, DRPU-YOLO11 achieves an mAP₅₀ of 85.7%, which is 5.0% and 4.6% higher than the RT-DETR and the YOLOv10, respectively. In addition, DRPU-YOLO11 attains an F1-Score of 78.4%, outperforming the other state-of-the-art detection algorithms. These improvements mainly arise from SOECFPN’s multi-scale feature processing and PowerTAL’s bounding-box quality refinement; their synergy with the high-quality multi-scale representations produced by CSP-PGMA further enhances multi-scale target detection. Consequently, DRPU-YOLO11 achieves consistent improvements in precision, recall, F1-Score, and mAP₅₀. It is worth noting that, despite RT-DETR having a larger number of parameters, DRPU-YOLO11 still maintains a clear performance advantage, further confirming the effectiveness of the proposed architecture.

Figure 8 presents the confusion matrices of RT-DETR, YOLOv8, YOLOv10, YOLO11, Panicle-cloud, and our proposed DRPU-YOLO11 on the test set. The results show that our model demonstrates superior capability in distinguishing rice panicles from the background compared with the other five models. DRPU-YOLO11 achieves higher values along the main diagonal, indicating a more accurate recognition of rice panicle targets. Meanwhile, it exhibits fewer false positives (background regions misclassified as panicles) and false negatives (panicles that are missed), suggesting stronger robustness under complex background conditions. In contrast, the other models show a relatively high proportion of misdetections in background areas, reflecting limitations in their feature extraction and target separation abilities. These advantages highlight that our model offers better robustness for multi-scale rice panicle detection under challenging field conditions.

Figure 9 provides a qualitative comparison of detection outputs produced by YOLO11, YOLOv10, YOLOv8, Panicle-AI, and DRPU-YOLO11 models on UAV images, targeting typical complex field conditions (such as dense distributions, background interference, and multi-scale issues). In the visualization, green boxes indicate correct detections, red boxes denote false or duplicate detections, and orange boxes mark missed detections. The results clearly demonstrate the advantages of DRPU-YOLO11 in suppressing occlusion interference, reducing background interference, and identifying multi-scale rice panicle targets. In the first column of Figure 9, YOLO11 experiences false detections with overlapping rice panicles, assigning multiple detection boxes to the same target. Similarly, YOLOv8, YOLOv10, and RT-DETR also experience false detections, incorrectly identifying a single target as two or more targets. Panicle-AI, when facing overlapping rice panicles, misses the features of occluded rice panicles, resulting in missed detections. In contrast, DRPU-YOLO11 performs relatively better. Under background interference conditions, we selected common field interferences such as water surface and ground interference as typical examples. As shown in the second column of Figure 9, YOLO11, YOLOv10, and YOLOv8 all mistakenly detect leaves as rice panicle targets over the water surface and miss small targets. RT-DETR experiences false detection in overlapping rice panicles, and Panicle-AI misses detections in overlapping rice panicles. In the third column of Figure 9, DRPU-YOLO11 demonstrates its advantage in multi-scale target detection, whereas YOLO11, YOLOv10, YOLOv8, and RT-DETR all miss rice panicle targets under non-occluded conditions. It is important to note that DRPU-YOLO11 also has some limitations. For example, in the first column of Figure 9, during edge detection of the image, DRPU-YOLO11, due to its ability to capture fine-grained features, often misidentifies targets with only a small amount of rice panicle features as rice panicles. These may actually belong to adjacent rice panicles or may not be rice panicles at all.

4. Discussion

4.1. Advantages

This study proposes DRPU-YOLO11, a multi-scale rice panicle detection framework designed for complex in-field UAV imagery. The main advantage of the proposed method lies in its ability to simultaneously address scale variation, background interference, and local occlusion without significantly increasing model complexity. By integrating CSP-PGMA for enhanced multi-scale feature extraction, SOCFPN for fine-grained feature fusion, and PowerTAL for improved label assignment, the model achieves consistent performance gains across multiple evaluation metrics.

From a practical perspective, the proposed DRPU-YOLO11 is designed to balance detection robustness and computational cost. While additional architectural components are introduced to address scale variation, dense overlap, and background interference, the overall model remains compact compared with heavier detection frameworks. The reported model size and computational complexity demonstrate that the proposed design does not rely on excessive depth or additional detection heads, making it suitable for UAV-based agricultural applications where onboard computing resources are limited.

Compared with closely related detectors, DRPU-YOLO11 differs not only in performance but also in architectural design philosophy. Instead of relying on deeper backbones or additional detection heads, the proposed method explicitly injects fine-grained P2-level information into higher-level features while maintaining a compact model size. This design choice is particularly effective for detecting small and densely distributed rice panicles under complex field backgrounds, as summarized in Table 6.

From an application perspective, the improved precision and recall directly benefit downstream agronomic tasks such as panicle counting, density estimation, and phenotypic analysis. Reducing false positives helps avoid yield overestimation, while improved recall ensures more reliable spatial characterization of panicle distribution across large field areas.

4.2. Challenges and Limitations

Despite these advantages, several challenges and limitations should be acknowledged. First, the dataset used in this study was collected from a single experimental site and growing season. Although diverse data augmentation strategies and multi-scale feature representations were employed to mitigate overfitting to site-specific conditions such as lighting, background changes, and planting density, full cross-site and cross-season validation remains an important direction for future investigation. Specifically, brightness and contrast augmentation was used to simulate illumination differences across weather conditions, while affine translation and multi-scale feature fusion were designed to improve robustness to viewpoint variation and scale changes commonly encountered in UAV imagery. These measures help reduce sensitivity to location-specific visual patterns, although they cannot fully substitute for validation on independent datasets.

Second, while some performance improvements are numerically modest, they are consistently reflected across multiple evaluation metrics and supported by systematic ablation experiments, providing converging evidence that the improvements are associated with the proposed design choices. However, we acknowledge that training stochasticity may affect the reported results, and repeated-run evaluation with statistical analysis would further strengthen the quantitative assessment of performance stability.

Finally, failure cases may still occur in complex field environments. Near image boundaries or in scenarios where panicles are partially visible, truncated targets may lead to ambiguous localization and redundant detections. Similarly, in densely overlapping panicle clusters, multiple highly confident predictions may coexist, increasing the risk of local over-detection. In such cases, the quality-aware label assignment mechanism in PowerTAL, which emphasizes high-IoU predictions during training, may further amplify these effects by assigning excessive importance to tightly localized but partially truncated or overlapping predictions. From a practical deployment perspective, it is therefore valuable to consider mitigation measures for these failure modes. These issues may be alleviated by refined post-processing (e.g., stronger suppression under heavy overlap and boundary-aware filtering) and by exploring adaptive confidence regulation in highly crowded regions.

It should also be noted that PowerTAL is designed as a quality-aware label assignment strategy tailored for dense and occluded rice panicle detection. While it improves performance in this target application, we do not claim that the same strategy will universally benefit all detection tasks or datasets, and its generalization warrants further evaluation.

4.3. Future Perspectives

Future research will focus on extending the proposed framework to more diverse datasets covering multiple geographic regions, growth stages, and environmental conditions to further evaluate generalization capability. Incorporating repeated-run experiments and statistical variance analysis will also enable more rigorous performance assessment. Meanwhile, we also consider a quantitative characterization of representative failure cases to complement the qualitative observations.

From a practical deployment perspective, DRPU-YOLO11 provides a balanced trade-off between accuracy and computational cost, making it suitable for UAV platforms with moderate onboard computing resources. Future work may explore adaptive weighting strategies for label assignment and additional spatial suppression mechanisms to further improve robustness in extreme density or boundary scenarios, facilitating reliable large-scale agricultural monitoring.

5. Conclusions

To achieve accurate detection of rice panicles in UAV images under complex field conditions, this study proposed the DRPU-YOLO11 model based on multi-scale rice panicle detection. The experiments demonstrated that the proposed DRPU-YOLO11 model achieved significant improvements in various accuracy metrics for rice panicle detection in UAV images. Ablation experiments confirmed the effectiveness of the proposed methods (CSP-PGMA module, SOCFPN structure, and PowerTAL strategy) and compared DRPU-YOLO11 with other advanced object detection models, demonstrating its advantages in detecting rice panicles in complex field environments. DRPU-YOLO11 provides a more efficient and innovative solution for rice panicle detection and is expected to offer valuable support for tasks such as rice production monitoring and yield estimation in the future.

Author Contributions

Conceptualization, D.H. and J.Z.; methodology, D.H.; software, D.H. and Z.C.; validation, D.H. and Z.C.; formal analysis, D.H.; investigation, D.H.; resources, D.H. and C.L.; data curation, D.H., H.H. and F.L.; writing—original draft preparation, D.H. and J.Z.; writing—review and editing, D.H., C.L., G.S. and G.H.; visualization, D.H., J.Z. and C.L.; supervision, G.H. and C.L.; project administration, D.H. and C.L.; funding acquisition, J.Z. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Philosophy and Social Sciences Planning Project of Guangdong Province of China (Grant No. GD23XGL099), the Philosophy and Social Sciences Planning Project of Guangzhou City of China (Grant No. 2025GZGJ21), the National Natural Science Foundation of China (Grant No. 62202110), the Key Science and Technology Research and Development Program of Guangzhou City, China (Grant No. 2024B03J1302), and the Doctoral program construction project for research capability of Guangdong Polytechnic Normal University (22GPNUZDJS18).

Data Availability Statement

The dataset in this study is available from the corresponding authors on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zeigler, R.S.; Barclay, A. The relevance of rice. Rice 2008, 1, 3–10. [Google Scholar] [CrossRef]
Prasad, R.; Shivay, Y.S.; Kumar, D. Current status, challenges, and opportunities in rice production. In Rice Production Worldwide; Chauhan, B.S., Jabran, K., Mahajan, G., Eds.; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–32. [Google Scholar]
Madec, S.; Jin, X.; Lu, H.; De Solan, B.; Liu, S.; Duyme, F.; Heritier, E.; Baret, F. Ear density estimation from high resolution RGB imagery using deep learning technique. Agric. For. Meteorol. 2019, 264, 225–234. [Google Scholar] [CrossRef]
Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.-L.; Chen, S.-C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv. 2018, 51, 92. [Google Scholar] [CrossRef]
Zualkernan, I.; Abuhani, D.A.; Hussain, M.H.; Khan, J.; ElMohandes, M. Machine learning for precision agriculture using imagery from Unmanned Aerial Vehicles (UAVs): A survey. Drones 2023, 7, 382. [Google Scholar] [CrossRef]
Sanaeifar, A.; Guindo, M.L.; Bakhshipour, A.; Fazayeli, H.; Li, X.; Yang, C. Advancing precision agriculture: The potential of deep learning for cereal plant head detection. Comput. Electron. Agric. 2023, 209, 107875. [Google Scholar] [CrossRef]
Fageria, N.K. Yield physiology of rice. J. Plant Nutr. 2007, 30, 843–879. [Google Scholar] [CrossRef]
Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2010, arXiv:2010.11929. [Google Scholar]
Du, J. Understanding of Object Detection Based on CNN Family and YOLO. J. Phys. Conf. Ser. 2018, 1004, 012029. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
Gao, J.; Tan, F.; Hou, Z.; Li, X.; Feng, A.; Li, J.; Bi, F. UAV-based automatic detection of missing rice seedlings using the PCERT-DETR model. Plants 2025, 14, 2156. [Google Scholar] [CrossRef]
Fang, Y.; Yang, C.; Zhu, C.; Jiang, H.; Tu, J.; Li, J. CML-RTDETR: A lightweight wheat head detection and counting algorithm based on the improved RT-DETR. Electronics 2025, 14, 3051. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Zhang, Y.; Xiao, D.; Chen, H.; Liu, Y. Rice panicle detection method based on improved faster R-CNN. Trans. Chin. Soc. Agric. Mach. 2021, 52, 231–240. [Google Scholar]
Xu, C.; Jiang, H.; Yuen, P.; Zaki Ahmad, K.; Chen, Y. MHW-PD: A robust rice panicles counting algorithm based on deep learning and multi-scale hybrid window. Comput. Electron. Agric. 2020, 173, 105375. [Google Scholar] [CrossRef]
Wang, X.; Yang, W.; Lv, Q.; Huang, C.; Liang, X.; Chen, G.; Xiong, L.; Duan, L. Field rice panicle detection and counting based on deep learning. Front. Plant Sci. 2022, 13, 966495. [Google Scholar] [CrossRef]
Tan, S.; Lu, H.; Yu, J.; Lan, M.; Hu, X.; Zheng, H.; Peng, Y.; Wang, Y.; Li, Z.; Qi, L.; et al. In-field rice panicles detection and growth stages recognition based on RiceRes2Net. Comput. Electron. Agric. 2023, 206, 107704. [Google Scholar] [CrossRef]
Teng, Z.; Chen, J.; Wang, J.; Wu, S.; Chen, R.; Lin, Y.; Shen, L.; Jackson, R.; Zhou, J.; Yang, C. Panicle-cloud: An open and AI-powered cloud computing platform for quantifying rice panicles from drone-collected imagery to enable the classification of yield production in rice. Plant Phenomics 2023, 5, 0105. [Google Scholar] [CrossRef] [PubMed]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023. [Google Scholar]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Cui, Y.; Ren, W.; Knoll, A. Omni-kernel network for image restoration. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory efficient Vision Transformer with cascaded group attention. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. arXiv 2023, arXiv:2303.03667v3. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting mobile CNN from ViT perspective. arXiv 2023, arXiv:2307.09283. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]

Figure 1. Research site. (A) Geographic location. (B) Geographic coordinates. (C) Rice research area.

Figure 2. Data processing flow, where a1 denotes normal images, a2 denotes images affected by light factors, a3 denotes images of no rice panicles and mostly weeds, a4 denotes images dominated by irrelevant background, the white box in crop sub-images is a clipping sketch, and the green box in manual annotation is a box marked with annotation software.

Figure 3. Rice panicle images collected under complex field conditions.

Figure 4. Architecture of the proposed DRPU-YOLO11. (A) Backbone architecture. (B) SOCFPN architecture. (C) Head architecture. (D) CSP-PGMA architecture. (E) CSP-ONMK architecture. The green boxes in the (C) is the detection boxes obtained by model detecting of the image.

Figure 5. IoU values before and after processing with the Power_transform function for different values of

β

.

Figure 5. IoU values before and after processing with the Power_transform function for different values of

β

.

Figure 6. Comparison of heatmaps from different backbone networks.

Figure 7. Comparison of validation results across models. (A) Precision-epoch curves. (B) Precision-Recall curves.

Figure 8. Comparison of confusion matrices for different detection networks. (A) RT-DETR. (B) YOLOv8. (C) YOLOv10. (D) YOLO11. (E) Panicle-cloud. (F) Ours.

Figure 9. Comparison of detection results from different detection networks, where green boxes represent correct detections, red boxes represent false or duplicate detections, and orange boxes represent missed detections.

Table 1. Hyperparameter settings for training.

Hyperparameter	Parameter
Epochs	200
Batch size	8
Initial learning rate	0.001
Momentum	0.937
Weight decay	0.0005
Input image	640 × 640
Optimizer	SGD

Table 2. Comparison of experimental results with different hyperparameter

β

values.

Table 2. Comparison of experimental results with different hyperparameter

β

values.

$Value of β$	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	mAP_50–95 (%)	Params (M)
0.2	81.0	78.4	79.7	84.6	50.2	3.21
0.3	82.5	78.4	80.4	85.7	51.4	3.21
0.4	80.9	78.6	79.7	84.5	50.7	3.21
0.5	82.0	78.3	80.1	85.3	51.1	3.21
0.6	81.4	78.1	79.7	84.7	50.9	3.21

Table 3. Results of ablation study experiments.

CSP-PGMA	SOCFPN	PowerTAL	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	Params (M)	GFLOPs (G)
			80.6	76.4	78.4	83.3	2.58	6.3
√			81.2	76.7	78.9	84.6	2.63	7.6
	√		81.4	77.2	79.2	84.2	3.17	13.3
		√	81.0	77.2	79.1	83.8	2.58	6.3
√	√		81.2	77.6	79.4	84.7	3.22	14.6
√		√	82.2	78.1	80.1	85.0	2.63	7.6
	√	√	81.8	76.6	79.2	84.1	3.17	13.3
√	√	√	82.5	78.4	80.4	85.7	3.22	14.6

Table 4. Results for comparison of different backbone networks.

Methods	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	mAP_50–95 (%)	Params (M)	GFLOPs (G)
YOLO11	80.6	76.4	78.4	83.3	50.0	2.58	6.3
EfficientViT	78.6	76.6	77.6	83.9	51.2	3.74	7.9
FasterNet	79.5	77.1	78.3	83.5	50.4	3.90	9.2
RepViT	81.0	75.9	78.4	83.5	50.7	6.43	17.0
ConvNeXtV2	80.4	75.2	77.7	83.3	49.8	5.39	12.5
Ours	81.2	76.7	78.9	84.6	51.3	2.63	7.6

Table 5. Comparison of detection performance of different detection models.

Methods	P (%)	R (%)	F1-Score (%)	mAP₅₀ (%)	mAP_50–95 (%)	Params (M)	GFLOPs (G)
RT-DETR	79.9	72.6	76.1	80.7	46.9	19.87	56.9
YOLOv8	80.1	74.9	77.4	82.5	49.8	3.01	8.1
YOLOv10	79.1	73.4	76.1	81.1	49.5	2.27	8.2
YOLO11	80.6	76.4	78.4	83.3	50.0	2.58	6.3
Panicle-AI	81.6	77.3	79.4	83.9	47.9	8.54	28.5
Ours	82.5	78.4	80.4	85.7	52.2	3.21	14.6

Table 6. Architectural-level comparison between DRPU-YOLO11 and closely related detectors.

Models	Backbone Characteristics	Multi-Scale Feature Strategies	Label Assignments	Key Limitations
RT-DETR	Transformer-based	Global attention	Hungarian	High computation cost
YOLOv8	CSP-based CNN	z/FPN	SimOTA	Sensitive to background clutter
YOLOv10	Efficiency-oriented CNN	Optimized YOLO neck	End-to-end aligned	Lacks explicit P2-guided fusion
YOLO11	C3k2-based CNN	PAN/FPN	TAL	Limited robustness under occlusion
Panicle-AI	Panicle-C3	Task-specific FPN	IOU-based	Larger model size
Ours	CSP-PGMA	SOCFPN	PowerTAL	Edge false positives

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, D.; Chen, Z.; Zhuang, J.; Song, G.; Huang, H.; Li, F.; Huang, G.; Liu, C. DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background. Agriculture 2026, 16, 234. https://doi.org/10.3390/agriculture16020234

AMA Style

Huang D, Chen Z, Zhuang J, Song G, Huang H, Li F, Huang G, Liu C. DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background. Agriculture. 2026; 16(2):234. https://doi.org/10.3390/agriculture16020234

Chicago/Turabian Style

Huang, Dongchen, Zhipeng Chen, Jiajun Zhuang, Ge Song, Huasheng Huang, Feilong Li, Guogang Huang, and Changyu Liu. 2026. "DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background" Agriculture 16, no. 2: 234. https://doi.org/10.3390/agriculture16020234

APA Style

Huang, D., Chen, Z., Zhuang, J., Song, G., Huang, H., Li, F., Huang, G., & Liu, C. (2026). DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background. Agriculture, 16(2), 234. https://doi.org/10.3390/agriculture16020234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Data Acquiring

2.1.2. Data Processing

2.2. DRPU-YOLO11 Model

2.2.1. Architecture

2.2.2. CSP-PGMA

2.2.3. SOCFPN

2.2.4. PowerTAL

2.3. Experimental Settings

2.3.1. Training Settings

2.3.2. Evaluation Metrics

3. Experimental Results and Analysis

3.1. Hyperparameter Sensitivity Analysis of PowerTAL

3.2. Ablation Study

3.3. Comparison of Different Backbone Networks

3.4. Comparison of Different Detection Models

4. Discussion

4.1. Advantages

4.2. Challenges and Limitations

4.3. Future Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI