1. Introduction
Rice is one of the world’s most important food crops and plays a crucial role in feeding the world’s population [
1]. Yield estimation, a critical aspect of rice production, primarily depends on the number of rice panicles per unit area [
2]. Rapid and accurate determination of the number of rice panicles per unit area is important for yield estimation as well as crop breeding and plant phenotyping [
3]. Traditionally, the number of rice panicles has been counted manually, which is time-consuming, labor-intensive and inefficient [
4]. In order to improve the efficiency of rice panicle detection and counting, there is an urgent need for a fast and accurate real-time detection algorithm suitable for portable computing devices in field conditions.
In recent years, with the advent of artificial intelligence and advances in computer hardware, machine learning and deep learning [
5] techniques have been introduced into agriculture. Traditional machine learning excels in handling classification and prediction tasks involving structured data. Algorithms such as support vector machines, random forests, and continuous wavelet transforms have been widely utilized for disease identification [
6,
7], classification [
8], yield prediction [
9,
10], and growth stages monitoring [
11]. However, the above machine learning methods face limitations when applied to complex image-based tasks. For instance, in rice panicle detection [
12], these methods rely on manually designed and selected features, leading to insufficient accuracy in recognizing individual rice panicles in complex field conditions and limited generalization ability.
By contrast, deep learning automatically learns features of a target from massive datasets. In recent years, deep learning techniques have been increasingly applied in agriculture, particularly in object detection, including crop classification [
13], pest detection [
14], fruit identification and counting [
15], plant identification [
16], weed identification [
17], and animal behavior recognition [
18]. To meet diverse application requirements, deep learning-based object detection models are categorized into one-stage and two-stage models, accommodating the real-time and accuracy needs of various detection tasks. Two-stage algorithms represented by models such as Region-Convolutional Neural Networks (R-CNNs) [
19], Faster R-CNN [
20], etc., and candidate regions are generated in the first stage, and targets are classified in the second stage. Zhang et al. [
21] proposed a Faster R-CNN-based method to detect rice panicles in indoor environments. Jiang et al. [
22] proposed an improved Faster R-CNN model that optimizes the model’s level of detection for occlusion as well as small-sized rice panicles. These studies have demonstrated the performance of two-stage models for rice panicle detection on their respective datasets. However, the detection speed of two-stage models is generally much lower than one-stage methods, and these models tend to be large in size with high computational costs. Consequently, they are not well-suited for real-time applications in the field. The one-stage model predicts both the target’s bounding box and the target category, striking a better balance between detection speed, accuracy, and model volume. Among the one-stage object detection models, You Only Look Once (YOLO) [
23] and the Single Shot MultiBox Detector (SSD) [
24] stand out as prominent representatives. The YOLO series of algorithms have become widely popular and are extensively used, owing to their high speed and accuracy. Sun et al. [
25] and Wang et al. [
26] both used a YOLO-based approach to detect rice panicles from images captured by stationary ground-based cameras. However, ground-based cameras have limited coverage for capturing image data and generally operate with lower efficiency, making them unsuitable for large-scale field surveys.
The emergence of drones, also known as UAVs, has significantly expanded coverage and greatly enhanced the efficiency of data acquisition. Equipped with sensors such as high-resolution RGB cameras, multispectral cameras, or radar, drones can rapidly collect large volumes of diverse and comprehensive data, providing valuable insights in a short period of time. Therefore, UAVs have been widely used in agriculture [
27]. In the detection of rice panicles, Zhou et al. [
28] used a drone to capture images of rice panicles at a height of 17 m and used an improved region-based fully convolutional network; the algorithm achieved a detection accuracy of 86.8%. Chen et al. [
29] proposes an algorithm named Refined Feature Fusion for Panicle Counting, which extracts and fuses optimal features based on the size distribution of the objects, with an average counting accuracy of 89.80%. In practical agricultural applications, it is essential to obtain timely information about rice panicles using portable and low-cost equipment. However, most existing algorithms that are capable of real-time detection and counting for rice panicles require significant computational power, making them unsuitable for deployment on resource-constrained portable or edge devices like laptops and microcomputers. Thus, it is crucial to develop efficient rice panicle detection models that can operate on portable, low-cost devices, maintaining high accuracy while enabling real-time detection.
To enable real-time detection and counting of rice panicles with diverse morphologies in complex field environments, this study introduces a lightweight detection model, YOLO-Rice, and demonstrates its deployment on portable computing devices. The main contributions of this work are as follows: (1) Dataset Creation: Image data of rice panicles under varying scales, lighting conditions, and morphologies were collected. Preprocessing steps, including image cropping and data augmentation, were performed to build a comprehensive field dataset of rice panicles. (2) Lightweight Backbone Network: The FasterNet [
30] architecture was adopted as the backbone feature extraction network for the YOLO-Rice model. This choice aimed to reduce model parameters and computational complexity, making the model more lightweight. (3) Enhanced Detection Performance: A Normalization-based Attention Module (NAM) was incorporated into the backbone network of the YOLO architecture to enhance the model’s detection performance specifically for rice panicles. (4) Model Optimization: The original three detection heads of the YOLO model were reduced to two, further decreasing the model size while optimizing the neck network structure. (5) Improved Loss Function: The Minimum Point Distance-based IoU (MPDIoU) was employed as a replacement for the CIoU loss function in the YOLOv8n framework to enhance overall network performance. Finally, the optimized lightweight model was deployed on portable computing devices, and its performance was validated through field tests.
2. Materials and Methods
2.1. Image Data Acquisition
The experimental data were collected in the rice fields at the Zhuanghang Comprehensive Experimental Station of Shanghai Academy of Agricultural Sciences, Fengxian District, Shanghai, China (30.891° N, 121.359° E), as shown in
Figure 1. The rice, the variety of which was Shenyou 28, was sown in late May and had an approximate growth cycle of 150–155 days. The rice fields comprised 36 plots with four nitrogen fertilization levels (0 kg ha
−1, 100 kg ha
−1, 200 kg ha
−1, and 300 kg ha
−1), leading to varied panicle density and morphologies.
Rice panicle images were captured using a DJI Mavic 2 Pro drone (SZ DJI Technology Co., Shenzhen, China), equipped with a Hasselblad L1D-20c camera. The camera has a field of view (FOV) of 77° and effective pixels of 20 million. The images were stored in JPG format with a resolution of 5472 × 3648 pixels. Both ISO and aperture were set to automatic mode for optimal image capture. The flight height was set as about 3 m above the rice canopy. The flight campaign was performed between 9 a.m. and 2 p.m. on 25 September, when the rice was at the tasseling to ripening stage. The weather was partly cloudy with intermittent sunshine, creating varying lighting conditions.
Figure 2 presents a selection of images from the dataset, revealing the diversity within the rice panicle images. These images capture rice panicles in various sizes, shapes, and developmental stages. Moreover, they illustrate challenging environmental factors such as dense clustering, mutual occlusion, varying lighting conditions, and water surface disturbances, all of which significantly complicate the accurate detection of rice panicles.
2.2. Data Processing
To reduce the time required for labeling rice panicle targets and training the model, each original image was first cropped into multiple sub-images of 640 × 640 pixels. These cropped sub-images, saved in JPG format, contained between 10 to 70 rice panicles each. After collecting the rice panicle sub-images, manual labeling of the targets was conducted using LabelImg software (Version 1.8.6). Each rice panicle was annotated by drawing the smallest enclosing rectangle around it, with the labeling information saved in a corresponding TXT file following the YOLO object detection format.
In order to prevent model overfitting and increase model robustness and rice panicle recognition, image enhancement methods including adding blur and noise, cropping, adjusting brightness, adjusting contrast, rotating, flipping, etc., were randomly performed on the labeled images before building the dataset. These methods were used to simulate different data acquisition conditions, such as different lighting (via brightness and contrast adjustments) and different shooting angles (via image flipping and rotation). Ultimately, 5525 rice panicle images were compiled to form the dataset for this experiment. The data enhancement effects are shown in
Figure 3.
The data-enhanced images and labeled files were divided into training set, validation set, and test set in a 8:1:1 ratio. The training set, consisting of 4420 images, contained 223,847 rice panicle targets, while the validation set included 552 images with 27,681 rice panicle targets. The test set was made up of 553 images with 28,048 rice panicle targets.
2.3. YOLOv8 Object Detection Model
The YOLO object detection algorithm was introduced in 2015 [
23] and quickly became popular due to its high speed and accuracy. Ultralytics LLC, the team behind YOLOv5, released the YOLOv8 in January 2023. As a one-stage object detection algorithm, YOLO provides a better balance between detection accuracy and speed compared to two-stage algorithms, leading to its wide use in agriculture. Similar to YOLOv5, YOLOv8 offers models of different sizes across five scales: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. This study selected YOLOv8n as the foundational model, taking into account factors such as detection speed, training cost, and other relevant considerations.
2.4. YOLO-Rice Rice Panicle Detection Model
The YOLO-Rice lightweight rice panicle detection model proposed in this study made several improvements compared to the original YOLOv8n model. Firstly, the lightweight feature extraction network, FasterNet, was used to replace the original backbone feature extraction network of YOLOv8n. The neck network employed a two-layer detection head and incorporated a NAM module to enhance feature extraction while reducing the model size. Finally, MPDIoU was utilized to improve network performance. The model structure of the improved YOLO-Rice is shown in
Figure 4.
2.4.1. Lightweight Feature Extraction Network
In order to improve the processing speed of the network while ensuring accuracy, the FasterNet network was used to replace the original backbone feature extraction network of YOLOv8n. The FasterNet architecture diminishes the computational and memory demands, allowing algorithms to operate efficiently in environments with limited resources. It enhances computational efficiency and accelerates model inference by employing Partial Convolution (PConv).
In the submodules of the FasterNet network, PConv is employed to reduce redundant computations and memory access. The convolution module processes only a subset of the channel information, while the remaining channels are not involved in the computation. This approach allows for the extraction of spatial features while simultaneously reducing FLOPs, lowering network latency, and enhancing computational speed. The use of PConv is particularly beneficial when dealing with complex backgrounds in rice panicle detection tasks. Given that rice panicles vary in density and are often influenced by background elements such as leaves and water, FasterNet’s efficient convolution process allows the model to focus on the most relevant features in the image, improving detection speed and accuracy.
A schematic diagram of the PConv structure is shown in
Figure 5. Assuming that the feature maps of the input and output modules have the same number of channels, let
represent the kernel size. The theoretical computational cost in terms of FLOPs for a standard
convolution can be calculated using the following formula:
For the PConv module, let
denote the number of channels in PConv. The module processes only
channels, while the remaining
channels are not processed. Therefore, the FLOPs for PConv can be expressed as
When
, the FLOPs of PConv are
of the standard convolution. Additionally, to fully utilize the features from all channels, a Pointwise Convolution (PWConv) module is connected after the PConv, which is analogous to a standard
convolution. Therefore, the total FLOPs for the two convolution modules can be expressed as
The images captured by UAVs are characterized by a wide viewing angle, high resolution, and real-time acquisition, which impose stringent demands on the processing speed and accuracy of detection algorithms. Leveraging its efficient architecture, FasterNet is capable of quickly processing large volumes of UAV-captured images for rice panicle detection. By reducing computational complexity, FasterNet enables the model to effectively handle rice panicles in environments with varying morphologies, densities, and occlusions, while maintaining high accuracy in challenging scenarios, such as overlapping leaves and complex backgrounds. The integration of FasterNet into the YOLO-Rice network significantly boosts processing speed while maintaining high accuracy, effectively meeting the real-time requirements for rice panicle detection and counting.
2.4.2. Normalization-Based Attention Module
In this study, we collected images of rice panicles with various shapes, sizes, colors, morphologies, and complex and variable backgrounds under different lighting conditions. These differences place high demands on the model’s ability to extract features. Therefore, in order to better capture the rice panicle features while constructing a lightweight model, the NAM [
31] module was added. NAM was used to self-adaptively adjust the network’s attention to the rice panicle features, thus improving the model detection performance.
The attention mechanism includes channel attention, spatial attention, and a hybrid of both. SENet [
32] pioneered channel attention with its core Squeeze and Excitation module, which uses global average pooling and channel weighting to focus on important channel information, albeit limiting the selection of key regions. To address this limitation, the Convolutional Block Attention Module (CBAM) [
33] combines channel and spatial attention, enabling the network to focus on informative channels and their important regions. However, the spatial attention distribution across all output channels in CBAM remains consistent. NAM builds on this by redesigning the channel and spatial attention submodules, as shown in
Figure 6, to enhance network flexibility and performance.
For the Channel Attention Module, NAM uses weights from Batch Normalization (
BN) [
34] for identifying relatively important channels in channel attention, as shown in Equation (4).
where
are the inputs and outputs of the batch normalization layer, respectively,
and
are the mean and variance of the input batch of data, respectively,
and
are the parameters used in the scaling and panning learned through training, and
is a constant used for numerical stabilization. The Channel Attention Module is shown in
Figure 6 with Equation (5).
where
represents the output of channel attention,
represents the scaling factor for each channel, and the weight
. In addition, the Spatial Attention Module is shown in
Figure 6 with Equation (6).
where
denotes the output of spatial attention,
is the scaling factor in the spatial dimension, and the weight is
.
By combining the two attention mechanisms, NAM enables the network to selectively focus on informative channels and their corresponding spatial regions. This enhances the model’s ability to identify key features in the image, which is crucial for accurately detecting rice panicles, even in complex background conditions. This targeted emphasis on relevant features plays a crucial role in improving the model’s robustness and its ability to effectively distinguish the target from distracting background elements.
2.4.3. Loss Function Improvement
In the YOLO algorithm, Intersection over Union (IoU) is a measure of how accurately the corresponding object is detected in a given dataset. IoU calculates the overlap rate between the “predicted border” and the “real border”, which is the ratio of their intersection and concatenation, as shown in Equation (7). Ideally, there would be a complete overlap, with a ratio of 1.
Complete Intersection over Union Loss (CIoU) [
35] is used in the YOLOv8n algorithm. However, the CIoU formula mainly calculates the difference in aspect ratios rather than the true difference between width and height and its confidence level, which sometimes affects model optimization. Moreover, the rice panicle targets in this study had different morphologies and distribution densities, and the edges of the rice panicles were irregular, with overlapping in some areas. CIoU has some limitations in dealing with these type of rice panicle targets with complex shapes and irregular locations.
To enhance the detection of rice panicles in complex backgrounds, this study optimizes the model by integrating the MPDIoU loss function. MPDIoU extends the traditional IoU method by incorporating the minimum point distance, while also considering key factors such as overlapping areas, centroid distance, and deviations in width and height, thereby improving the precision of bounding box regression. In complex backgrounds, challenges such as background noise, occlusion, and cluttered elements often hinder accurate detection, particularly when the background closely resembles the target object. Traditional IoU methods may struggle to effectively differentiate the target from the background under such conditions. MPDIoU mitigates this issue by calculating the minimum distance between the upper-left and lower-right corners of the predicted and ground truth bounding boxes, ensuring better alignment and more accurate localization. The formulas of the MPDIoU calculation process are as follows:
where
and
are the coordinates of the upper-left and lower-right corners of the prediction frame
;
and
are the coordinates of the upper-left and lower-right corners of the target frame.
2.4.4. Network Neck Improvement
During the UAV flight, rice panicles appear smaller compared to images captured by a close-range handheld device, primarily due to the flight altitude. Additionally, the overhead angle limits the capture to only the upper portions of the rice panicles, leading to reduced pixel coverage for certain targets. To accurately detect these smaller and incomplete targets, the network must focus on more detailed information. The original neck network of the YOLOv8n has three scaled feature layers: P3 (80 × 80), P4 (40 × 40), and P5 (20 × 20). The P5 detection head in the network is often used to process lower resolution feature maps and is primarily used for the detection of medium to large targets. Given the dataset’s limited number of larger rice panicle targets, we opted to remove the P5 detection head and retain only the P3 and P4 layers. This modification improves detection accuracy for small- and medium-sized targets while also reducing model complexity.
2.5. Experiment Platform
The training and testing of models were conducted on a high-performance computer (HPC) running the Windows 11 operating system. The software environment for model training included Python 3.11, CUDA 12.3, and cuDNN 8.9.6.50, with the deep learning framework being PyTorch 2.2.1.
Table 1 presents some of the training parameter configurations used in this study.
In this study, the original training platform, a high-performance computing (HPC) system, offers substantial computational power and processing speed. However, it does not align with the practical application requirements of this research. To enable real-time processing and analysis of field data, the proposed lightweight model must be deployable on resource-constrained portable devices or embedded systems, which are typically limited in computational power, memory, and storage capacity.
To comprehensively evaluate the model’s performance in real-world scenarios, three representative test platforms were selected: the ThinkPad X13 Gen 2, the Apple Mac mini M2, and the Raspberry Pi 5. The ThinkPad X13 Gen 2 and Apple Mac mini M2 exemplify portable computing devices optimized for scenarios demanding both high performance and compact form factors. In contrast, the Raspberry Pi 5 represents a low-power embedded device widely used in edge computing applications. These platforms also span a diverse range of operating systems—Windows, macOS, and Linux—and processor architectures, enabling a well-rounded assessment of the model under varying computational constraints.
The main hardware specifications of these devices are summarized in
Table 2. The software environment for model testing includes Python 3.11, with PyTorch 2.2.1 serving as the deep learning framework.
2.6. Evaluating Metric
The study used four evaluation metrics to assess the performance of the models for rice panicle detection, namely mean Average Precision (mAP), Precision, and Recall, as shown in Equations (12)–(15).
where Average Precision (AP) denotes the precision for individual categories. The mAP is defined as the average of the AP values across all categories, representing the overall precision of the model. From Equations (14) and (15), it can be inferred that the calculations of Precision and Recall are contingent upon the counts of True Positives (TP), False Positives (FP), and False Negatives (FN). TP indicates that positive sample targets are correctly identified as positive, while True Negative (TN) signifies that negative samples are correctly identified as negative. FP refers to negative samples that are incorrectly classified as positive, and FN denotes positive samples that are incorrectly classified as negative. Precision is the proportion of model-identified positive samples that are actually positive, while Recall is the proportion of actual positive samples correctly identified by the model.
To evaluate the complexity and detection speed of the models, this study utilized three metrics: Parameters, Model Size (MS), and Frames Per Second (FPS). Parameters refers to the total number of learnable and optimizable parameters within the model. MS denotes the amount of disk space occupied by the model file after training is completed. FPS is calculated based on the total time consumed by each module of the algorithm.
The counting performance of the models was evaluated using the coefficient of determination (R-squared, R
2) and Root Mean Square Error (RMSE). A higher R
2 value indicates a stronger relationship between the predicted and actual values. RMSE quantifies the deviation between predicted and actual values, with a smaller RMSE signifying that the predicted values are closer to the actual values. Therefore, an R
2 value closer to 1, combined with a lower RMSE, indicates greater accuracy in the model’s counting results. The calculations for R
2 and RMSE are presented in Equations (16) and (17).
where
are the actual value, predicted value, and the average of the actual value, respectively, and n is the number of samples.
2.7. Experiment Setting
We conducted several experiments to validate the performance of YOLO-Rice, using the dataset developed in this study for training and testing all comparative models. The specific experimental procedures are as follows: (1) To determine the optimal lightweight network, FasterNet, MobileNetv3 [
36], GhostNet [
37], and ShuffleNet [
38] were integrated into the backbone of YOLOv8n, and their performance was evaluated. (2) An ablation study was performed to assess the impact of various improvement mechanisms on model performance. (3) YOLO-Rice was compared with several mainstream deep learning object detection algorithms from recent years. (4) Finally, to verify the lightweight nature of the proposed model and its performance on resource-constrained devices, all models were deployed on three devices, and the impact of varying input image resolutions on performance in embedded systems was further evaluated. The technical flow of this study is shown in
Figure 7.
5. Conclusions
This study presents YOLO-Rice, a lightweight and efficient rice panicle detection model tailored for large field applications using UAV imagery. The model leverages a modified YOLOv8n framework, integrating a lightweight backbone network, FasterNet, and a two-layer detection head to enhance detection performance while reducing computational overhead. The inclusion of the NAM and the MPDIoU loss function further improves the model’s ability to accurately detect rice panicles in complex agricultural settings. The experimental results validate the effectiveness of YOLO-Rice, achieving an object detection accuracy and mAP of 93.5% and 95.9%, respectively. The model’s compact size, with parameters reduced to only 32.6% of the original YOLOv8n model, facilitates deployment on resource-constrained platforms such as Raspberry Pi 5. The model demonstrated a significant increase in FPS, with a 15.3% reduction in average detection time per image compared to YOLOv8n, highlighting its suitability for real-time applications. The performance of YOLO-Rice on various computing devices, including high-performance computers and low-cost, portable or edge devices like Raspberry Pi 5, underscores its versatility and practicality for in-field use. The model’s robustness against challenges such as optical distortion and varying growth conditions of rice panicles was also demonstrated, although some miss-detections in densely packed areas were noted, suggesting potential areas for future improvement. In conclusion, YOLO-Rice offers a promising solution for accurate and efficient rice panicle detection, contributing valuable support to rice yield estimation and related agricultural practices.
In this study, to ensure detection accuracy, our experimental design prioritized capturing the clearest possible images with minimal disruption to the canopy caused by drone-induced wind effects. However, this approach, while effective for improving detection precision under controlled conditions, may not fully account for more variability scenarios, such as different drone models, camera specifications, and diverse rice field conditions. Future research should address these limitations by collecting more diverse datasets. This includes using drones and cameras of various models, acquiring images at multiple altitudes, and covering a wider range of rice varieties and planting methods. Such efforts will contribute to the development of rice panicle detection models with broader applicability and improved performance across diverse environments, enabling the models to be effectively utilized in real-world crop monitoring scenarios.