1. Introduction
Wheat is one of the world’s primary staple crops, playing a crucial role in meeting global food demands. Its production and quality control are critical factors in ensuring global food security [
1], with the wheat ear as the reproductive organ of the wheat plant, directly influencing both the yield and quality of wheat [
2]. Research indicates that the number of ears per unit area is the dominant yield component, and negative correlations were obtained between most of the parameters affecting the yield per plant and the yield per unit area [
3]. Plant breeding experts utilize the information on wheat ear quantities from different varieties, combining these data with genetic and environmental factors. Through hybridization experiments involving a large number of wheat varieties, they select strains suited to various growing conditions, developing more resistant and higher-yielding varieties [
4]. Effective detection and counting of wheat ears are essential for wheat yield prediction and ensuring food security [
5]. Therefore, building a fast and efficient automatic counting method for wheat ears is of great significance, and wheat ear detection technology enables precise assessment of wheat ear quantities in fields, providing vital support for agricultural production, field management, and food trade [
6].
Due to the high planting density of wheat, accurate counting of wheat ears is a challenging task. In the past, wheat yield estimation relied primarily on labor-intensive manual counting [
7,
8] and expert visual estimation [
9]. The former approach is not only time-consuming and inefficient but also struggles to acquire accurate data in large-scale farmland settings. The latter method is subjective, demanding a high level of expertise in agriculture and resulting in difficulties in scientifically and accurately estimating the correct yield. As a result, the aforementioned methods are unable to rapidly and precisely estimate the wheat yield in large-scale wheat fields.
With the advancement of computer vision technology, image processing techniques have been widely applied in agricultural production. Fernandez-Gallego et al. proposed a method for automatic wheat ear counting using RGB drone images. This method utilizes techniques such as frequency filtering, segmentation, and feature extraction to achieve efficient and accurate wheat ear counting [
10]. Tan et al. introduced a rapid identification method for field wheat ears based on superpixel segmentation algorithms and digital images. This approach involves image classification based on color feature parameters and analysis of wheat ear morphology, demonstrating both speed and accuracy [
11]. Bao et al. presented a wheat ear counting method based on frequency domain decomposition. They employed multiscale support value filtering (MSVF) in combination with improved sampling contour transformation (ISCT) for frequency domain decomposition of wheat ear images. At last, the wheat ear images are segmented and counted [
12]. Fang et al. proposed an automatic wheat tiller counting algorithm based on ground LiDAR data. This algorithm utilizes adaptive hierarchical and hierarchical clustering algorithms to comprehensively leverage 3D crop information in field environments, successfully counting wheat tillers of different varieties, nitrogen levels, planting densities, and ecological conditions [
13].
In recent years, an increasing number of researchers have begun using deep learning techniques to use in agriculture. Compared to traditional image processing methods, deep learning offers higher adaptability, accuracy, generalization, and scalability, thus demonstrating better performance in processing large-scale image datasets and handling complex image tasks. Pérez-Porras et al. proposed a method for early on-ground image-based detection of poppies (Papaver rhoeas) in wheat using the YOLO architecture. Their research findings demonstrate that the deep-learning-based object detection strategy can accurately identify poppies at an early stage, providing precise information for the development of accurate wheat weed management [
14]. Yang et al. proposed a deep-learning-based cross-platform model for wheat ear counting. This model combines a collaborative attention mechanism, achieving high-density counting of wheat ears while maintaining high counting accuracy and a reduced number of model parameters [
15]. Zaji et al. introduced automatic object level augmentation (AutoOLA), which decouples different objects in wheat images and generates augmented images through random combinations, significantly reducing the required training sample size for the wheat ear model [
16]. Alkhudaydi et al. introduced SpikeCount, a density-based method for wheat ear counting. This approach automatically extracts useful features from images using a fully convolutional neural network and utilizes transfer learning to optimize model training [
17]. Qiu et al. proposed an unsupervised learning method that automatically detects and labels wheat ears from wheat ear images. They established a dataset to train a deep convolutional neural network model for accurate detection and counting of wheat ears [
18].
The Internet and computer technology are finding broader applications in agriculture [
19]. Modern agriculture is increasingly demanding efficient and precise intelligent solutions. Despite the considerable research efforts focused on wheat ear counting [
20], challenges persist due to variations in wheat plants across growth stages and environmental conditions, as well as the diversity of wheat ears in images. Achieving accurate and efficient wheat ear counting remains a complex task. Furthermore, previous studies on wheat ear counting, whether based on image processing or deep learning methods, predominantly employed static counting approaches. This involves counting wheat ears in acquired images and then summing up the counts from each image to obtain the total count. However, this approach lacks real-time capability and involves time-consuming and intricate data preparation processes. To avoid repetitive counting of the same wheat ears, data collectors need to precisely control the shooting range while capturing images of wheat ears. This necessity introduces challenges such as difficulties in field operation, time consumption, and inefficiency during data collection. Wu et al. used YOLOv7 and DeepSORT to train on a subset of the GWHD dataset for real-time wheat ear counting [
21]. However, the model has too many parameters and slow inference speed. On a high-end GPU like 3090TI, it only achieved 14 FPS, making it unsuitable for real-time counting in large wheat fields. This approach also demands high computational resources, increasing hardware costs and hindering model deployment on mobile devices in the future.
In response to the challenges outlined above and to address the complex issues related to efficient automatic wheat ear counting, our main objectives were as follows: (i) to propose a novel lightweight wheat ear counting model, introducing an innovative and efficient real-time wheat ear counting method based on applying leading-edge artificial intelligence (AI) and Internet technology (IT) solutions, and (ii) to further advance the globally important agriculture practices in wheat monitoring and production. The new method implemented by our model is intended for accurate identifying and counting of wheat ears in real time under unmanned aerial vehicle (UAV) conditions, thereby significantly reducing the manual labor, and effectively auto-calculating the number of wheat ears, thereby conducting a preliminary evaluation of wheat yield in the field to help agriculture management and decision-making processes. We pursue our main objectives by addressing our computational research hypotheses in the background of the following proposed technological approaches and specific objectives:
- (1)
To enhance the robustness of our model performance, various data augmentation methods were applied to the acquired dataset to ensure it would perform well under diverse conditions, such as different contrast, lighting, and environments.
- (2)
To improve the computational efficiency of our model, FasterNet [
22] was utilized as the primary backbone for feature extraction. A specific objective was to enhance computational efficiency while minimizing the number of parameters, thereby making the model easily deployable on mobile devices.
- (3)
To enhance the backbone network, dynamic sparse attention and deformable convolution models were integrated into the model. A specific objective was to mitigate the influence of intricate environmental factors, such as the stickiness of wheat ears, while improving the model’s capability to efficiently extract wheat ear features.
- (4)
To comprehensively capture fine details and context characteristics, feature pyramid network (FPN) [
23] and lightweight upsampling operators were integrated into the PAN [
24]. A specific objective was to enhance the capability of the proposed model to detect various sizes of wheat ears by optimal extraction of multi-scale features while minimizing the information loss during the upsampling process.
- (5)
To further build upon the wheat ear detection algorithms, the Kalman filter-based tracking algorithm was incorporated into our model. A specific objective was to overcome the limitations of traditional image-based counting methods by achieving accurate motion prediction, and thereby avoid repeated counting in the continuous sequence by analyzing the context of video frames. Another objective was to significantly decrease the amount of manual work for wheat ear counting in the field.
3. Results and Discussion
This study trained and tested the model on the Ubuntu 18.04.5 LTS 64-bit operating system. The experimental environment employed an NVIDIA RTX 3090 (24G) graphics card with a CUDA 11.1 driver. Python 3.8.3 and the deep learning framework PyTorch 1.8.0 were utilized. The final set of hyperparameters is presented in
Table 2.
During training, we employed common object detection techniques like mosaic data augmentation, cosine learning rate scheduling, and hyperparameter evolution.
Figure 11 visualizes bounding box regression loss, confidence loss, precision, and recall for the Wheat-FasterYOLO model’s validation set.
3.1. The Impact of Data Augmentation
Data augmentation experiments trained the baseline model on original and augmented GWHD datasets, resulting in two models. As shown in
Table 3, using FasterNet as the baseline, the non-augmented model achieved mAP and F1 scores of 84.91% and 81.17%. After augmentation, scores improved to 85.66% and 81.97%, confirming data augmentation’s necessity for field-derived wheat ear images.
3.2. Comparative Experiments with Different Attention Integrations
Attention mechanisms typically enhance model performance. In the context of the FasterNet backbone feature extraction network for global wheat ear detection datasets, we assessed various attention mechanisms’ effectiveness. The comparative experimental results, showcased in
Table 4, illustrate the impact of different attention mechanisms within the augmented global wheat ears detection dataset.
It can be observed that compared to the baseline model, the inclusion of SimAM [
41] and CBAM [
42] resulted in a slight improvement in model performance, with an increase of 0.45% and 0.15% in mAP values, and 0.31% and 0.66% in F1 values, respectively. In contrast, incorporating GAM [
43] significantly boosted performance, with mAP and F1 values rising by 4.83% and 4.93%. However, SE [
44] had no positive impact; instead, it led to a 0.48% and 0.46% decrease in mAP and F1 values, indicating its unsuitability for this model.
It is worth noting that BiFormer performed the best in the experiments, with mAP and F1 values reaching 91.21% and 87.71%, respectively, marking a significant improvement of 5.55% and 5.74% compared to the baseline model.
Figure 12 visually presents heatmaps depicting various attention mechanisms, illustrating the model’s precise targeting of wheat ear objectives. These findings reaffirm BiFormer’s commendable performance in the realm of wheat ear detection tasks.
3.3. Ablation Experiment
To validate the effectiveness of the wheat ear detection model improvement, we conducted ablation experiments, and the results are shown in
Table 5. The experimental results demonstrate a significant enhancement in the model’s performance after the incorporation of BiFormer, with an increase of 5.55% in mAP and 5.74% in F1. BiFormer, with its unique sparsity and query-aware adaptability, can effectively model regions of interest across the feature maps globally.
By introducing the improved upsampling operator path aggregation network, the mAP and F1 of the wheat ear detection model improved by 2.37% and 2.65%, respectively. The path aggregation network helps the model better fuse multiscale features in the feature maps. After upsampling with the lightweight operator CARAFE, the model can capture the details and contextual information of wheat ears more effectively.
With the addition of the DcnV2 module, the mAP of the wheat ear detection model increased by 0.43%, and the F1 increased by 0.44%. Deformable convolution adjusts the position information of convolution kernels dynamically, responding more accurately to the deformation and spatial positional changes in wheat ear targets.
The mAP of the improved wheat ear detection model reached 94.01%, and the F1 score reached 90.8%. Compared to the baseline model before improvement, there was an 8.35% increase in mAP and an 8.83% increase in F1 score. The proposed improvement methods in this paper have played a significant role in the wheat ear detection model, effectively enhancing its performance. The detection results of wheat ears before and after the model improvement are shown in
Figure 13.
3.4. Comparative Experimental Analysis of Different Detection Models
In order to evaluate the performance of our proposed wheat ear detection model, we conducted a comparative analysis with popular object detection models. We utilized the same set of parameters and dataset, and each model underwent training in the same experimental environment. The experimental findings in
Table 6 reveal that our Wheat-FasterYOLO model, introduced in this study, outperformed in terms of P, R, mAP, and F1 scores, achieving high scores of 92.63%, 89.04%, 94.01%, and 90.8%, respectively. Furthermore, our model has fewer parameters and lower computational complexity, with a mere 1.34
parameters and 3.9 GFLOPs. Additionally, it demonstrated a faster speed; the frame rate reached 185 FPS.
Wheat-FasterYOLO outperforms SSD-VGG [
45,
46], SSD-MobileNet [
29], Faster R-CNN [
47], and EfficientDet [
48] significantly, even though Faster R-CNN and EfficientDet exhibit FPS of only 30 and 21. However, the introduction of tracking algorithms, requiring increased computational resources for Kalman filtering to estimate target motion, makes it unsuitable for real-time wheat ear tracking. Compared to SSD-MobileNet, Wheat-FasterYOLO experiences a slight 0.9 increase in GFLOPs, but its parameters are only 37.93% of SSD-MobileNet. It also achieves a 98 FPS boost, demonstrating Wheat-FasterYOLO’s fast and lightweight performance despite increased GFLOPs, with superior FPS and fewer parameters.
While Wheat-FasterYOLO has a slightly lower F1 score compared to YOLOX [
49] and YOLOv7-Tiny [
50], it surpasses all other models in terms of mAP. This indicates that Wheat-FasterYOLO may not have the absolute best precision and recall. However, in terms of mAP, it outperforms all other models, which means it is better in overall detection accuracy across a range of confidence thresholds. It stands out for its efficiency, requiring only 16.71% of YOLOX’s parameters and 22.36% of YOLOv7-Tiny’s parameters, with computational demands at 18.06% of YOLOX and 30% of YOLOv7-Tiny. Combining the highest mAP with faster speed, Wheat-FasterYOLO demonstrates better overall performance, making it a suitable choice for real-time wheat ear tracking and counting tasks.
3.5. Comparative Experiments Incorporating Different Tracking Algorithms
In the real-time wheat ear tracking and counting task, the wheat ear detection model is combined with popular target-tracking algorithms. Through comparative experiments in different wheat varieties and using the TrackEval [
51] evaluation, the performance differences in models incorporating different tracking algorithms in practical wheat ear counting applications are assessed. The results of the experiment are shown in
Table 7.
According to the experiment data, OC-SORT achieved the best performance, with an average HOTA of 60.52%, which is 7.04% higher than ByteTrack [
52] and 13.05% higher than StrongSORT [
53]. When working in conjunction with the wheat ear detector, OC-SORT had a slightly lower average FPS than ByteTrack. However, its average DetA, AssA, DetRe, and AssRe were 9.36%, 3.2%, 13.54%, and 5.65% higher than ByteTrack, respectively, indicating that it outperformed ByteTrack comprehensively. StrongSORT had a higher DetRe than OC-SORT in testing. However, its HOTA metric was significantly lower than OC-SORT, indicating that StrongSORT’s overall performance in real-time wheat ear counting tasks was unsatisfactory. Additionally, due to the introduction of the feature re-identification network, StrongSORT consumed a large amount of computational resources, resulting in high latency, with an average FPS of only 20, making it unsuitable for practical wheat ear tracking and counting tasks. The feature re-identification network is able to recapture a similar-looking target and confirm whether it is the same target as the previously detected one. In StrongSORT, the feature re-identification network tends to misidentify different wheat ear targets as the same target when dealing with wheat ears with highly similar appearance features, significantly affecting the counting results.
In summary, the Wheat-FasterYOLO model proposed in this paper, when integrated with the OC-SORT algorithm, achieved higher HOTA and overall performance compared to ByteTrack and StrongSORT. It achieved an average FPS of 92, meeting the requirements of real-time wheat ear tracking and counting.
Figure 14 shows the HOTA, DetA, AssA, DetRe, and AssRe curves of OC-SORT at different association accuracy threshold values “alpha”, reflecting the variations in scores of various metrics with the threshold “alpha”.
3.6. Analysis of Counting Accuracy in the Wheat-FasterYOLO Model
When collecting data with UAV, wheat ear targets are prone to temporary loss in the detector due to motion blur or occlusion. However, by integrating target-tracking algorithms, as long as the wheat ear target is detected once in the video sequence, a unique ID can be assigned and counted. In subsequent detections, if the detector redetects the lost target and the target has not undergone significant irregular motion or severe deformation, the tracking algorithm will ensure consistent ID recognition. Wheat-FasterYOLO avoids the issue of the same target being counted repeatedly in different video sequences, as shown in
Figure 15 for a specific illustration.
Table 8 shows the counting results for three different types of wheat ears. In the table, “IDs” represent the model’s counting results, “GT_IDs” represent the actual number of wheat ears, and “Counting accuracy” reflects the accuracy of the model in practical wheat ear counting tasks. As shown in
Figure 16, a linear regression analysis is performed between the model count results and the actual quantity over a period of time. R
reflects the degree of agreement between the model count values and the actual values. The closer its value is to 1, the better the fit. RMSE represents the deviation between the calculated value of the model and the actual value. It can be seen from this that there is a strong correlation between the counting results of the wheat ear counting method proposed in this study and the manual counting results, indicating that our method is practical.
In the wheat ear counting experiment, the accuracy rates for Yangmai 17, Huanuo No.1, and Xumai 45 were 91.71%, 92.66%, and 91.28%, respectively, with an average accuracy rate of 91.88%. By analyzing the detection results, it was found that there were weeds in Yangmai 17 and Huanuo No.1 with heights similar to wheat ears, leading to the model mistakenly identifying weeds as wheat ears. Additionally, in windy conditions, when the wheat ears moved only slightly in the wind, the model was able to track the wheat ear targets well. However, when strong winds caused the wheat ears to sway significantly, the model had difficulty accurately capturing the same wheat ear target, resulting in the model incorrectly considering wheat ears that moved significantly before and after as different objects, ultimately leading to an overestimation of the detected wheat ear count.
In the case of Xumai 45 detection, there were no issues related to weeds with heights similar to wheat ears or interference from strong winds. However, due to the heavy overlap and occlusion of Xumai 45 wheat ears, the model erroneously identified overlapped wheat ears as a single target. Furthermore, the wheat ears of Xumai 45 had a relatively large aspect ratio, making it difficult for the model to fit the position information of the real bounding boxes. These factors led to instances of missed detections, resulting in a lower detected wheat ear count compared to the actual count.
Figure 17 shows a randomly selected frame from the three detection video sequences, illustrating the counting results of Wheat-FasterYOLO. The top-left corner displays the total number of different wheat plants detected by the model from the first frame to the currently selected frame. The information displayed above the detection boxes indicates the wheat ear’s ID value, category, and confidence level.
3.7. Advantages and Limitations
In this subsection, we will discuss the advantages and limitations of Wheat-FasterYOLO in detail as follows:
Firstly, we employed a combined model approach by introducing the OC-SORT algorithm based on the Kalman filter into the wheat ear detection model under study. This integration enables the model to accurately estimate the motion of wheat ear targets in UAV video sequences. By assigning a unique identification number (ID) to each wheat ear target, we achieved non-repetitive and high-precision counting. Farm owners only need to plan the drone’s flight path based on real-world conditions to automatically obtain the desired wheat ear count information for a better preliminary assessment and decision-making regarding their wheat fields.
Secondly, we recognized the critical importance of GPU resource allocation in our approach. While ensuring sufficient GPU resources for effective YOLO operation, allocating a processing layer for tracking is a key consideration. In extensive tests, we found that the tracking algorithm typically consumes fewer resources than YOLO. The fast and lightweight nature of Wheat-FasterYOLO allows it to operate on a variety of devices, reducing hardware costs and enabling real-time counting in diverse environments. However, for optimal results and to prevent processing delays during counting, we recommend using a GTX 1050 or higher image processor to ensure the quality of wheat ear detection and tracking in various scenarios.
Moreover, understanding the growth stages of wheat is crucial for making informed agricultural decisions. Wheat growth can be divided into six distinct phases: germination, vegetative growth, heading, flowering, grain filling, and maturation. Our model, trained on a diverse dataset, is capable of effectively counting wheat heads during the flowering stage and beyond. This feature provides valuable insights for farmers during the mid to late stages of wheat growth, contributing to improved crop management and planning.
Finally, although our model can perform real-time counting for different wheat varieties in general, there are limitations. In some cases, wheat ears may be empty, and since our model was not trained on samples of empty wheat ears, it cannot effectively handle this specific situation. To address this limitation, we plan to collect more samples in future research and continuously enhance our model to make it more versatile.
4. Conclusions
In this study, we utilized the path planning and constant-speed cruising functions of UAV to automatically collect video sequences of wheat ears and achieved real-time tracking and counting of wheat ears in the field environment using the proposed Wheat-FasterYOLO. Compared to target detection and counting methods focused solely on static images, our approach circumvented the complexities associated with the operation, time-consuming, and low efficiency associated with data collection processes. Compared to existing real-time wheat ear counting models, our approach has fewer parameters, faster speed, and can achieve good results. In practical applications, it significantly enhanced the level of automation in wheat ear counting.
In the wheat ear detection method, we trained the Wheat-FasterYOLO model based on the GWHD dataset. Its mAP, F1 score, parameter count, GFLOPs, and FPS are 94.01%, 90.8%, 1.34 × 10, 3.9, and 185, respectively. This model combines lightweight design with speed and accuracy, demonstrating better overall performance than many popular object detection models and showing great potential for wheat ear detection tasks.
For wheat ear tracking and counting tasks, this study integrated the Kalman filter-based object tracking algorithm OC-SORT with the wheat ear detection model. We collected video sequences of three different wheat varieties using the DJI Mavic 3 and annotated them frame by frame. In multi-object tracking tests, the average HOTA reaches 60.52%, and the FPS is 92. In actual wheat ear counting scenarios, the average RMSE is 10.35, R is 99.08%, and the counting accuracy is 91.88%. The lightweight design of Wheat-FasterYOLO makes it suitable for mobile edge terminals such as drones, allowing for rapid completion of wheat ear counting tasks in field environments and further advancing agricultural automation.
The effectiveness of the tracking algorithm is influenced by the detection model, as well as factors such as motion blur, image distortion generated during the drone flight, and the mutual occlusion of wheat ears, all of which can introduce certain interference into the counting results. In future research, we will continuously improve the quality of the wheat ear detection model and explore more stable counting methods to achieve efficient and accurate detection and counting of wheat ears in high-density wheat field scenarios. This will provide strong support for field management, grain trade, and agricultural production.