1. Introduction
As an important food crop, wheat provides an energy source for one-fifth of the global population [
1]. Continuously increasing its yield per area and ensuring food supply is an issue of national importance [
2]. The number of wheat heads is one of the important indicators for estimating yield [
3]. Accurately identifying and counting wheat spikes and timely obtaining wheat growth conditions and yield information not only provide support for agricultural production and help farmers improve crop quality in the early stages but also contribute to the country’s land policy and food price regulation [
4]. However, wheat fields have a complex background, and the targets are dense, with wheat spikes overlapping each other. Rapid and accurate detection of wheat spikes faces a huge challenge. Artificially counting wheat spikes is susceptible to subjective factors, and it is time-consuming, inaccurate, and difficult to ensure accuracy, especially for large areas of land [
5].
Compared with laborious and time-consuming manual statistics, the emergence of machine learning technology greatly improves the efficiency of wheat head detection and counting. Alharbi et al. [
6] combined Gabor filters with the K-means clustering algorithm to recognize and count wheat heads in the target area from wheat images captured by the Crop Quant platform. Zhou et al. [
7] proposed a segmentation algorithm based on multiple optimized features and the twin support vector machine, which achieved good results in identifying wheat heads. Zhu et al. [
8] presented a two-step wheat head detection mechanism from coarse to fine, which emphasized candidate wheat head regions and then removed non-wheat head regions with higher-level features. The mechanism achieved good detection performance on the test set. However, traditional machine learning-based wheat head detection technology is susceptible to noise such as lighting, angle, color, and soil, making it unable to accurately recognize stick wheat heads. Also, it lacks generalization ability, making it difficult to achieve wheat head detection and counting in different scenarios.
With the development of computer vision and deep learning theory, researchers have begun to apply deep neural networks to wheat-head object detection. Currently, mainstream object detection algorithms are mainly divided into two categories: two-stage object detection algorithms and one-stage object detection algorithms. Two-stage object detection algorithms are based on region proposals, such as R-FCN (Region-based Fully Convolutional Networks) [
9] and R-CNN (Region-CNN) series (including R-CNN [
10], Fast-RCNN [
11], Faster-RCNN [
12], Mask-RCNN [
13], etc.). Attributed to classification and then localization, two-stage object detection algorithms have high accuracy, but their detection speed is often slow, and the primary goal at present is to improve their detection speed. One-stage object detection algorithms are based on region regression, which directly extracts features through convolutional neural networks (CNNs) to classify and locate targets. The representatives of this category of algorithm include SSD (Single Shot MultiBox Detector) [
14] and the YOLO series (You Only Look Once) (including YOLO [
15], YOLO9000 [
16], YOLOv3 [
17], YOLOv4 [
18], YOLOv5 [
19], YOLOX [
20], YOLOv6 [
21], YOLOv7 [
22], etc.). Due to the direct classification and location by one-stage object detection algorithms, they have extremely fast detection speeds, and the primary goal at present is to enhance their accuracy.
The rise of deep learning methods provides a better technical platform for object detection, which can greatly reduce labor costs, time costs, and mechanical equipment costs [
23]. He et al. [
24] studied and simplified the YOLOv4 network structure and used the K-means algorithm to re-cluster the anchor points, thereby effectively solving the wheat head detection problem in natural scenes based on unmanned aerial platforms. Based on the network structure of YOLOv4, Gong et al. [
25] optimized the double spatial pyramid pooling (SPP) network to enhance the feature learning ability, enlarged the receptive field of the convolutional network, and improved its recognition accuracy and speed. Xu et al. [
26] automatically segmented wheat head images and automatically extracted wheat head contour features based on the K-means clustering algorithm, thereby significantly improving the efficiency and accuracy of wheat head counting. Wang et al. [
27] used multi-level CNNs (SSRNET) to segment wheat head images and achieve a rapid estimation of wheat head numbers under field conditions. Zhao et al. [
28] improved the feature extraction of wheat heads by adding micro-scale detection layers and setting prior anchor boxes, optimized the network structure of YOLOv5, and increased the detection accuracy of wheat head images taken by unmanned aerial vehicles by 10.8%. Amirhossein et al. [
29] applied automatic object enhancement technology to deep learning models and used the mixed AutoOLA-DL model to improve wheat head counting performance. However, the training dataset used in the above research comes from the same region, with relatively single wheat varieties, growth environments, and wheat head morphology, leading to certain limitations in terms of model adaptability to a wide range of other datasets. David et al. [
30] mentioned that the diversity and richness of the wheat head data are crucial for the performance of the model. Therefore, some researchers used public datasets from different countries and regions to increase sample diversity and enhance the model’s generalization ability. Bhagat et al. [
31] proposed the WheatNet-Lite model to address the issue of overlapping and occlusion of wheat heads, and it achieved higher accuracy on the GWHD 2020 public dataset. Li et al. [
32] combined transfer learning with the Faster RCNN and RetinaNet models to compare and evaluate the recognition accuracy of different wheat heads in different growth stages, and good results were obtained. Amirhossein et al. [
9] developed the hybrid UNet architecture, pre-trained the model using the ACID wheat head image public dataset, and combined transfer learning to evaluate the model on the GWHD 2020 public dataset. The model achieved significantly enhanced wheat head localization and counting results. Based on the RetinaNet network structure, Wen et al. [
2] introduced a weighted bidirectional feature pyramid network (BIFPN) and fused multi-scale features to recognize different varieties and complex environments of wheat heads. This network showed good detection ability on the GWHD 2020 and WSD public datasets. However, mutual occlusion of wheat heads is still an important factor affecting detection accuracy, and attention mechanisms can solve this problem effectively. Wang et al. [
33] introduced the convolutional block attention module (CBAM) attention module into the EfficientDet-D0 algorithm, effectively solved the wheat head occlusion problem, and enhanced wheat head counting accuracy. Li et al. [
34] integrated CBAM into the YOLOv5 network and combined it with the Mosaic-8 data augmentation method to improve the accuracy of wheat head recognition in complex large-field backgrounds such as overlap and occlusion and achieve accurate counting results. Dong et al. [
35] introduced a random polarized self-attention mechanism, SPSA, in both spatial and channel dimensions and effectively combined it with a random unit to improve the detection capability of overlapping and occluded wheat heads. Liu et al. [
36] proposed the YOLO-Extract algorithm based on YOLOv5, which enhances the feature extraction capability of wheat heads by introducing a coordinated attention mechanism. Zhou et al. [
37] presented the multi-window swine transformer network, which combines self-attention mechanisms and feature pyramid networks to extract multi-scale features and effectively improve the accuracy of detecting complex wheat heads in the field. However, there is still a trade-off between the speed and accuracy of wheat head identification and detection. The current research focus is on improving both the accuracy and speed of wheat spike detection simultaneously.
In summary, several studies have shown that models trained on a single dataset collected from a specific region lack generalizability and exhibit significantly reduced accuracy when applied to wheat head detection in other regions [
38,
39,
40,
41]. Additionally, the presence of overlapping and small-sized wheat heads in field conditions poses limitations on the accuracy of wheat head detection. In this study, we employ YOLOv7 as the baseline model and apply it to wheat head detection and counting. Furthermore, we improve the YOLOv7 model to enhance its accuracy and speed, making the refined model more suitable for wheat head detection in complex backgrounds. This supports automatic wheat head recognition and yield estimation in actual field settings.
The following are the most important contributions that this study provides:
- (1)
Addressing the challenges of overlapping and small-sized wheat heads in natural environments, the microscale detection layer and convolutional block attention module are introduced into the YOLOv7-MA network. And this enhances the model’s ability to detect and count wheat heads;
- (2)
Mixup and improved mosaic are adopted in both data pre-processing and model training to improve the representational learning ability of the algorithm;
- (3)
Comparing YOLOv7-MA with Faster RCNN, YOLOv5, YOLOX, and YOLOv7 algorithms, YOLOv7-MA exhibits the best performance on both public datasets and field datasets of different growth stages. Furthermore, YOLOv7-MA demonstrates good robustness in challenging conditions of low illumination, blur, and occlusion.
The remaining sections of this paper are organized as follows:
Section 2 presents the Materials and Methods, where we provide a detailed explanation of the dataset and its data processing procedures, the object detection algorithm, the model enhancement approach, and accuracy evaluation metrics. In
Section 3, we present the results and conduct an analysis.
Section 4 discusses the results of this paper in comparison to existing findings in the field. Finally, in
Section 5, we summarize the conclusions of this study and propose future research directions.
3. Results
The experimental platform used in this study is equipped with an Intel® Xeon® W-2145 CPU @ 3.70 GHz processor and an NVIDIA GeForce RTX 2080Ti graphics card, running the 64-bit Windows 10 operating system. The YOLOv7-MA deep learning network model was developed using the Python programming language based on Pytorch1.2.0 and torchvision0.4.0, with cuda10.0 and cudnn7.4.1.5, to achieve wheat head detection, and the proposed model was validated and compared against Faster-RCNN, YOLOv5, YOLOX, and YOLOv7 to verify its recognition accuracy.
Faster-RCNN [
12] is a typical two-stage object detection algorithm that introduces the region proposal network (RPN) instead of a time-consuming selective search to extract candidate regions, shares convolutional features of images with the detection network, and achieves real-time object detection. YOLO [
15] (You-only-look-once) series models merge candidate selection and object recognition phases into one and use predefined candidate areas to regress both the category and position of the target at once. It is a typical one-stage object detection algorithm and has a higher real-time object detection rate. YOLOv5 [
19], as the fifth-generation YOLO series, makes some major improvements, such as using cross-stage partial (CSP) to increase network convergence and combining FPN and path aggregation network (PAN) modules to ensure accurate predictions for images of different sizes. It regresses the category and position of the target box directly in the output layer, leading to a high recognition speed. YOLOx [
20] is an improved version of YOLOv5 and uses a decoupled head, anchor-free, and advanced label-assigning strategy (SimOTA) to balance speed and accuracy better across all model sizes than other models in the same class.
3.1. Detection Results and Analysis of Different Algorithms on the Global Wheat Head Dataset
To scientifically evaluate the detection performance of the proposed algorithm in this paper, 1000 images of the GWHD were taken as the testing set, and the rest of the images were augmented to 30,000 images as the training set. Five models, including Faster-RCNN, YOLOv5, YOLOX, YOLOv7, and YOLOv7-MA, were trained and tested, and the results were compared, as shown in
Table 2 and
Figure 9.
Through analyzing the results in
Table 2, it is shown that the proposed YOLOv7-MA achieves good results in wheat head detection, with the
MAP@0.5,
precision,
recall, and F1-score of 93.86%, 93.60%, 88.67%, and 0.92, respectively, and a detection speed of 35.93 FPS. Compared with the two-stage object detection method Faster-RCNN, the
[email protected], precision, recall, and F1-score are respectively improved by 14.36%, 16.06%, 5.5%, and 0.12, and the detection speed is improved by 20.77 FPS. Compared with the one-stage object detection algorithm YOLOv5, the
[email protected], precision, recall, and F1-score are improved by 3.92%, 3.18%, 4.60%, and 0.04, respectively. Compared with YOLOX, the
[email protected], precision, recall, and F1-score are improved by 3.18%, 1.81%, 3.45%, and 0.03, respectively. Compared with YOLOv7, the
[email protected], precision, recall, and F1-score are respectively improved by 1.88%, 0.50%, 2.95%, and 0.02, and the detection speed is slightly improved by 0.09 FPS. Overall, compared with Faster-RCNN, YOLOv5, YOLOX, and YOLOv7, the YOLOv7-MA model proposed in this paper has superior performance in wheat head detection in terms of accuracy without sacrificing detection speed.
Figure 9 illustrates the precision-recall curves of different algorithms for wheat head detection on GWHD. Through analyzing
Figure 9, it can be seen that as the recall value increases, the precision will gradually show a corresponding downward trend.
3.2. Detection Results and Analysis of Wheat Heads under Different Backgrounds
Under natural conditions, the environment of wheat fields can be complex, and factors such as cluttered wheat head backgrounds, low illumination, and unstable photography equipment result in blurred or defocused images, leaf-obscuring wheat heads, or overlapping heads. These conditions can all reduce the accuracy of detection models. To test the detection efficiency of the proposed YOLOv7-MA model under complex backgrounds in the three conditions of low illumination, blur, and occlusion, 30 images were selected randomly from the GWHD as the test set. Meanwhile, the performance of Faster-RCNN, YOLOv5, YOLOX, YOLOv7, and YOLOv7-MA detection models was compared and analyzed. The number of wheat spikes recognized by different models and the number of wheat spikes counted manually were used to calculate
RMSE,
rRMSE,
MAE, and
R2. The results are shown in
Figure 10 and
Figure 11.
In
Figure 10, the results of the 90 test images in different complex conditions were analyzed. In low illumination conditions, the
R2 between the prediction results of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN and manual counting results were 0.9895, 0.9833, 0.9768, 0.9786, and 0.8672, respectively. Among them, the prediction results of the YOLOv7-MA algorithm had the strongest correlation with manual counting results, surpassing YOLOv7, YOLOX, YOLOv5, and Faster-RCNN. In conditions where wheat spikes were blurry with the background, the
R2 between the prediction results and manual counting results was 0.9872, 0.9796, 0.9786, 0.9677, and 0.8914 for YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN, respectively. The prediction results of the YOLOv7-MA algorithm had the strongest correlation with manual counting, surpassing other models. In conditions where wheat spikes were overlapping or occluded by leaves, the
R2 between the prediction results and manual counting results was 0.9882, 0.9810, 0.9795, 0.9808, and 0.6605 for YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN, respectively. The prediction results of the YOLOv7-MA algorithm had the strongest correlation with manual counting results, also surpassing other models. Overall, the proposed YOLOv7-MA algorithm had better stability than YOLOv7, YOLOX, YOLOv5, and Faster-RCNN, and it demonstrated good wheat head counting performance in low illumination, blurry, and overlapping occlusion conditions.
Figure 11 shows comparison examples of the counting results of different algorithms and manual counting for some test images under three complex conditions, where the rectangle boxes indicate the recognized wheat spikes. In low illumination conditions, the predicted wheat head numbers of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN were 35, 34, 34, 33, and 42, respectively, while the manual counting result was 35. YOLOv7-MA accurately recognized every wheat head in the image, while YOLOv7 missed one, YOLOX missed one, YOLOv5 missed two, and Faster-RCNN produced seven false-positive results. Among them, the predicted wheat head numbers of YOLOv7-MA were closer to manual counting than those of YOLOv7, YOLOX, YOLOv5, and Faster-RCNN. In conditions where wheat spikes were blurry with the background, the predicted wheat head numbers of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN were respectively 67, 68, 64, 64, and 62, while the manual counting result was 66. YOLOv7-MA had one false-positive result, YOLOv7 missed two, YOLOX missed two, YOLOv5 missed two, and Faster-RCNN missed four. Among them, the predicted wheat head numbers of YOLOv7-MA, YOLOX, YOLOv5, and Faster-RCNN were the closest to manual counting. In conditions where wheat spikes were overlapping or occluded by leaves, the predicted wheat head numbers of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN were respectively 47, 48, 45, 45, and 39, while the manual counting result was 47. YOLOv7-MA accurately recognized every spike, while YOLOv7 missed one, YOLOX missed two, YOLOv5 missed two, and Faster-RCNN missed eight. Among them, the predicted wheat head numbers of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN were the closest to manual counting. Overall, the proposed YOLOv7-MA algorithm demonstrated better counting performance than YOLOv7, YOLOX, YOLOv5, and Faster-RCNN.
3.3. Ablation Experiments with YOLOv7-MA
To investigate the significance of the modifications made to the YOLOv7 model, the paper conducted ablation experiments on the public GWHD dataset. By using YOLOv7 as the baseline model, three optimization schemes were tested: data augmentation, adding micro-scale detection layers, and adding the mixed attention mechanism CBAM. After each modification, the trained models were tested with an experiment setting of 300 epochs, 1000 iterations per epoch, a batch size of 8, an initial learning rate of 0.01, and a stochastic gradient descent (SGD) optimizer. The results are presented in
Table 3 and
Figure 11.
The baseline model YOLOv7 was trained on the original public dataset GWHD, with 1000 images reserved for testing and the rest for training. In this case, the
[email protected] was 91.63%, and the detection speed was 35.81
FPS. Then, the same 1000 reserved images were not augmented and used as the test set. The training set was subjected to various data augmentation operations such as random cropping, splicing, and rotation. A total of 30,000 images were generated and used to train the baseline model, YOLOv7. In this case, the
[email protected] was 91.98%, showing an improvement of 0.35% over the non-augmented data, and the detection speed was 35.84
FPS, showing an improvement of 0.03
FPS. Then, adding the CBAM module to the YOLOv7 network and training the model with the augmented data set resulted in a
[email protected] of 92.80%, showing an improvement of 0.17% over the original network, and a detection speed of 35.94
FPS, showing an improvement of 0.13
FPS. Subsequently, adding the microscale detection layer to the YOLOv7 network and training the model with the augmented data set resulted in a
[email protected] of 92.83%, showing an improvement of 1.20% over the original network, but the detection speed showed a decrease of 0.05
FPS. Finally, adding both the CBAM module and the microscale detection layer to the YOLOv7 network resulted in a
[email protected] of 93.86%, showing an improvement of 2.23% over the baseline YOLOv7 network, and the detection speed increased by 0.12
FPS.
The results of the ablation experiments showed that: ① when both the microscale detection layer and CBAM were added to the model training process, all performance indicators were better than those of the original network. ② Increasing the number of deep learning samples had a positive effect on improving both the accuracy and speed of the model. ③ Adding the microscale detection layer contributed to a 0.03% higher
[email protected] than adding CBAM while improving the detection accuracy of small wheat spikes, but there was a slight loss of detection speed. ④ The CBAM mixed attention mechanism better focused on the characteristics of wheat spikes while suppressing irrelevant information, which led to an increase in both accuracy and speed.
3.4. Detection Results and Analysis of Wheat Heads under Transfer Learning
Transfer learning [
51] is the process of transferring feature information from a source domain to a target domain to obtain better object detection results in that domain. Although the GWHD contains a large number of wheat heads of different types and in different growth environments, even wheat heads of the same variety have various shapes. Therefore, this study transferred the weights trained on the public GWHD to field-collected wheat heads of different growth periods for training. The recognition performance of YOLOv7-MA on wheat heads in different growth stages was compared to verify the model’s adaptability and robustness.
The experiment used 100 original images collected in the filling and maturity stages as the test sets, and the training sets were augmented to 1000 images each, followed by Faster-RCNN, YOLOv5, YOLOX, YOLOv7, and YOLOv7-MA training and accuracy evaluation. The results are presented in
Table 4.
Analyzing
Table 4, it was found that there were slight differences in the wheat head recognition indicators in different growth stages among YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN under transfer learning. Furthermore, each model performs better on the mature wheat head dataset compared to the filling stage. Particularly, YOLOv7-MA outperformed other models in terms of various measures during both the filling and mature stages, exhibiting strong adaptability and generalization ability. The YOLOv7-MA model achieved
[email protected] for the filling and maturity stages, both exceeding 93%, and the F1-score exceeded 0.9. The
[email protected] of the maturity stage was slightly higher than that of the filling stage by 0.26%, the precision was higher than that of the filling stage by 1.29%, and the recall was lower than that of the filling stage by 0.54%. The detection speed of the maturity stage was higher than that of the filling stage by 0.63
FPS. Therefore, the YOLOv7-MA model performs best on the field-collected wheat head dataset, demonstrating strong adaptability and generalization ability. Moreover, it exhibits superior image recognition performance for mature wheat ear images compared to the filling stage.
Subsequently, the YOLOv7-MA model was selected based on its performance. 100 wheat head images at filling and maturity stages were taken as test samples. The number of wheat spikes obtained by the proposed YOLOv7-MA model and manual methods were used to calculate the
RMSE,
rRMSE,
MAE, and
R2 for different wheat growth stages, and the results are shown in
Figure 12 and
Figure 13.
Analyzing
Figure 13 and
Figure 14, the
R2 in the filling and maturity stages were 0.9155 and 0.9632, respectively, indicating a stronger correlation between the YOLOv7-MA model’s prediction results and manual counting for the maturity stage.
Figure 13 presents a comparison between the predicted wheat head number and manual counting for the test samples in filling and maturity stages using the YOLOv7-MA algorithm. When the number of wheat spikes in the image was between 90 and 100, the counting accuracy of YOLOv7-MA was comparable for both filling and maturity stages, missing one wheat head in each case. When the number of spikes was between 100 and 110, YOLOv7-MA missed one wheat head for the maturity stage and wrongly detected two spikes for the filling stage. When the number of spikes was between 110 and 130, YOLOv7-MA missed six spikes for the maturity stage and three spikes for the filling stage. Combining
Figure 13 and
Figure 14, wheat head texture features in the filling stage are less evident than those in the maturity stage. This is because the leaves are plumper in the filling stage, which may cause the leaves to be misidentified as wheat spikes. Therefore, it can be concluded that YOLOv7-MA has superior wheat head-counting performance in the maturity stage compared to the filling stage.
4. Discussion
This paper proposed the YOLOv7-MA algorithm and trained it on the publicly available GWHD with data augmentation. The training dataset was large and diverse, and the proposed algorithm demonstrated good performance in terms of
[email protected], precision, recall, and F1-score. Meanwhile, there was a strong correlation between the predicted wheat head count and the manual counting results.
To better evaluate the wheat head recognition performance of the YOLOv7-MA algorithm, comparative experiments were conducted with other object detection algorithms. In this paper, Faster-RCNN, YOLOv5, YOLOX, YOLOv7, and YOLOv7-MA were trained on the GWHD 2021 after data augmentation. Many scholars, such as Bhagat [
31] and Amirhossein [
29], have shown that using the sample diversity of GWHD can better verify the model’s adaptability and overcome limitations. The test results indicated that YOLOv7-MA had better performance than other algorithms in terms of
[email protected], precision, recall, and F1-score. Although the detection speed of YOLOv7-MA was slightly slower than that of YOLOv5 and YOLOX, it was improved by 0.09 FPS compared to that of the original YOLOv7 model. This is because the improved baseline model used in this study is YOLOv7, proposed by Wang [
22], which added deeper convolutional layers to improve accuracy, and the increased computation amount had some impact on detection speed. However, it still met the needs of object detection tasks. YOLOv5 proposed by Zhou [
19] and YOLOX proposed by Ge [
20] are improved versions of the YOLO series, and they have a faster detection speed, but their detection accuracy is the target of improvement. Furthermore, the counting performance of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN was compared. In-field wheat field backgrounds are complex, with poor illumination conditions, mutual overlapping occlusion between heads and between heads and leaves, and blurring of wheat heads caused by shooting, all of which affect the accuracy of wheat head recognition. Dong [
35], Liu [
36], and other researchers have all proposed that the overlapping occlusion of wheat heads will cause difficulties in detection. In this study, these three types of images were selected as the test set to compare the correlation between the predicted wheat heads of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN and manual counting results. The
R2 of the predicted wheat head number by YOLOv7-MA and manual counting results under low illumination, blur, and overlapping occlusion conditions were 0.9895, 0.9872, and 0.9882, respectively, indicating that the predicted results of YOLOv7-MA under complex background conditions were highly consistent with the manual counting results, and the correlation was stronger than that of other object detection algorithms taken for comparison.
Ablation experiments were performed on the publicly available GWHD using the YOLOv7-MA algorithm. The ablation study evaluated the impact of three optimization methods, namely data augmentation, adding micro-scale detection layers, and incorporating CBAM, on wheat head recognition accuracy using YOLOv7 as the baseline model. The results showed that adding micro-scale detection layers contributed more to the detection accuracy than incorporating CBAM but resulted in a slight decrease in detection speed due to the increased computational complexity required to detect small wheat heads. Incorporating CBAM improved detection accuracy and speed by focusing attention on the target wheat heads and reducing the influence of complex backgrounds during feature learning. Similar results were obtained by Wang [
33], Li [
34], and other researchers when implementing CBAM in other networks. Data augmentation played a minor role in enhancing detection accuracy and speed by increasing the diversity of training samples, thereby improving the model’s ability to learn wheat head features and reducing the misclassification of complex samples. Li [
34] also improved detection accuracy by augmenting the training dataset using Mosaic-8. The study found that the best detection performance was achieved by combining data augmentation with both microscale detection layers and the CBAM module. Therefore, the three proposed optimization techniques, namely data augmentation, adding micro-scale detection layers, and incorporating CBAM, effectively improve wheat head recognition accuracy without sacrificing detection speed.
The YOLOv7-MA algorithm was used in combination with transfer learning to apply the weights trained on the GWHD to the field-collected wheat head images in different growth stages. In both the maturity and filling stages, YOLOv7-MA achieved a
[email protected] greater than 93% and a detection speed higher than 33
FPS, with slightly better detection accuracy and speed in the maturity stage. By comparing the predicted results of YOLOv7-MA with manual counts in different growth stages, it was demonstrated that the
R2 was 0.9155 in the filling stage and 0.9632 in the maturity stage. Consequently, YOLOv7-MA exhibited stronger transfer learning ability and more stable wheat head recognition performance in the maturity stage. This is mainly because wheat heads have different morphological characteristics in different growth stages, with the features of mature wheat heads being more stable, complete, and unique and the contrast between wheat heads and backgrounds (such as stems and leaves) being more significant. This leads to lower wheat head recognition difficulty in the maturity stage, which is consistent with the research results of Zhou [
37], Li [
32], and others. Transfer learning can improve the robustness of models and make them more suitable for real-world scenarios. However, the field-collected wheat head images in this study were limited, with only 1000 images even after augmentation, resulting in a slightly lower recognition performance of YOLOv7-MA than that on the large training dataset of the public dataset.
Multiple studies have shown that models trained on single-sample datasets collected in the field have significant limitations and experience a significant decrease in detection accuracy when applied to other wheat head datasets [
39,
40]. In this study, the model was pre-trained on the
Global Wheat Head Dataset 2021 of wheat heads of different varieties with different morphologies in different periods to solve the problem of poor generalization ability caused by a single sample set. Furthermore, the small size of wheat heads and frequent occurrences of occlusion pose challenges for wheat ear detection models [
38,
41]. These factors lead to a limited amount of wheat ear feature information that can be acquired by the detection model, which constrains the model’s detection accuracy. To address these challenges, the convolutional block attention module and micro-scale detection layers were introduced into the YOLOv7-MA network [
33,
34]. This modification improved the model’s ability to learn representative features of wheat heads, thus enhancing detection accuracy. However, achieving high-precision recognition often requires complex, deep model structures. These structures inevitably involve a larger number of parameters, which impose certain requirements on computer hardware. Additionally, the training process for the model incurs a longer time commitment. Therefore, the next research direction is to explore how to simplify the model as much as possible while ensuring accuracy.
Due to its higher detection accuracy and speed compared to previous versions, yolov7 was chosen as the baseline model for this study. It supports high-resolution images and multiple object types, and it is flexible and easy to deploy. Hence, yolov7 was selected as the foundation for our research. Of course, there were also several challenges encountered during the research process. The initial attempt was to directly train the model wheat head using images collected from various growth stages, consisting of 500 images per stage for both maturity and filling. Regrettably, the achieved accuracy was considerably low, resulting in a
[email protected] of only 18% and 22% for the respective stages. After consulting various references [
43,
52], it was evident that the scale and diversity of the dataset are crucial for deep learning. However, the training samples currently employed are far from sufficient. It became evident that our training samples were insufficient in this regard. Therefore, the
Global Wheat Head Dataset 2021, with its rich scenarios and large amount of data, was utilized. In conjunction with data augmentation strategies, we efficiently expanded the training samples [
53,
54]. Ultimately, the trained model exhibited superior performance across various testing conditions. It is undeniable that large-scale training datasets and complex network structures with high precision increase the time and cost of training, which is an aspect that needs improvement in the next steps.
Overall, the proposed YOLOv7-MA model has shown excellent performance in both detection accuracy and speed, enabling fast and accurate detection and counting of wheat heads, which meets the needs of production and life. In the future, the proposed model can be used in unmanned aerial vehicles equipped with a high-definition RGB camera with 4 K resolution to obtain snapshots of wheat heads during low-altitude flight and estimate the yield of wheat fields on a large scale.