YOLOv7-MA: Improved YOLOv7-Based Wheat Head Detection and Counting

Meng, Xiaopeng; Li, Changchun; Li, Jingbo; Li, Xinyan; Guo, Fuchen; Xiao, Zhen

doi:10.3390/rs15153770

Open AccessArticle

YOLOv7-MA: Improved YOLOv7-Based Wheat Head Detection and Counting

by

Xiaopeng Meng

,

Changchun Li

^*,

Jingbo Li

,

Xinyan Li

,

Fuchen Guo

and

Zhen Xiao

School of Surveying and Land Information Engineering, Henan Polytechnic University, Jiaozuo 454000, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(15), 3770; https://doi.org/10.3390/rs15153770

Submission received: 5 July 2023 / Revised: 26 July 2023 / Accepted: 26 July 2023 / Published: 29 July 2023

(This article belongs to the Special Issue Remote Sensing and Associated Artificial Intelligence in Agricultural Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Detection and counting of wheat heads are crucial for wheat yield estimation. To address the issues of overlapping and small volumes of wheat heads on complex backgrounds, this paper proposes the YOLOv7-MA model. By introducing micro-scale detection layers and the convolutional block attention module, the model enhances the target information of wheat heads and weakens the background information, thereby strengthening its ability to detect small wheat heads and improving the detection performance. Experimental results indicate that after being trained and tested on the Global Wheat Head Dataset 2021, the YOLOv7-MA model achieves a mean average precision (MAP) of 93.86% with a detection speed of 35.93 frames per second (FPS), outperforming Faster-RCNN, YOLOv5, YOLOX, and YOLOv7 models. Meanwhile, when tested under the three conditions of low illumination, blur, and occlusion, the coefficient of determination (R²) of YOLOv7-MA is respectively 0.9895, 0.9872, and 0.9882, and the correlation between the predicted wheat head number and the manual counting result is stronger than others. In addition, when the YOLOv7-MA model is transferred to field-collected wheat head datasets, it maintains high performance with MAP in maturity and filling stages of 93.33% and 93.03%, respectively, and R² values of 0.9632 and 0.9155, respectively, demonstrating better performance in the maturity stage. Overall, YOLOv7-MA has achieved accurate identification and counting of wheat heads in complex field backgrounds. In the future, its application with unmanned aerial vehicles (UAVs) can provide technical support for large-scale wheat yield estimation in the field.

Keywords:

wheat head; microscale detection layer; convolutional block attention module; deep learning; transfer learning

1. Introduction

As an important food crop, wheat provides an energy source for one-fifth of the global population [1]. Continuously increasing its yield per area and ensuring food supply is an issue of national importance [2]. The number of wheat heads is one of the important indicators for estimating yield [3]. Accurately identifying and counting wheat spikes and timely obtaining wheat growth conditions and yield information not only provide support for agricultural production and help farmers improve crop quality in the early stages but also contribute to the country’s land policy and food price regulation [4]. However, wheat fields have a complex background, and the targets are dense, with wheat spikes overlapping each other. Rapid and accurate detection of wheat spikes faces a huge challenge. Artificially counting wheat spikes is susceptible to subjective factors, and it is time-consuming, inaccurate, and difficult to ensure accuracy, especially for large areas of land [5].

Compared with laborious and time-consuming manual statistics, the emergence of machine learning technology greatly improves the efficiency of wheat head detection and counting. Alharbi et al. [6] combined Gabor filters with the K-means clustering algorithm to recognize and count wheat heads in the target area from wheat images captured by the Crop Quant platform. Zhou et al. [7] proposed a segmentation algorithm based on multiple optimized features and the twin support vector machine, which achieved good results in identifying wheat heads. Zhu et al. [8] presented a two-step wheat head detection mechanism from coarse to fine, which emphasized candidate wheat head regions and then removed non-wheat head regions with higher-level features. The mechanism achieved good detection performance on the test set. However, traditional machine learning-based wheat head detection technology is susceptible to noise such as lighting, angle, color, and soil, making it unable to accurately recognize stick wheat heads. Also, it lacks generalization ability, making it difficult to achieve wheat head detection and counting in different scenarios.

With the development of computer vision and deep learning theory, researchers have begun to apply deep neural networks to wheat-head object detection. Currently, mainstream object detection algorithms are mainly divided into two categories: two-stage object detection algorithms and one-stage object detection algorithms. Two-stage object detection algorithms are based on region proposals, such as R-FCN (Region-based Fully Convolutional Networks) [9] and R-CNN (Region-CNN) series (including R-CNN [10], Fast-RCNN [11], Faster-RCNN [12], Mask-RCNN [13], etc.). Attributed to classification and then localization, two-stage object detection algorithms have high accuracy, but their detection speed is often slow, and the primary goal at present is to improve their detection speed. One-stage object detection algorithms are based on region regression, which directly extracts features through convolutional neural networks (CNNs) to classify and locate targets. The representatives of this category of algorithm include SSD (Single Shot MultiBox Detector) [14] and the YOLO series (You Only Look Once) (including YOLO [15], YOLO9000 [16], YOLOv3 [17], YOLOv4 [18], YOLOv5 [19], YOLOX [20], YOLOv6 [21], YOLOv7 [22], etc.). Due to the direct classification and location by one-stage object detection algorithms, they have extremely fast detection speeds, and the primary goal at present is to enhance their accuracy.

The rise of deep learning methods provides a better technical platform for object detection, which can greatly reduce labor costs, time costs, and mechanical equipment costs [23]. He et al. [24] studied and simplified the YOLOv4 network structure and used the K-means algorithm to re-cluster the anchor points, thereby effectively solving the wheat head detection problem in natural scenes based on unmanned aerial platforms. Based on the network structure of YOLOv4, Gong et al. [25] optimized the double spatial pyramid pooling (SPP) network to enhance the feature learning ability, enlarged the receptive field of the convolutional network, and improved its recognition accuracy and speed. Xu et al. [26] automatically segmented wheat head images and automatically extracted wheat head contour features based on the K-means clustering algorithm, thereby significantly improving the efficiency and accuracy of wheat head counting. Wang et al. [27] used multi-level CNNs (SSRNET) to segment wheat head images and achieve a rapid estimation of wheat head numbers under field conditions. Zhao et al. [28] improved the feature extraction of wheat heads by adding micro-scale detection layers and setting prior anchor boxes, optimized the network structure of YOLOv5, and increased the detection accuracy of wheat head images taken by unmanned aerial vehicles by 10.8%. Amirhossein et al. [29] applied automatic object enhancement technology to deep learning models and used the mixed AutoOLA-DL model to improve wheat head counting performance. However, the training dataset used in the above research comes from the same region, with relatively single wheat varieties, growth environments, and wheat head morphology, leading to certain limitations in terms of model adaptability to a wide range of other datasets. David et al. [30] mentioned that the diversity and richness of the wheat head data are crucial for the performance of the model. Therefore, some researchers used public datasets from different countries and regions to increase sample diversity and enhance the model’s generalization ability. Bhagat et al. [31] proposed the WheatNet-Lite model to address the issue of overlapping and occlusion of wheat heads, and it achieved higher accuracy on the GWHD 2020 public dataset. Li et al. [32] combined transfer learning with the Faster RCNN and RetinaNet models to compare and evaluate the recognition accuracy of different wheat heads in different growth stages, and good results were obtained. Amirhossein et al. [9] developed the hybrid UNet architecture, pre-trained the model using the ACID wheat head image public dataset, and combined transfer learning to evaluate the model on the GWHD 2020 public dataset. The model achieved significantly enhanced wheat head localization and counting results. Based on the RetinaNet network structure, Wen et al. [2] introduced a weighted bidirectional feature pyramid network (BIFPN) and fused multi-scale features to recognize different varieties and complex environments of wheat heads. This network showed good detection ability on the GWHD 2020 and WSD public datasets. However, mutual occlusion of wheat heads is still an important factor affecting detection accuracy, and attention mechanisms can solve this problem effectively. Wang et al. [33] introduced the convolutional block attention module (CBAM) attention module into the EfficientDet-D0 algorithm, effectively solved the wheat head occlusion problem, and enhanced wheat head counting accuracy. Li et al. [34] integrated CBAM into the YOLOv5 network and combined it with the Mosaic-8 data augmentation method to improve the accuracy of wheat head recognition in complex large-field backgrounds such as overlap and occlusion and achieve accurate counting results. Dong et al. [35] introduced a random polarized self-attention mechanism, SPSA, in both spatial and channel dimensions and effectively combined it with a random unit to improve the detection capability of overlapping and occluded wheat heads. Liu et al. [36] proposed the YOLO-Extract algorithm based on YOLOv5, which enhances the feature extraction capability of wheat heads by introducing a coordinated attention mechanism. Zhou et al. [37] presented the multi-window swine transformer network, which combines self-attention mechanisms and feature pyramid networks to extract multi-scale features and effectively improve the accuracy of detecting complex wheat heads in the field. However, there is still a trade-off between the speed and accuracy of wheat head identification and detection. The current research focus is on improving both the accuracy and speed of wheat spike detection simultaneously.

In summary, several studies have shown that models trained on a single dataset collected from a specific region lack generalizability and exhibit significantly reduced accuracy when applied to wheat head detection in other regions [38,39,40,41]. Additionally, the presence of overlapping and small-sized wheat heads in field conditions poses limitations on the accuracy of wheat head detection. In this study, we employ YOLOv7 as the baseline model and apply it to wheat head detection and counting. Furthermore, we improve the YOLOv7 model to enhance its accuracy and speed, making the refined model more suitable for wheat head detection in complex backgrounds. This supports automatic wheat head recognition and yield estimation in actual field settings.

The following are the most important contributions that this study provides:

(1): Addressing the challenges of overlapping and small-sized wheat heads in natural environments, the microscale detection layer and convolutional block attention module are introduced into the YOLOv7-MA network. And this enhances the model’s ability to detect and count wheat heads;
(2): Mixup and improved mosaic are adopted in both data pre-processing and model training to improve the representational learning ability of the algorithm;
(3): Comparing YOLOv7-MA with Faster RCNN, YOLOv5, YOLOX, and YOLOv7 algorithms, YOLOv7-MA exhibits the best performance on both public datasets and field datasets of different growth stages. Furthermore, YOLOv7-MA demonstrates good robustness in challenging conditions of low illumination, blur, and occlusion.

The remaining sections of this paper are organized as follows: Section 2 presents the Materials and Methods, where we provide a detailed explanation of the dataset and its data processing procedures, the object detection algorithm, the model enhancement approach, and accuracy evaluation metrics. In Section 3, we present the results and conduct an analysis. Section 4 discusses the results of this paper in comparison to existing findings in the field. Finally, in Section 5, we summarize the conclusions of this study and propose future research directions.

2. Materials and Methods

2.1. Data Acquisition

2.1.1. Global Wheat Head Dataset 2021 Acquisition

The Global Wheat Head Dataset 2021 (GWHD) [42] (http://www.global-wheat.com/, accessed on 6 December 2022) is a large-scale wheat head detection dataset based on field optical images. GWHD was established in 2020, and it has been expanded from multiple perspectives, such as data size, wheat head diversity, and label reliability, in 2021.

GWHD 2021 is prepared for the global wheat challenge and has been divided into three subsets. It consists of images, competition_text.csv, competition_train.csv, competition_val.csv, and metadata_dataset.csv. Among them, the images file contains a total of 6515 images in the database. The competition_text.csv file contains label files for the test set, which comprises 1382 images. The competition_train.csv file contains label files for the training set, which contains 3657 images. The competition_val.csv file contains label files for the validation set, which consists of 1476 images. The metadata_dataset.csv file contains additional metadata for each domain. This is shown in Table 1.

GWHD 2021 includes 6515 RGB images, and each image has 1024 × 1024 pixels. These images were collected from 11 different countries and regions and have boundary boxes containing more than 270,000 wheat heads in total. Some images from GWHD 2021 are shown in Figure 1.

2.1.2. Field Data Acquisition

An image acquisition experiment was conducted at the wheat breeding base in Xuliang Town, Boai County, Jiaozuo City, Henan Province (35.18°N, 113.03°E). The region has a temperate monsoon climate, which is suitable for the growth of wheat crops. The location of the study area is shown in Figure 2.

Field wheat head images were collected under clear and windless conditions, starting at 10 am and lasting until 2 pm. In this way, the possibility of image distortion due to weather and lighting conditions was minimized. The wheat heads in the filling stage (10 May) and maturity stage (10 June) were photographed with a Canon EOS 450D camera (aperture value f/8, exposure time 1/125 s, ISO speed ISO-200, focal length 25 mm) held vertically at a position of about 1 m above the wheat canopy with backlight. A total of 1000 high-definition digital images of wheat heads were obtained, of which 500 were taken in the filling stage and 500 in the maturity stage. The image resolution was 3088 × 2056 pixels, and the horizontal and vertical resolutions were both 72 dpi, as shown in Figure 3.

2.2. Data Processing

2.2.1. Data Annotation

In this study, the LabelImg tool (https://github.com/tzutalin/labelImg, accessed on 1 June 2022) was used to annotate the wheat heads in the field-collected dataset. In the annotation process, it was ensured that each wheat head was fully enclosed within a rectangular bounding box, while the inclusion of excess background within the box was minimized. The position of the bounding box was determined by the coordinates of its upper-left and lower-right vertices. Once all the spikes in an image were annotated, an XML file containing information such as the image path, image name, image size, label name, and label position was generated.

To minimize the possibility of decreased model training and recognition performance due to inaccurate object labeling, post-annotation result correction was performed.

2.2.2. Data Augmentation

To obtain accurate detection results with deep learning technology, it is usually necessary to use a training dataset with rich scenes, a large scale, and accurate labeling [43]. In this study, data augmentation strategies are used to process the collected high-resolution wheat images to address the issue of insufficient samples and enable the model to have stronger generalization ability.

Mosaic and Mixup [44] are commonly used and effective data augmentation methods for image classification and recognition tasks in deep learning. Their data augmentation effect is demonstrated in Figure 4. Mosaic randomly flips, scales, crops, and combines four images into a new image, each with its own corresponding annotated box, and the newly combined image also has its own corresponding annotated box. In this study, the mosaic algorithm was improved by setting its scale parameter to 1 to 3, thereby randomly magnifying the target 1 to 3 times to facilitate the detection of small wheat spikes. Mixup reads two images each time and applies various transformations such as flipping, scaling, and color space changes to both images; then, these images are mixed randomly at a specific ratio determined by the fusion coefficient, and the original annotated boxes in the images are mixed and superimposed. By mixing the two methods, images with more details and labels can be generated, thereby greatly enriching the wheat dataset, increasing sample diversity, reducing misjudgment of complex samples, and improving the network’s robustness.

2.3. Methodology

As the most advanced object detection algorithm, YOLOv7 [22] is a more accurate, faster, and more advanced real-time object detector. It consists of three parts: the backbone, neck, and head. Specifically, the backbone is used to extract the main features, the neck is used to fuse the feature information extracted from the backbone, and the head is used to predict the bounding boxes and object categories. Compared with the previous YOLO series models, YOLOv7 has made some breakthrough improvements: merging convolution +BN layers and different convolutions into one convolution module to enhance the network’s execution speed without compromising on model performance; auxiliary head detection that extracts shallow features from the head section and uses the deep features, i.e., the final output of the network, as the lead head to guide the prediction of a series of layered labels from rough to fine and train both the auxiliary head and lead head; E-ELAN, an extended version of the efficient layer aggregation network ELAN, maintains the original design architecture of ELAN, conducts group convolution on the input feature map, guides different feature groups to learn more diverse features, maintains the original gradient path, and enhances the learning capability; meanwhile, the dynamic label assignment strategy adopts the cross-grid search of YOLOv5 and the matching strategy of YOLOX, thereby integrating the advantages of both networks.

The multi-level convolution operations in YOLOv7 generate a large amount of calculation information while achieving strong expression capability, and the weak information carried by small targets leads to poor feature expression capability, resulting in fewer extracted features after deep convolution. To further improve the detection performance of wheat spikes and, especially, accurately identify small wheat spikes in complex field backgrounds, this study restructured the YOLOv7 network. First, a microscale detection layer [45] was introduced into the neck structure of the YOLOv7 network, a micro-target detection head was added, and the CBAM [46] was integrated to obtain the new target detection model (YOLOv7-MA). The structure of the YOLOv7-MA model is shown in Figure 5.

2.3.1. Microscale Detection Layer

In principle, objects with pixel points less than 32 × 32 or with a size of 0.1 times the entire image size are called small targets. For small target detection, this study added a micro-scale detection layer to the YOLOv7-MA network to extract lower spatial features and fuse them with deep semantic features to generate a feature map. The micro-scale detection layer makes the YOLOv7-MA network structure more extensive and detailed, thereby improving the detection accuracy of wheat spikes.

Due to the small volume and dense distribution of some wheat spikes in vertical photography, the YOLOv7 detection layer has poor applicability. In the proposed YOLOv7-MA network structure, a new micro-scale detector is added, which is combined with the other three detectors to achieve multi-scale detection, i.e., enabling the recognition of wheat spikes of different sizes. To maintain scale matching, additional upsampling and downsampling structures were added to the feature pyramid networks (FPN) and Pyramid Attention Networks (PANet) in the feature pyramid of the neck layer, thereby improving the robustness of the detection scale while controlling computational complexity. Experimental results have shown that integrating these modifications into the YOLOv7-MA network structure can significantly improve its ability to detect small wheat spikes.

2.3.2. Convolutional Block Attention Module

Wheat head images usually have a complex background, and with the deepening of convolutional layers, a lot of redundant information will be generated. Attention mechanisms have been widely used in deep learning. They enable the network to focus on the target area, enhance the expression of important features, improve task processing efficiency and accuracy, and are plug-and-play with low computational complexity. Relevant studies have indicated that integrating attention mechanisms into deep learning models can improve their accuracy in recognizing objects [47].

CBAM consists of two modules: the channel attention module (CAM) and the spatial attention module (SAM), where the former mainly focuses on the semantic information of the image while the latter focuses on the positional information of the image in the spatial dimension. Therefore, CBAM considers different channel pixels and different position pixels in the same channel, calculates attention along channel and spatial dimensions in order, and optimizes the input feature map by multiplying it with weights based on attention. CBAM enables the network to better focus on wheat head information and suppress irrelevant information, and when applied to the YOLOv7-MA network, it can pay attention to the key positions of wheat spikes, thereby improving training effectiveness. Figure 6, Figure 7 and Figure 8 show the structure of CBAM and its components, namely, CAM and SAM.

2.4. Accuracy Evaluation Index

To ensure the fairness and comparability of experimental results, the evaluation metrics widely used in existing object detection methods were adopted to evaluate the detection performance of the proposed model. These metrics include precision, recall, F1-score, and mean average precision (MAP). Specifically, precision measures the accuracy of the algorithm, while recall measures the completeness of the image recognition results. The F1-score [48] is a composite evaluation metric for model detection accuracy and is the harmonic mean of precision and recall. The specific definitions of these evaluation metrics are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

R e c a l l = \frac{T P}{T P + F N}

(2)

F 1 - S c o r e = \frac{2 \times p r e c i s i o n \times R e c a l l}{p r e c i s i o n + R e c a l l}

(3)

In the above formulas, true positives (TP) represent the number of correctly classified wheat spikes where both the detection result and the ground truth are wheat spikes. False negatives (FN) represent the number of incorrectly classified wheat spikes where the detection result is the background and the ground truth is wheat spikes. False positives (FP) represent the number of incorrectly classified backgrounds where the detection result is wheat spikes and the ground truth is background. True negatives (TN) represent the number of correctly classified backgrounds where both the detection result and the ground truth are the backgrounds.

Mean average precision (MAP) [49] is a metric used to measure the performance of object detectors and is defined as the area under the precision-recall curve. In this paper, MAP is calculated under the condition that IoU ≥ 0.5, so it is presented as MAP@0.5. Since wheat spikes are the only target for detection, MAP is equivalent to the AP value in this paper

A P = \int_{0}^{1} P (R) d R

(4)

Frames per second (FPS) [50] is a metric used to measure the model’s detection speed, and it is defined as the reciprocal of the inference time. FPS refers to the number of images that can be detected per second. The more frames processed per second, the faster the model detection speed.

F P S = \frac{1}{I n f e r e n c e t i m e}

(5)

Meanwhile, the coefficient of determination (R²), root mean square error (RMSE), relative root mean square error (rRMSE), and mean absolute error (MAE) were adopted as evaluation metrics to measure the model’s counting performance. R² represents the degree of model regression, i.e., the correlation between predicted and true values. RMSE, rRMSE, and MAE represent the deviation between true and predicted values. The higher the R² value, the better the model’s fitting effect. The lower the RMSE, rRMSE, and MAE values, the better the model’s counting performance. The definitions of these evaluation metrics are as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(T r u t h_{i} - P r e d i c t e d_{i})}^{2}}{\sum_{i = 1}^{n} {(T r u t h_{i} - {\bar{T r u t h}}_{i})}^{2}}

(6)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{n} {(T r u t h_{i} - P r e d i c t e d_{i})}^{2}}

(7)

r R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{n} {(\frac{T r u t h_{i} - P r e d i c t e d_{i}}{T r u t h_{i}})}^{2}}

(8)

M A E = \frac{1}{N} \sum_{i = 1}^{n} |T r u t h_{i} - P r e d i c t e d_{i}|

(9)

where, N represents the number of test samples for the model,

T r u t h_{i}

and

P r e d i c t e d_{i}

represent the manually counted wheat head number and the predicted wheat head number of the i-th image, and

{\bar{T r u t h}}_{i}

represents the mean value of the manually counted wheat head number.

3. Results

The experimental platform used in this study is equipped with an Intel^® Xeon^® W-2145 CPU @ 3.70 GHz processor and an NVIDIA GeForce RTX 2080Ti graphics card, running the 64-bit Windows 10 operating system. The YOLOv7-MA deep learning network model was developed using the Python programming language based on Pytorch1.2.0 and torchvision0.4.0, with cuda10.0 and cudnn7.4.1.5, to achieve wheat head detection, and the proposed model was validated and compared against Faster-RCNN, YOLOv5, YOLOX, and YOLOv7 to verify its recognition accuracy.

Faster-RCNN [12] is a typical two-stage object detection algorithm that introduces the region proposal network (RPN) instead of a time-consuming selective search to extract candidate regions, shares convolutional features of images with the detection network, and achieves real-time object detection. YOLO [15] (You-only-look-once) series models merge candidate selection and object recognition phases into one and use predefined candidate areas to regress both the category and position of the target at once. It is a typical one-stage object detection algorithm and has a higher real-time object detection rate. YOLOv5 [19], as the fifth-generation YOLO series, makes some major improvements, such as using cross-stage partial (CSP) to increase network convergence and combining FPN and path aggregation network (PAN) modules to ensure accurate predictions for images of different sizes. It regresses the category and position of the target box directly in the output layer, leading to a high recognition speed. YOLOx [20] is an improved version of YOLOv5 and uses a decoupled head, anchor-free, and advanced label-assigning strategy (SimOTA) to balance speed and accuracy better across all model sizes than other models in the same class.

3.1. Detection Results and Analysis of Different Algorithms on the Global Wheat Head Dataset

To scientifically evaluate the detection performance of the proposed algorithm in this paper, 1000 images of the GWHD were taken as the testing set, and the rest of the images were augmented to 30,000 images as the training set. Five models, including Faster-RCNN, YOLOv5, YOLOX, YOLOv7, and YOLOv7-MA, were trained and tested, and the results were compared, as shown in Table 2 and Figure 9.

Through analyzing the results in Table 2, it is shown that the proposed YOLOv7-MA achieves good results in wheat head detection, with the MAP@0.5, precision, recall, and F1-score of 93.86%, 93.60%, 88.67%, and 0.92, respectively, and a detection speed of 35.93 FPS. Compared with the two-stage object detection method Faster-RCNN, the MAP@0.5, precision, recall, and F1-score are respectively improved by 14.36%, 16.06%, 5.5%, and 0.12, and the detection speed is improved by 20.77 FPS. Compared with the one-stage object detection algorithm YOLOv5, the MAP@0.5, precision, recall, and F1-score are improved by 3.92%, 3.18%, 4.60%, and 0.04, respectively. Compared with YOLOX, the MAP@0.5, precision, recall, and F1-score are improved by 3.18%, 1.81%, 3.45%, and 0.03, respectively. Compared with YOLOv7, the MAP@0.5, precision, recall, and F1-score are respectively improved by 1.88%, 0.50%, 2.95%, and 0.02, and the detection speed is slightly improved by 0.09 FPS. Overall, compared with Faster-RCNN, YOLOv5, YOLOX, and YOLOv7, the YOLOv7-MA model proposed in this paper has superior performance in wheat head detection in terms of accuracy without sacrificing detection speed.

Figure 9 illustrates the precision-recall curves of different algorithms for wheat head detection on GWHD. Through analyzing Figure 9, it can be seen that as the recall value increases, the precision will gradually show a corresponding downward trend.

3.2. Detection Results and Analysis of Wheat Heads under Different Backgrounds

Under natural conditions, the environment of wheat fields can be complex, and factors such as cluttered wheat head backgrounds, low illumination, and unstable photography equipment result in blurred or defocused images, leaf-obscuring wheat heads, or overlapping heads. These conditions can all reduce the accuracy of detection models. To test the detection efficiency of the proposed YOLOv7-MA model under complex backgrounds in the three conditions of low illumination, blur, and occlusion, 30 images were selected randomly from the GWHD as the test set. Meanwhile, the performance of Faster-RCNN, YOLOv5, YOLOX, YOLOv7, and YOLOv7-MA detection models was compared and analyzed. The number of wheat spikes recognized by different models and the number of wheat spikes counted manually were used to calculate RMSE, rRMSE, MAE, and R². The results are shown in Figure 10 and Figure 11.

In Figure 10, the results of the 90 test images in different complex conditions were analyzed. In low illumination conditions, the R² between the prediction results of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN and manual counting results were 0.9895, 0.9833, 0.9768, 0.9786, and 0.8672, respectively. Among them, the prediction results of the YOLOv7-MA algorithm had the strongest correlation with manual counting results, surpassing YOLOv7, YOLOX, YOLOv5, and Faster-RCNN. In conditions where wheat spikes were blurry with the background, the R² between the prediction results and manual counting results was 0.9872, 0.9796, 0.9786, 0.9677, and 0.8914 for YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN, respectively. The prediction results of the YOLOv7-MA algorithm had the strongest correlation with manual counting, surpassing other models. In conditions where wheat spikes were overlapping or occluded by leaves, the R² between the prediction results and manual counting results was 0.9882, 0.9810, 0.9795, 0.9808, and 0.6605 for YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN, respectively. The prediction results of the YOLOv7-MA algorithm had the strongest correlation with manual counting results, also surpassing other models. Overall, the proposed YOLOv7-MA algorithm had better stability than YOLOv7, YOLOX, YOLOv5, and Faster-RCNN, and it demonstrated good wheat head counting performance in low illumination, blurry, and overlapping occlusion conditions.

Figure 11 shows comparison examples of the counting results of different algorithms and manual counting for some test images under three complex conditions, where the rectangle boxes indicate the recognized wheat spikes. In low illumination conditions, the predicted wheat head numbers of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN were 35, 34, 34, 33, and 42, respectively, while the manual counting result was 35. YOLOv7-MA accurately recognized every wheat head in the image, while YOLOv7 missed one, YOLOX missed one, YOLOv5 missed two, and Faster-RCNN produced seven false-positive results. Among them, the predicted wheat head numbers of YOLOv7-MA were closer to manual counting than those of YOLOv7, YOLOX, YOLOv5, and Faster-RCNN. In conditions where wheat spikes were blurry with the background, the predicted wheat head numbers of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN were respectively 67, 68, 64, 64, and 62, while the manual counting result was 66. YOLOv7-MA had one false-positive result, YOLOv7 missed two, YOLOX missed two, YOLOv5 missed two, and Faster-RCNN missed four. Among them, the predicted wheat head numbers of YOLOv7-MA, YOLOX, YOLOv5, and Faster-RCNN were the closest to manual counting. In conditions where wheat spikes were overlapping or occluded by leaves, the predicted wheat head numbers of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN were respectively 47, 48, 45, 45, and 39, while the manual counting result was 47. YOLOv7-MA accurately recognized every spike, while YOLOv7 missed one, YOLOX missed two, YOLOv5 missed two, and Faster-RCNN missed eight. Among them, the predicted wheat head numbers of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN were the closest to manual counting. Overall, the proposed YOLOv7-MA algorithm demonstrated better counting performance than YOLOv7, YOLOX, YOLOv5, and Faster-RCNN.

3.3. Ablation Experiments with YOLOv7-MA

To investigate the significance of the modifications made to the YOLOv7 model, the paper conducted ablation experiments on the public GWHD dataset. By using YOLOv7 as the baseline model, three optimization schemes were tested: data augmentation, adding micro-scale detection layers, and adding the mixed attention mechanism CBAM. After each modification, the trained models were tested with an experiment setting of 300 epochs, 1000 iterations per epoch, a batch size of 8, an initial learning rate of 0.01, and a stochastic gradient descent (SGD) optimizer. The results are presented in Table 3 and Figure 11.

The baseline model YOLOv7 was trained on the original public dataset GWHD, with 1000 images reserved for testing and the rest for training. In this case, the MAP@0.5 was 91.63%, and the detection speed was 35.81 FPS. Then, the same 1000 reserved images were not augmented and used as the test set. The training set was subjected to various data augmentation operations such as random cropping, splicing, and rotation. A total of 30,000 images were generated and used to train the baseline model, YOLOv7. In this case, the MAP@0.5 was 91.98%, showing an improvement of 0.35% over the non-augmented data, and the detection speed was 35.84 FPS, showing an improvement of 0.03 FPS. Then, adding the CBAM module to the YOLOv7 network and training the model with the augmented data set resulted in a MAP@0.5 of 92.80%, showing an improvement of 0.17% over the original network, and a detection speed of 35.94 FPS, showing an improvement of 0.13 FPS. Subsequently, adding the microscale detection layer to the YOLOv7 network and training the model with the augmented data set resulted in a MAP@0.5 of 92.83%, showing an improvement of 1.20% over the original network, but the detection speed showed a decrease of 0.05 FPS. Finally, adding both the CBAM module and the microscale detection layer to the YOLOv7 network resulted in a MAP@0.5 of 93.86%, showing an improvement of 2.23% over the baseline YOLOv7 network, and the detection speed increased by 0.12 FPS.

The results of the ablation experiments showed that: ① when both the microscale detection layer and CBAM were added to the model training process, all performance indicators were better than those of the original network. ② Increasing the number of deep learning samples had a positive effect on improving both the accuracy and speed of the model. ③ Adding the microscale detection layer contributed to a 0.03% higher MAP@0.5 than adding CBAM while improving the detection accuracy of small wheat spikes, but there was a slight loss of detection speed. ④ The CBAM mixed attention mechanism better focused on the characteristics of wheat spikes while suppressing irrelevant information, which led to an increase in both accuracy and speed.

3.4. Detection Results and Analysis of Wheat Heads under Transfer Learning

Transfer learning [51] is the process of transferring feature information from a source domain to a target domain to obtain better object detection results in that domain. Although the GWHD contains a large number of wheat heads of different types and in different growth environments, even wheat heads of the same variety have various shapes. Therefore, this study transferred the weights trained on the public GWHD to field-collected wheat heads of different growth periods for training. The recognition performance of YOLOv7-MA on wheat heads in different growth stages was compared to verify the model’s adaptability and robustness.

The experiment used 100 original images collected in the filling and maturity stages as the test sets, and the training sets were augmented to 1000 images each, followed by Faster-RCNN, YOLOv5, YOLOX, YOLOv7, and YOLOv7-MA training and accuracy evaluation. The results are presented in Table 4.

Analyzing Table 4, it was found that there were slight differences in the wheat head recognition indicators in different growth stages among YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN under transfer learning. Furthermore, each model performs better on the mature wheat head dataset compared to the filling stage. Particularly, YOLOv7-MA outperformed other models in terms of various measures during both the filling and mature stages, exhibiting strong adaptability and generalization ability. The YOLOv7-MA model achieved MAP@0.5 for the filling and maturity stages, both exceeding 93%, and the F1-score exceeded 0.9. The MAP@0.5 of the maturity stage was slightly higher than that of the filling stage by 0.26%, the precision was higher than that of the filling stage by 1.29%, and the recall was lower than that of the filling stage by 0.54%. The detection speed of the maturity stage was higher than that of the filling stage by 0.63 FPS. Therefore, the YOLOv7-MA model performs best on the field-collected wheat head dataset, demonstrating strong adaptability and generalization ability. Moreover, it exhibits superior image recognition performance for mature wheat ear images compared to the filling stage.

Subsequently, the YOLOv7-MA model was selected based on its performance. 100 wheat head images at filling and maturity stages were taken as test samples. The number of wheat spikes obtained by the proposed YOLOv7-MA model and manual methods were used to calculate the RMSE, rRMSE, MAE, and R² for different wheat growth stages, and the results are shown in Figure 12 and Figure 13.

Analyzing Figure 13 and Figure 14, the R² in the filling and maturity stages were 0.9155 and 0.9632, respectively, indicating a stronger correlation between the YOLOv7-MA model’s prediction results and manual counting for the maturity stage. Figure 13 presents a comparison between the predicted wheat head number and manual counting for the test samples in filling and maturity stages using the YOLOv7-MA algorithm. When the number of wheat spikes in the image was between 90 and 100, the counting accuracy of YOLOv7-MA was comparable for both filling and maturity stages, missing one wheat head in each case. When the number of spikes was between 100 and 110, YOLOv7-MA missed one wheat head for the maturity stage and wrongly detected two spikes for the filling stage. When the number of spikes was between 110 and 130, YOLOv7-MA missed six spikes for the maturity stage and three spikes for the filling stage. Combining Figure 13 and Figure 14, wheat head texture features in the filling stage are less evident than those in the maturity stage. This is because the leaves are plumper in the filling stage, which may cause the leaves to be misidentified as wheat spikes. Therefore, it can be concluded that YOLOv7-MA has superior wheat head-counting performance in the maturity stage compared to the filling stage.

4. Discussion

This paper proposed the YOLOv7-MA algorithm and trained it on the publicly available GWHD with data augmentation. The training dataset was large and diverse, and the proposed algorithm demonstrated good performance in terms of MAP@0.5, precision, recall, and F1-score. Meanwhile, there was a strong correlation between the predicted wheat head count and the manual counting results.

To better evaluate the wheat head recognition performance of the YOLOv7-MA algorithm, comparative experiments were conducted with other object detection algorithms. In this paper, Faster-RCNN, YOLOv5, YOLOX, YOLOv7, and YOLOv7-MA were trained on the GWHD 2021 after data augmentation. Many scholars, such as Bhagat [31] and Amirhossein [29], have shown that using the sample diversity of GWHD can better verify the model’s adaptability and overcome limitations. The test results indicated that YOLOv7-MA had better performance than other algorithms in terms of MAP@0.5, precision, recall, and F1-score. Although the detection speed of YOLOv7-MA was slightly slower than that of YOLOv5 and YOLOX, it was improved by 0.09 FPS compared to that of the original YOLOv7 model. This is because the improved baseline model used in this study is YOLOv7, proposed by Wang [22], which added deeper convolutional layers to improve accuracy, and the increased computation amount had some impact on detection speed. However, it still met the needs of object detection tasks. YOLOv5 proposed by Zhou [19] and YOLOX proposed by Ge [20] are improved versions of the YOLO series, and they have a faster detection speed, but their detection accuracy is the target of improvement. Furthermore, the counting performance of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN was compared. In-field wheat field backgrounds are complex, with poor illumination conditions, mutual overlapping occlusion between heads and between heads and leaves, and blurring of wheat heads caused by shooting, all of which affect the accuracy of wheat head recognition. Dong [35], Liu [36], and other researchers have all proposed that the overlapping occlusion of wheat heads will cause difficulties in detection. In this study, these three types of images were selected as the test set to compare the correlation between the predicted wheat heads of YOLOv7-MA, YOLOv7, YOLOX, YOLOv5, and Faster-RCNN and manual counting results. The R² of the predicted wheat head number by YOLOv7-MA and manual counting results under low illumination, blur, and overlapping occlusion conditions were 0.9895, 0.9872, and 0.9882, respectively, indicating that the predicted results of YOLOv7-MA under complex background conditions were highly consistent with the manual counting results, and the correlation was stronger than that of other object detection algorithms taken for comparison.

Ablation experiments were performed on the publicly available GWHD using the YOLOv7-MA algorithm. The ablation study evaluated the impact of three optimization methods, namely data augmentation, adding micro-scale detection layers, and incorporating CBAM, on wheat head recognition accuracy using YOLOv7 as the baseline model. The results showed that adding micro-scale detection layers contributed more to the detection accuracy than incorporating CBAM but resulted in a slight decrease in detection speed due to the increased computational complexity required to detect small wheat heads. Incorporating CBAM improved detection accuracy and speed by focusing attention on the target wheat heads and reducing the influence of complex backgrounds during feature learning. Similar results were obtained by Wang [33], Li [34], and other researchers when implementing CBAM in other networks. Data augmentation played a minor role in enhancing detection accuracy and speed by increasing the diversity of training samples, thereby improving the model’s ability to learn wheat head features and reducing the misclassification of complex samples. Li [34] also improved detection accuracy by augmenting the training dataset using Mosaic-8. The study found that the best detection performance was achieved by combining data augmentation with both microscale detection layers and the CBAM module. Therefore, the three proposed optimization techniques, namely data augmentation, adding micro-scale detection layers, and incorporating CBAM, effectively improve wheat head recognition accuracy without sacrificing detection speed.

The YOLOv7-MA algorithm was used in combination with transfer learning to apply the weights trained on the GWHD to the field-collected wheat head images in different growth stages. In both the maturity and filling stages, YOLOv7-MA achieved a MAP@0.5 greater than 93% and a detection speed higher than 33 FPS, with slightly better detection accuracy and speed in the maturity stage. By comparing the predicted results of YOLOv7-MA with manual counts in different growth stages, it was demonstrated that the R² was 0.9155 in the filling stage and 0.9632 in the maturity stage. Consequently, YOLOv7-MA exhibited stronger transfer learning ability and more stable wheat head recognition performance in the maturity stage. This is mainly because wheat heads have different morphological characteristics in different growth stages, with the features of mature wheat heads being more stable, complete, and unique and the contrast between wheat heads and backgrounds (such as stems and leaves) being more significant. This leads to lower wheat head recognition difficulty in the maturity stage, which is consistent with the research results of Zhou [37], Li [32], and others. Transfer learning can improve the robustness of models and make them more suitable for real-world scenarios. However, the field-collected wheat head images in this study were limited, with only 1000 images even after augmentation, resulting in a slightly lower recognition performance of YOLOv7-MA than that on the large training dataset of the public dataset.

Multiple studies have shown that models trained on single-sample datasets collected in the field have significant limitations and experience a significant decrease in detection accuracy when applied to other wheat head datasets [39,40]. In this study, the model was pre-trained on the Global Wheat Head Dataset 2021 of wheat heads of different varieties with different morphologies in different periods to solve the problem of poor generalization ability caused by a single sample set. Furthermore, the small size of wheat heads and frequent occurrences of occlusion pose challenges for wheat ear detection models [38,41]. These factors lead to a limited amount of wheat ear feature information that can be acquired by the detection model, which constrains the model’s detection accuracy. To address these challenges, the convolutional block attention module and micro-scale detection layers were introduced into the YOLOv7-MA network [33,34]. This modification improved the model’s ability to learn representative features of wheat heads, thus enhancing detection accuracy. However, achieving high-precision recognition often requires complex, deep model structures. These structures inevitably involve a larger number of parameters, which impose certain requirements on computer hardware. Additionally, the training process for the model incurs a longer time commitment. Therefore, the next research direction is to explore how to simplify the model as much as possible while ensuring accuracy.

Due to its higher detection accuracy and speed compared to previous versions, yolov7 was chosen as the baseline model for this study. It supports high-resolution images and multiple object types, and it is flexible and easy to deploy. Hence, yolov7 was selected as the foundation for our research. Of course, there were also several challenges encountered during the research process. The initial attempt was to directly train the model wheat head using images collected from various growth stages, consisting of 500 images per stage for both maturity and filling. Regrettably, the achieved accuracy was considerably low, resulting in a MAP@0.5 of only 18% and 22% for the respective stages. After consulting various references [43,52], it was evident that the scale and diversity of the dataset are crucial for deep learning. However, the training samples currently employed are far from sufficient. It became evident that our training samples were insufficient in this regard. Therefore, the Global Wheat Head Dataset 2021, with its rich scenarios and large amount of data, was utilized. In conjunction with data augmentation strategies, we efficiently expanded the training samples [53,54]. Ultimately, the trained model exhibited superior performance across various testing conditions. It is undeniable that large-scale training datasets and complex network structures with high precision increase the time and cost of training, which is an aspect that needs improvement in the next steps.

Overall, the proposed YOLOv7-MA model has shown excellent performance in both detection accuracy and speed, enabling fast and accurate detection and counting of wheat heads, which meets the needs of production and life. In the future, the proposed model can be used in unmanned aerial vehicles equipped with a high-definition RGB camera with 4 K resolution to obtain snapshots of wheat heads during low-altitude flight and estimate the yield of wheat fields on a large scale.

5. Conclusions

Accurately identifying wheat heads has great reference value for wheat yield estimation and growth detection. Considering complex field conditions, this paper proposes a new object detection algorithm called YOLOv7-MA, which integrates a mixed attention mechanism and micro-scale detection layers into the feature extraction network of YOLOv7. After the model was trained and tested on datasets of different growth conditions and varieties, it obtained better results than other detectors. When comparing the counting performance of the model under three complex conditions of low illumination, blur, and overlapping occlusion, the correlation between the predicted results and manual counting results was stronger than that of other detectors. Furthermore, the YOLOv7-MA model’s strong stability was confirmed by using transfer learning from the GWHD public dataset to field-collected wheat head datasets, and it achieved higher detection performance in the maturity stage than in the filling stage. Related experiments proved that YOLOv7-MA can well identify dense and tiny wheat heads and has good adaptability and generalization. Therefore, the YOLOv7-MA model proposed in this study can well perform wheat head detection and counting under complex field conditions, and it provides a technical reference for agricultural wheat phenotype monitoring and yield prediction. In the future, an exploration can be made to integrate wheat ear recognition models with diverse sensing modalities such as optical images, infrared images, and hyperspectral images, aiming to achieve multi-modal fusion and capture a comprehensive representation of wheat ears. This integration is expected to further enhance the accuracy and robustness of wheat-ear recognition.

Author Contributions

Conceptualization, X.M. and C.L.; methodology, X.M.; software, X.M.; validation, X.M., J.L. and X.L.; formal analysis, X.M.; investigation, F.G. and Z.X.; resources, X.M., J.L. and X.L.; data curation, X.M., J.L. and X.L.; writing—original draft preparation, X.M.; writing—review and editing, C.L.; visualization, X.M. and J.L.; supervision, X.L.; project administration, C.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (41871333), the National Innovation and Entrepreneurship Training Program for College Students (202210460019), the Key Project of Science and Technology of the Henan Province (222102110038), the Joint Fund of Collaborative Innovation Center of Geo-Information Technology for Smart Central Plains, Henan Province, and the Key Laboratory of Spatiotemporal Perception and Intelligent Processing, Ministry of Natural Resources, No. 211102.

Data Availability Statement

Global Wheat Head Dataset 2021 data source: http://www.global-wheat.com/.Ground (accessed on 6 December 2022) measurement data is not applicable.

Acknowledgments

We thank all the authors for their support. The authors would like to thank all the reviewers who participated in this review.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations:

Abbreviation	Explanation
GWHD	Global Wheat Head Dataset
Faster R-CNN	faster regions with convolutional neural networks
YOLO	you only look once
BIFPN	bidirectional feature pyramid network
FPN	feature pyramid networks
PANet	pyramid attention networks
SPP	spatial pyramid pooling
CAM	channel attention Module
SAM	spatial attention module
CBAM	convolutional block attention module
MAP	mean average precision
TP	true positives
TN	true negatives
FP	false positives
FN	false negatives
FPS	frames Per Second
R²	the coefficient of determination
RMSE	root mean square error
rRMSE	relative root mean square error
MAE	mean absolute error

References

Ayas, S.; Dogan, H.; Gedikli, E.; Ekinci, M. Microscopic image segmentation based on firefly algorithm for detection of tuberculosis bacteria. In Proceedings of the 2015 23nd Signal Processing and Communications Applications Conference (SIU), Malatya, Turkey, 16–19 May 2015; IEEE: New York, NY, USA, 2015; pp. 851–854. [Google Scholar]
Wen, C.; Wu, J.; Chen, H.; Su, H.; Chen, X.; Li, Z.; Yang, C. Wheat spike detection and counting in the field based on SpikeRetinaNet. Front. Plant Sci. 2022, 13, 821717. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Turner, N.C.; Poole, M.L.; Asseng, S. High ear number is key to achieving high wheat yields in the high-rainfall zone of south-western Australia. Aust. J. Agric. Res. 2007, 58, 21–27. [Google Scholar] [CrossRef]
Ferrante, A.; Cartelle, J.; Savin, R.; Slafer, G.A. Yield determination, interplay between major components and yield stability in a traditional and a contemporary wheat across a wide range of environments. Field Crops Res. 2017, 203, 114–127. [Google Scholar] [CrossRef]
Jin, X.; Liu, S.; Baret, F.; Hemerlé, M.; Comar, A. Estimates of plant density of wheat crops at emergence from very low altitude UAV imagery. Remote Sens. Environ. 2017, 198, 105–114. [Google Scholar] [CrossRef] [Green Version]
Alharbi, N.; Zhou, J.; Wang, W. Automatic Counting of Wheat Spikes from Wheat Growth Images. 2018. Available online: https://ueaeprints.uea.ac.uk/id/eprint/65922/ (accessed on 6 March 2023).
Zhou, C.; Liang, D.; Yang, X.; Yang, H.; Yue, J.; Yang, G. Wheat ears counting in field conditions based on multi-feature optimization and TWSVM. Front. Plant Sci. 2018, 9, 1024. [Google Scholar] [CrossRef]
Zhu, Y.; Cao, Z.; Lu, H.; Li, Y.; Xiao, Y. In-field automatic observation of wheat heading stage using computer vision. Biosyst. Eng. 2016, 143, 28–41. [Google Scholar] [CrossRef]
Vijaya Kumar, D.; Mahammad Shafi, R. A fast feature selection technique for real-time face detection using hybrid optimized region based convolutional neural network. Multimed. Tools Appl. 2023, 82, 13719–13732. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [Green Version]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016. Part I 14. pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2778–2788. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Khasawneh, N.; Fraiwan, M.; Fraiwan, L. Detection of K-complexes in EEG waveform images using faster R-CNN and deep transfer learning. BMC Med. Inform. Decis. Mak. 2022, 22, 297. [Google Scholar] [CrossRef] [PubMed]
He, M.-X.; Hao, P.; Xin, Y.-Z. A robust method for wheatear detection using UAV in natural scenes. IEEE Access 2020, 8, 189043–189053. [Google Scholar] [CrossRef]
Gong, B.; Ergu, D.; Cai, Y.; Ma, B. Real-time detection for wheat head applying deep neural network. Sensors 2020, 21, 191. [Google Scholar] [CrossRef]
Xu, X.; Li, H.; Yin, F.; Xi, L.; Qiao, H.; Ma, Z.; Shen, S.; Jiang, B.; Ma, X. Wheat ear counting using K-means clustering segmentation and convolutional neural network. Plant Methods 2020, 16, 1–13. [Google Scholar] [CrossRef]
Wang, D.; Zhang, D.; Yang, G.; Xu, B.; Luo, Y.; Yang, X. SSRNet: In-field counting wheat ears using multi-stage convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, X.; Yan, J.; Qiu, X.; Yao, X.; Tian, Y.; Zhu, Y.; Cao, W. A wheat spike detection method in UAV images based on improved YOLOv5. Remote Sens. 2021, 13, 3095. [Google Scholar] [CrossRef]
Zaji, A.; Liu, Z.; Xiao, G.; Bhowmik, P.; Sangha, J.S.; Ruan, Y. AutoOLA: Automatic object level augmentation for wheat spikes counting. Comput. Electron. Agric. 2023, 205, 107623. [Google Scholar] [CrossRef]
David, E.; Serouart, M.; Smith, D.; Madec, S.; Velumani, K.; Liu, S.; Wang, X.; Pinto, F.; Shafiee, S.; Tahir, I.S.A.; et al. Global Wheat Head Detection 2021: An Improved Dataset for Benchmarking Wheat Head Detection Methods. Plant Phenom. 2021, 2021, 9846158. [Google Scholar] [CrossRef] [PubMed]
Bhagat, S.; Kokare, M.; Haswani, V.; Hambarde, P.; Kamble, R. WheatNet-lite: A novel light weight network for wheat head detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1332–1341. [Google Scholar]
Li, J.; Li, C.; Fei, S.; Ma, C.; Chen, W.; Ding, F.; Wang, Y.; Li, Y.; Shi, J.; Xiao, Z. Wheat ear recognition based on RetinaNet and transfer learning. Sensors 2021, 21, 4845. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Qin, Y.; Cui, J. Occlusion robust wheat ear counting algorithm based on deep learning. Front. Plant Sci. 2021, 12, 645899. [Google Scholar] [CrossRef]
Li, R.; Wu, Y. Improved YOLO v5 Wheat Ear Detection Algorithm Based on Attention Mechanism. Electronics 2022, 11, 1673. [Google Scholar] [CrossRef]
Dong, Y.; Liu, Y.; Kang, H.; Li, C.; Liu, P.; Liu, Z. Lightweight and efficient neural network with SPSA attention for wheat ear detection. PeerJ Comput. Sci. 2022, 8, e931. [Google Scholar] [CrossRef]
Liu, Z.; Gao, Y.; Du, Q.; Chen, M.; Lv, W. YOLO-Extract: Improved YOLOv5 for Aircraft Object Detection in Remote Sensing Images. IEEE Access 2023, 11, 1742–1751. [Google Scholar] [CrossRef]
Zhou, Q.; Huang, Z.; Zheng, S.; Jiao, L.; Wang, L.; Wang, R. A wheat spike detection method based on Transformer. Front. Plant Sci. 2022, 13, 1023924. [Google Scholar] [CrossRef]
Almoujahed, M.B.; Rangarajan, A.K.; Whetton, R.L.; Vincke, D.; Eylenbosch, D.; Vermeulen, P.; Mouazen, A.M. Detection of fusarium head blight in wheat under field conditions using a hyperspectral camera and machine learning. Comput. Electron. Agric. 2022, 203, 107456. [Google Scholar] [CrossRef]
Dandrifosse, S.; Ennadifi, E.; Carlier, A.; Gosselin, B.; Dumont, B.; Mercatoris, B. Deep learning for wheat ear segmentation and ear density measurement: From heading to maturity. Comput. Electron. Agric. 2022, 199, 107161. [Google Scholar] [CrossRef]
Grbovic, Z.; Panic, M.; Marko, O.; Brdar, S.; Crnojevic, V. Wheat ear detection in RGB and thermal images using deep neural networks. Environments 2019, 11, 13. [Google Scholar]
Yang, Y.; Huang, X.; Cao, L.; Chen, L.; Huang, K. Field Wheat Ears Count Based on YOLOv3. In Proceedings of the 2019 International Conference on Artificial Intelligence and Advanced Manufacturing (AIAM), Dublin, Ireland, 16–18 October 2019; pp. 444–448. [Google Scholar]
David, E.; Madec, S.; Sadeghi-Tehran, P.; Aasen, H.; Zheng, B.; Liu, S.; Kirchgessner, N.; Ishikawa, G.; Nagasawa, K.; Badhon, M.A. Global Wheat Head Detection (GWHD) dataset: A large and diverse dataset of high-resolution RGB-labelled images to develop and benchmark wheat head detection methods. Plant Phenom. 2020, 2020, 3521852. [Google Scholar] [CrossRef]
Li, Z.; Yu, Y.; Pan, X.; Karim, M.N. Effect of Dataset Size on Modeling and Monitoring of Chemical Processes. Chem. Eng. Sci. 2020, 227, 115928. [Google Scholar] [CrossRef]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yang, G.; He, Y.; Yang, Y.; Xu, B. Fine-grained image classification for crop disease based on attention mechanism. Front. Plant Sci. 2020, 11, 600854. [Google Scholar] [CrossRef]
Awwad, S.; Igried, B.; Wedyan, M.; Alshira, H.M. Hybrid features for object detection in RGB-D scenes. Indones. J. Electr. Eng. Comput. Sci. 2021, 23, 1073–1083. [Google Scholar] [CrossRef]
Azam, M.A.; Sampieri, C.; Ioppi, A.; Africano, S.; Vallin, A.; Mocellin, D.; Fragale, M.; Guastini, L.; Moccia, S.; Piazza, C. Deep Learning Applied to White Light and Narrow Band Imaging Videolaryngoscopy: Toward Real-Time Laryngeal Cancer Detection. Laryngoscope 2022, 132, 1798–1806. [Google Scholar] [CrossRef]
Papadeas, I.; Tsochatzidis, L.; Amanatiadis, A.; Pratikakis, I. Real-time semantic image segmentation with deep learning for autonomous driving: A survey. Appl. Sci. 2021, 11, 8802. [Google Scholar] [CrossRef]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Fourati, F.; Mseddi, W.S.; Attia, R. Wheat Head Detection using Deep, Semi-Supervised and Ensemble Learning. Can. J. Remote Sens. 2021, 47, 198–208. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef] [Green Version]
Qian, Y.; Hu, H.; Tan, T. Data augmentation using generative adversarial networks for robust speech recognition. Speech Commun. 2019, 114, 1–9. [Google Scholar] [CrossRef]

Figure 1. Some example images in the GWHD 2021.

Figure 2. The location of the study area.

Figure 3. High-resolution images of wheat heads in different stages in the field.

Figure 4. The results of different data augmentation methods.

Figure 5. The network structure of YOLOv7-MA (the red dashed box represents the mixed attention module added to the model in this study, the blue dashed box represents the micro-scale detection layer added to the model, and the black dashed box represents the specific module structure).

Figure 6. The structure of CBAM.

Figure 7. The structure of CAM.

Figure 8. The structure of SAM.

Figure 9. Precision-recall curves of the detection results of different methods on GWHD.

Figure 10. Accuracy evaluation of predicted wheat head counts and manual counts under three complex conditions using five different algorithms.

Figure 11. Comparison of predicted wheat head counts by five different algorithms under three complex conditions.

Figure 12. The results of the ablation experiments.

Figure 13. Accuracy evaluation of manual counting and YOLOv7-MA predicted wheat head numbers at different growth stages: (a,c) show the regression results of manual counting and model prediction, (b,d) indicate the prediction errors.

Figure 14. The predicted number of wheat heads in different growth stages.

Table 1. The composition and overview of the GWHD 2021.

Files	Information
images	6515 images
competition_text.csv	the text data of 1382 images
competition_train.csv	the training data of 3657 images
competition_val.csv	the validation data of 1476 images
metadata_dataset.csv	additional metadata for each domain

Table 2. Comparison of detection results of wheat heads with different methods on GWHD.

Methods	MAP@0.5 (%)	Precision (%)	Recall (%)	F1-Score	FPS
Faster-RCNN	79.50	77.54	83.17	0.80	15.16
YOLOv5	89.94	90.42	84.07	0.88	57.89
YOLOX	90.68	91.79	85.22	0.89	59.15
YOLOv7	91.98	93.10	85.72	0.90	35.84
YOLOv7-MA	93.86	93.60	88.67	0.92	35.93

Table 3. The accuracy and performance of ablation experiments with the YOLO-MA model.

Model Improvement	MAP@0.5 (%)	Precision (%)	Recall (%)	F1-Score	FPS
——	91.63	92.32	85.48	0.90	35.81
+AUG	91.98 (+0.35)	93.10	85.72	0.90	35.84 (+0.03)
+AUG + CBAM	92.80 (+1.17)	93.49	86.05	0.90	35.94 (+0.13)
+AUG + MDL	92.83 (+1.20)	93.33	85.74	0.90	35.76 (−0.05)
+AUG + MDL + CBAM	93.86 (+2.23)	93.60	88.67	0.91	35.93 (+0.12)

Table 4. Comparison of detection results of transfer learning to field-collected datasets in different growth stages.

Methods	MAP@0.5 (%)		Precision (%)		Recall (%)		F1-Score		FPS
Methods	Filling Stage	Maturity Stage	Filling Stage	Maturity Stage	Filling Stage	Maturity Stage	Filling Stage	Maturity Stage	Filling Stage	Maturity Stage
Faster-RCNN	64.43	68.08	78.12	84.82	60.22	61.83	0.69	0.76	13.50	13.42
YOLOv5	82.27	82.43	83.23	87.31	79.01	81.52	0.83	0.84	54.28	56.01
YOLOX	85.44	87.67	88.26	90.10	82.45	85.66	0.87	0.89	57.76	57.83
YOLOv7	89.78	90.93	90.39	91.19	84.03	85.30	0.88	0.90	33.10	33.22
YOLOv7-MA	93.08	93.33	92.13	93.72	88.36	87.82	0.91	0.92	33.27	33.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, X.; Li, C.; Li, J.; Li, X.; Guo, F.; Xiao, Z. YOLOv7-MA: Improved YOLOv7-Based Wheat Head Detection and Counting. Remote Sens. 2023, 15, 3770. https://doi.org/10.3390/rs15153770

AMA Style

Meng X, Li C, Li J, Li X, Guo F, Xiao Z. YOLOv7-MA: Improved YOLOv7-Based Wheat Head Detection and Counting. Remote Sensing. 2023; 15(15):3770. https://doi.org/10.3390/rs15153770

Chicago/Turabian Style

Meng, Xiaopeng, Changchun Li, Jingbo Li, Xinyan Li, Fuchen Guo, and Zhen Xiao. 2023. "YOLOv7-MA: Improved YOLOv7-Based Wheat Head Detection and Counting" Remote Sensing 15, no. 15: 3770. https://doi.org/10.3390/rs15153770

APA Style

Meng, X., Li, C., Li, J., Li, X., Guo, F., & Xiao, Z. (2023). YOLOv7-MA: Improved YOLOv7-Based Wheat Head Detection and Counting. Remote Sensing, 15(15), 3770. https://doi.org/10.3390/rs15153770

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv7-MA: Improved YOLOv7-Based Wheat Head Detection and Counting

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.1.1. Global Wheat Head Dataset 2021 Acquisition

2.1.2. Field Data Acquisition

2.2. Data Processing

2.2.1. Data Annotation

2.2.2. Data Augmentation

2.3. Methodology

2.3.1. Microscale Detection Layer

2.3.2. Convolutional Block Attention Module

2.4. Accuracy Evaluation Index

3. Results

3.1. Detection Results and Analysis of Different Algorithms on the Global Wheat Head Dataset

3.2. Detection Results and Analysis of Wheat Heads under Different Backgrounds

3.3. Ablation Experiments with YOLOv7-MA

3.4. Detection Results and Analysis of Wheat Heads under Transfer Learning

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations:

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI