1. Introduction
Wheat is an important food crop globally and is consumed by almost one-third of the population. Current wheat yield forecasting has become an essential part of agricultural production. It can provide a necessary reference for field management and agricultural decision making [
1]. Therefore, the accurate identification and counting of wheat ears are essential for monitoring crop growth, estimating wheat yield, and analysing plant phenotypic characteristics.
The number of wheat ears is mainly gathered by manual field yield prediction, capacity prediction, prediction based on annual scenario [
2], and forecast based on remote sensing images [
3]. Manual field judgements are mainly empirical, have low accuracy, and are also labour-intensive. Volumetric methods are costly and inefficient in wheat density measurement. Remote sensing is based on satellite images as samples. As these images are distant, they are only suitable for large-scale processing and analysis, resulting in a low accuracy of wheat prediction. Meanwhile, multiple linear regression-based predictions are heavily influenced by variables such as precipitation, making accuracy difficult to guarantee and unsuitable for field yield estimation. Traditional image processing techniques often use moving window methods [
4] or superpixel segmentation [
5] to harvest the image, extract colour or texture features from the sub-image, then train a classifier and use the classifier to identify the wheat ears and complete the count, or they highlight the wheat ears through image processing methods, such as binarising the image to place the ears after removing the adhesion [
6]. In contrast, visual sensors can acquire rich texture and colour information at a lower cost. However, the colour-texture characteristics of wheat affect the detection accuracy. Therefore, detecting and counting wheat ears in the natural environment remains a significant challenge.
In addition, the adhesion and occlusion between wheat ears severely limit the accuracy of wheat ear identification and counting. Several scholars have attempted to successfully detect adhering objects using segmentation techniques such as morphology [
7], concave point matching [
8], and watershed algorithms [
9]. However, the morphological variation of wheat ears in images is considerable. Therefore, morphology-based segmentation cannot necessarily be used to identify wheat ears in the adherence zone, resulting in missed and false detections. In addition, the notch matching algorithm cannot detect objects with sharper edges. However, as we all know, the advantages of wheat ears are not smooth, and burr edges are susceptible and delicate and are particularly prone to blend in with complex backgrounds, making it difficult to obtain smooth edges of wheat ear images, even if their binary images undergo a series of erosion and expansion operations. The watershed algorithm requires the calculation of local extremes. However, there are more local extremes since the wheat ears have a detailed texture. Therefore, a watershed algorithm for detecting wheat ears would lead to over-segmentation. Thus, the problem of how to accurately count mutually-occluded wheat ears remains an urgent one.
With the development of image processing techniques, previous studies [
10] have shown that classifiers for wheat ear detection were built using machine learning methods, which led to wheat ear detection and counting. Xu et al. used the k-means algorithm to segment wheat ears to achieve recognition [
11]. Although the recognition of wheat ears was performed based on machine learning methods, most of them still required a priori knowledge to artificially set up image features, which led to insufficient recognition accuracy in field environments with noise disturbances such as uneven lighting and complex backgrounds. At the same time, it is difficult to detect and count sheaves of wheat in different scenarios based on traditional machine learning methods due to the lack of generalisation ability of the models.
In recent years, deep learning has achieved impressive results in many fields. Significant progress has been made in target detection technology, one of the core problems in computer vision. Target detection uses image processing, deep learning, and other techniques to classify and locate target objects from an image or video, determine whether the input image contains target information, and select the target location and category. Deep learning-based target detection algorithms are mainly divided into two categories. One is the two-stage detection algorithm represented by R-CNN (region-CNN) [
12] and Fast-RCNN (regions with CNN features) [
13]. Based on feature extraction, firstly, many candidate regions are generated, and then they are classified and regressed. The other is a single-stage detection algorithm represented by an SSD (Single Shot MultiBox Detector) [
14] and the YOLO (you only look once) series of algorithms, which performs classification and regression tasks while generating candidate frames. The specific problem of wheat quantity detection is currently being investigated nationally and internationally. For example, HASAN et al. [
15] used R-CNN, and MADEC et al. [
16] used Faster-RCNN to detect wheat ears. Studies have also used single-stage target detection algorithms, such as counting wheat sheaves, using target detection algorithms such as YOLO [
17]. Xiong et al. used context-enhanced local regression networks to detect and measure wheat sheaves [
18]. Brilliant results have also been achieved using convolutional neural networks (CNNs) in detecting and counting wheat ears [
19,
20,
21]. Although convolutional neural networks are a class of feedforward neural networks that include convolutional computation and have a deep structure, they are one of the representative algorithms for deep learning. However, the core of the CNN-based approach to detection is based on the region proposal method, where a sliding window or extracted proposal is first selected to train the network. Then classification is performed in the region proposal. The limitation of this approach is that background regions are often mistaken for specific targets in object recognition. Wheat images captured in a field environment have many distracting factors such as high plant density, multiple overlaps, uneven lighting, and complex backgrounds. Therefore, there are still problems worth exploring and solving for wheat detection based on deep learning.
The YOLO family has spawned many versions as a representative framework for single-stage detection. YOLO is a high-performance general-purpose target detection model. YOLO v1 [
22] uses a single-stage detection algorithm for the two tasks of locating a target and classifying the target object. Subsequently, YOLO v2 [
23] improved in three areas: more accurate predictions, faster speed, and more objects recognised than the v1 version. YOLO v3 [
24] accelerated the implementation of object detection by introducing multi-scale prediction, core network optimisation, and loss function improvements. YOLO v4 [
25] presented an efficient and fast object detection model that significantly reduced the computational number of parameters, making it easier to deploy on general-purpose and hardware devices. Compared with YOLO v4, YOLO v5 has a smaller and more flexible structure, faster image inference, and is closer to natural production and life. However, the actual environment in which wheat is grown is complex. The main problems are (1) severe object occlusion, with wheat ears obscuring and overlapping each other; (2) dense objects, with multiple piles of wheat ears challenging to distinguish; (3) small target objects, with the whole target taking up a smaller proportion of the whole image; (4) complex backgrounds, increasing the difficulty of feature target extraction, and these practical factors undoubtedly affect the detection accuracy of wheat. We also found that YOLO v5 also suffers from insufficient bounding box localisation and has difficulty distinguishing between overlapping detection objects, especially objects such as wheat ears that are heavily occluded. However, the presence of an attention mechanism can effectively solve these problems. When processing information, the attention module resembles the human visual attention mechanism by scanning the global image to obtain the target area that needs to be focused on and then devoting more attention resources to this area to obtain more detailed information related to the target while filtering out the secondary data to improve the model effect. The Convolutional Block Attention Module (CBAM) [
26] was integrated into the convolutional module of YOLO v5 to implement the learning of target features and location features in the channel dimension and global spatial dimension, respectively. One of the difficulties for small target detection is that using multi-level convolution operations may lead to small targets, with small pixel occupancy being lost in the process. To solve this problem, we proposed to add a quadruple downsampling layer to the original YOLO v5 feature pyramid to improve the semantic information of small targets and, thus, make the model’s prediction more accurate. To the best of our knowledge, there are few reports on the detection and counting of wheat ears by YOLO v5 models using the improved method described above.
Therefore, this study rationalises a novel and simple method based on YOLO v5 to detect the number of wheat ears. Firstly, the real wheat images were pre-processed considering their quality, and on top of the image enhancement, we proposed an 8-Mosaic data enhancement method inspired by the 4-Mosaic of the original YOLO v5 model. At the same time, data enhancement methods such as varying degrees of brightness conversion, increasing contrast by different multiples, and performing random multi-angle rotation were performed on the dataset to greatly enrich the number of samples for wheat ear recognition in complex backgrounds. Next, the colour and texture features of the wheat ear images were extracted, and parameters defining the subsequent training were established. Next, an improved YOLO v5 neural network model was built in PyTorch. The main improvements were: (1) eliminating background interference by using a two-channel (channel and spatial channel) attention mechanism, and (2) enriching the semantic information of small targets by adding 4-fold downsampling layers and improving the feature pyramid structure to improve the robustness of the model. The GWHDD dataset was then divided into a training set and a test set according to a 9:1 ratio, and input and output matrices were created for training and testing. Finally, the loss function was improved to GIOU [
27] to speed up the convergence of the model according to the actual recognition situation. In China, there have been few problems involving deep learning for wheat counting so far. The network model proposed in this paper provides a new idea and direction for the accuracy of wheat counting. It will facilitate the rapid development of sustainable, green, and automated smart agriculture.
The remainder of the paper is organised as follows:
Section 2 presents the proposed method for processing wheat ear images and the specification of the improved process.
Section 3 presents the experiments and results.
Section 4 offers a discussion of the proposed method.
Section 5 concludes the paper.
5. Conclusions
Wheat counting is an important research area in agricultural yield estimation. Based on the original YOLO v5 algorithm, this paper improved on five aspects: Mosaic data enhancement, feature extraction, loss function, target frame regression, and attention mechanism, which effectively improved the detection accuracy of the YOLO v5 network model for small target objects. Finally, a neural network model for training image recognition was developed. The conclusions can be summarised as follows:
(1) Convolutional neural networks can play an important role in studying wheat counting, using the improved YOLO v5 to detect wheat counts. Its recognition speed is fast, and the recognition rate is high. After pre-processing, feature extraction, and network optimisation, the training recognition rate can reach 94.42%.
(2) The application of Mosaic-8 in the image pre-processing process can accelerate the convergence speed of the model.
(3) The recognition rate of colour and texture features in complex backgrounds can be significantly improved by introducing a shallow feature layer and an attention mechanism.
(4) The method achieves fast and accurate recognition of what counts while also eliminating the need to rely on complex technologies such as remote sensing, which helps to significantly reduce the labour intensity of agricultural practitioners and thus improve work efficiency. In the future, there is a need to increase the number of images collected at different heights to cover the whole process of wheat counting and to increase the number of training samples to improve the recognition rate. In addition, detection methods can be combined with drones to compress the number of model parameters and achieve real-time detection.
The neural network model proposed in this report will shed new light on wheat counting. Due to its unique advantages, crop yield estimation will break new ground. However, fast detection still needs specific hardware configuration. We will continue to optimise our model and use pruning technology to optimise the model. At the same time, we will continue to increase the research on more wheat varieties and increase the scope of application. We believe it can make a great contribution to sustainable, green, and automated agriculture.