A Wheat Spike Detection Method in UAV Images Based on Improved YOLOv5

: Deep-learning-based object detection algorithms have signiﬁcantly improved the performance of wheat spike detection. However, UAV images crowned with small-sized, highly dense, and overlapping spikes cause the accuracy to decrease for detection. This paper proposes an improved YOLOv5 (You Look Only Once)-based method to detect wheat spikes accurately in UAV images and solve spike error detection and miss detection caused by occlusion conditions. The proposed method introduces data cleaning and data augmentation to improve the generalization ability of the detection network. The network is rebuilt by adding a microscale detection layer, setting prior anchor boxes, and adapting the conﬁdence loss function of the detection layer based on the IoU (Intersection over Union). These reﬁnements improve the feature extraction for small-sized wheat spikes and lead to better detection accuracy. With the conﬁdence weights, the detection boxes in multiresolution images are fused to increase the accuracy under occlusion conditions. The result shows that the proposed method is better than the existing object detection algorithms, such as Faster RCNN, Single Shot MultiBox Detector (SSD), RetinaNet, and standard YOLOv5. The average accuracy (AP) of wheat spike detection in UAV images is 94.1%, which is 10.8% higher than the standard YOLOv5. Thus, the proposed method is a practical way to handle the spike detection in complex ﬁeld scenarios and provide technical references for ﬁeld-level wheat phenotype monitoring.


Introduction
Wheat is an important food crop in the world, with an annual global yield of about 730 million tons. Wheat is the foundation of world food security [1]. However, biological and abiotic adversities have often occurred in the wheat production process in recent years, introducing many uncertainties to wheat yield formation. Therefore, using remote sensing to monitor the wheat growth process and predict yield has become a meaningful way to stabilize yield and optimize production management [2,3]. Moreover, assessing the production of wheat spikes as the grain-bearing organ is a valuable and practical measure of wheat yield [4,5]. Thus, detecting wheat spikes from remote sensing images has received increased interest recently.
Considering the cost and observation limitations of satellite and ground remote sensing [6], UAVs have the advantages of low-altitude flight capability and efficient operation. As a result, UAVs can easily and quickly obtain large-scale high-spatial-resolution images of wheat fields [7] and successfully assess large-scale wheat spikes by equipping with visible light, multispectral, and thermal infrared cameras [8][9][10]. Meanwhile, because researchers can freely customize UAV flights according to their needs and field environments [11], UAVs significantly improve the efficiency of the wheat spike survey.
Wheat spike monitoring in UAV images mainly uses object detection methods to obtain the number and geometric pattern of wheat spikes in the image. The existing detection methods are mainly divided into two categories: concrete-feature-based methods and abstract-feature-based methods. Concrete-feature-based methods realize the segmentation and detection of wheat spikes by manually selecting features. Researchers integrate color, geometric, and texture features to analyze and classify the features based on non-neural approaches (e.g., Bayesian, support vector machine, and random forest) [12][13][14][15][16][17][18]. However, concrete-feature-based methods have disadvantages of complex feature design, weak migration, and cumbersome manual design [19]. They cannot be well adapted to scenes with dense wheat spike distribution and severe occlusion in the field [20]. Deep learning based on convolutional neural networks (CNNs) in computer vision has been well developed with the advancement of computer performance and improved availability of numerous labeled images [21,22]. Methods based on abstract features realize the segmentation and detection of wheat spikes through various abstract features. These abstract features are extracted by a convolutional neural network [23] without manual intervention. The performance of abstract-feature-based methods is better than that of methods based on specific features [19]. The one-stage and two-stage detection algorithms are the two main groups of abstract-feature-based methods and have received extensive attention in wheat spike detection research studies. Two-stage detection algorithms are based on region proposals, mainly including SPP-Net [24], Fast R-CNN [25], and Faster R-CNN [26]. The detection happens in two stages: region proposal generation and detection for these proposals [27]. The main one-stage detection algorithms are the SSD [28] and YOLO (You Look Only Once) family, which include YOLO [29], YOLO9000 [30], YOLOv3 [31], YOLOv4 [32], and YOLOv5 [33]. As a regression-based object detection method, the one-stage detection algorithm does not require the step of proposal generation. By directly obtaining the location and category information of the object, the one-stage detection algorithm significantly improves the detection speed. However, the detection accuracy is lower than that of the two-stage detection algorithm.
State-of-the-art deep learning object detection algorithms have made significant progress in wheat spike detection in images [34,35]. The success of the wheat spike detection led to the high accuracy of in-field spike counting in former works [36][37][38][39]. However, small-sized, highly dense, and overlapping wheat spikes in UAV images can easily lead to error detection and miss detection. Meanwhile, the complex background of UAV images in fields and the substantial morphological differences between individual wheat spikes will increase the difficulty of the detection. These problems lead to the low accuracy of wheat spike detection in UAV images and make it impossible to forecast and evaluate yield.
In order to solve the issues mentioned above, this paper proposes a method based on improved YOLOv5 to detect wheat spikes accurately in UAV images. This method improves the generalization capability of the network and the accuracy of detecting smallsized wheat spikes in UAV images. The detection process is refined by adding a microscale detection layer, setting prior anchor boxes, and adapting the confidence loss function of the detection layer based on the IoU (Intersection over Union). Moreover, we fuse predicted boxes on multiresolution images to increase the wheat spike detection accuracy in complex field scenes. The proposed method improves the applicability of the YOLO algorithm in complex field environments, which can accurately detect multisized wheat spikes, especially small-sized wheat spikes, and better solve the occlusion and overlap problem of wheat spikes.

UAV Wheat Spike Images
High-quality in-field images were taken by a DJI TM Matrice TM 210 drone equipped with a DJI TM Zenmuse TM X4S camera at three different heights of 7, 10, and 15 m during the ripening stage. The experimental field was located in Xinghua, Jiangsu Province, China (119.90 • E, 33.07 • N) (Figure 1). Original 5472 × 3648 pixel images were cropped into 150 × 150 pixel pictures ( Figure 1a) to reduce data processing time, highlight wheat characteristics, and avoid loss of image information. Moreover, some obtained images were blurry due to the unstable situations of UAV flights (Figure 1b), so we applied the Laplace transform to remove blurred images and get clear images [40]. In addition, an image annotation tool (LabelImg) was used to label wheat spikes in clear images [41] ( Figure 1c-f).

Wheat Spike Detection Method
This research proposes a wheat spike detection method in UAV images based on improved YOLOv5. The method consists of data preprocessing, network training, and model inference ( Figure 2). First, all the images are cleaned, augmented, and labeled. Then, with the network improvements, the detection models are trained and used on multiresolution images. Finally, the wheat spike detection results are achieved by fusing detection boxes derived from multiresolution images. Moreover, the YOLOv5 network is mainly improved by adding a microscale detection layer, setting prior anchor boxes, and adapting the confidence loss function of the detection layer based on the IoU. The method consists of three critical parts: data processing, network training, and model inference. The improvements proposed in this method (in orange color) include refining network structure and fusing detection boxes.

Data Augmentation
This research used data augmentation to improve network learning and enhance the generalization capability of the network model [19]. We mainly chose image rotation, image flip, and luminance balance as the data augmentation methods (Figure 3). Rotated and flipped images can improve the detection performance and robustness of the network. Meanwhile, the luminance balance can eliminate the impacts of the brightness deviation on network performance caused by the environmental lighting changes and sensor differences [42,43]. After data augmentation, a total of 12,000 images were obtained and divided into a training dataset, validation dataset, and test dataset according to a ratio of 7:2:1.

YOLOv5 Network Structure and Refinements
Glenn Jocher released YOLOv5 in 2020. Its network structure is mainly divided into the backbone module, neck module, and head module [33]. The backbone module extracts features from the input image based on Focus, Bottleneck CSP (Cross Stage Partial Networks), and SPP (Spatial Pyramid Pooling) and transmits them to the neck module. The neck module generates a feature pyramid based on the PANet (Path Aggregation Network). It enhances the ability to detect objects with different scales by fusing low-level spatial features and high-level semantic features bidirectionally. The head module generates detection boxes, indicating the category, coordinates, and confidence by applying anchor boxes to multiscale feature maps from the neck module.

Microscale Detection Layer
YOLOv5 makes detection at three scales, which are precisely given by downsampling the input image dimensions by 32, 16, and 8, respectively. In this research, we detect spikes of different sizes at different scales. However, some wheat spikes are tiny in size and densely distributed in UAV images, and the small-scale detection layer of YOLOv5 has poor applicability to these wheat spikes. Thus, we added a new microscale detection layer, which was given by downsampling the input image dimensions by four. This microscale layer generates a feature map by extracting lower spatial features and fusing them with deep semantic features. The new microscale detection layer makes a broader and more detailed detection network structure (Figure 4), which is applicable in detecting the tiny, crowded wheat spikes in UAV images.

Hierarchical Setting of Anchor Box Size Based on k-Means
Faster RCNN first proposed the concept of the anchor box to detect multiple objects in a grid unit [26]. YOLO uses anchor boxes to match objects better [30,31]. Since customizing the anchor boxes depends on the prior knowledge of the dataset, the anchor box autolearning based on the entire dataset used in previous research studies had an excellent performance on a single-scale dataset. However, there are significant differences in the size of wheat spikes in UAV images, and the number of samples of different sizes is unbalanced. As a result, the anchor boxes based on the whole dataset clustering only focus on the wheat spike sizes with a large number and cannot effectively cover all the wheat spike sizes. This research classified all wheat spikes into four categories according to their size based on four detection layers. For all wheat spikes gt i j x j , y j , w j , h j , j ∈ {1, . . . M}, i ∈ {1, . . . N} in each class G i , the distance metric between the ground truth box and the anchor box can be defined as and where gt is the ground truth of the wheat spike bounding box, and bbox denotes the anchor box. The larger the IoU value between gt and bbox is, the smaller the distance metric is, which means the anchor box can precisely describe the wheat spike bounding box. This study introduced five different sizes of anchor boxes, and this strategy makes it possible to detect wheat spikes of different sizes in UAV images. The process of clustering is shown in Algorithm 1.

Algorithm 1. The procedure for setting sizes of anchors
Input: ground truth boxes G i Output: anchor boxes Y i 1: Select S cluster center points of anchor boxes Y i 2: repeat 3: Step: 4: Calculate the distance between Y i and G i by Equations (1) and (2)  5: Recalculate the cluster center of S by Equations (3) and (4)  6: until clusters converge where W O+1 i and H O+1 i are new clustering centers to calculate new distance metrics.

Improvement of Confidence Loss Function of Detection Layer Based on IoU
Neural networks usually use loss function to minimize network errors, and the value calculated by the loss function is referred to as "loss" [44]. This research utilizes location loss e d , classification loss e s , and confidence loss e i to define the network loss l as [31]: The network loss is the difference between predicted and observed values. Since the uncertainty in samples will affect the network accuracy, weights of loss function should be set according to the quality and quantity of samples [45,46]. Hence, we adopt a new method of setting the confidence loss function of the detection layer based on the IoU (Intersection over Union). We get positive anchor boxes p for each wheat spike bounding box. Among positive anchor boxes p, we also get positive anchor boxes q, which have the maximum IoU. Suppose all anchor boxes in the grid where q falls are the positive anchor boxes q m with max IoU. Calculate the number of positive anchor boxes p i and q m i in each detection layer D i . The weight of the confidence loss function of the detection layer can be set (Formulas (7) and (8)). The calculation formula of IoU between the spike bounding box and the anchor box is as follows: where ar represents the positive anchor box, and tr represents the wheat spike bounding box. The process of setting the weight of the confidence loss e i of the detection layer is shown in Algorithm 2.
Algorithm 2. The procedure of setting weights for confidence loss e i Input: a set of UAV images I Output: weights {λ i } 4 i=1 of detection layers 1: Input the images I into the network for training 2: repeat 3: Step: 4: Calculate p and q m for detection layers 5: until training epochs reach K 6: Calculate p i and q m i for each detection layers D i 7: Normalize final weights {λ i } 4 i=1 of detection layers {D i } 4 i=1 by Equations (7) and (8) The network is initialized with image set I, then trained K times, and yields positive anchor boxes {p i } 4 i=1 for each detection layer. Count the number of positive samples q m i 4 i=1 with max IoU in each detection layer, and get the weight of the confidence loss function {λ i } 4 i=1 of each layer as follows: where {j i } 4 i=1 denotes the ratio between p i and q m i in each detection layer. {j i } 4 i=1 is defined as follows: is the weight of the confidence loss function of each detection layer after normalization. α equals 0.1. The improved confidence loss function weight concerns the variety of spike sizes and the anchor size of the output layer. Thus, the method can increase the number of positive anchor boxes and reduce the negative impacts of low-quality anchor boxes on the network. As a consequence of using this improvement, the network will learn enough high-quality positive anchor boxes and improve the capability for detecting small-sized wheat spikes.

Detection Box Fusion Based on Confidence Weight
The UAV wheat spikes in images are crowded. Due to the occlusion of wheat spikes, the accuracy of wheat spike detection is low. For multiple wheat spike detection boxes, Figure 5 which is modified after [47] shows that the commonly used nonmaximum suppression (NMS) method will only select a single detection box as a result. NMS cannot be adapted to wheat spike detection in UAV images. With this problem, this paper uses the WBF (weighted boxes fusion) algorithm [47] to calculate the fusion weight based on the confidence of the wheat spike detection boxes generated by different networks. The fused box is taken as the final result of wheat spike bounding, and it improves the detection accuracy caused by overlapping and occlusion ( Figure 5).
The detection box fusion first finds the wheat spike detection boxes responsible for this box in all networks for each wheat spike bounding box. Then the fused box is generated based on the confidence of each wheat spike detection box as follows: In the formula, Xa, Ya, Xb, and Yb are separately the coordinates of top-left and bottomright vertexes of the fused box, and C is the confidence of the fusion box. Xa i , Ya i , Xb i , and Yb i are the coordinates of top-left and bottom-right vertexes of wheat spike detection boxes involved in the calculation. C i is the corresponding confidence. Z is the number of wheat spike detection boxes involved in the calculation.  [47]. Green boxes represent detection boxes, and red boxes represent ground truth boxes.

Multiresolution Image Training
Compared with previous studies that only trained images with a single resolution [48,49], the proposed method resamples images into multiple resolutions and sends them to the network for training. The images are resampled into four resolutions: 150 × 150, 300 × 300, 450 × 450, and 600 × 600.
The experiment was performed on a workstation equipped with an Intel ® Xeon ® processor, NVIDIA Titan V graphics processor (12 GB memory), and 500 GB memory. The operating system was Ubuntu 16.06. We set the corresponding initial learning rate and batch size for the resolution of input images. The SGD (stochastic gradient descent) method is used to optimize the learning rate in the training process, the weight decay is set to 1 × 10 −4 , and the momentum is set to 0.9. The specific settings of hyperparameters of the network training are shown in Table 1.

Network Performance Evaluation
This research evaluates the network performance by the metrics of detection accuracy and speed. FPS (frames per second) is used as an indicator of detection speed. Denoting the boxes as spike or nonspike can yield four potential predictions: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). If the IoU between the detection box and the wheat spike bounding box is greater than 0.5, the detection box is marked as TP. Otherwise, the detection box is marked as FP. If the wheat spike bounding box does not have a matching detection box, it is marked as FN. TN is not required in this binary classification problem, where the foreground is always determined for spike detection. TP and FP are separately the number of wheat spikes detected correctly and incorrectly, and FN is the number of undetected wheat spikes. Precision rate (Pr) and recall rate (Re) are defined as: Pr and Re affect each other and cannot be used directly to evaluate the detection accuracy. Therefore, we introduced the average precision (AP) to indicate the detection accuracy (Formula (14)). AP refers to the average recall rate of the spike detection in the range of 0 to 1. Therefore, higher AP means higher accuracy of the network.

Experimental Results
Compared with standard YOLOv5 and other general object detection methods, the proposed method based on improved YOLOv5 achieves the highest accuracy, with an AP of 94.1% ( Table 2). The accuracy is 10.8% higher than that of the standard YOLOv5, and the speed is 30 FPS, thus realizing the accurate detection of wheat spikes in UAV images. We find that the resolution of trained images has significant impacts on detection accuracy. The detection accuracy is higher when the resolution of input images is higher. After refining the network by adding the microscale detection layer, setting the prior anchor box, and adapting the confidence loss function of the detection layer, the detection accuracy of the rebuilt network is 91.9% when the resolution is 600 × 600 of input images. The accuracy is 8.6% higher than that of the standard YOLOv5 network ( Figure 6, Table 3). Moreover, the fusion strategy leads to the best detection accuracy of 94.1%. Figure 6. The precision and recall curves of wheat spike detection. Refined YOLOv5 is based on the refined network, including adding a microscale detection layer, setting prior anchor boxes, and adapting the confidence loss function of the detection layer based on the IoU (Intersection over Union).

Discussion
Adding a microscale detection layer and adapting the confidence loss function of the detection layer based on the IoU can realize the small-sized wheat spike detection by concerning small-sized and high-quality positive anchor boxes to the network. For data-driven deep neural networks, there are usually far more negative anchor boxes than positive anchor boxes. Too many negative anchor boxes will cause an imbalance, leading to negative anchor boxes dominating the network's training [27,50]. The wheat spike sizes in UAV images are generally distributed in 25 to 400 pixels, accounting for about 80% of all spikes (Figure 7). The positive anchor boxes of the standard YOLOv5 concern few small-sized wheat spikes (Figure 8b-d). It means that with the standard three detection layers, the network cannot learn the characteristics of small-sized wheat spikes. In this case, the network will mistake wheat spikes for the background, resulting in many missed detection errors. After adding a microscale detection layer for small-sized wheat spikes, the network acquires more small-sized positive anchor boxes (Figure 8a), which improves the detection accuracy of small-sized wheat spikes. The results of the ablation study test of components of the proposed method with 600 × 600 input images also reveal that adding a microscale detection layer is the most critical improvement (Table 4).
Focusing on high-quality positive anchor boxes benefits the detection accuracy of convolutional networks [51]. Standard YOLOv5 cannot accurately describe and fit the actual position of wheat spikes (Figure 8b-d) because few positive anchor boxes are enrolled in the standard YOLOv5 layer's training. A larger detection layer with a higher downsampling rate enlarges the difference between the detection bounding box and the ground truth. The positive anchor boxes of the microscale detection layer are closer to the actual position of the wheat spike bounding boxes and can be more helpful for the network (Figure 8a). Besides, adapting the confidence loss function of the detection layer pays more attention to high-quality positive anchor boxes. It improves the contribution of the microscale detection layer to the network. Hence, the network can have a good wheat spike detection accuracy in UAV images with a microscale detection layer.   The results show that the network with the ideal anchor box setting outperforms the network using the default anchor box configuration (Table 4, Figure 9a,b). The number of miss-detected wheat spikes reduced from 17 to 9 (Figure 9b). Anchor box configuration is a critical issue in spike detection. The default anchor box settings are not applicable when spikes have significantly different sizes in one scene [52]. Thus, anchor boxes with multiple sizes and shapes should be developed for various datasets [53,54]. These anchors can concern the characteristics of objects in the images and can improve detection accuracy. Figure 9. Detection results using the default anchor setting (a) and the prior anchor setting by k-means cluster (b). Blue boxes represent the correct-detected wheat spikes, red boxes represent the error-detected wheat spikes, and green boxes represent the undetected wheat spikes.
The wheat spike detection accuracy is successfully improved by using a multiresolution image training strategy and generating fusion results based on the confidence weight of the detection boxes. The input image resolution can affect the detection results, so using a multiresolution image training strategy becomes an effective method to detect small-sized objects [55][56][57]. In the research, the detection accuracy is higher when the resolution of input images is higher, which is consistent with the results of other studies tested on general datasets [58]. Considering factors such as individual characteristics of wheat, monitoring platforms, and computational resource consumption, the wheat spike detection research based on the convolutional neural network often selects the most suitable resolution manually [8,59,60]. However, wheat spike occlusion and overlapping are typical in UAV images. The detection results of the network trained by single-resolution images exist in the miss detection and error detection, and the adaptability to occlusion and overlapping conditions is poor (Figure 10a,b). Compared with single-resolution image training, multiresolution image training can cover more widely and generate detection results more accurately [61]. The research integrates detection results from different resolutions and successfully generates more accurate results. The accuracy of the fusion results is 94.1%, which illustrates that spike occlusion and overlapping are solved (Figure 10c, Table 4).

Conclusions
We developed a wheat spike detection method based on the improved YOLOv5 for UAV images. The method consists of three critical steps: data preprocess for the UAV wheat spike images, network refinement by adding a microscale detection layer, setting the anchor prior size, adapting the confidence loss function, and multiresolution detection result fusion. We can well detect wheat spikes in UAV images under occlusion and overlapping conditions with the proposed method. The average accuracy (AP) of 94.1% increases by 10.8% compared with the standard YOLOv5. Therefore, the proposed method improves the applicability of the YOLO algorithm in complex field environments and provides technical reference for agricultural wheat phenotype monitoring and yield prediction. With the development of deep learning, researchers are not satisfied with just using a convolutional neural network in wheat spike detection. In future work, we will gradually dissect the constructed network structure, explain the semantics of the network, illustrate how individual hidden units of a deep convolutional neural network teach the network to solve a wheat spike detection task, and further optimize the structure of the wheat spike detection network to achieve better wheat detection performance.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.