Rapid Detection and Counting of Wheat Ears in the Field Using YOLOv4 with Attention Module

: The detection and counting of wheat ears are very important for crop field management, yield estimation, and phenotypic analysis. Previous studies have shown that most methods for de ‐ tecting wheat ears were based on shallow features such as color and texture extracted by machine learning methods, which have obtained good results. However, due to the lack of robustness of these features, it was difficult for the above ‐ mentioned methods to meet the detection and counting of wheat ears in natural scenes. Other studies have shown that convolutional neural network (CNN) methods could be used to achieve wheat ear detection and counting. However, the adhesion and occlusion of wheat ears limit the accuracy of detection. Therefore, to improve the accuracy of wheat ear detection and counting in the field, an improved YOLOv4 (you only look once v4) with CBAM (convolutional block attention module) including spatial and channel attention model was pro ‐ posed that could enhance the feature extraction capabilities of the network by adding receptive field modules. In addition, to improve the generalization ability of the model, not only local wheat data (WD), but also two public data sets (WEDD and GWHDD) were used to construct the training set, the validation set, and the test set. The results showed that the model could effectively overcome the noise in the field environment and realize accurate detection and counting of wheat ears with different density distributions. The average accuracy of wheat ear detection was 94%, 96.04%, and 93.11%. Moreover, the wheat ears were counted on 60 wheat images. The results showed that R 2 = 0.8968 for WD, 0.955 for WEDD, and 0.9884 for GWHDD. In short, the CBAM ‐ YOLOv4 model could meet the actual requirements of wheat ear detection and counting, which provided technical sup ‐ port for other high ‐ throughput parameters of the extraction of crops.


Introduction
Wheat is one of the most important food crops and plays an important role in food security. The forecast of wheat yield has become an important part of the agricultural production process, which can provide necessary references for field management and agricultural decision-making [1]. Therefore, accurately identifying and counting wheat ears is of great significance for monitoring crop growth, estimating wheat yield, and analyzing plant phenotypic characteristics.
At present, with the rapid development of machine vision technology, LiDAR [2], heat map [3], and digital images have achieved good detection results in wheat monitoring. In particular, high-resolution images have made the detection of crops easier and more efficient [4]. Among them, image processing and feature extraction have gradually become key technologies for wheat ear recognition, and have made excellent contributions to improving the accuracy of detection and counting. Previous studies have shown that some features were used to successfully detect wheat ears from the background, including texture, color, and morphology. Narkhede et al. used color space conversion to count wheat ears by using color features in different color spaces [5]. However, the color of wheat ears, leaves, and stalks in the wheat field are similar. Moreover, the color of various parts of the wheat plant are also changing as the wheat grows. Therefore, it was difficult to accurately identify wheat ears by using color features [6]. Some scholars have proposed counting wheat ears based on their texture and color characteristics [7,8]. However, in the heading stage, wheat ears and leaves have similar texture characteristics that affect detection accuracy. Therefore, the detection and counting of wheat ears in the natural environment still faces great challenges.
In addition, the adhesion and occlusion between wheat ears severely limits the accuracy of wheat ear identification and counting. Some scholars have successfully detected the target by the use of segmentation technology for adhesion objects, such as morphology [9], concave point matching [10], and watershed algorithm [11]. However, the morphology of wheat ears in the image is quite different. Therefore, segmentation based on morphology cannot be used to accurately count the wheat ears in the adhesion area. Moreover, the concave point matching algorithm requires that the adhered detected objects have smoother edges. However, as we know, the edges of wheat ears are not smooth, and it is difficult to obtain a smooth edge for the image of wheat ears, even if its binary image undergoes a series of erosion and expansion operations. The watershed algorithm needs to calculate local extrema. However, there are more local extremums because the texture of the wheat ear itself is clearer. Therefore, the watershed algorithm used to detect the wheat ear will lead to excessive segmentation. Although the corner points of wheat ears were used for effective segmentation to detect wheat ears [12], different wheat ears have different corner rules, which is not convenient for large-scale promotion. Therefore, how to accurately count wheat ears that are blocked by each other still needs to be solved urgently.
With the development of image processing technology, previous studies have shown that machine learning methods are used to build a classifier for wheat ear detection, thereby realizing wheat ear detection and counting [13]. Xu used the k-means algorithm to segment the wheat ears to achieve recognition [14]. Fernandez-Gallego et al. used Fourier filtering and Fourier transform to segment wheat ears and backgrounds [15]. Zhu et al. used the support vector machine method to successfully detect and count wheat ears [16]. Zhou et al. used the twin-support-vector machine segmentation method to segment and count wheat ears [17]. Although the recognition of wheat ears has been achieved based on machine learning methods, most methods still require prior knowledge to artificially set image features, which leads to insufficient robustness of features under noise interference such as uneven lighting and complex backgrounds in a field environment. Therefore, it is difficult to detect and count wheat ears in different scenarios based on traditional machine learning methods due to the lack of universality.
In the past ten years, deep learning has become a research hotspot in the field of pattern recognition. It has led to excellent achievements in many fields such as computer vision, image analysis, and multimedia applications [18]. Different from traditional pattern recognition methods, deep learning automatically learns features from big data instead of manually designed features. In fact, the characteristics of wheat ears in the field are often combined in a non-linear manner under the influence of various complex factors.
The key to deep learning is to successfully separate these factors through multi-layer nonlinear mapping [19]. Misra et al. integrated local patch extraction network (LPNet) and global mask refinement network (GMRNet) to achieve wheat ear segmentation and counting [20]. Xiong et al. used context-augmented local regression networks to detect and count wheat ears [21]. The research mentioned above shows that deep learning has strong robustness in detecting wheat ears. In recent years, convolutional neural network (CNN), a type of deep learning, has achieved brilliant results [22][23][24] in the detection and counting of wheat ears. However, the core of detection based on the CNN method is based on the region proposal method, that is, first select the sliding window or extract the proposal to train the network, and then classify it in the region proposal. The limitation of this method is that the background area is often misdetected as a specific target in object recognition. The wheat images collected in a field environment have many interferences such as high plant density, multiple overlaps, uneven lighting, and complex backgrounds. Therefore, wheat ear detection based on deep learning still has issues worth exploring.
YOLO (you only look once) is a high-precision target detection method, which can directly predict the location and attributes of the target for the entire image based on a single convolutional network [25]. With the development of the YOLO algorithm, YOLOv4 has attracted more attention. Although YOLOv4 was successfully used to detect apple flowers [26], we still found that YOLOv4 also has insufficient bounding box positioning and it is difficult to distinguish between overlapping detected objects. The emergence of the attention mechanism can effectively solve the above problems. When processing information, the attention module only pays attention to part of the regional information that is conducive to the realization of the task, and filters out secondary information to improve the model effect, which has been used in image classification [27], image segmentation [28], and image detection [29]. Therefore, CBAM-YOLOv4 was proposed in this research, which integrated the convolutional block attention module (CBAM) [30] into the YOLOv4 convolution module to achieve the learning of target features and location features in the channel dimension and the global space dimension, respectively. The improved YOLOv4 algorithm could dynamically enhance useful features (wheat ears) and suppress background noise (wheat stalks, wheat leaves, wheat seedlings, wheat awns, and soil). As far as we know, there are few reports on the detection and counting of wheat ears using the CBAM-YOLOv4 model.
Therefore, this study focused on the feasibility of the deep learning method for highthroughput wheat ear detection and counting under field conditions, and verified that the proposed model had the ability to quickly detect wheat ears from a complex background. The purpose of this study was to (1) train a CBAM-YOLOv4 model to fine-tune the parameters of the model to achieve accurate detection of wheat ears, (2) improve the robustness of the model by using a dual-channel (channel and spatial channel) attention mechanism to eliminate background interference, and (3) realize accurate counting of wheat ears under complex backgrounds to obtain wheat high-throughput parameters, such as yield and above-ground nitrogen content, etc.

Data Acquisition
To make the wheat ear samples diversified, three different datasets were used in this study. Among them, the dataset of WD comes from the National Agricultural Science and Technology Innovation and Integration Demonstration Base in Anhui Province, China (31.25° N, 117.28° E). The wheat varieties are Wanmai 55 and Ningmai 15 with three nitrogen fertilizer treatments (0, 104, 150 kg/hm 2 ). These images were used as the original dataset and were taken with a digital camera (D5300, Nikon Corp, Tokyo, Japan). At 50-80 cm above the top of the wheat canopy, the images of the heading and maturity stages were manually taken in May 2019. A total of 190 images were selected from the collected images to construct the WD data set. All images were stored in JPG format according to the sRGB color standard, and the original resolution was 3456 × 4408 pixels. A single image contains wheat ears and leaves, and part of the wheat image is shown in Figure 1a. Among them, the wheat ear samples in these images included severe occlusion, slight occlusion, and no occlusion. The contour of a single wheat ear was incomplete due to being blocked by wheat leaves or other wheat ears. Generally, when the pixel area covered by a wheat ear in the image accounted for more than 50% of the whole wheat ear it was considered severely occluded, and less than 50% was considered lightly occluded. The second data set was from the public data set WEDD (Wheat Ears Detection Dataset) provided by Madec et al. [31]. It contains 236 high-resolution wheat images (6000 × 4000 pixels), with a total of 30,729 wheat ears. The data was collected using a Sony ILCE-6000 digital camera, which was fixed on a boom 2.9 m from the ground for shooting. Part of the wheat image from WEDD is shown in Figure 1b.
The third data sets were from the Global Wheat Head Detection Dataset (GWHDD) [32], including 3376 RGB images (1024 × 1024 pixels) with a total of 145,665 wheat ears. These wheat images come from different regions, including Europe (France, Switzerland, United Kingdom), North America (Canada), Oceania (Australia), and Asia (Japan, China). The acquired images have great differences, including different varieties, different planting conditions, and different image acquisition methods. Therefore, the wheat samples from GWHDD are diverse and typical. Part of the wheat image was shown in Figure 1c.
The images in the data set were directly cropped to a size of 1024 × 1024 pixels, and included as many ear samples of wheat as possible to reduce hardware pressure and unified requirements for annotations.
The processed data sets were divided into training set, validation set and test set, as shown in Table 1. The training set was randomly sampled from the overall data set with independent and identical distribution, and the test set and the verification set were mutually exclusive, which ensured the reliability of the later evaluation standards. The validation set was used to determine the hyperparameters in the model during the training process, and the test set was used to evaluate the generalization ability of the model.

. Data Annotation
To realize the labeling of wheat ears, labelImg (https://github.com/tzutalin/labelImg, 10 January 2021) was used to label the collected data sets. Specifically, each wheat ear in the corresponding white box of an image was annotated with a rectangular box, which was represented by the coordinates of its four vertices. After marking all the wheat ears in the corresponding frame in an image, the corresponding XML file was generated, which included information such as the size of the image, the name of the label frame, and the location of the target frame. Figure 2 is an example of labeling of three different data sets (WD, WEDD, and GWHDD). Among them, the images in the first row of Figure 2 represent annotated samples of wheat ears. The images in the second row show a magnified portion of the annotated area in the first row of samples.

Data Augmentation
In order to improve the robustness of the detection model, a variety of methods were adopted for data enhancement. The specific operations were as follows: (1) perform different levels of brightness conversion, and the brightness of the image was increased by 1.3 times and decreased by 0.7 times respectively, so that the wheat target detection model was not affected by the diversity of light in the field environment; (2) increase the contrast of the wheat image by 1.2 times and weaken it by 0.8 times, so that the sharpness, gray level and texture details of the wheat image could be better expressed; (3) perform random multi-angle rotation, such as 90°, 270°, horizontal flip, mirror flip, etc.

YOLOv4 Model
The YOLO (You Only Look Once) network is a target detection algorithm that directly returns the position and category of the bounding box based on a convolutional neural network. The advantage of the YOLO model is that it can better distinguish the target and the background area [25]. YOLO models generally include YOLO v1 [33], YOLO v2 [34] and YOLO v3 [35]. The YOLO models mentioned above have achieved good results in many target detections. The YOLOv4 target detection algorithm is based on the YOLOV3 architecture to improve the detection accuracy of the model by optimizing data processing, backbone network, network training, activation function, and loss function [36]. Specifically, the YOLOv4 model retains the head part of yolov3, modifies the backbone network to CSPDarknet53, and uses the idea of SPP (spatial pyramid pooling) to expand the receptive field. YOLOv4 introduces a multi-scale feature extraction module to ensure strong detection performance for targets of different sizes [37]. Path Aggregation Network (PANet) mainly realizes the integration of features extracted by the backbone network. CBL (Convolution, Batch normalization, and Leaky ReLU) is the smallest component in the Yolo v4 network structure, which includes convolution, BN (Batch normalization) and Leaky ReLU functions. Compared with the YOLO3 model, the model of YOLOv4 has faster detection speed and good accuracy.

Channel Attention Module and Spatial Attention Module
CBAM consisted of a channel attention module and spatial attention module, as shown in Figure 3. For input feature F, the global information of each feature channel was obtained through global average pooling and maximum pooling operations, and then the feature channel attention vector was obtained through two fully connected layers, FC1 and FC2, which was used to weight the input feature F channel by channel to obtain the feature F'.
For the spatial attention module, the feature is subjected to the maximum pooling operation to obtain the feature map, which is used to calculate the spatial information of the feature map. In addition, the feature F' was input into a 3 × 3 convolutional layer and output by the sigmoid function to obtain the spatial attention map, which was used to activate the feature F' to obtain the fusion feature F'' [30].
The information for detecting wheat ear characteristics was usually concealed by leaves or other wheat ears. Therefore, the channel attention module could enhance the feature expression of the occluded target, and the spatial attention module could highlight areas in the feature map that are related to the current task.

Model of Wheat Ear Detection and Counting Based on CBAM-YOLOv4
As shown in Figure 4, the process of wheat ear detection and counting model mainly included two modules; one was the batch training module, the other was the detection and counting module. Among them, the batch training module included original data samples, data expansion, data labeling, divided data sets, and CBAM-YOLOv4 model construction; the detection and counting module includes wheat ear detection and counting.
The detection and counting module were as follows: firstly, the training set and the validation set were constructed using the manually labeled wheat ear data set. Secondly, the training sets of wheat images were used to fine-tune the model based on the transfer learning method. Once more, the CBAM-YOLOv4 model was further adjusted, optimized, and verified using the verification set. Finally, the model was tested using the test set, and the detection and counting results of wheat ears were generated. Among them, the wheat ear recognition results were presented in the form of a bounding box. Counting wheat ears was based on wheat ear identification, and the results are displayed with a bounding box and statistical serial numbers. According to the structure of the YOLOv4 detection model, we embed the CBAM in the neck area of YOLOv4, and the result is shown in Figure 5. Among them, the image size of the wheat input to the model is 416 × 416 × 3, and the output feature maps A1, B1, and C1 of CSPDarknet53 through the SPP network and the CBAM of F1, F2, and F3 to generate feature maps A2, B2, and C2, containing feature attention mechanisms, respectively, with sizes of 52 × 52 × 256, 26 × 26 × 512, and 13 × 13 × 1024. Then, after outputting feature maps A2, B2, and C2 through the CBL × 5 module of the PAN network, feature maps A3, B3, and C3 will be generated respectively, with sizes of 52 × 52 × 128, 26 × 26 × 256, 13 × 13 × 512. Figure 5 shows the modification area of CBAM-YOLOv4.

Method of Wheat Ear Detection and Counting
The detection and counting results of wheat ears are presented in the form of a detection box, and the counting includes counting the target value of the wheat ears. The counting method is: traverse all the detection frames and set a threshold N. When S > N, it is considered to belong to the wheat ear target, and then the detection frame participates in the statistics. When S ≤ N, it is considered not to belong to the wheat ear target. This detection frame does not participate in statistics. The specific value is displayed in the detection box participating in the statistics (starting with the number 1, and accumulating in sequence). The formula is as shown: where: x S is the score of the x-th detection frame, and N is the set threshold, the threshold N = 0.5 in this study.
In addition, we asked eight scholars in the field to manually count the wheat ears from the image of the test set ten times, which was averaged as the ground truth value of the wheat ears, and CSRNet [38] was used to visualize the distribution of the number of wheat ears.

Evaluation of the Model Performance
To test the effectiveness of the CBAM-YOLOv4 model and to verify the transfer performance of the attention information on the model, intersection over union (IOU) is used to evaluate the accuracy of the model according to the coincidence rate of the output box and the label box. Setting a different IOU threshold will result in different numbers of detection frames. Among them, a high threshold results in a small number of detection frames, and a low threshold results in a large number of detection frames. When the detected wheat ear target is small, if a larger threshold is set, the detection of the wheat ear may be missed. Therefore, the threshold value is 0.5 in this study.
In addition, the precision, recall, F1-score, and mean average precision (mAP) were used as evaluation indicators to evaluate the trained model: Among them, true positives (TP) means that both the detection result and the true value are wheat ears, that is, the number of wheat ears detected correctly. False positives (FP) indicate that the detection result is wheat ears, and the true value is the background, that is, the number of wheat ears counted incorrectly. False negatives (FN) means that the detection result is the background, and the true value is the wheat ears, that is, the number of wheat ears that are not counted.
"TP + FP" refers to the total number of wheat ears detected, and "TP + FN" refers to the total number of wheat ears in an image. F1-score is used to evaluate the performance of the method by balancing the weights of precision and recall. C is the number of categories, N represents the number of all pictures in the test set, ( ) P k represents the Precision when k pictures can be recognized, and ( ) R k  represents the change of the recall value when the number of recognized pictures changes from k − 1 to k.
Coefficient of determination (R²), root mean square error (RMSE) and Bias are used as evaluation indicators to measure the counting performance of the model, which are defined as follows:  

Model Training
The model training and validation were performed using the training set and validation set in Table 1. The model training parameters were set as follows: learning rate = 0.001, max batches = 10,000, momentum = 0.9, decay = 0.0005, batch size = 16. Mini-batch gradient descent (MBGD) was used to optimize the training model. The hardware parameters of the experiment were Intel Core i7-8700 processor and NVIDIA GeForce GTX 2080 GPU, which were implemented using Darknet deep learning framework and Python programming. CUDA version 10.0 parallel computing framework and CUDNN version 7.5 deep neural network acceleration library were used in this study. Figure 6 shows the model verification accuracy and training loss values obtained in each iteration during the training process. It can be seen from Figure 6 that the training accuracy of the model gradually increases as the number of iterations increases, and the training loss value of the model gradually decreases as the number of iterations increases. In the initial stage of model training, the model learning efficiency was high, and the training loss curve converges more quickly. As the number of iterations increases, the slope of the training loss curve gradually decreases. Finally, when the number of training iterations reaches about 8500, the fluctuation trend of the loss value gradually stabilizes, and the corresponding accuracy no longer changes. Among them, the maximum value of mAP was 88.76%, indicating that the CBAM-YOLOv4 model did not have problems such as over-fitting or under-fitting, and gradient disappearance. This model was the model used to detect and count wheat ears.

Results of Detecting Wheat Ears in Different Data Sets
The trained CBAM-YOLOv4 model was tested using test sets from three data sets. There were 771 images in the test set, including 30 images in the WD test set, 48 images in the WEDD test set, and 673 images in the GWHDD test set. The resolution of the smallest wheat ear was no less than 15 × 15 pixels, and it was ignored if the size was too small.
It could be seen from Table 2 that CBAM-YOLOv4 was effective in detecting wheat ears in different data sets, even though the shape, color, and texture of wheat ears in the image were different. For WD, WEDD and GWHDD, the F1-score was 88.89%, 93.02%, and 89.25%, the mAP was 94%, 96.04%, and 93.11%, the Precision was 86.06%, 89.73%, and 87.55%, and recall was 91.91%, 96.55%, and 91.01%, respectively. Part of the test results were shown in the Figure 7. Judging from the detection results, the CBAM-YOLOv4 model could detect wheat ears in different data sets. It can be seen from Figure 7 that the WEDD detection effect was best in the three data sets, and the detection accuracy of the WD and GWHDD test sets were a little lower than that of WEDD.

Results of Counting Wheat Ears in Different Data Sets
To compare the effect of the proposed method of estimating the number of wheat ears, 20 images were randomly selected from each test set of WD, WEDD, and GWHDD, and 617, 535, and 742 wheat ears were manually counted on the images. The CBAM-YOLOv4 model was used to detect and count the number of wheat ears. The results are shown in Figure 8. The R 2 of the model was 0.8968 for WD, 0.9550 for WEDD, and 0.9884 for GWHDD. The RMSE corresponding to the three data sets are 1.604, 1.070, and 2.097. In addition, for WD, WEDD, and GWHDD, the results of CBAM-YOLOv4 method count deviation values are 2.6, −1.95, and −3.3, respectively. Moreover, compared with the actual number of wheat ears, the detection result of CBAM-YOLOv4 had a certain deviation. The Bias were 2.6, −1.95, and −3.3 for WD, WEDD and GWHDD, respectively.  Figure 9 shows part of the counting results from the three data sets, which were counted based on the method we proposed, and the results were compared with the ground truth. It could be seen that it was easy to see that the method we proposed had better robustness, and even if the distribution of wheat ears were relatively concentrated, better counting results could be obtained. At the same time, we also found that the improved YOLOv4 method could detect wheat ears with complex background, but some of the detection results were higher than the ground truth. The possible reason was that the sample label was insufficient, which led to some wheat leaves being misjudged.

Comparison of The Effect of Wheat Ear Detection and Counting under Complex Background
The detection accuracy of the model under the complex background of the natural environment will be affected to a certain extent. In particular, it was difficult to detect wheat ears when the leaves cover the samples and the wheat ear samples overlap each other. To test the detection effect of the model proposed in this study under a complex background, 20 images with severe occlusion were selected as data set A from WD, and 20 images with slightly occluded wheat ears as data set B from WD. The degree of occlusion was used as a control variable, and the CBAM-YOLOv4 model was used to detect data sets A, B, and A + B, respectively. The detection results are shown in Table 3 and Figure 10. For the detection of lightly obscured wheat ears (data sets A), the F1-score of the model can reach 0.9253, and the mAP can reach 95.38%. In a heavily occluded environment with dense targets (data sets B), the model can also achieve an F1 value of 0.9019 and the mAP value of 96.25%. The two data sets were mixed into one data set (A + B), and the F1-score and the mAP of the model reached 0.9124 and 93.92%. It showed that the CBAM-YOLOv4 model could effectively detect wheat ears in the natural field environment. As shown in the density maps in Figure 10, the colors in the maps reflect the size of the density value. The darker the color, the greater the density value. It could be seen from Figure 10 that the detection and counting of wheat ears can meet the needs of conventional farmland management regardless of whether the wheat ears are severely occluded or slightly occluded. On the one hand, high-density planting of wheat will result in denser wheat ears and severe occlusion. On the other hand, the camera shooting angle will also increase the sample occluded by wheat ears [30]. In fact, in fields with dense wheat ears, even experts must count the number of wheat ears multiple times to obtain reliable measurement results.

Comparison of Detection Effects Based on Different YOLO Methods
To evaluate the performance of the CBAM-YOLOv4 model proposed in this study, the results of the detection of wheat ears by typical convolutional neural networks YOLOV3, YOLOv4, and CBAM-YOLO4 were compared using the test set. Some examples of the results of the GWHDD test set are shown in Figure 11. The rectangular box in Figure  11 is the result of detecting wheat ears, the red circle represents the result of false detection of wheat ears, and the blue circle represents the result of missing wheat ears. When comparing the detection of wheat ears in the same image, it can be seen from Figure 11 that we found that there were missed ear detections and false detections of wheat ears in the results based on YOLOV3 and YOLOv4 model detection, but the CBAM-YOLOv4 model can accurately detect wheat ears, which showed that the model had good robustness, and that the attention information was beneficial to the detection of wheat ear targets. In addition, it could be seen from Table 4 that the precision, recall, F1-score, and mAP of the CBAM-YOLO4 model were the highest. Among them, the mAP of CBAM-YOLO4 was 93.11%, which was 3.89% and 1.98% higher than that of the YOLOV3 and YOLOv4 models, respectively. The results showed that the detection effect of the CBAM-YOLO4 model was better than that of the YOLOV3 and YOLOv4 models.  Figure 12 showed the precision-recall curves of three YOLO models on wheat ears. It can be seen from Figure 12 that when the recall of the three models were less than 0.1, the precision remained around 1.0, and the difference was not significant. However, with the increase in the recall value, the advantages of the CBAM-YOLOv4 model gradually become obvious, and its corresponding precision was larger than the other two models, indicating that spatial attention could improve the detection performance of the model and could fully reflect the advantages of the spatial attention module.

Estimation of Yield and Aboveground Nitrogen Content Based on of Wheat Ears
The number of wheat ears detected, not only allows us to quickly predict wheat yield, but also to easily obtain other high-throughput parameters of wheat such as aboveground nitrogen content. Aboveground nitrogen content (ANC, kg. ha −1 ) of wheat is determined by Equation (9) (9) where SPNC is the sample plant nitrogen concentration (g.100 g −1 ), m is the dry mass (kg. ha −1 ), k is the number of samples and # Ears is the number of wheat ears. In previous studies, when we calculated ANC, the number of wheat ears was manually collected. Now, we can use images to count wheat ears in a certain area in the field.
Moreover, the method in this study was used to quickly detect and count the number of wheat ears. Based on the number of wheat ears, other high-throughput parameters of wheat grains, such as starch content and nitrogen content, can quickly be obtained by calculation, the details are shown in Figure 13.

Conclusions
We combined the convolutional neural network and attention mechanism technology to propose a CBAM-YOLOv4 wheat ear detection and counting method. The model was trained, validated, and tested using public data sets (WEDD and GWHDD) and data sets (WD) obtained by ourselves. The key contribution was that the CBAM-YOLOv4 model had good robustness. By being integrated into the dual-channel attention mechanism, the YOLOv4 model paid more attention to important wheat ear features in the image, and suppressed unimportant features such as wheat leaves and wheat awns. Thus, the CBAM-YOLOv4 model improves the accuracy of detecting and counting wheat ears. Moreover, the model was verified and tested using wheat ear images from different countries and regions. The mAP of the method proposed in this study exceeded 91%, meaning that it can effectively detect and count wheat ears.