Weed Detection in Maize Fields by UAV Images Based on Crop Row Preprocessing and Improved YOLOv4

: Effective maize and weed detection plays an important role in farmland management, which helps to improve yield and save herbicide resources. Due to their convenience and high resolution, Unmanned Aerial Vehicles (UAVs) are widely used in weed detection. However, there are some challenging problems in weed detection: (i) the cost of labeling is high, the image contains many plants, and annotation of the image is time-consuming and labor-intensive; (ii) the number of maize is much larger than the number of weed in the ﬁeld, and this imbalance of samples leads to decreased recognition accuracy; and (iii) maize and weed have similar colors, textures, and shapes, which are difﬁcult to identify when an UAV ﬂies at a comparatively high altitude. To solve these problems, we propose a new weed detection framework in this paper. First, to balance the samples and reduce the cost of labeling, a lightweight model YOLOv4-Tiny was exploited to detect and mask the maize rows so that it was only necessary to label weeds on the masked image. Second, the improved YOLOv4 was used as a weed detection model. We introduced the Meta-ACON activation function, added the Convolutional Block Attention Module (CBAM), and replaced the Non-Maximum Suppression (NMS) with Soft Non-Maximum Suppression (Soft-NMS). Moreover, the distributions and counts of weeds were analyzed, which was useful for variable herbicide spraying. The results showed that the total number of labels for 1000 images decrease by half, from 33,572 to 17,126. The improved YOLOv4 had a mean average precision ( mAP ) of 86.89%.


Introduction
Maize plays an important role in agriculture because of its nutritional value and high consumption [1]. Excessive weeds in maize fields affect the growth and yield of maize. Weeds not only compete with crops for living space but also lead to the spread of diseases and pests, resulting in crop failure [2][3][4]. Therefore, we need to identify weeds in maize fields and ascertain their distribution and quantity. At present, the main method of weeding is large-scale spraying with herbicides, which destroys the ecological environment, consumes resources, and affects food safety [5]. The application of UAV technology in agriculture has developed rapidly [6][7][8] as it provides unique features that humankind cannot implement on the ground [9], thus facilitating the detection of weeds in maize fields. UAVs can obtain complete field images with high resolution [10,11]. The advantage of UAV images is that they have a wide collection area and can quickly obtain complete field data for statistics and analysis [12]. The analysis of farmland images and the implementation of precise spraying are of great significance for controlling the growth of weeds and The main contributions of this paper are as follows: (i) a new weed detection framework is proposed to deal with the high cost of labeling and the serious imbalance of samples, (ii) an optimized weed detection model is proposed, which has a good recognition accuracy, and (iii) the analysis of high-resolution images of farmland collected by the UAV, which can calculate the quantity, distribution, and density of maize and weeds in large areas.

Data Collection
We used a DJI Phantom 4 RTK for image collection in the experimental farm of the Yellow River Delta Agricultural Hi-Tech Industry Demonstration Zone of Dongying City,

Data Collection
We used a DJI Phantom 4 RTK for image collection in the experimental farm of the Yellow River Delta Agricultural Hi-Tech Industry Demonstration Zone of Dongying City, Shangdong Province, China. A Phantom 4 RTK camera uses a 1 inch CMOS sensor to capture 20 mega-pixel imagery with a resolution of 4864 * 3648 pixels. During flight, a 3-axis gimbal on a Phantom 4 RTK provides a steady platform to keep the attached camera pointed close to the nadir. The capture area is 40 mu. The image was captured in the seedling stage (V1), and the specific image acquisition time was May 2021. We planned the flight route of the UAV according to the image resolution and its flight efficiency. We set the flying altitude to 25 m and the Ground Sampling Distance (GSD) to 0.685 cm/pixel. Finally, the visible photos were synthesized into a farmland image through two-dimensional reconstruction using pix4D [27] software. The image collected by the UAV is shown in Figure 2. We cropped the edge of the non-farmland area to obtain an image of the whole farmland, as shown in Figure 3a. Finally, we cropped the image to a suitable resolution for the identification of maize and weed. The image was cropped to 416 * 416 pixels, as shown in Figure 3b. capture 20 mega-pixel imagery with a resolution of 4864 * 3648 pixels. During flight, a 3-axis gimbal on a Phantom 4 RTK provides a steady platform to keep the attached camera pointed close to the nadir. The capture area is 40 mu. The image was captured in the seedling stage (V1), and the specific image acquisition time was May 2021. We planned the flight route of the UAV according to the image resolution and its flight efficiency. We set the flying altitude to 25 m and the Ground Sampling Distance (GSD) to 0.685 cm/pixel. Finally, the visible photos were synthesized into a farmland image through two-dimensional reconstruction using pix4D [27] software.
The image collected by the UAV is shown in Figure 2. We cropped the edge of the non-farmland area to obtain an image of the whole farmland, as shown in Figure 3a. Finally, we cropped the image to a suitable resolution for the identification of maize and weed. The image was cropped to 416 * 416 pixels, as shown in Figure 3b.

YOLOv4 and YOLOv4-Tiny
Convolutional Neural Networks have been widely employed in object detection and have shown promising results [28]. YOLO is a one-stage object detection network that

YOLOv4 and YOLOv4-Tiny
Convolutional Neural Networks have been widely employed in object detection and have shown promising results [28]. YOLO is a one-stage object detection network that turns object detection into a regression problem, as is typical of a standard object detection technique. By processing photos with a single CNN, YOLO can directly extract the target's categories and position coordinates. YOLO has the advantages of fast detection speed, a small model, and convenient deployment. The increase in efficiency and accuracy of YOLOv4 compared with YOLOv3 arise mainly from several improvements incorporated into the model: (i) the backbone extraction network is improved from Darknet53 to CSP-Darknet53; (ii) the spatial pyramid pooling (SPP) module is introduced to significantly increase the receptive field, (iii) the use of the Path Aggregation Network (PANet) as the parameter aggregation method, and (iv) the Mosaic data augmentation and Mish activation functions are used to further improve the accuracy.
YOLOv4-Tiny is a simplified version of YOLOv4 that greatly improves training and detection speeds by reducing the amount of network calculation [29]. YOLOv4-Tiny omits spatial pyramid pooling (SPP), and the Path Aggregation Network (PANet) uses the CSPDarknet53-Tiny network as the backbone network to replace the CSPDarknet53 network used in YOLOv4 [30]. Regarding feature fusion, YOLOv4-Tiny uses the feature pyramid network to extract and fuse features of different scales, which improves the accuracy of object detection. In addition, other lightweight versions of YOLOv4 include YOLOv4-Mobilenetv1 [31], YOLOv4-Mobilenetv3 [32], and YOLOv4-Ghost [33]. In YOLOv4-Mobilenetv1 and YOLOv4-Mobilenetv3, the original backbone CSPDarknet53 was replaced by Mobilenetv1 and Mobilenetv3. In YOLOv4-Ghost, the core concept is to use cheap operations to generate redundant feature maps.

Crop Row Detection and Mask
Accounting for the important fact that maize is grown in the crop row, we covered the maize by masking the crop row to obtain the image only with inter-row weeds. Then, we selected an appropriate number of original images and masked images and labeled them using LabelImg [34], which skillfully solves the problem of imbalanced samples. Moreover, by reducing the annotation of maize with low demand for sample size, a lot of time was saved. Therefore, greater significance is achieved if the training time of our crop row detection model satisfies Equation (1).
where T train is the time of training crop row detection model, T M is the time of labeling M original images, T N is the time of labeling N masked images, and T M+N is the time of labeling M and N original images. This meant that it was necessary to obtain the crop row data set and train the crop row detection model faster. Therefore, the requirements for the crop row detection model were as follows: (i) fewer labeled samples, (ii) simpler labeled content, (iii) shorter training time, and (iv) a lightweight model. The traditional method of crop row detection is semantic segmentation. Semantic segmentation requires pixel-level label, which cannot meet the requirement of simpler labeled content. Therefore, an object detection method was adopted to achieve this.

Crop Row Detection Model Dataset
To meet the requirement of fewer labeled samples, only 150 images were randomly taken for labeling. We used the LabelImg tool to label crop rows, as shown in Figure 4. There are only 2-4 labels in any given image, and it took about 30 min to label 150 images. Finally, we augmented the dataset to 750 images. was adopted to achieve this.

Crop Row Detection Model Dataset
To meet the requirement of fewer labeled samples, only 150 images were randomly taken for labeling. We used the LabelImg tool to label crop rows, as shown in Figure 4. There are only 2-4 labels in any given image, and it took about 30 min to label 150 images. Finally, we augmented the dataset to 750 images.

Crop Row Detection Model
Since we needed to quickly obtain the best model to save time when outputting masked images, the model training time was a key factor when choosing the model. We experimented with several lightweight models of YOLOv4, such as YOLOv4-Tiny, YOLOv4-Ghost, YOLOv4-Mobilenetv1, and YOLOv4-Mobilenetv3. Considering the

Crop Row Detection Model
Since we needed to quickly obtain the best model to save time when outputting masked images, the model training time was a key factor when choosing the model. We experimented with several lightweight models of YOLOv4, such as YOLOv4-Tiny, YOLOv4-Ghost, YOLOv4-Mobilenetv1, and YOLOv4-Mobilenetv3. Considering the model performance and training time, YOLOv4-Tiny is an ideal model for crop row detection.
Finally, according to the detection results from the crop row detection model, we located the crop row coordinates, drew the bounding box, and masked the bounding box so that we obtained images only with inter-row weeds. The outputted masked image is shown in Figure 5. Finally, according to the detection results from the crop row detection model, we located the crop row coordinates, drew the bounding box, and masked the bounding box so that we obtained images only with inter-row weeds. The outputted masked image is shown in Figure 5.

Weed Detection Model Dataset
A total of 1000 images were selected for annotation. We randomly selected 300 images from the original images to label the maize and weeds. Furthermore, we randomly selected 700 images from the masked images to label the weeds. We labeled the original image and the masked image according to the ratio of 3:7, which greatly reduced the maize labels with similar features and helped to achieve sample balance. In total, 7700 maize labels and 9426 weed labels were labeled using this method. We estimated the number of labels for 1000 original images based on the number of maize and weed labels for 300 original images. It was estimated that 25,690 maize labels and 7882 weed labels were to be labeled. The total number of labels for 1000 images decreased by half, from 33,572 to 17,126, and all the reduced labels were maize labels with low demand for sample size. The proposed method not only saves a lot of labeling time but also achieves sample balance.
Before the network training, data augmentation was performed on the dataset, in-

Weed Detection Model Dataset
A total of 1000 images were selected for annotation. We randomly selected 300 images from the original images to label the maize and weeds. Furthermore, we randomly selected 700 images from the masked images to label the weeds. We labeled the original image and the masked image according to the ratio of 3:7, which greatly reduced the maize labels with similar features and helped to achieve sample balance. In total, 7700 maize labels and 9426 weed labels were labeled using this method. We estimated the number of labels for 1000 original images based on the number of maize and weed labels for 300 original images. It was estimated that 25,690 maize labels and 7882 weed labels were to be labeled. The total number of labels for 1000 images decreased by half, from 33,572 to 17,126, and all the reduced labels were maize labels with low demand for sample size. The proposed method not only saves a lot of labeling time but also achieves sample balance.
Before the network training, data augmentation was performed on the dataset, including brightness enhancement and reduction, contrast enhancement and reduction, Gaussian noise addition, vertically flipping 30% of all images, flipping 30% of all images with the mirror, translation, rotation, and zooming, the purpose of which was to enrich the image training set, effectively extract the image features, and avoid overfitting. After data argumentation, 3000 photos yielded a total of 23,100 maize labels and 28,278 weed labels, which were used for network training and parameter adjustment.
The method of labeling the original image and masked image is shown in Figure 6a,b. Data augmentation of the original image and masked image are shown in Figure 6c,d. We randomly selected 70% of images as the training set, 10% of images as the validation set, and 20% of images as the test set.

Weed Detection Model
In this paper, we optimized the YOLOv4 model. The structure of YOLOv4 is composed of three parts: (i) backbone, (ii) neck, and (iii) YOLO head. First, a new activation function, Meta−ACON, was introduced into our model, which can adaptively determine the upper and lower bounds and dynamically control the degree of linearity and nonlinearity. Second, we also added the CBAM and optimized it by changing the activation

Weed Detection Model
In this paper, we optimized the YOLOv4 model. The structure of YOLOv4 is composed of three parts: (i) backbone, (ii) neck, and (iii) YOLO head. First, a new activation function, Meta−ACON, was introduced into our model, which can adaptively determine the upper and lower bounds and dynamically control the degree of linearity and nonlinearity. Second, we also added the CBAM and optimized it by changing the activation function of the channel attention module to Meta−ACON. Finally, the traditional NMS [35] function was replaced with the Soft-NMS function. Thus, the optimized model can improve the recognition accuracy and prove the applicability of the Meta−ACON activation function, optimized CBAM, and Soft-NMS in the weed detection model. In addition, we drew a distribution map of maize and weed in the whole farmland. The numbers of maize and weed were calculated to provide a basis for variable-rate spraying. The structure of improved YOLOv4 is shown in Figure 7.

CBAM
The attention module can effectively eliminate interference factors, allowing the network to focus on areas that need more attention. We added the CBAM to the model and introduced the Meta−ACON activation function in the channel attention part. We changed its activation function from ReLU to Meta−ACON to improve the detection effect. To facilitate the comparison between our optimized attention module and the original CBAM, we named the attention module with the attached Meta−ACON A_CBAM. A_CBAM, as a lightweight convolution block attention module, infers the attention map through two independent dimensions (channel and space), and then multiplies the attention map by the input feature map for adaptive feature refinement, improving the performance of the model through end-to-end training. The process and formulas of CBAM are shown in Appendix A(b).

Soft-NMS
The traditional NMS sorts all bounding boxes according to their scores. It selects the bounding box M with the highest score and suppresses all other bounding boxes b i that significantly overlap with M. The problem with the NMS is that it sets the score of the adjacent bounding box to zero, so if an object exists in the overlap threshold, it can be missed. In farmland, it often happens that maize or weeds are adjacent, and the bounding box of maize or weeds with lower scores can be missed. Soft-NMS considers both the score and degree of overlap when performing non-maximum suppression. This is achieved by decaying the detection score of all other objects to a continuous function overlapped with M, rather than directly setting the score to zero, as in NMS. Therefore, no objects are eliminated in this process. The comparison between Soft-NMS and NMS is shown in Figure 8. The calculation formula of Soft-NMS is shown in Appendix A(c). missed. In farmland, it often happens that maize or weeds are adjacent, and the bounding box of maize or weeds with lower scores can be missed. Soft-NMS considers both the score and degree of overlap when performing non-maximum suppression. This is achieved by decaying the detection score of all other objects to a continuous function overlapped with , rather than directly setting the score to zero, as in NMS. Therefore, no objects are eliminated in this process. The comparison between Soft-NMS and NMS is shown in Figure 8. The calculation formula of Soft-NMS is shown in Appendix A(c).

Methods Evaluation Indicator
Precision and recall should both be addressed while developing a detection model, so measures such as precision, recall, F1-score, AP, and mAP were utilized to test the

Methods Evaluation Indicator
Precision and recall should both be addressed while developing a detection model, so measures such as precision, recall, F1-score, AP, and mAP were utilized to test the model's performance and evaluate the detection outcomes in this study. The precision, recall and F1-score can be calculated by Formulas (2)-(4).
where P r represents precision, R e represents recall, and F1 represents F1-score. TP PR curve can be drawn by taking different precision and recall values. The area of PR curve is defined as AP, and the mean value of AP of all detection categories is mAP. AP and mAP can be calculated by Formulas (5) and (6).
where p(r) represents the function of PR curve, and AP i represents the AP value of each category.

Results and Discussion
The tests were carried out on a system that had an Intel Xeon Gold 5220 CPU and an NVIDIA Tesla V100 GPU. CUDNN 7.6.5 and CUDA 10.2 were the accelerated environments. The crop row detection network and weed detection network were trained and tested using the Ubuntu 18.04 operating system, with Python 3.7.0 as the programming language.

Crop Row Detection Model Experiment
We experimented with several lightweight models of YOLOv4, such as YOLOv4-Ghost, YOLOv4-Mobilenetv1, and YOLOv4-Mobilenetv3. We conducted nine experiments for each model. Under the condition that other parameters are the same, the batch size was  It can be seen that each model achieved good results. YOLOv4 achieved the highest AP (93.15%) and the highest recall (90.77%), but the parameter size and training time of YOLOv4 were much larger than other lightweight models. YOLOv4-Tiny achieved the second highest AP (92.97%) and the highest precision (99.28%), with the lowest parameter value and the fastest detection speed. According to Table 2, YOLOv4-Tiny, which can obtain the best model at the fastest speed, had the shortest training time. Combining the results of Tables 1 and 2, it can be concluded that YOLOv4-Tiny is the ideal model.

Weed Detection Model Ablation Experiment
To prove that the CBA module can effectively improve the performance of the model, we compared the evaluation indicators with the original YOLOv4. The comparison is shown in Table 3. From the above table, we can see that the mAP of the improved model increased by 0.79% compared with the original YOLOv4. The AP of maize increased by 0.7%, reaching 86.67%, and the AP of weed increased by 0.89%, reaching 84.83%. Its recall greatly improved, especially that of weed (4.15%). Since Meta-ACON can dynamically control the degree of linearity and nonlinearity of the activation function, the performance significantly improved.
We added the CBAM to the model and introduced the Meta−ACON activation function in the channel attention part. As the attention module is a plug-and-play module, we studied the effect of adding the attention module at different positions. We conducted three ablation studies on CBAM and A_CBAM to evaluate the benefits of adding them after the three effective feature layers were extracted from the backbone network and after upsampling. The comparison is shown in Table 4. It can be seen from the above table that the effect of adding after upsampling is better than adding after the effective feature layers, and the effect of adding on both positions is the best. The mAP of A_CBAM was 0.31% higher than that of CBAM and 1.12% higher than the original YOLOv4. The AP of maize increased by 1.11%, reaching 87.08%, and the AP of weed increased by 1.14%, reaching 85.08%. These results show that the attention module focuses on important information and suppresses irrelevant details through weight parameters, which can effectively improve the performance of the model.

Weed Detection Model Comparison Experiment
To further analyze the performance, we compared our method with several other types of object detection models, such as YOLOv3, SSD, and YOLOv4-Tiny. We used the same training set, validation set, and test set to train and test these networks. Table 5 shows the comparisons made for each of these methods. It can be seen from Table 5 that the AP, recall precision, and F1-score of the improved YOLOv4 were all higher than those of the other detection models. The mAP of our model was 1.93%, 5.13%, 6.34%, and 13.65% higher than those of original YOLOv4, SSD, YOLOv3, and YOLOv4-Tiny, respectively. In addition, YOLOv4-Tiny had the worst effect, indicating that the lightweight model is not suitable for identifying complex targets such as maize and weed in the image. The mAP of YOLOv3 and SSD were 4.41% and 3.2%, respectively, which is lower than the original YOLOv4, indicating that it is correct to select YOLOv4 as the baseline. Thus, our improved model can achieve the best effect. The mAP was 86.89%, while the AP of maize was 87.49%, and the AP of weed was 86.28%.
The PR curves of different models are shown in Figure 9. Our proposed model had the best PR curve, especially the PR curve of weed. The detection results of different models are shown in Figure 10. Other models have the problem of missing detection, and some even recognize maize as weed. The findings show that our suggested algorithm accurately and quickly detected maize and weed in images collected in natural situations.

Maize and Weed Distribution and Counts
We input the image to be predicted into the weed detection model, determined whether the object in the image was maize or weed, and drew bounding boxes for them. We counted the different bounding boxes separately to ascertain the amount of maize and weed, a method which can help estimate the yield of maize and the number of weeds. We spliced all the output images into a complete farmland image, thus obtaining the distribution of maize and weed for the whole farmland, as shown in Figure 11. The number of maize in the experimental field was 88,845, and the number of weeds was 16,976. The detection results of different models are shown in Figure 10. Other models have the problem of missing detection, and some even recognize maize as weed. The findings show that our suggested algorithm accurately and quickly detected maize and weed in images collected in natural situations. The detection results of different models are shown in Figure 10. Other models have the problem of missing detection, and some even recognize maize as weed. The findings show that our suggested algorithm accurately and quickly detected maize and weed in images collected in natural situations.

Maize and Weed Distribution and Counts
We input the image to be predicted into the weed detection model, determined whether the object in the image was maize or weed, and drew bounding boxes for them. We counted the different bounding boxes separately to ascertain the amount of maize and weed, a method which can help estimate the yield of maize and the number of weeds. We spliced all the output images into a complete farmland image, thus obtaining the distribution of maize and weed for the whole farmland, as shown in Figure 11. The number of maize in the experimental field was 88,845, and the number of weeds was 16,976.

Maize and Weed Distribution and Counts
We input the image to be predicted into the weed detection model, determined whether the object in the image was maize or weed, and drew bounding boxes for them. We counted the different bounding boxes separately to ascertain the amount of maize and weed, a method which can help estimate the yield of maize and the number of weeds. We spliced all the output images into a complete farmland image, thus obtaining the distribution of maize and weed for the whole farmland, as shown in Figure 11. The number of maize in the experimental field was 88,845, and the number of weeds was 16,976.
1 Figure 11. Distribution map of maize and weed.

Regional Data Analysis
We divided the whole farmland into several areas, calculated the number and proportion of maize and weed in each area, and plotted the data on the original map. As shown in Figure 12, where m represents the number of maize, w represents the number of weeds, and r represents the proportion of weeds in the number of maize. The green area indicates that the proportion is less than 10%, the blue area indicates that the proportion is between 10 and 20%, the yellow area indicates that the proportion is between 20 and 30%, and the red area indicates that the proportion is greater than 30%. It can be seen that the green area accounts for 22%, the blue for 44%, the yellow for 25%, and the red for 9%. For the green area, we believe that weeds do not affect the living space of maize, so we did not spray herbicides in this area. For the blue, yellow, and red areas, we increased the use of herbicides in turn. Through variable spraying of herbicides, herbicide resources are saved, and the environment is protected. Furthermore, the yield of maize is affected by many factors. Several studies have shown that plant density is related to the yield in fields [36,37]. Through analysis of the statistics of maize in each area, the plant density can be understood and adjusted, as well as helping to facilitate data support for experts. Agriculture 2022, 12, x FOR PEER REVIEW 16 of 19 Figure 12. The image is divided into small areas for data analysis.

Conclusions
In this paper, we identified weeds in maize fields by UAV image and achieved good results. A new weed detection framework was proposed. First, a lightweight model YOLOv4-Tiny was exploited to detect and mask the maize rows so that we only needed to label weeds on the masked image, a process that can deal with the serious imbalance of samples and the high cost of labeling. Second, to improve the network recognition accuracy, the Meta-ACON activation function was used to change the CBL module, and the A_CBAM was added to the network. Furthermore, the NMS function was replaced with the Soft-NMS function. The results showed that the proposed model had a maize AP of 87.49%, a weed AP of 86.28%, and a mAP of 86.89% and can thus accurately identify maize and weed. We drew a distribution map of maize and weed, which provided data support for variable herbicide spraying and estimation of yield. In future work, we will propose new versions, such as YOLOv5 [38], to further improve the performance of weed detection.

Conclusions
In this paper, we identified weeds in maize fields by UAV image and achieved good results. A new weed detection framework was proposed. First, a lightweight model YOLOv4-Tiny was exploited to detect and mask the maize rows so that we only needed to label weeds on the masked image, a process that can deal with the serious imbalance of samples and the high cost of labeling. Second, to improve the network recognition accuracy, the Meta-ACON activation function was used to change the CBL module, and the A_CBAM was added to the network. Furthermore, the NMS function was replaced with the Soft-NMS function. The results showed that the proposed model had a maize AP of 87.49%, a weed AP of 86.28%, and a mAP of 86.89% and can thus accurately identify maize and weed. We drew a distribution map of maize and weed, which provided data support for variable herbicide spraying and estimation of yield. In future work, we will propose new versions, such as YOLOv5 [38], to further improve the performance of weed detection.

Conflicts of Interest:
The authors declare no conflict of interest.