Real-Time Detection of Seedling Maize Weeds in Sustainable Agriculture

: In recent years, automatic weed control has emerged as a promising alternative for reducing the amount of herbicide applied to the ﬁeld, instead of conventional spraying. This method is beneﬁcial to reduce environmental pollution and to achieve sustainable agricultural development. Achieving a rapid and accurate detection of weeds in maize seedling stage in natural environments is the key to ensuring maize yield and the development of automatic weeding machines. Based on the lightweight YOLO v4-tiny model, a maize weed detection model which combined an attention mechanism and a spatial pyramid pooling structure was proposed. To verify the effectiveness of the proposed method, ﬁve different deep-learning algorithms, including the Faster R-CNN, the SSD 300, the YOLO v3, the YOLO v3-tiny, and the YOLO v4-tiny, were compared to the proposed method. The comparative results showed that the mAP (Mean Average Precision) of maize seedlings and its associated weed detection using the proposed method was 86.69%; the detection speed was 57.33 f/s; and the model size was 34.08 MB. Furthermore, the detection performance of weeds under different weather conditions was discussed. The results indicated that the proposed method had strong robustness to the changes in weather, and it was feasible to apply the proposed method for the real-time and accurate detection of weeds.


Introduction
With the development of animal husbandry and maize processing industry, and the application of ethanol gasoline using maize as raw material, maize has become an important food crop, fodder crop, and industrial raw material [1], occupying an increasingly important position in agricultural production. The level of maize output and its economic benefits directly affect food security and the development of agricultural production.
Weeds are one of the biggest threats to the growth and yield of such crops as maize. They compete with crops for nutrients, sunlight, space, and water [2]. Weed prevention, control and removal are still based on traditional full-drench spraying pesticides [3], but current agricultural spraying practices have been considered unsustainable. This method of evenly spraying herbicides cannot distinguish between crops and weeds [4,5], so it will result in considerable pesticide waste. Moreover, this method is costly, causes soil and water source pollution [6], and affects farmland productivity and crop growth [7,8].
To reduce environmental pollution from chemicals, the European Commission is advocating strategies, such as the Integrated Pest Management (IPM). One of the principle aims of IPM is that chemicals applied will be as specific as possible to the target. The realization of this method is beneficial to the sustainable development of agriculture. The current research trend is to use smart equipment for the precise application of pesticides or intelligent mechanical weeding to reduce pesticide waste and improve the utilization rate of pesticides. Computer vision is the core technology for the precise and rapid identification and positioning of weeds [9], and is also the technical prerequisite for the precise control and management of weeds in the field. Therefore, intelligent equipment for precise pesticide application or intelligent mechanical weeding has broad application prospects.
Traditional target detection methods primarily start with image features obtained using machine vision and employ the differences in color, shape, or texture of crops, weeds, and even farmland backgrounds to perform image processing and manually extract features to recognize weed targets. Alchanatis et al. [10] selected two spectral channels from 100 channels of a hyperspectral sensor for soil-crop segmentation based on the acousto-optic tunable hyperspectral sensor and detection algorithm. Based on the texture features, weed detection of cotton plants was performed on the segmented image. The test results revealed that the accuracy rate could reach 86.00% under mixed conditions. Raja et al. [11] proposed a topical marker-based crop signaling technique by applying signaling compounds to lettuce seedlings prior to transplanting, and then using machine vision to recognize crop signals to distinguish lettuce from weeds. Rani et al. [12] proposed an intelligent weed detection system for sustainable agriculture. First, the features of the weeds and crops were extracted using speeded-up robust features and histogram of gradients, then logistic regression and support vector machine (SVM) algorithms were used for the classification of weeds and crops. The test results demonstrated that the accuracy of this method could reach 83.00%. These studies have effectively improved the level of weed detection and weed recognition. However, the image feature extraction based on the traditional method is cumbersome and the requirements for the extracted features are higher, which severely restricts the improvement of the automation level of weeding in maize fields. The realization of a rapid and accurate detection of weeds in maize fields in natural environments is of great significance for developing intelligent weeding equipment.
In recent years, machine learning has had a growing impact on science, health, and sustainable development, and deep-learning technology is considered to be the future development trend due to its strong learning ability, good adaptability, and high performance. It has begun to be applied in agriculture [13], such as crop classification [14], weed segmentation [15], weed detection [16,17], and other tasks [18]. Therefore, deep-learning method is the key technology and future trend to achieve an accurate and rapid response to weed detection in maize fields.
Milioto et al. [19] fed RGB images of sugar beets and near infra-red (NIR) images into an artificial neural network and segmented the plants based on their appearance, which did not require manually defining features of the classified crops, and greatly reduced manual labeling costs. Yu et al. [20] used VGGNet and DetectNet to detect weeds of specified classes, and they achieved an extremely high accuracy rate of 99% on the weeds' specific dataset. In order to achieve higher accuracy in the target area segmentation of complex natural environment, Dias et al. [21] used convolutional neural network (CNN) and SVM methods to automatically extract crop features in a complex background. This image segmentation method obtained relatively accurate crop region segmentation results. At present, weed detection method based on deep learning has a good effect in practical application, but there are still some problems. The research mainly focuses on three common problems: first, the appearance of crops and weeds is often quite different, and for early weed detection in farmland, the morphological characteristics of seedlings and weeds may be similar and cannot be distinguished well; secondly, the specific network only has a good result on specific weed species, but the network generalization performance is not good, and it is usually not applicable in complex natural environment; and finally, even in natural environment weed detection, the detection effect is also limited by the population density, that is, there are occlusion and overlap problems between overgrown weeds (either under the segmentation task or under the detection task). In the framework based on deep learning, how to improve the network to solve the above problems has become the beginning of research innovation.
Bakhshipour et al. [22] designed successive steps in a discrimination algorithm to determine the wavelet texture features for each image subdivision to be fed to an artificial neural network. The recognition algorithm was optimized by studying the two methods of single-machine wavelet transform and image segmentation to distinguish weeds from main crops. The results demonstrated that the recognition accuracy of this method for sugar beet and four types of weed could reach 96.00%. However, a small mobile terminal cannot complete such a huge amount of calculation. Farjon et al. [23] used a deep-learning network based on two-stage detection and faster R-CNN algorithm to detect apple blossoms, and the average detection accuracy of this method was 68%. Faster R-CNN comprises two parts: a classification network and a region proposal network (RPN). However, the amount of calculation is large, the detection speed is not fast, and real-time detection cannot be realized. Most of the deep learning-based methods take up a lot of memory, which affects the deployment and application of these models in intelligent equipment and limits the development of sustainable agriculture. The development of agricultural intelligent equipment is an inevitable development trend of sustainable agriculture, and the detection network based on server-side applications often cannot achieve a high detection speed. Therefore, how to make lightweight modifications to the network to adapt to mobile applications has become a recent research topic.
The You Only Look Once (YOLO) [24][25][26] method unifies target classification and positioning into a regression problem. It is a one-stage object detection method, so it does not require an RPN. It only performs regression to detect the target, so it has faster detection speed. The most advanced version, which is named YOLO v4, has high detection accuracy, high speed, and good detection performance for small targets, and the newly improved lightweight model YOLO v4-tiny reduces the memory and improves the detection speed while ensuring high accuracy.
However, the YOLO v4-tiny model has not been applied to weed detection. After image data are passed into the convolutional neural network and after the convolution and downsampling operations, the feature map will gradually shrink, which makes it difficult for the model to extract the features of small objects. Due to the variable light conditions of maize fields in the natural environment, maize and weeds are densely distributed, and the phenomenon of overlapping leaves is also serious. Moreover, the size, shape, and cluster density of plants at different growth stages are also different. These problems have brought great challenges to detecting weeds in maize fields in the natural environment. In a complex and changeable environment, traditional detection methods are unsuitable for detecting weeds under different weather conditions.
In addition, it is challenging to balance the accuracy and real-time performance of deep-learning methods. In order to solve the problems of change in light conditions in the natural environment, the overlapping phenomenon of corn seedlings and weed leaves, and the difference in the appearance of plants at different growth stages, the effect of the lightweight model for weed detection is poor. In this research object, based on the lightweight YOLO v4-tiny model, a corn weed detection model combining an attention mechanism and a spatial pyramid hybrid structure is proposed. This model is used to train, test, and verify the collected data, and the detection results are compared to other mainstream target detection models. The purpose is to enable the improved model to be applied to the detection of corn seedlings and weeds in the natural environment, and to promote the application of deep-learning methods in the detection of farmland targets so as to provide an accurate identification method for weeds in unmanned agricultural machinery field operations.

Materials and Methods
The workflow of this paper was as follows: Firstly, data images of corn and weeds acquired by the camera were used, and the data were manually labeled. These data images were randomly divided into a training set, a validation set, and a test set with a ratio of 3:1:1, and the training images were preprocessed. Then, the training set was input into different, improved YOLO v4-tiny network models for training, and the optimal weights of the network models with different structures were obtained after training. Finally, the test set was used to test the network models with different structures. The model with different, improved YOLO v4-tiny network models for training, and the optimal weights of the network models with different structures were obtained after training. Finally, the test set was used to test the network models with different structures. The model with the best detection effect was selected, and the detection effects of other mainstream object detection models were compared. The content of this section is organized as follows. Section 2.1 introduces the collection of the image data, including the process of collecting images and tapes of weeds. Section 2.2 describes the preprocessing of the image datasets. Section 2.3 introduces the process of labeling objects. Section 2.4 introduces the YOLO v4tiny algorithm. Section 2.5 introduces the improved YOLO v4-tiny algorithm. Section 2.6 introduces the structure of the improved YOLO v4-tiny model. Afterwards, the loss function used by the improved YOLO v4-tiny model is introduced. Section 2.7 introduces the loss function of the algorithm proposed in this paper. Section 2.8 introduces the model training in detail.

Image Data Acquisition
The experimental data were collected in an experimental maize field of the Northeast Agricultural University in Harbin, Heilongjiang Province, China, from May to July 2021. The name of the maize variety is Dongnong275 (the breeder is the Northeast Agricultural University, and the variety source is DN2710 × DN4206). The targets in this study were grown in natural conditions without human intervention, and the soil was also natural. The data collection was divided into 10 stages (with a difference of 3 to 5 days in each stage). The camera used for data acquisition was a 12-megapixel resolution color camera (the model is Sony IMX503, the unit pixel area is 1.4 micron, and the focal length is 26 mm). The camera was fixed on a horizontal telescopic pole of an uncrewed vehicle at 50 cm away from the ground, and the camera lens was perpendicular to the ground for data capturing. The image resolution was 3024 pixels (horizontal) by 4032 pixels (vertical), and all images were saved in JPG format. The image data were taken on sunny, cloudy, and rainy days. The collection times were 10 am and 2 pm, with 50 pictures taken in each period. The Latin names and abbreviation of the weeds that were photographed are Abutilon theophrasti Medicus (Abutilon), Chenopodium album Linn (Chenopedium), Amaranthus blitum Linnaeus (Amaranthus), and Ipomoea purpurea Lam. seedlings (Morning glory seedlings). The data collection location, the appearance characteristics of the four weeds, the uncrewed vehicle, and the data collection process are illustrated in Figure 1.

Data Analysis and Preprocessing
In the collected images, the growth status of the weeds included weeds only, maize seedlings and weeds, densely distributed weeds, sparse weed distribution, and so on. A total of 1000 effective images were taken during this period. The growth and distribution of maize seedlings and weeds in the experimental field are presented in Figure 2.

Data Analysis and Preprocessing
In the collected images, the growth status of the weeds included weeds only, maize seedlings and weeds, densely distributed weeds, sparse weed distribution, and so on. A total of 1000 effective images were taken during this period. The growth and distribution of maize seedlings and weeds in the experimental field are presented in Figure 2.

Data Analysis and Preprocessing
In the collected images, the growth status of the weeds included weeds only, maize seedlings and weeds, densely distributed weeds, sparse weed distribution, and so on. A total of 1000 effective images were taken during this period. The growth and distribution of maize seedlings and weeds in the experimental field are presented in Figure 2. The data set was divided into a training set, a validation set, and a testing set with a ratio of 3:1:1. Rotation preprocessing was performed on the images of the training set, and the training set was expanded to enhance the richness of the experimental data set, as depicted in Figure 3. The expanded training data set has 2400 images, and the number of expanded types of data is listed in Table 1.   The data set was divided into a training set, a validation set, and a testing set with a ratio of 3:1:1. Rotation preprocessing was performed on the images of the training set, and the training set was expanded to enhance the richness of the experimental data set, as depicted in Figure 3. The expanded training data set has 2400 images, and the number of expanded types of data is listed in Table 1.

Data Analysis and Preprocessing
In the collected images, the growth status of the weeds included weeds only, maize seedlings and weeds, densely distributed weeds, sparse weed distribution, and so on. A total of 1000 effective images were taken during this period. The growth and distribution of maize seedlings and weeds in the experimental field are presented in Figure 2. The data set was divided into a training set, a validation set, and a testing set with a ratio of 3:1:1. Rotation preprocessing was performed on the images of the training set, and the training set was expanded to enhance the richness of the experimental data set, as depicted in Figure 3. The expanded training data set has 2400 images, and the number of expanded types of data is listed in Table 1.

Data Tag
Accurately labeling the target of the detection task can enable the weed detection model to better learn the features of the target. The model targeted maize seedlings and various weeds in a test maize field, and the targets were marked with a minimum bounding box on each image. Each minimum bounding box should only contain a maize seedling or a weed and should minimize background pixels. An example of image tagging is illustrated in Figure 4, Different objects are marked with boxes of different colors in the figure.
model to better learn the features of the target. The model targeted maize seedlings various weeds in a test maize field, and the targets were marked with a minimum bou ing box on each image. Each minimum bounding box should only contain a maize s ling or a weed and should minimize background pixels. An example of image taggin illustrated in Figure 4, Different objects are marked with boxes of different colors in figure. The test images were imported into LabelImg. Then, weeds and maize seedling the data were marked, as shown in Figure 4a. During the marking process, LabelImg tomatically generated a text file, which included the weed type, maize or weed coo nates, and other data. As shown in Figure 4b, the numbers in the first red box repre its class, and the numbers in the second red box represent the center coordinates o box and the width and height of the box, respectively. Objects with incomplete cont or unclear features were also marked with minimum bounding boxes to ensure the d tion performance of the model.

YOLO v4-Tiny Model
The YOLO v4-tiny is a lightweight target detection model designed on the bas the YOLO v4. It occupies less memory and runs faster, which greatly increases the p bility of the realization of intelligent weeding equipment. The backbone network o YOLO v4-tiny is CSPDarknet53-tiny, which is a lightweight version of the CSPDarkn network. The network structure of CSPDarknet53-tiny is shown in Figure 5. The CSPD net53-tiny network adopts the CSPBlock module in the cross-stage partial network stead of the residual block (Resblock) module in the residual network. The CSPBlock m ule divides the feature map into two parts and combines the two parts through the cr stage residual edge, allowing the gradient flow to propagate on two different netw paths and increasing the relative difference in gradient information. Compared to th sidual block module, the learning ability of the CSPBlock module is stronger. Altho this increases the amount of computation, it improves feature extraction ability and tection effect. This network eliminates the computational bottleneck of the CSPBlock m ule, which has considerable calculation, and reduces the calculation. The accuracy o YOLO v4-tiny method is improved, whereas the amount of calculation remains changed or even reduced. To further simplify the calculation process and increase the culation speed of the network, the YOLOv 4-tiny method uses a leaky rectified linear (Leaky ReLU) function in the CSPDarknet53-tiny network, instead of the Mish functio the YOLO v4, as the activation function. The test images were imported into LabelImg. Then, weeds and maize seedlings in the data were marked, as shown in Figure 4a. During the marking process, LabelImg automatically generated a text file, which included the weed type, maize or weed coordinates, and other data. As shown in Figure 4b, the numbers in the first red box represent its class, and the numbers in the second red box represent the center coordinates of the box and the width and height of the box, respectively. Objects with incomplete contours or unclear features were also marked with minimum bounding boxes to ensure the detection performance of the model.

YOLO v4-Tiny Model
The YOLO v4-tiny is a lightweight target detection model designed on the basis of the YOLO v4. It occupies less memory and runs faster, which greatly increases the possibility of the realization of intelligent weeding equipment. The backbone network of the YOLO v4-tiny is CSPDarknet53-tiny, which is a lightweight version of the CSPDarknet53 network. The network structure of CSPDarknet53-tiny is shown in Figure 5. The CSPDarknet53tiny network adopts the CSPBlock module in the cross-stage partial network instead of the residual block (Resblock) module in the residual network. The CSPBlock module divides the feature map into two parts and combines the two parts through the cross-stage residual edge, allowing the gradient flow to propagate on two different network paths and increasing the relative difference in gradient information. Compared to the residual block module, the learning ability of the CSPBlock module is stronger. Although this increases the amount of computation, it improves feature extraction ability and detection effect. This network eliminates the computational bottleneck of the CSPBlock module, which has considerable calculation, and reduces the calculation. The accuracy of the YOLO v4-tiny method is improved, whereas the amount of calculation remains unchanged or even reduced. To further simplify the calculation process and increase the calculation speed of the network, the YOLOv 4-tiny method uses a leaky rectified linear unit (Leaky ReLU) function in the CSPDarknet53-tiny network, instead of the Mish function in the YOLO v4, as the activation function. In terms of feature fusion, the YOLO v4-tiny model uses a feature pyramid network (FPN) structure to extract feature maps of different scales, perform feature fusion on the two effective feature layers, and improve target detection speed. Moreover, the YOLO v4tiny uses two feature graphs with different proportions, 13 × 13 and 26 × 26, to predict the detection results.

Improvement of the Feature Extraction Network
The accuracy of weed detection is often interfered by various factors, such as light and occlusion, so it is critical to overcome these interferences. Given erroneous detection, missed detection, and low confidence of weed detection caused by partial occlusion and the influence of light on the YOLO v4-tiny model, the detection effect can be effectively improved by enhancing the receptive field and enhancing the feature extraction method.
The SE-Block [27] was designed based on the above ideas. As depicted in Figure 6, the SE-Block obtains a weight for each channel through learning. According to the weights, the important channel features are strengthened, and the nonimportant channel features are suppressed to improve the network model accuracy.  In terms of feature fusion, the YOLO v4-tiny model uses a feature pyramid network (FPN) structure to extract feature maps of different scales, perform feature fusion on the two effective feature layers, and improve target detection speed. Moreover, the YOLO v4-tiny uses two feature graphs with different proportions, 13 × 13 and 26 × 26, to predict the detection results. The accuracy of weed detection is often interfered by various factors, such as light and occlusion, so it is critical to overcome these interferences. Given erroneous detection, missed detection, and low confidence of weed detection caused by partial occlusion and the influence of light on the YOLO v4-tiny model, the detection effect can be effectively improved by enhancing the receptive field and enhancing the feature extraction method.
The SE-Block [27] was designed based on the above ideas. As depicted in Figure 6, the SE-Block obtains a weight for each channel through learning. According to the weights, the important channel features are strengthened, and the nonimportant channel features are suppressed to improve the network model accuracy. In terms of feature fusion, the YOLO v4-tiny model uses a feature pyramid network (FPN) structure to extract feature maps of different scales, perform feature fusion on the two effective feature layers, and improve target detection speed. Moreover, the YOLO v4tiny uses two feature graphs with different proportions, 13 × 13 and 26 × 26, to predict the detection results.

Improvement of the Feature Extraction Network
The accuracy of weed detection is often interfered by various factors, such as light and occlusion, so it is critical to overcome these interferences. Given erroneous detection, missed detection, and low confidence of weed detection caused by partial occlusion and the influence of light on the YOLO v4-tiny model, the detection effect can be effectively improved by enhancing the receptive field and enhancing the feature extraction method.
The SE-Block [27] was designed based on the above ideas. As depicted in Figure 6, the SE-Block obtains a weight for each channel through learning. According to the weights, the important channel features are strengthened, and the nonimportant channel features are suppressed to improve the network model accuracy.  Moreover, the SE-Block primarily comprises the squeeze and excitation operations. The excitation operation generates a weight for each feature channel, different colors in Figure 6 represent different values, which are used to measure the importance of channels. For the squeeze operation, global average pooling of the input feature maps is performed. Each two-dimensional feature channel in the spatial dimension becomes a real number. The global distribution characteristics of the characteristic channels are obtained, providing a global receptive field. The dimensionality of the output matches the number of characteristic channels of the input.
Finally, the scale operation (F scale ) multiplies the weight generated by the previous corresponding channel feature, and recalibrates the original features in the channel dimension to strengthen effective features and weaken inefficient or ineffective features. The extracted features have stronger directivity, which improves the detection results. In Figure 6, X is the input feature map, and X' is the output feature map. In addition, W, H, and C denote the length, width, and channel number of the feature map, respectively. This paper combines the Resblock and SE-Block to form a lightweight, featureextraction basic Res-SE structure to ensure the front-end basic network has high accuracy while quickly extracting features, as displayed in Figure 7. Two Resblock structures were added at the end of the feature extraction network. Each Resblock consists of 1 × 1 and 3 × 3 convolutional layers, and the activation function is Mish. In addition, the SE-Block was added to the second Resblock to obtain a series of new feature outputs. The SE-Block acts as a bypass unit for output characteristics. After a series of operations, such as global average pooling, through the two fully connected layers, ReLu and sigmoid normalization, the weight of each characteristic channel is obtained. Then, the weight is multiplied by each channel element corresponding to the original feature information to complete the channel information intensity recalibration. Moreover, the SE-Block primarily comprises the squeeze and excitation operations. The excitation operation generates a weight for each feature channel, different colors in Figure 6 represent different values, which are used to measure the importance of channels. For the squeeze operation, global average pooling of the input feature maps is performed. Each two-dimensional feature channel in the spatial dimension becomes a real number. The global distribution characteristics of the characteristic channels are obtained, providing a global receptive field. The dimensionality of the output matches the number of characteristic channels of the input.
Finally, the scale operation (Fscale) multiplies the weight generated by the previous corresponding channel feature, and recalibrates the original features in the channel dimension to strengthen effective features and weaken inefficient or ineffective features. The extracted features have stronger directivity, which improves the detection results. In Figure 6, X is the input feature map, and X' is the output feature map. In addition, W, H, and C denote the length, width, and channel number of the feature map, respectively. This paper combines the Resblock and SE-Block to form a lightweight, feature-extraction basic Res-SE structure to ensure the front-end basic network has high accuracy while quickly extracting features, as displayed in Figure 7. Two Resblock structures were added at the end of the feature extraction network. Each Resblock consists of 1 × 1 and 3 × 3 convolutional layers, and the activation function is Mish. In addition, the SE-Block was added to the second Resblock to obtain a series of new feature outputs. The SE-Block acts as a bypass unit for output characteristics. After a series of operations, such as global average pooling, through the two fully connected layers, ReLu and sigmoid normalization, the weight of each characteristic channel is obtained. Then, the weight is multiplied by each channel element corresponding to the original feature information to complete the channel information intensity recalibration.

Dense-SPP Structure
In most CNN architectures, the fully connected layers are usually connected after the convolutional layers, so it can only accept a fixed size as input. Thus, the input image is often cropped and zoomed to satisfy the input requirements. However, these operations cause a loss of information and distort the image, resulting in poor detection results. He et al. [28] designed SPP, a multiscale feature fusion model, to solve the problem of image distortion caused by different input image scales in the computer vision field. The SPP structure uses multilevel spatial windows to extract features of different scales from the same feature map and perform pooling operations, producing the same scale output for input images of any size.
Drawing lessons from the idea of SPP, this paper proposes a brand new dense SPP structure, where the multiscale SPP module has four branches, three of them are max pooling layer and convolutional layer with different window sizes, and the other branch is a jump connection. The sizes of the largest pooling layer windows in the three branches are 13 × 13, 9 × 9, and 5 × 5, respectively, with a step size of 1 and passing through a 1 × 1 convolutional layer. Given the design idea of feature reuse, the feature map output by the branch of the 5 × 5 maximum pooling window is spliced with the feature map before pooling and being concatenated with the 9 × 9 pooling window, and the feature map output

Dense-SPP Structure
In most CNN architectures, the fully connected layers are usually connected after the convolutional layers, so it can only accept a fixed size as input. Thus, the input image is often cropped and zoomed to satisfy the input requirements. However, these operations cause a loss of information and distort the image, resulting in poor detection results. He et al. [28] designed SPP, a multiscale feature fusion model, to solve the problem of image distortion caused by different input image scales in the computer vision field. The SPP structure uses multilevel spatial windows to extract features of different scales from the same feature map and perform pooling operations, producing the same scale output for input images of any size.
Drawing lessons from the idea of SPP, this paper proposes a brand new dense SPP structure, where the multiscale SPP module has four branches, three of them are max pooling layer and convolutional layer with different window sizes, and the other branch is a jump connection. The sizes of the largest pooling layer windows in the three branches are 13 × 13, 9 × 9, and 5 × 5, respectively, with a step size of 1 and passing through a 1 × 1 convolutional layer. Given the design idea of feature reuse, the feature map output by the branch of the 5 × 5 maximum pooling window is spliced with the feature map before pooling and being concatenated with the 9 × 9 pooling window, and the feature map output through the 9 × 9 window performs the same operation before entering the 13 × 13 Sustainability 2022, 14, 15088 9 of 20 maximum pooling window, as shown by the arrows between the branches in Figure 8. Finally, the feature map output of each branch are also spliced as the output of the module.
Sustainability 2022, 14, x FOR PEER REVIEW 9 of 21 through the 9 × 9 window performs the same operation before entering the 13 × 13 maximum pooling window, as shown by the arrows between the branches in Figure 8. Finally, the feature map output of each branch are also spliced as the output of the module. The SPP module combines the dense connection method [29], which increases the extracted multiscale local feature information and integrates it into subsequent global features to obtain richer feature representations and promote feature reuse. The number of parameters is reduced, and the detection accuracy is ultimately improved.

Spatial Attention Module
The spatial attention module (SAM) [30], as illustrated in Figure 9, performs global average pooling and global maximum pooling operations in the channel dimension to generate two feature maps representing different information. It performs feature fusion operations on the feature maps, reducing the dimensionality through a 7 × 7 convolutional layer with a larger receptive field. Finally, the sigmoid operation generates a weight map and superimposes the original input feature map to enhance the target area, as indicated in Equation (1), where σ represents the sigmoid operation and F is the input feature map.
In this paper, the residual block [31] is combined with the SAM to form the Res-SAM structure, as depicted in Figure 10. A Resblock structure was added after the dense SPP The SPP module combines the dense connection method [29], which increases the extracted multiscale local feature information and integrates it into subsequent global features to obtain richer feature representations and promote feature reuse. The number of parameters is reduced, and the detection accuracy is ultimately improved.

Spatial Attention Module
The spatial attention module (SAM) [30], as illustrated in Figure 9, performs global average pooling and global maximum pooling operations in the channel dimension to generate two feature maps representing different information. It performs feature fusion operations on the feature maps, reducing the dimensionality through a 7 × 7 convolutional layer with a larger receptive field. Finally, the sigmoid operation generates a weight map and superimposes the original input feature map to enhance the target area, as indicated in Equation (1), where σ represents the sigmoid operation and F is the input feature map.
Sustainability 2022, 14, x FOR PEER REVIEW 9 of 21 through the 9 × 9 window performs the same operation before entering the 13 × 13 maximum pooling window, as shown by the arrows between the branches in Figure 8. Finally, the feature map output of each branch are also spliced as the output of the module. The SPP module combines the dense connection method [29], which increases the extracted multiscale local feature information and integrates it into subsequent global features to obtain richer feature representations and promote feature reuse. The number of parameters is reduced, and the detection accuracy is ultimately improved.

Spatial Attention Module
The spatial attention module (SAM) [30], as illustrated in Figure 9, performs global average pooling and global maximum pooling operations in the channel dimension to generate two feature maps representing different information. It performs feature fusion operations on the feature maps, reducing the dimensionality through a 7 × 7 convolutional layer with a larger receptive field. Finally, the sigmoid operation generates a weight map and superimposes the original input feature map to enhance the target area, as indicated in Equation (1), where σ represents the sigmoid operation and F is the input feature map.
In this paper, the residual block [31] is combined with the SAM to form the Res-SAM structure, as depicted in Figure 10. A Resblock structure was added after the dense SPP  Figure 9. Spatial attention module structure.
In this paper, the residual block [31] is combined with the SAM to form the Res-SAM structure, as depicted in Figure 10. A Resblock structure was added after the dense SPP structure, comprising two 1 × 1 convolutional layers, and the activation function is the Leaky ReLU. After adding the SAM module to the second convolutional layer, a series of new feature outputs were obtained.

YOLO v4-Weeds Network Structure
The complete network structure of the YOLO v4-weeds is presented in Figure 11. The YOLO v4-weeds network uses 608 × 608 × 3 as the input image size to improve the model's ability to handle high-resolution images. First, the backbone feature extraction network uses two 3 × 3 convolutional layers with a stride of 2 to extract global features and then performs down-sampling through the CSPBlock in the network. The resulting output feature map is added to the more efficient output feature map extracted by the channel attention, and the combined feature is used as the output feature of the backbone feature extraction network. Then, two effective feature layers from the backbone are selected and added into the feature fusion module. The feature fusion module consists of the dense SPP structure, the Res-SAM, and the FPN structure. In this way, a richer feature representation can be obtained, and feature reuse can be promoted. Two enhanced features can be obtained using the feature fusion module: 38 × 38 feature maps for small target detection and 19 × 19 feature maps for large target detection. Finally, these two features are passed to the YOLO head to obtain prediction result, and the output prediction result contains the labels of different kinds of weeds and maize seedlings.

YOLO v4-Weeds Network Structure
The complete network structure of the YOLO v4-weeds is presented in Figure 11. The YOLO v4-weeds network uses 608 × 608 × 3 as the input image size to improve the model's ability to handle high-resolution images. First, the backbone feature extraction network uses two 3 × 3 convolutional layers with a stride of 2 to extract global features and then performs down-sampling through the CSPBlock in the network. The resulting output feature map is added to the more efficient output feature map extracted by the channel attention, and the combined feature is used as the output feature of the backbone feature extraction network. Then, two effective feature layers from the backbone are selected and added into the feature fusion module. The feature fusion module consists of the dense SPP structure, the Res-SAM, and the FPN structure. In this way, a richer feature representation can be obtained, and feature reuse can be promoted. Two enhanced features can be obtained using the feature fusion module: 38 × 38 feature maps for small target detection and 19 × 19 feature maps for large target detection. Finally, these two features are passed to the YOLO head to obtain prediction result, and the output prediction result contains the labels of different kinds of weeds and maize seedlings.

Loss Function
The loss function of the YOLO v4-weeds consists of bounding box location loss (Loss-CIoU), confidence loss (Lossconfidence), and classification loss (Lossclass), as indicated in Equations (2)-(9). In addition, the YOLO v4-weeds uses CIoU loss as the bounding box regression  Figure 11. Network structure of the YOLO v4-weeds.

Loss Function
The loss function of the YOLO v4-weeds consists of bounding box location loss (Loss CIoU ), confidence loss (Loss confidence ), and classification loss (Loss class ), as indicated in Equations (2)- (9). In addition, the YOLO v4-weeds uses CIoU loss as the bounding box regression loss function to judge the distance between the predicted box (PB) and the ground truth (GT). Moreover, CIoU addresses the situation in which PB or GT is completely surrounded by the other party, the convergence speed is too slow, and the accuracy of the predicted box is low. The overlap area, the center point distance, the length and width ratio of PB and GT should be considered simultaneously, as noted in Equations (2) and (3): where b and b gt represent the center points of PB and GT; ρ 2 finds the Euclidean distance; c denotes the shortest diagonal length of the smallest bounding box of PB and GT; α represents a positive balance parameter; and v denotes the consistency of the aspect ratio of PB and GT. The definitions of v and α are given in Equations (4) and (5), where w gt and h gt represent the width and height of GT and PB, respectively: v = 4 Loss = Loss CIoU + Loss confidence + Loss class.
In Equations (5) and (6), S is the number of grids, and B is the anchor number corresponding to each grid. In Equation (8), K represents the weight, and its value is 1 if there is an object in the jth anchor of the ith grid; otherwise, its value is 0. In addition, n represents the actual and predicted classes of the jth anchor in the ith grid, respectively, and p is the probability that the object is a weed or maize.

Model Training Detail
The YOLO v4-weeds detection model used in this research was modified using the Darknet framework. In model training and testing, the software environment was Ubuntu 20.04 (Ubuntu is a product of Canonical Corporation, Ubuntu is a Debian Linux-based operating system founded by Mark Shuttleworth, Canonical Corporation is registered in the Isle of Man) and Python 3.8.5 (The author of Python is Guido van Rossum, and the copyright of Python currently belongs to the Python Software Foundation). Table 2 lists the training platform parameters of the improved lightweight weed detection model, YOLO v4-weeds. The model initialization parameters are presented in Table 3. Considering the server's memory constraints, this article sets the batch size to 16. In order to better analyze the training process, the number of iterations was chosen to be 30,000. For parameters such as momentum, initial learning rate, and weight decay regularization, these refer to the original parameters in the YOLO v4-tiny model, and the model is trained after defining the training parameters. The learning rate drops to 0.0001 after 20,000 steps and 0.00001 after 25,000 steps. Then, the image data in the test set are passed into the trained YOLO v4-weeds model for testing to verify the performance of the model.

Evaluation Indicators and Results
For binary classification problems, the samples can be divided into four types: true positive (TP), false positive (FP), true negative (TN), and false negative (FN), according to the combinations of the true and predicted classes of the learner. The confusion matrix of classification results is presented in Table 4. This research uses four indicators for evaluation to verify the model performance: precision, recall, mean average precision (mAP), and detection speed. When the intersection over union (IoU) ≥ 0.5, it is a TP case. When the IoU < 0.5, it is a FP case, and when the IoU = 0, it is a FN case. The mAP is the mean value of average accuracy when weeds and maize are detected. A higher mAP value indicates better detection results for weeds. The calculation formulas for IoU, precision, recall, and mAP are presented in Equations (11)- (14). Taking the precision as the vertical axis and the recall as the horizontal axis, the precision curve, referred to as the P-R curve, can be obtained.
where R denotes the detection area of the target bounding box; R is the actual area of the target bounding box; and C is the number of detection types. Because there are five detection targets in this paper, C = 5. The training loss curve during training is displayed in Figure 12a. It can be seen from Figure 12a that, in the early stage of model training, the model learning is more efficient and the training curve converges faster. The slope of the training curve gradually decreases as the training progresses. Finally, when the number of training iterations reaches about 21,000, the loss reaches a plateau. In addition, the total time for model training is 17.2 h.
where R denotes the detection area of the target bounding box; R' is the actual area of the target bounding box; and C is the number of detection types. Because there are five detection targets in this paper, C = 5. The training loss curve during training is displayed in Figure 12a. It can be seen from Figure 12a that, in the early stage of model training, the model learning is more efficient and the training curve converges faster. The slope of the training curve gradually decreases as the training progresses. Finally, when the number of training iterations reaches about 21,000, the loss reaches a plateau. In addition, the total time for model training is 17.2 h.  The improved lightweight weed detection model was tested to verify its detection capability. The trained model YOLO v4-weeds was used to process the testing set, and the test took 3.49 s. The results are provided in Table 5. The precision of the model was 76.64%, the recall was 83.33%, the mAP was 86.69%, and the detection speed was 57.33 f/s. An example of the test result is shown in Figure 12b.

Comparison of Different Target Detection Algorithms
Researchers have used detection methods based on deep learning in many target detection tasks [32][33][34][35]. This study compared five target detection algorithms, including the Faster R-CNN [36], SSD 300 [37], YOLO v3, YOLO v3-tiny, and YOLO v4-tiny, to the proposed YOLO v4-weeds target detection algorithm to verify the effectiveness of the proposed method for weed detection in maize fields. During the test, the P-R curves of the different models are listed in Figure 13.
The testing set was used to evaluate the performance of the different detection algorithms. The results are provided in Table 6.
The rapid and accurate detection of weeds in maize fields helps protect the growth of crops and provides a technical reference for developing field weeding robots. This research compared five different target detection algorithms (i.e., Faster R-CNN, SSD 300, YOLO v3, YOLO v3-tiny, and YOLO v4-tiny) with the proposed method and analyzed their weed detection performance.
The Researchers have used detection methods based on deep learning in many target detection tasks [32][33][34][35]. This study compared five target detection algorithms, including the Faster R-CNN [36], SSD 300 [37], YOLO v3, YOLO v3-tiny, and YOLO v4-tiny, to the proposed YOLO v4-weeds target detection algorithm to verify the effectiveness of the proposed method for weed detection in maize fields. During the test, the P-R curves of the different models are listed in Figure 13. The testing set was used to evaluate the performance of the different detection algorithms. The results are provided in Table 6.  In terms of detection accuracy, the proposed method has higher detection accuracy than the other five detection algorithms. In terms of detection speed, although the YOLO v3-tiny and the YOLO v4-tiny are slightly faster than the proposed method, the detection speed of the proposed method can still meet real-time requirements. In terms of model size, the YOLO v4-weeds model is slightly larger than the YOLO v3-tiny model and the YOLO v4-tiny model, but is much smaller than the YOLO v3 model. In summary, by comparing the detection performance of the above five detection models with the proposed method on weeds in natural environments, the model proposed in this paper not only ensures high accuracy, but also reduces the size of the model and improves the detection speed. Therefore, the proposed model can achieve accurate and rapid detection of weeds in maize fields in natural environments and is easy to apply to embedded devices.

Comparison of Algorithms in Varying Weather
The model efficiency was further tested according to actual conditions to test its adaptability and effectiveness in natural scenarios. The sunny, cloudy, and rainy images taken in natural scene in the testing set were divided into two parts, each forming a testing set for detection, and were compared to the YOLO v3, YOLO v3-tiny, and YOLO v4-tiny models. Precision, recall and mAP were used to evaluate model performance. Among the images, 437 images of sunny days were denoted as Type 1, and 563 images of cloudy or rainy days were denoted as Type 2. The YOLO v3, YOLO v3-tiny, and YOLO v4-tiny models were used for comparison, and the evaluation indicators and results are presented in Table 7. The detection results of the four models maintain high accuracy, but the proposed model's detection effect is better than the other three models in terms of the precision and mAP indicators. Table 5 reveals that the mAP of the proposed model is 1.3% and 0.4% higher than those of YOLO v3 in Type 1 and Type 2, respectively. Two images were selected for comparison between sunny and rainy days. The detection results and confidence levels of the four models under varying weather conditions are illustrated in Figure 14 and Table 8. As listed in Table 8, for sunny days, the proposed model can detect some occluded or partially missing objects while other models do not, and the proposed model has a high average confidence. The proposed model also performs better on cloudy and rainy days, although slightly worse than the YOLO v3. In summary, the proposed model can accurately and rapidly detect weeds in maize fields under different weather conditions and is more suitable for application in intelligent devices in complex natural environments.

Discussion
The situation in maize field in the natural environment is complex, and the interference is substantial. The mAP of the proposed model is 86.69%, and the detection speed can reach 57.33 f/s. This study compared the proposed model with other crop detection models to more accurately verify the robustness and effectiveness of the algorithm. In addition, the main factors affecting the accuracy of weed detection in a natural environment were analyzed. The comparative analysis of some weed detection algorithms is shown in Table 9. Comparative studies are based on object type, detection methods, and detection speed and accuracy. The model structure, especially the depth of the feature extraction network, greatly influences weed detection results. The reason is that a deeper model has better nonlinear expression ability. Furthermore, deeper models can learn more complex transformations and can fit more complex feature inputs [38].

Analysis of Model Structure Influence on Weed Detection Results
Ying et al. [39] replaced the backbone network of the YOLO v4 with MobileNetV3-Small, combined depth separable convolutional and inverted residual structures, and introduced a lightweight attention mechanism to detect carrot seedlings and weeds. The model accuracy in the two testing sets with only carrot seedling images and images including carrots and weeds can reach 89.11% and 87.80%, respectively. The average accuracy is 1.77% higher than that of the proposed model, but the model size is 124.92 MB larger than that of the proposed model. The MobileNetV3-Small backbone network used by Ying et al. consists of 11 bneck structures. The bneck structure is composed of four parts: Firstly, the dimension is enhanced by 1 × 1 convolution and then by 3 × 3 depth-separable convolution. It then passes through the SE-Block and, finally, 1 × 1 convolution is used to reduce the dimension. The use of bneck structure can reduce the computational burden of the model, enhance the attention of key channels, and improve the accuracy of detection.
McCool et al. [40] applied a DCNN (deep convolutional neural network) to classify crops and weeds in a carrot dataset and achieved an accuracy of 93.9%. However, the speed is 0.12 × 10 −3 f/ms, and it is not applicable for real-time application. The network used by McCool et al. is a new version of GoogleLeNet, Inception-v3. The network uses nine Inception modular structures with a total of 22 floors. The Inception structure consists of multiple 1 × 1 and 3 × 3 convolution layers combined with max pooling. At the end of the network, the average pooling is adopted to replace the fully connected layer, and, in order to avoid gradient disappearance, the network adds two additional SoftMax for forward conduction gradient.
Compared to the two feature extraction networks introduced above, the feature extraction network proposed in this paper has a simpler structure. Firstly, the backbone feature extraction network uses two 3 × 3 convolutional layers, each with a step size of 2, to extract global features, which then pass through the CSPBlock in the network to downsample. In addition, the SE-Block is used at the end of the backbone network to enhance the attention of key channels. The feature extraction network proposed in this paper has shallower layers. Therefore, the feature extraction ability is slightly insufficient, which may be the reason for the low accuracy of the model. As shown in Figure 14i, for the test set, weeds with smaller size and incomplete shape have lower confidence or missed detection.

Analysis of Influence of Natural Environmental Factors on Weed Detection Results
Natural environmental factors have a great influence on weed detection results. On the one hand, insufficient light will affect the image quality of the data. On the other hand, the appearance of different types of weeds is similar, and the overlap and occlusion between weeds are more serious, so it is difficult to obtain a clear and complete target image, resulting in difficulty in feature extraction.
Wu et al. [41] used the channel pruning-based YOLO v4 model to detect apple blossoms. Wu stated that missed detection was because the mobile phone that automatically took the image blurred the background and the details in the center, making it difficult for the detection model to extract target features. In addition, the illumination also affected data collection. Lin et al. [42] proposed a decision tree-based machine learning technique based on hyperspectral images to classify maize and weeds. They achieved an accuracy of 95%. However, this algorithm had not been tested in areas with dense weeds and could not confirm that the algorithm could be applied to areas with dense weeds. Milioto et al. [19] fed RGB images of sugar beets and near infra-red (NIR) images into an artificial neural network and segmented the plants based on their appearance. Although the classification accuracy achieved was 95%, it took 50 ms ± 10 ms per blob in the image, which is very high compared to previous methods.
Tian et al. [43] used DenseNet combined with the YOLO v3 model to detect apples. He believed that the main aspect affecting the detection effect was that partial occlusion by branches and leaves and overlaps between apples often occur in orchards. These factors have a certain impact on the test results. The missed detection in this paper is presented in Figure 14. For sunny days with sufficient light, the color and texture characteristics of the plants are obvious, and the contrast with the surrounding environment is clear. However, when the light is insufficient on cloudy and rainy days, the quality of the images captured by the digital devices significantly reduces, resulting in unclear edges, color distortions, and a lack of texture features for the targeted objects, hindering detection. The colors of the targeted plants are relatively similar, and it is easy to miss detection when overlapping or occlusion occurs. There is a gap between the test results of the proposed model and the YOLO v3 model under varying weather conditions, but the gap is small, indicating that the method can detect weeds in maize fields in complex environments. In addition, the results demonstrate that this algorithm has better robustness to data in different situations.
The visual attention mechanism is a method to simulate human attention perception. It can effectively integrate multiscale features and improve the performance of feature extraction. Therefore, other feature extraction algorithms can be developed to fuse multi-scale features to improve the situation where feature extraction is difficult due to overlapping or occlusion by weeds. Under the premise of ensuring speed, the network can be appropriately deepened to improve the ability of feature extraction.
The detection performance of the proposed model for a few objects is constrained by the following. Certain weed types in the data set images have relatively few targets; thus, the training samples are imbalanced. Therefore, during the model training process, the features of some targets were not sufficiently extracted, which can be solved in future research. The next work will consider the detection of weeds in the later stage of maize growth and attempt to apply the model to other types of crop detection.

Conclusions
Aiming at identifying various weeds in maize fields in a natural environment, this paper proposes a lightweight maize weed detection model based on the YOLO v4-tiny CNN combining an attention mechanism and an improved SPP (dense SPP) structure. The model can be used to detect Maize seedlings, Abutilon, Chenopodium, Morning glory seedlings, and Amaranth. The mAP of the proposed YOLO v4-weeds model is 86.69%, and the size is 34.08 MB. The detection speed is 57.33 f/s, which is better than the Faster R-CNN, SSD 300, YOLO v3, YOLO v3-tiny, and YOLO v4-tiny models. This research indicates that the model can effectively detect maize seedlings and various weeds in complex scenes of maize fields and can be easily applied to embedded equipment, providing the foundation and basis for developing intelligent weeding equipment.
Funding: This study was funded by the Science and Technology Innovation 2030-"New Generation of Artificial Intelligence" major project (2021ZD0110904), and the Scholars Program of Northeast Agricultural University: Young talents (No. 20QC32).