SE-YOLOv5x: An Optimized Model Based on Transfer Learning and Visual Attention Mechanism for Identifying and Localizing Weeds and Vegetables

: Weeds in the ﬁeld affect the normal growth of lettuce crops by competing with them for resources such as water and sunlight. The increasing costs of weed management and limited herbicide choices are threatening the proﬁtability, yield, and quality of lettuce. The application of intelligent weeding robots is an alternative to control intra-row weeds. The prerequisite for automatic weeding is accurate differentiation and rapid localization of different plants. In this study, a squeeze-and-excitation (SE) network combined with You Only Look Once v5 (SE-YOLOv5x) is proposed for weed-crop classiﬁcation and lettuce localization in the ﬁeld. Compared with models including classical support vector machines (SVM), YOLOv5x, single-shot multibox detector (SSD), and faster-RCNN, the SE-YOLOv5x exhibited the highest performance in weed and lettuce plant identiﬁcations, with precision, recall, mean average precision (mAP), and F1-score values of 97.6%, 95.6%, 97.1%, and 97.3%, respectively. Based on plant morphological characteristics, the SE-YOLOv5x model detected the location of lettuce stem emerging points in the ﬁeld with an accuracy of 97.14%. This study demonstrates the capability of SE-YOLOv5x for the classiﬁcation of lettuce and weeds and the localization of lettuce, which provides theoretical and technical support for automated weed control.


Introduction
Lettuce is one of the most productive vegetables worldwide with high nutritional value.However, lettuce is very sensitive to competition from weeds.Weeds compete with the crop for surrounding water and sunlight to grow faster, which can reduce global agricultural production by 30% [1].Compared with inter-row weeds, intra-row weeds are more difficult to remove [2].With the increasing cost of manual weeding, weed management has largely relied on chemical methods.However, the long-term and widespread use of chemicals could result in numerous hazards, such as water and soil pollution and pesticide residues in vegetables.Due to the shortages of laborers for hand-weeding and the limited herbicide options, there is an urgent need to develop intelligent techniques to remove weeds in lettuce fields.A prerequisite for smart weeding is the accurate identification and localization of weeds and crops.Therefore, it is crucial to develop an effective and efficient approach for weed-crop classification and localization in the field.
The identity of a plant is mainly related to its properties such as color, shape, texture, and vein patterns.Computer vision (CV) and machine learning (ML) have been widely used for weed identification [3].For instance, Ahmed et al. [4] proposed a weed identification method that applied support vector machine (SVM) algorithms to classify weeds with an accuracy of 97%.With the continuous development of deep learning, Ferreira et al. [5] developed software using AlexNet to identify broad-leaved grasses with an accuracy of 99.5% in soybean crop images.Ahmad et al. [6] compared different deep learning frameworks for identification of weeds in corn and soybean fields.However, the mean average precision (mAP) was only 54.3% due to fewer images and unbalanced classes.In addition, a convolutional neural network (CNN) feature-based graph convolution network (GCN) method was proposed by Jiang et al. [7] to enhance weed and crop recognition accuracy.The proposed GCN-ResNet-101 method improved the identification accuracy of crops and weeds on limited labeled datasets, yielding the best accuracy of 97.8%.
In recent years, deep-learning-based object detection algorithms have been used to accelerate the development of precision agriculture [8][9][10][11].You Only Look Once (YOLO) is a depth structure learning algorithm based on machine vision, which can directly solve the problem of target detection.For example, the improved YOLOv5 algorithm was applied to identify the stem/calyx of apples in the study of [12].The results showed that the improved YOLOv5 demonstrated the highest performance compared to other models (such as faster R-CNN, YOLOv3, SSD, and EfficientDet), with F1-score of 0.851.Ref. [13] combined the YOLOv5 model with distance intersection over union non-maximum suppression (DIOU-NMS) to detect diseased wheat ears.The average detection accuracy of the improved YOLOv5 model was 90.67%.Based on swin transformer prediction heads (SPHs) and normalization-based attention modules (NAMs), [14] proposed an SPH-YOLOv5 model to detect small objects in public datasets, yielding the best mAP of 0.716.However, YOLOv5 has not been employed in weed identifications, especially in lettuce fields.
In this paper, an optimized YOLOv5x model was built based on the squeeze-andexcitation (SE) network.The YOLOv5x is one version of the YOLOv5 models with deeper network structure.The SE attention network was proposed by Hu et al. [15] to extract key features by reweighting each feature channel, which introduced the corresponding attention weight to each characteristic.The novelty of this study lies in the development of an integrated method for weed identification and crop localization.The specific objectives are as follows: (1) build an SE-YOLOv5x deep learning model to identify crops and weeds under complex backgrounds; (2) combine local binary pattern (LBP) with SVM for classification; (3) compare the performance of SVM and deep learning models on different datasets; (5) propose an effective method for detecting lettuce stem emerging points.To the best of our knowledge, this is the first study using the SE-YOLOv5x framework for weed-crop classification and lettuce localization.

Dataset Preparation
The lettuce and weed images were collected in Weifang, Shandong, China.The original dataset comprises 275 lettuce images and weed images of five species including 52 Geminate Speedwell (GS) weeds, 51 Wild Oats (WO) weeds, 54 Malachium Aquaticum (MA) weeds, 87 Asiatic Plantain (AP) weeds, and 23 Sonchus Brachyotus (SB) weeds, as shown in Figure 1.Since machine learning requires sufficient data for model training, it is necessary to employ data augmentation methods to build robust models.The original images of weeds and lettuces in this study were enhanced in two ways, including chroma adjustment (values in the range of 0.5 to 0.7 were randomly selected) and brightness adjustment (values in the range of 0.5 to 0.7 were randomly selected).For lettuce plants, 100 lettuce images were randomly selected for data augmentation, and the remaining 175 lettuce images were used as the test set.After data augmentation, the new dataset included 312 GS images, 306 WO images, 324 MA images, 408 AP images, 138 SB images, and 430 lettuce images.Three quarters of the plant images were selected for training while one quarter of the images were used for testing.Then, the texture features of different plants in the training set were extracted based on the local binary pattern (LBP) algorithm [16].As shown in Figure 2, plant texture features were used to train the SVM model for class identification of each plant.In addition, the plant images in the new dataset were labeled using bounding boxes for training of different deep learning models including SSD [17], faster-RCNN [18], and YOLOv5x.The annotation of all images was performed manually using an image-annotation software (LabelImg, https://github.com/tzutalin/labelImg,accessed on 30 October 2021).Image annotation consisted of two steps.In the first step, entire plants were annotated (Figure 3).The information (such as class label, position coordinates of the bounding box) for each bounding box annotation was saved in XML files.In the second step, the XML files were converted to TXT tags for training models.
Agronomy 2022, 12, x FOR PEER REVIEW 3 of 18 identification of each plant.In addition, the plant images in the new dataset were labeled using bounding boxes for training of different deep learning models including SSD [17], faster-RCNN [18], and YOLOv5x.The annotation of all images was performed manually using an image-annotation software (LabelImg, https://github.com/tzutalin/labelImg,accessed on 30 October 2021).Image annotation consisted of two steps.In the first step, entire plants were annotated (Figure 3).The information (such as class label, position coordinates of the bounding box) for each bounding box annotation was saved in XML files.
In the second step, the XML files were converted to TXT tags for training models.identification of each plant.In addition, the plant images in the new dataset were labeled using bounding boxes for training of different deep learning models including SSD [17], faster-RCNN [18], and YOLOv5x.The annotation of all images was performed manually using an image-annotation software (LabelImg, https://github.com/tzutalin/labelImg,accessed on 30 October 2021).Image annotation consisted of two steps.In the first step, entire plants were annotated (Figure 3).The information (such as class label, position coordinates of the bounding box) for each bounding box annotation was saved in XML files.
In the second step, the XML files were converted to TXT tags for training models.dinates of the bounding box) for each bounding box annotation was saved In the second step, the XML files were converted to TXT tags for training m

Texture Extraction
The rotation-invariant uniform local binary pattern (LBP riu2 P,R ) algorithm, improved by Ojala et al. [19], was used to extract the feature vectors of the plants.The main characteristics of this algorithm are monotonic grayscale transformation, illumination, and rotation invariance.Compared with the original LBP algorithm, the improved LBP riu2 P,R algorithm can capture the crucial features.The LBP riu2 P,R can be expressed as follows: where the g c express the gray value of the center pixel (x c , y c ).The g p are the sorted values, and p∈{0, 1, 2,...P−1}.In addition, P is the number of pixels in radius R. s is defined as

Support Vector Machine (SVM)
SVM was used for classification based on the following equation: where the parameters w and b are the weight and bias, respectively.They were calculated from the training dataset of feature vector x 1 , . . ., x N with the corresponding target values t 1 , . . . ,t N , where t i ∈{−1,1}.New data x is classified based on the sign of y(x).The SVM considers the margin concept to deal with the classification problem.The margin is defined as the smallest distance between the samples and the decision boundary.The margin is calculated by the parameters w and b, as follows [20]: To solve this calculated process, the Lagrange function is needed: where a is a vector of function whose element a i ≥ 0 .The element N is input data for training.The derivatives with respect to b and w are calculated to simplify Equation ( 6) as follows: Then, using these conditions, Equation (6) can be improved as follows: Agronomy 2022, 12, 2061 where K is a kernel function, which can transform the nonlinearly separable space to the linear separable one, and a i is the Lagrange multiplier.
The SVM was implemented using a computer (Intel ® Core i7-6700HQ central process unit (CPU) @ 2.60 GHz, and 8 GB random-access memory (RAM)) with Python 3.8.0.The performance of the SVM in the plant classification was evaluated.After repeated experiments, we set the spatial and angular resolutions (P, R) of LBP operator values (8,1).In this study, the linear kernel function was used to train the SVM model.

Deep Learning 2.4.1. Equipment
The hardware configurations were used for training and testing in this study on a computer (processor: Intel ® Xeon ® Platinum 8156 CPU; operating system: 64-bit Linux; memory: 24 GB).The training speed was improved in graphics processing unit (GPU) mode (NVIDIA GeForce RTX 3090).In order to avoid the influence of the hyperparameters on the experimental results, the hyperparameters of each network were configured uniformly.After repeated experiments, the hyperparameters were determined as follows: learning rate 0.001, epoch number 150, and batch size 16.

Transfer Learning
The transfer learning method was implemented by using pretrained weights (the pretraining weights were obtained by training the deep learning model in large-scale datasets).Transfer learning required characteristic extraction from pretrained weights.The output layer of the pretrained models is replaced by a fresh dense layer with the activation function.The fresh dense layer contains many nodes which were used to express the number of weeds and crops to be classified.Because the pretrained model was created, the time demanded for training a model using the transfer learning technique was shorter than when creating a new model from scratch [21].Figure 4 shows the general workflow of the transfer learning method for training a deep learning model.
, where is a kernel function, which can transform the nonlinearly separable spac linear separable one, and is the Lagrange multiplier.
The SVM was implemented using a computer (Intel ® Core i7-6700HQ central unit (CPU) @ 2.60 GHz, and 8 GB random-access memory (RAM)) with Python 3.8 performance of the SVM in the plant classification was evaluated.After repeated ments, we set the spatial and angular resolutions (P, R) of LBP operator values ( this study, the linear kernel function was used to train the SVM model.

Equipment
The hardware configurations were used for training and testing in this stud computer (processor: Intel ® Xeon ® Platinum 8156 CPU; operating system: 64-bit memory: 24 GB).The training speed was improved in graphics processing unit mode (NVIDIA GeForce RTX 3090).In order to avoid the influence of the hyperp ters on the experimental results, the hyperparameters of each network were con uniformly.After repeated experiments, the hyperparameters were determined as f learning rate 0.001, epoch number 150, and batch size 16.

Transfer Learning
The transfer learning method was implemented by using pretrained weigh pretraining weights were obtained by training the deep learning model in large-sc tasets).Transfer learning required characteristic extraction from pretrained weigh output layer of the pretrained models is replaced by a fresh dense layer with the act function.The fresh dense layer contains many nodes which were used to express th ber of weeds and crops to be classified.Because the pretrained model was crea time demanded for training a model using the transfer learning technique was than when creating a new model from scratch [21].Figure 4 shows the general wo of the transfer learning method for training a deep learning model.

SE Network
In recent years, the attention mechanism technologies have been extensively used in the deep learning field [22].Great progress has been made in employing attention mechanisms in the domains of image segmentation and natural language processing [23].An attention mechanism concentrates attention on interesting or useful areas to eliminate unnecessary features.Then, it can distribute various weights to every feature for improving the accuracy and efficiency of the model [24].SE is a typical channel attention mechanism that has the benefits of simple structure and convenient deployment.SE is primarily composed of three basic parts [25]: (1) The basic content of F sq (squeeze operation) is global average pooling.In this way, global spatial information can be compressed into channel descriptors.A single two-dimensional feature channel is converted into a real number and makes it a global feature.The output one-dimensional matrix likes Z = [z 1 ,z 2 , ..., z c ], where the c-th element of Z can be expressed as follows: where H and W are the height and width of the feature map of a single feature channel, respectively.u c (i, j) is the value of every point on the feature map channel.
(2) Next, the F ex (excitation operation) generates different weights to assign to each channel.Then, it establishes the dependencies between various channels.z is a one-dimensional matrix obtained by F sq .Both W 1 and W 2 are the full connection layers.F ex can be interpreted as follows: where δ and σ are the activation function ReLU and sigmoid.
(3) The F ex is applied to obtain the weight s of different channels and the output.After rescaling the transformation output y n , the block output is obtained.The F scale (reweight operation) can be interpreted as follows: where s n is the weight value of the n th channel and y n is the two-dimension matrix of the output of the n th channel.In addition, x n is the output feature of the different channels after adding weight.As shown in Figure 5, the SE network can obtain the global description of the input terminal through the squeeze operation.Then, the weight of each feature channel is obtained by the excitation operation and the key features are extracted.The SE-YOLOv5x model is developed by embedding the SE network into the YOLOv5x model.The reconstruction model is shown in Figure 6.In recent years, the attention mechanism technologies have been extensively used in the deep learning field [22].Great progress has been made in employing attention mechanisms in the domains of image segmentation and natural language processing [23].An attention mechanism concentrates attention on interesting or useful areas to eliminate unnecessary features.Then, it can distribute various weights to every feature for improving the accuracy and efficiency of the model [24].SE is a typical channel attention mechanism that has the benefits of simple structure and convenient deployment.SE is primarily composed of three basic parts [25]: (1) The basic content of (squeeze operation) is global average pooling.In this way, global spatial information can be compressed into channel descriptors.A single two-dimensional feature channel is converted into a real number and makes it a global feature.The output one-dimensional matrix likes , where the -th element of can be expressed as follows: where and are the height and width of the feature map of a single feature channel, respectively.is the value of every point on the feature map channel.
(2) Next, the (excitation operation) generates different weights to assign to each channel.Then, it establishes the dependencies between various channels.is a one-dimensional matrix obtained by .Both and are the full connection layers.can be interpreted as follows: where and are the activation function ReLU and sigmoid.
(3) The is applied to obtain the weight of different channels and the output.
After rescaling the transformation output , the block output is obtained.The (reweight operation) can be interpreted as follows: where is the weight value of the n th channel and is the two-dimension matrix of the output of the n th channel.In addition, is the output feature of the different channels after adding weight.
As shown in Figure 5, the SE network can obtain the global description of the input terminal through the squeeze operation.Then, the weight of each feature channel is obtained by the excitation operation and the key features are extracted.The SE-YOLOv5x model is developed by embedding the SE network into the YOLOv5x model.The reconstruction model is shown in Figure 6.

Localization of Lettuce Stem Emerging Point
The localization method was realized based on bounding box generated by the deep learning models and the hue, saturation, and value (HSV) color space.Specifically, lettuce that is anchored to a lone location exhibits an overall approximately radial symmetry.Based on this truth, the central position of the bounding box around the crop should be considered as the evaluated stem position in the image frame.The precision of the stem position is immediately proportional to how exactly the bounding box locates the weeds or crops.Thus, identifying the tender leaves proved to be more accurate than directly identifying the whole seedling when locating the crop center.The deep learning model was used to identify the tender leaves located in the center of the lettuce, as shown in Figure 7b.Then, image processing was used to extract the center coordinates.In addition, green leaf weeds can easily affect the extraction of lettuce coordinates due to the similar color of crops and weeds, during the process of extraction.The HSV color space is a choice for eliminating interference from the natural environment.Therefore, this research treated lettuce and bounding boxes separately in the HSV color space.The lettuce and bounding box can be segmented by transferring to the HSV color space and using the green and red channels separately.As shown in Figure 7c, the HSV color space was gained from the following formula: , , ( 16)

Localization of Lettuce Stem Emerging Point
The localization method was realized based on bounding box generated by the deep learning models and the hue, saturation, and value (HSV) color space.Specifically, lettuce that is anchored to a lone location exhibits an overall approximately radial symmetry.Based on this truth, the central position of the bounding box around the crop should be considered as the evaluated stem position in the image frame.The precision of the stem position is immediately proportional to how exactly the bounding box locates the weeds or crops.Thus, identifying the tender leaves proved to be more accurate than directly identifying the whole seedling when locating the crop center.The deep learning model was used to identify the tender leaves located in the center of the lettuce, as shown in Figure 7b.Then, image processing was used to extract the center coordinates.In addition, green leaf weeds can easily affect the extraction of lettuce coordinates due to the similar color of crops and weeds, during the process of extraction.The HSV color space is a choice for eliminating interference from the natural environment.Therefore, this research treated lettuce and bounding boxes separately in the HSV color space.The lettuce and bounding box can be segmented by transferring to the HSV color space and using the green and red channels separately.As shown in Figure 7c, the HSV color space was gained from the following formula: if H < 0 then H = H + 360.On output 0 ≤ V ≤ 1, 0 ≤ S ≤ 1, 0 ≤ H ≤ 360.In the formula, R, G, B represent the values in the red, green, and blue color channels, respectively.Then, the red and green channels are separated by a mask covering in the HSV color space.

,
In the formula, (X, Y) represents the coordinates of the positioned lettuce ce , , and are the coordinates of the four corners of the rectangular surrounding the crop.After the separation of color channels, the bounding box and lettuce were extracted.In this process, some of the other boxes and impurities were extracted too.The method of filling the connected domain was used to remove the interference of weed surrounding frames and impurities, as can be seen in Figure 7(d1,d2).The formulas for removing the connected domain are as follows: In the formula, g(x, y) and y(x, y) are judgment functions of the bounding box and crop, respectively.area represents the actual area of each connected domain in the image.S min and S max are the minimum and maximum threshold, respectively.
The new lettuce and bounding box images were obtained after filtering out small connected areas.The images of the two channels were merged, as shown in Figure 7e.Finally, the lettuce stem center coordinates were located according to the bounding box and the morphological characteristics of lettuce, as shown in Figure 7f.The formulas are shown as follows: In the formula, (X, Y) represents the coordinates of the positioned lettuce center.x max , x min , y max , and y min are the coordinates of the four corners of the rectangular bounding box surrounding the crop.This centripetal localization method based on the bounding box of tender leaves can locate the practical central coordinate of lettuce.It has the potential to control the opening and closing of the weeding knives to realize the removal of intra-row weeds in the lettuce field.The conceptual method combining center localization and weeding knives is shown in Figure 8.The weeding knives created a safety zone during opening and closing, and the concept of this safety area was also mentioned by Perez-Ruiz et al. [26].In addition, the lettuce stem is approximately considered to be cylindrical, so its top view is a circle.The center of the circle is the ideal center coordinate.It can be regarded as a successful prediction when the predicted center coordinate is located in the stem circle area.
Agronomy 2022, 12, x FOR PEER REVIEW This centripetal localization method based on the bounding box of tender le locate the practical central coordinate of lettuce.It has the potential to control the and closing of the weeding knives to realize the removal of intra-row weeds in th field.The conceptual method combining center localization and weeding knives in Figure 8.The weeding knives created a safety zone during opening and clos the concept of this safety area was also mentioned by Perez-Ruiz et al. [26].In a the lettuce stem is approximately considered to be cylindrical, so its top view is The center of the circle is the ideal center coordinate.It can be regarded as a su prediction when the predicted center coordinate is located in the stem circle area

Evaluation Indicators
As shown in Equations ( 22)-( 26), detailed evaluation indicators are defined uate the results of all the deep learning and SVM models used in this study.As th used in this study is uneven and the number of images of each plant sample va score is also used to assess the image classification performance of the deep model.F1-score is obtained by calculating recall, precision, true negative (TN), t tive (TP), false negative (FN), and false positive (FP).Mean average precision ( used to estimate the effect of the target detection model.In addition, the loss of learning model is used to estimate the error between the prediction results of th and the ground truths.The losses consist of three parameters: object loss (obj_lo sification loss (cls_loss), and bounding box loss (box_loss).

Evaluation Indicators
As shown in Equations ( 22)-( 26), detailed evaluation indicators are defined to evaluate the results of all the deep learning and SVM models used in this study.As the dataset used in this study is uneven and the number of images of each plant sample varies, F1-score is also used to assess the image classification performance of the deep learning model.F1-score is obtained by calculating recall, precision, true negative (TN), true positive (TP), false negative (FN), and false positive (FP).Mean average precision (mAP) is used to estimate the effect of the target detection model.In addition, the loss of the deep learning model is used to estimate the error between the prediction results of the model and the ground truths.The losses consist of three parameters: object loss (obj_loss), classification loss (cls_loss), and bounding box loss (box_loss).

SVM for Plant Classification
The SVM model was established to identify different plants.Figure 9 shows the plant classification results based on SVM models developed with different datasets.When the SVM model was trained using the dataset containing six species of plants, the precision and F1-score of the SVM model were 39.4% and 50.3%.When the model was built using the data of three plants, the recall values of the SVM model were comparable but the precision and F1-score of the model increased substantially, regardless of whether the data contained lettuce crops or not.As can be seen, the precision and F1-score of the model containing the dataset of lettuce, GS, and WO were 78.6% and 78.1%, respectively.The precision and F1-score of the other model containing the dataset of MA, AP, and SB were 60.0% and 66.5%, respectively.The results showed that classification performance of SVM would become worse with the increase of plant species, which meant that the classical SVM was not qualified for multiclassification tasks.

SVM for Plant Classification
The SVM model was established to identify different plants.Figure 9 shows the plant classification results based on SVM models developed with different datasets.When the SVM model was trained using the dataset containing six species of plants, the precision and F1-score of the SVM model were 39.4% and 50.3%.When the model was built using the data of three plants, the recall values of the SVM model were comparable but the precision and F1-score of the model increased substantially, regardless of whether the data contained lettuce crops or not.As can be seen, the precision and F1-score of the model containing the dataset of lettuce, GS, and WO were 78.6% and 78.1%, respectively.The precision and F1-score of the other model containing the dataset of MA, AP, and SB were 60.0% and 66.5%, respectively.The results showed that classification performance of SVM would become worse with the increase of plant species, which meant that the classical SVM was not qualified for multiclassification tasks.

Training of Deep Learning Models
Figure 10a shows the variation curves of training loss with epochs in weed-lettuce identification.In general, the training loss curves of each model gradually tended to be stable with the increasing of epoch values.Specifically, the SE-YOLOv5x and YOLOv5x, compared with the other four models, were faster to converge and had lower loss value.In Figure 10

Training of Deep Learning Models
Figure 10a shows the variation curves of training loss with epochs in weed-lettuce identification.In general, the training loss curves of each model gradually tended to be stable with the increasing of epoch values.Specifically, the SE-YOLOv5x and YOLOv5x, compared with the other four models, were faster to converge and had lower loss value.In Figure 10   The corresponding loss curve changes of the SE-YOLOv5x and YOLOv5x were analyzed in detail, as shown in Figure 11.It was observed that the loss curves of both models gradually stabilized with the ever-increasing values of epochs.Specifically, the SE-YOLOv5x model converged the fastest, which gradually stabilized after 80 epochs.Figure 11 showed that each loss curve of the improved SE-YOLOv5x model decreased faster than the original YOLOv5x.The loss curves of the improved SE-YOLOv5x had a lower loss value.In conclusion, the improved SE-YOLOv5x model had better performance for the classification of lettuce and weeds compared with the original YOLOv5x model.
Agronomy 2022, 12, x FOR PEER REVIEW 12 of 18 The corresponding loss curve changes of the SE-YOLOv5x and YOLOv5x were analyzed in detail, as shown in Figure 11.It was observed that the loss curves of both models gradually stabilized with the ever-increasing values of epochs.Specifically, the SE-YOLOv5x model converged the fastest, which gradually stabilized after 80 epochs.Figure 11 showed that each loss curve of the improved SE-YOLOv5x model decreased faster than the original YOLOv5x.The loss curves of the improved SE-YOLOv5x had a lower loss value.In conclusion, the improved SE-YOLOv5x model had better performance for the classification of lettuce and weeds compared with the original YOLOv5x model.

Deep Learning for Plant Classification
This research trained the object detection algorithms, including SE-YOLOv5x, YOLOv5x, faster-RCNN, and SSD.The backbone networks of some models were trained after being replaced by Mobilenetv2 [27] and Resnet50 [28].The training results of these target detection algorithms are shown in Table 1.The precision, recall, mAP @0.5 , and F1-score of the SE-YOLOv5x model were 0.976, 0.956, 0.971, and 0.973, respectively.The evaluation indicators of the SE-YOLOv5x model were the best of all the models.The results proved the effectiveness of the SE-YOLOv5x model in classifying weeds in lettuce field.In addition, the weight size of the improved SE-YOLOv5x was only 105.9 MB, reduced by 39.52% compared with the original YOLOv5x weight.The recognition speed was 19.1 ms, improved by 34.14%, compared with the original YOLOv5x.The SE-YOLOv5x model can enormously reduce the model size and increase the recognition speed while maintaining high recognition accuracy due to using the SE network.For some embedded mobile devices, their RAM space is limited and calculating power is not enough, and cannot satisfy largescale and high-intensity operations.Therefore, the SE-YOLOv5x is suitable for embedded mobile devices.As shown in Table 2, the classification performance of SE-YOLOv5x in identifying individual plant species was further investigated.The classification results of the SE-YOLOv5x model show the precision, recall, and F1-score for each weed species and the lettuce crop.The highest F1-score was 99.8% for lettuce, and the lowest F1-score was 90.0% for MA.The precision of GS, WO, MA, AP, SB, and lettuce were 100%, 98%, 89.8%, 98.7%, 99.5%, and 99.6%, respectively.For identification of MA, the precision value was not high, and was only less than 90%, but the model for the rest of the plants reached a significantly high accuracy.In summary, the SE-YOLOv5x model had a strong capability in classification of weeds and lettuce plants.

Performance Comparison of SVM and Deep Learning Models
In this part, a comparison is presented between the classic SVM model and six deep learning models.The SVM and deep learning models were trained based on the dataset including six plants.The SVM was trained with LBP riu2 8,1 /256 × 256/C = 1.Table 3 shows the modeling results of SVM and deep learning methods.It can be observed that the mean performance of the deep learning models outperformed the SVM method.Except for recall value, the performance of the SVM was the worst among all these models.Moreover, it could be observed that the classification effect of the SSD models with different backbones (VGG and Mobilenetv2) was better than that of the faster-RCNN models with different backbones (VGG and Resnet50), although the recall of the SSD models was lower than that of the faster-RCNN models.Due to using the SE network, the SE-YOLOv5x model achieved the highest accuracies, with precision, recall, and F1-score of 97.6%, 95.6%, and 97.3%, respectively.The F1-score of the SE-YOLOv5x surpassed the SVM by 47%.The SE-YOLOv5x model showed very excellent results in identifying weeds and lettuce.In conclusion, deep learning models performed better than the classical SVM method in multiclassification, and the SE-YOLOv5x model performed the best among all deep learning models.

Determination of Lettuce Stem Emerging Point
Table 4 shows the testing results of different deep learning models in lettuce stem emerging point localization.The average diameter of lettuce stem at the stage of 4-6 leaves was about 20 mm.As the tender leaves of lettuce grew from the middle of the lettuce rhizome, the area of the lettuce stem was marked using green circles in the lettuce image shown in Figure 12.Compared with the original YOLOv5x, faster-RCNN (Resnet50 and VGG), and SSD (Mobilenetv2 and VGG) models, the SE-YOLOv5x model located the lettuce stem emerging point more accurately with the accuracy of 97.14%.When different backbone networks (such as Resnet50, VGG in faster-RCNN or Mobilenetv2, and VGG in SSD models) were considered, it was found that the VGG backbone networks achieved higher accuracy than others.Even though the faster-RCNN model was superior to the SSD model in weeds and lettuce classification, it was inferior to the SSD model in lettuce stem emerging point localization.There were two main reasons for the decline of localization accuracy: firstly, the recognition accuracy of the deep learning model itself has an important influence on localization results.In addition, the growth of the edge leaves is more irregular, which also increases the difficulty of locating the lettuce stem emerging point.Therefore, the SE-YOLOv5x model was adopted for locating the stem emerging points of lettuce.

Discussion
In this study, an optimized SE-YOLOv5x method was developed for weed-c sification and localization of lettuce stem emerging point in a complex backgrou performance of different deep learning models (such as improved SE-YOLOv5x, YOLOv5x, SSD with VGG or Mobilenetv2 backbone, and faster-RCNN with Res VGG backbone) were compared for classification of weeds and lettuce plants.unique method was proposed for localization of lettuce stem emerging points b the bounding box location.The training results showed that the SE-YOLOv5x mo the best in weed-crop classification and plant localization in a lettuce field.In th more weed species should be considered.Moreover, the data annotation is extraor laborious and time-consuming.Thus, automatic labeling should be realized by a annotation methods.In addition, equipment should be developed for real-time im quisition under the variable natural environmental conditions (such as sunlight an that occur in the crop field.In this study, only the network framework of YOLO considered for object detection.Other YOLOv5 algorithms (such as YOLOv5s, YO and YOLOv5l) should be considered in the future.Although the YOLOv5 netw four versions (YOLOv5s, v5m, v5l, v5x), YOLOv5x achieved the best performan pared to the other models in public datasets.
The SE-YOLOv5x model yielded an F1-score as high as 97.14%, and a test t single image of 19.1 ms.Table 5 shows the related research results on plant ident based on CNN in recent years.Specifically, Wang et al. [29] proposed a DeepS Net model to identify solanum rostratum dunal plants.The model achieved an of 0.901 with a test time of 131.88 ms.Zou et al. [30] developed a modified U-Ne ment the green bristlegrass in complex background, with an F1-score of 0.936 an time of 51.71 ms.Although the above studies performed well in segmenting th the F1-scores and test times were lower than the proposed method in the curren Jin et al. [31] combined the Centernet and image processing for identifying bok c Chinese white cabbage with a test time of 8.38 ms, but its F1-score was only 0.9 baldi-Marquez et al. [20] designed a vision system to classify three species of pla mays, narrow-leaf weeds, and broadleaf weeds) using the VGG16 model.The (97.7%) of VGG16 was higher than that of the SE-YOLOv5x in the current study test time was much longer (194.56 ms).In addition, in the study of Veeranamp

Discussion
In this study, an optimized SE-YOLOv5x method was developed for weed-crop classification and localization of lettuce stem emerging point in a complex background.The performance of different deep learning models (such as improved SE-YOLOv5x, original YOLOv5x, SSD with VGG or Mobilenetv2 backbone, and faster-RCNN with Resnet50 or VGG backbone) were compared for classification of weeds and lettuce plants.Then, a unique method was proposed for localization of lettuce stem emerging points based on the bounding box location.The training results showed that the SE-YOLOv5x model was the best in weed-crop classification and plant localization in a lettuce field.In the future, more weed species should be considered.Moreover, the data annotation is extraordinarily laborious and time-consuming.Thus, automatic labeling should be realized by advanced annotation methods.In addition, equipment should be developed for real-time image acquisition under the variable natural environmental conditions (such as sunlight and wind) that occur in the crop field.In this study, only the network framework of YOLOv5x was considered for object detection.Other YOLOv5 algorithms (such as YOLOv5s, YOLOv5m, and YOLOv5l) should be considered in the future.Although the YOLOv5 network has four versions (YOLOv5s, v5m, v5l, v5x), YOLOv5x achieved the best performance compared to the other models in public datasets.
The SE-YOLOv5x model yielded an F1-score as high as 97.14%, and a test time of a single image of 19.1 ms.Table 5 shows the related research results on plant identification based on CNN in recent years.Specifically, Wang et al. [29] proposed a DeepSolanum-Net model to identify solanum rostratum dunal plants.The model achieved an F1-score of 0.901 with a test time of 131.88 ms.Zou et al. [30] developed a modified U-Net to segment the green bristlegrass in complex background, with an F1-score of 0.936 and a test time of 51.71 ms.Although the above studies performed well in segmenting the plants, the F1-scores and test times were lower than the proposed method in the current study.Jin et al. [31] combined the Centernet and image processing for identifying bok choy and Chinese white cabbage with a test time of 8.38 ms, but its F1-score was only 0.953.Garibaldi-Marquez et al. [20] designed a vision system to classify three species of plants (Zea mays, narrow-leaf weeds, and broadleaf weeds) using the VGG16 model.The F1-score (97.7%) of VGG16 was higher than that of the SE-YOLOv5x in the current study, but its test time was much longer (194.56 ms).In addition, in the study of Veeranampalayam Sivakumar et al. [32], object detection-based faster-RCNN models were used over low-altitude unmanned aerial vehicle (UAV) imagery for weed detection in soybean fields, yielding an F1-score of just 66.0% and a test time of 230 ms.Although Chen et al. [33] reported an accuracy (F1-score = 98.93 ± 0.34%) for recognition of 15 weed classes in cotton fields that was similar to our study (F1-score = 97.3%),their test time (338.5 ± 0.1 ms) for detection was very long.As shown in Table 4, the test times of the methods proposed by Wang et al. [34] and Jin et al. [35] are shorter than that of the SE-YOLOv5x, but their F1-scores are lower.The results of this research show that the improved SE-YOLOv5x model has great potential for weed-crop classification and localization.However, this study is an initial work.Further research should be conducted to develop a ground-based automated four-wheel-driving robot equipped with knives to remove the weeds.The ideal vehicle should be lightweight, low-cost, and robust enough to conduct automatic weed removal and deal with various abnormal situations [36][37][38][39][40].

Conclusions
In this study, the SE-YOLOv5x model was used for weed-crop classification and lettuce localization.The model was optimized by attention mechanism and transfer learning.Compared with the classical SVM method and other deep learning models (such as SSD and faster-RCNN) with different backbones (such as VGG, Resnet50, and Mobilenetv2), the optimized SE-YOLOv5x model exhibited the highest precision, recall, mAP @0.5 , and F1-score values of 97.6%, 95.6%, 97.1%, and 97.3% in weed-crop classification, respectively.The accuracy of localization of lettuce stem emerging points based on the SE-YOLOv5x model was 97.14%.The knowledge generated by this research will greatly facilitate the efficient identification and removal of weeds in fields.

Figure 1 .
Figure 1.Examples of lettuce and weed images.(a) An example of lettuce, (b) an example of Sonchus Brachyotus (SB), (c) an example of Asiatic Plantain (AP), (d) an example of Malachium Aquaticum (MA), (e) an example of Wild Oats (WO), and (f) an example of Geminate Speedwell (GS).

Figure 2 .
Figure 2. General scheme of the classification method based on support vector machine (SVM).

Figure 3 .
Figure 3. Bounding box annotation of an AP.

Figure 1 .
Figure 1.Examples of lettuce and weed images.(a) An example of lettuce, (b) an example of Sonchus Brachyotus (SB), (c) an example of Asiatic Plantain (AP), (d) an example of Malachium Aquaticum (MA), (e) an example of Wild Oats (WO), and (f) an example of Geminate Speedwell (GS).

Figure 1 .
Figure 1.Examples of lettuce and weed images.(a) An example of lettuce, (b) an example of Sonchus Brachyotus (SB), (c) an example of Asiatic Plantain (AP), (d) an example of Malachium Aquaticum (MA), (e) an example of Wild Oats (WO), and (f) an example of Geminate Speedwell (GS).

Figure 2 .
Figure 2. General scheme of the classification method based on support vector machine (SVM).

Figure 3 .
Figure 3. Bounding box annotation of an AP.

Figure 2 .
Figure 2. General scheme of the classification method based on support vector machine (SVM).

Figure 2 .Figure 3 .
Figure 2. General scheme of the classification method based on support vector mac

Figure 3 .
Figure 3. Bounding box annotation of an AP.

Figure 4 .
Figure 4.The process of transfer learning.

Figure 4 .
Figure 4.The process of transfer learning.

Figure 6 .
Figure 6.The structure of the SE-YOLOv5x model.

Figure 7 .
Figure 7.The visualization process of center positioning algorithm; (a) the original tured, (b) the detection result of the deep learning model, (c) the color space image image of detection result is converted from RGB to HSV channel by image process separation results of crop and bounding box in different channels, (e) the result of channel, (f) the final localization result.

Figure 7 .
Figure 7.The visualization process of center positioning algorithm; (a) the original image captured, (b) the detection result of the deep learning model, (c) the color space image of HSV (the image of detection result is converted from RGB to HSV channel by image processing), (d1,d2) the separation results of crop and bounding box in different channels, (e) the result of merging color channel, (f) the final localization result.

Figure 8 .
Figure 8.The working process of the weeding knives: on the right side of the figure, the w knives were placed in the closed position where they are close to each other, and they can weeds in the intra-row area.In the center of this figure, the weed knives were placed in th position in order to bypass the lettuce plant.

Figure 8 .
Figure 8.The working process of the weeding knives: on the right side of the figure, the weed knives were placed in the closed position where they are close to each other, and they can kill all weeds in the intra-row area.In the center of this figure, the weed knives were placed in the open position in order to bypass the lettuce plant.

Figure 9 .
Figure 9.Comparison of SVM modeling results using different datasets.

Figure 9 .
Figure10ashows the variation curves of training loss with epochs in weed-lettuce identification.In general, the training loss curves of each model gradually tended to be stable with the increasing of epoch values.Specifically, the SE-YOLOv5x and YOLOv5x, compared with the other four models, were faster to converge and had lower loss value.In Figure10, it was found that the SE-YOLOv5x and YOLOv5x gradually stabilized, and their training loss values were very close after 20 epochs.The training loss curves of faster-RCNN with two different backbones (VGG and Resnet50) were also very close, which gradually stabilized around the 40 th epoch.The training loss of faster-RCNN with VGG backbone was slightly worse than that of faster-RCNN (Resnet50).SSD models with different backbones (VGG and Mobilenetv2) were stable around the 70th epoch.The SSD (VGG) had the highest training loss value after convergence.The variation curves of validation loss with epochs of six models in identifying weeds and lettuce are shown in Figure 10b.Similar to training loss curves, the validation loss curves rapidly decreased in the early training stage (30 epochs) and then slowly converged at the end of training.As shown in Figure 10b, the SSD (VGG) model exhibited the highest loss value, and converged slowly after 60 epochs.The SSD (Mobilenetv2) had the fastest convergence rate and stabilized around the 50th epoch.Compared with the SSD Figure10ashows the variation curves of training loss with epochs in weed-lettuce identification.In general, the training loss curves of each model gradually tended to be stable with the increasing of epoch values.Specifically, the SE-YOLOv5x and YOLOv5x, compared with the other four models, were faster to converge and had lower loss value.In Figure10, it was found that the SE-YOLOv5x and YOLOv5x gradually stabilized, and their training loss values were very close after 20 epochs.The training loss curves of faster-RCNN with two different backbones (VGG and Resnet50) were also very close, which gradually stabilized around the 40th epoch.The training loss of faster-RCNN with VGG backbone was slightly worse than that of faster-RCNN (Resnet50).SSD models with different backbones (VGG and Mobilenetv2) were stable around the 70th epoch.The SSD (VGG) had the highest training loss value after convergence.The variation curves of validation loss with epochs of six models in identifying weeds and lettuce are shown in Figure 10b.Similar to training loss curves, the validation loss curves rapidly decreased in the early training stage (30 epochs) and then slowly converged at the end of training.As shown in Figure 10b, the SSD (VGG) model exhibited the highest loss value, and converged slowly after 60 epochs.The SSD (Mobilenetv2) had the fastest convergence rate and stabilized around the 50th epoch.Compared with the SSD model, the faster-RCNN (Resnet50) model had a lower initial loss and eventually achieved

Figure 10 .
Figure 10.(a) Training loss curves and (b) validation loss curves against the number of epochs in weed-lettuce classification.

Figure 10 .
Figure 10.(a) Training loss curves and (b) validation loss curves against the number of epochs in weed-lettuce classification.

Figure 11 .
Figure 11.(a) The training loss curves and (b) the validation loss curves of SE-YOLOv5x and YOLOv5x.

Figure 11 .
Figure 11.(a) The training loss curves and (b) the validation loss curves of SE-YOLOv5x and YOLOv5x.

Figure 12 .
Figure 12.(a) The SE-YOLOv5x identification results of the tender leaves; (b) localization

Figure 12 .
Figure 12.(a) The SE-YOLOv5x identification results of the tender leaves; (b) localization results.

Table 1 .
Lettuce and weeds classification results of different deep learning models.

Table 2 .
Classification results of different plants based on SE-YOLOv5x model on PyTorch.

Table 3 .
Training results of the SVM and different deep learning models for crop/weed identification.

Table 4 .
Test results of different deep learning models in lettuce stem emerging point localization.

Table 5 .
A summary of plant identification based on different CNN models.