An Improved U-Net Model Based on Multi-Scale Input and Attention Mechanism: Application for Recognition of Chinese Cabbage and Weed

: The accurate spraying of herbicides and intelligent mechanical weeding operations are the main ways to reduce the use of chemical pesticides in ﬁelds and achieve sustainable agricultural development, and an important prerequisite for achieving these is to identify ﬁeld crops and weeds accurately and quickly. To this end, a semantic segmentation model based on an improved U-Net is proposed in this paper to address the issue of efﬁcient and accurate identiﬁcation of vegetable crops and weeds. First, the simpliﬁed visual group geometry 16 (VGG16) network is used as the coding network of the improved model, and then, the input images are continuously and naturally down-sampled using the average pooling layer to create feature maps of various sizes, and these feature maps are laterally integrated from the network into the coding network of the improved model. Then, the number of convolutional layers of the decoding network of the model is cut and the efﬁcient channel attention (ECA) is introduced before the feature fusion of the decoding network, so that the feature maps from the jump connection in the encoding network and the up-sampled feature maps in the decoding network pass through the ECA module together before feature fusion. Finally, the study uses the obtained Chinese cabbage and weed images as a dataset to compare the improved model with the original U-Net model and the current commonly used semantic segmentation models PSPNet and DeepLab V3+. The results show that the mean intersection over union and mean pixel accuracy of the improved model increased in comparison to the original U-Net model by 1.41 and 0.72 percentage points, respectively, to 88.96% and 93.05%, and the processing time of a single image increased by 9.36 percentage points to 64.85 ms. In addition, the improved model in this paper has a more accurate segmentation effect on weeds that are close to and overlap with crops compared to the other three comparison models, which is a necessary condition for accurate spraying and accurate weeding. As a result, the improved model in this paper can offer strong technical support for the development of intelligent spraying robots and intelligent weeding robots.


Introduction
Weeds growing in fields not only raise the risk of agricultural diseases [1] but also compete with crops for sunlight, water, fertilizers, and other nutrients, which negatively impacts crop growth and yield [2,3]. As a result, timely and effective weed removal has historically been a key area of study. Currently, chemical and mechanical weeding are the two main methods used to manage weeds. Chemical weeding relies heavily on spraying herbicides evenly across the field regardless of the presence of weeds, which not only results in the excessive use of chemical pesticides and brings a series of environmental beet and weeds using FCN with sequence information; Ma et al. [36] proposed a SegNet semantic segmentation model based on FCN and achieved high classification accuracy in the segmentation of rice seedlings and weeds. Kamath et al. [37] studied semantic segmentation models, such as PSPNet and SegNet, for the recognition of rice crops and weeds, and all obtained good results with over 90% accuracy. U-Net [38], as a classical variant of the first semantic segmentation model FCN, is named after its overall structure of "U" shape and was originally proposed for medical image segmentation. Compared with FCN, U-Net uses dimensional splicing for feature fusion in the jump connection part, which can retain more feature information and has higher segmentation accuracy, and U-Net outperforms other coding-decoding structure networks for both small target segmentation tasks and small sample datasets. The VGG network uses small convolutional kernels repeatedly to deepen the network. Despite having a straightforward structure, it performs exceptionally well at picture recognition. VGG16 is a typical structure in the VGG network and is frequently used as a feature extraction network for U-Net since it is well-suited for classification and localization tasks. Yu et al. [39] investigated the potential of the U-Net model to segment maize tassel, and the results showed that the segmentation accuracy of the U-Net model with VGG16 as the feature extraction network for tassels at the all the tasseling stages was better than that of U-Net model with MobileNet; Sugirtha et al. [40] also confirmed that U-Net with the VGG16 encoder shows better performance than the ResNet-50 encoder when segmenting urban streets. In order to accomplish the reliable detection of navigation lines in different growth periods of potato, Yang et al. [41] presented a fitting approach of feature midpoint modification and replaced the original U-Net's feature extraction structure with VGG16; Zou et al. [42] proposed an image-enhancement method based on the random synthesis of "foreground" and "background", and reduced the number of convolutional layers in the U-Net model network, achieving the semantic segmentation of field weed images. Qian et al. [43] also used VGG16 to replace the encoder in the original U-Net network and added a repeated criss-cross attention to the U-Net network's skip connection; the experiments showed that the segmentation accuracy indexes of the improved U-Net network were higher than those of other comparative algorithms. In order to achieve the online quality detection of machine-harvested soybean, Jin et al. [44] employed U-Net as the basic network structure, combined with the VGG16 network, and added the convolutional block attention module (CBAM) after the feature maps were extracted in the encoder. Zou et al. [45] pre-trained a decoding network using image segmentation tasks on similar datasets and effectively segmented field wheat crops and weeds based on an improved U-Net model.
Although the aforementioned studies also produced promising findings, most of them increased the depth and width of the network to improve the detection accuracy of the model without considering the number of model parameters, model size, and recognition speed, which are crucial for reducing resource consumption and achieving real-time recognition effects in constrained hardware environments [46]. Therefore, to remedy the above deficiencies, this paper proposes a semantic segmentation model based on the U-Net network. The main contributions of this paper are the following: (1) A dataset of Chinese cabbage crops and weeds at seedling stage was created; (2) To accomplish the effective, precise, and quick detection of Chinese cabbage crops and weeds, the U-Net model was enhanced by the lateral integration of multi-scale feature maps and the addition of the efficient channel attention (ECA); (3) The revised U-Net model put forth in this study can operate in a lower hardware environment configuration than the original U-Net, which reduces memory costs and conserves resources. Additionally, the upgraded model's picture-processing speed is quicker than the original U-Net, better meeting the demands of smart agriculture for the real-time detection of crop and weed; (4) The proposed model has a more precise segmentation effect on weeds near and overlapping with crops, which can offer a strong technical foundation for the growth of precision agriculture. Overall, this study proposes a semantic segmentation model that can accurately identify weeds and Chinese cabbage crops, which can offer technical assistance for attaining agricultural sustainable development. The remainder of the article is organized as follows. The dataset needed for model training is included in Section 2 along with detailed explanations of the individual strategies used to enhance the U-Net model. The results of this study are presented and discussed in Section 3. Section 4 summarizes the research results of this paper, points out the limitations of this study, and outlines the future research directions.

Image Acquisition
The images needed for the study were taken between 16 August and 18 August 2022 in the Zhanlin Green Agricultural Picking Garden, Changchun City, Jilin Province, China (125 • 12 33" E, 43 • 59 27" N). The image acquisition site is located in the hinterland of the Northeast China Plain, which has a temperate monsoon climate, where the Chinese cabbage was planted in seedbeds, and transplanted at 4-6 leaves with plant spacing of 40-45 cm and row spacing of 55-60 cm. When the images were taken, the Chinese cabbage was in the seedling stage, 7-10 days after transplanting. The acquisition equipment is shown in Figure 1, and the RGB industrial camera (the camera is produced by Sichuan Weixin Vision Technology Co., Ltd., China, and the specific specifications of the camera are shown in Table 1) was mounted vertically on the mobile trolley with a height of 65 cm above the ground and an imaging area of 65 × 110 cm; the body of the mobile trolley and tires were excluded from the imaging area. The field image was continuously collected when the mobile trolley was moving, and a total of 345 pictures were gathered.
quicker than the original U-Net, better meeting the demands of smart agriculture for the real-time detection of crop and weed; (4) The proposed model has a more precise segmentation effect on weeds near and overlapping with crops, which can offer a strong technical foundation for the growth of precision agriculture.
Overall, this study proposes a semantic segmentation model that can accurately identify weeds and Chinese cabbage crops, which can offer technical assistance for attaining agricultural sustainable development. The remainder of the article is organized as follows. The dataset needed for model training is included in Section 2 along with detailed explanations of the individual strategies used to enhance the U-Net model. The results of this study are presented and discussed in Section 3. Section 4 summarizes the research results of this paper, points out the limitations of this study, and outlines the future research directions.

Image Acquisition
The images needed for the study were taken between 16 August and 18 August 2022 in the Zhanlin Green Agricultural Picking Garden, Changchun City, Jilin Province, China (125°12′33″ E, 43°59′27″ N). The image acquisition site is located in the hinterland of the Northeast China Plain, which has a temperate monsoon climate, where the Chinese cabbage was planted in seedbeds, and transplanted at 4-6 leaves with plant spacing of 40-45 cm and row spacing of 55-60 cm. When the images were taken, the Chinese cabbage was in the seedling stage, 7-10 days after transplanting. The acquisition equipment is shown in Figure 1, and the RGB industrial camera (the camera is produced by Sichuan Weixin Vision Technology Co., Ltd., China, and the specific specifications of the camera are shown in Table 1) was mounted vertically on the mobile trolley with a height of 65 cm above the ground and an imaging area of 65 × 110 cm; the body of the mobile trolley and tires were excluded from the imaging area. The field image was continuously collected when the mobile trolley was moving, and a total of 345 pictures were gathered.

Image Annotation and Data Enhancement
As seen in Figure 2, this study manually annotated the collected photographs using the image annotation program Labelme (version 3.16.7, relying on Anaconda software for implementation) to create corresponding label files for the sections of the images that represented the cabbage crops and weeds. As illustrated in Figure 3, this study uses four data enhancement techniques to enlarge the original image dataset to improve the dataset's variety and avoid the model being overfitted owing to a lack of data, specifically as follows: (a) Gaussian noise-the image is augmented with random Gaussian noise with mean value of 0 and variance of 0.05; (b) Random rotation-the image is rotated at random to the left or right, with a maximum angle of 30 degrees for each direction; (c) Random cropping-the cropped image area is 0.7 times the original image, and the cropped image is enlarged to the same size as the original image; (d) Random flip-the image is randomly flipped either horizontally, vertically, or diagonally, with a probability of 1/3 for each flip. Each data improvement method has a probability of 0.5 to be activated, and finally, the original dataset was expanded to four times its original size, with 1104 sheets as the training set and 276 sheets as the test set.

Image Annotation and Data Enhancement
As seen in Figure 2, this study manually annotated the collected photographs using the image annotation program Labelme (version 3.16.7, relying on Anaconda software for implementation) to create corresponding label files for the sections of the images that represented the cabbage crops and weeds. As illustrated in Figure 3, this study uses four data enhancement techniques to enlarge the original image dataset to improve the dataset's variety and avoid the model being overfitted owing to a lack of data, specifically as follows: (a) Gaussian noise-the image is augmented with random Gaussian noise with mean value of 0 and variance of 0.05; (b) Random rotation-the image is rotated at random to the left or right, with a maximum angle of 30 degrees for each direction; (c) Random cropping-the cropped image area is 0.7 times the original image, and the cropped image is enlarged to the same size as the original image; (d) Random flip-the image is randomly flipped either horizontally, vertically, or diagonally, with a probability of 1/3 for each flip. Each data improvement method has a probability of 0.5 to be activated, and finally, the original dataset was expanded to four times its original size, with 1104 sheets as the training set and 276 sheets as the test set.

Image Annotation and Data Enhancement
As seen in Figure 2, this study manually annotated the collected photographs using the image annotation program Labelme (version 3.16.7, relying on Anaconda software for implementation) to create corresponding label files for the sections of the images that represented the cabbage crops and weeds. As illustrated in Figure 3, this study uses four data enhancement techniques to enlarge the original image dataset to improve the dataset's variety and avoid the model being overfitted owing to a lack of data, specifically as follows: (a) Gaussian noise-the image is augmented with random Gaussian noise with mean value of 0 and variance of 0.05; (b) Random rotation-the image is rotated at random to the left or right, with a maximum angle of 30 degrees for each direction; (c) Random cropping-the cropped image area is 0.7 times the original image, and the cropped image is enlarged to the same size as the original image; (d) Random flip-the image is randomly flipped either horizontally, vertically, or diagonally, with a probability of 1/3 for each flip. Each data improvement method has a probability of 0.5 to be activated, and finally, the original dataset was expanded to four times its original size, with 1104 sheets as the training set and 276 sheets as the test set.

Multi-Scale Feature Map Input
To make the model retain more details of the feature maps, this paper employs average pooling to continuously down-sample the input images to generate feature maps of various sizes, and input the above feature maps from the network side based on the U-Net network. More specifically, the input RGB three-channel images are continuously pooled using a pooling kernel of size f = 2 and step size s = 2, and the pooling principle is shown in Figure 4a. The feature map's height and breadth will decrease to half of their original size after each pooling, as shown in Figure 4a, but the number of channels will remain the same. In this study, the size of the second layer feature map is 256 × 256, three repetitions of averaging pooling are completed, and the size of the final feature map is 64 × 64, as shown in Figure 4b, which constitutes the multi-scale input feature map of this study.

Multi-Scale Feature Map Input
To make the model retain more details of the feature maps, this paper employs average pooling to continuously down-sample the input images to generate feature maps of various sizes, and input the above feature maps from the network side based on the U-Net network. More specifically, the input RGB three-channel images are continuously pooled using a pooling kernel of size =2 f and step size =2 s , and the pooling principle is shown in Figure 4a. The feature map's height and breadth will decrease to half of their original size after each pooling, as shown in Figure 4a, but the number of channels will remain the same. In this study, the size of the second layer feature map is 256 × 256, three repetitions of averaging pooling are completed, and the size of the final feature map is 64 × 64, as shown in Figure 4b, which constitutes the multi-scale input feature map of this study.

Attention Mechanism
This study introduces the efficient channel attention (ECA) [47] mechanism before the feature fusion of the U-Net network, and its precise structure is illustrated in Figure 5. The model can become more concentrated on the extraction of target features according to this mechanism. For the input feature map χ × × ∈ W H C R , the ECA module first aggregates the spatial information of each channel through global average pooling (GAP) to obtain a global description feature of × × 1 1 C , and then uses a one-dimensional convolution with a kernel size of k to determine the weights of each channel. Finally, the resulting channel weights are multiplied with the corresponding elements of the input

Attention Mechanism
This study introduces the efficient channel attention (ECA) [47] mechanism before the feature fusion of the U-Net network, and its precise structure is illustrated in Figure 5. The model can become more concentrated on the extraction of target features according to this mechanism. For the input feature map χ ∈ R W×H×C , the ECA module first aggregates the spatial information of each channel through global average pooling (GAP) to obtain a global description feature of 1 × 1 × C, and then uses a one-dimensional convolution with a kernel size of k to determine the weights of each channel. Finally, the resulting channel weights are multiplied with the corresponding elements of the input feature map to obtain the final output feature map χ ∈ R W×H×C , which uses one-dimensional convolutional cross-channel interaction instead of fully connected layers to effectively reduce the computational effort and complexity of the model. Furthermore, the width, height, and number of channels of the feature map are left unchanged after the ECA module. ∈ W H C R , which uses onedimensional convolutional cross-channel interaction instead of fully connected layers to effectively reduce the computational effort and complexity of the model. Furthermore, the width, height, and number of channels of the feature map are left unchanged after the ECA module.

Overall Structure of the Model
This paper streamlines the 16-layer VGG16 network before introducing it to the network to decrease the number of network parameters and increase the efficiency of network operation. First, the three fully connected layers of the VGG16 network that take up a significant portion of the network's parameters are eliminated, followed by a reduction in the number of convolutional layers of the VGG16 network. As shown in Figure 6, finally, the VGG16 network's layers are reduced to six, with the final convolutional layer's channel count increasing from 512 to 1024, and it is then incorporated into the coding network of the improved model. At the same time, the input feature maps in this study are down-sampled using both average pooling and maximum pooling, and the three feature maps of various sizes that were generated by downsampling from the × 2 2 average pooling layer are input from the network laterally, and feature-fused with the feature maps produced by down-sampling from the × 2 2 maximum pooling layer in a dimensional splicing manner. The improved model coding network is as follows. The input RGB image has three channels and the size of × 512 512 . Initially, it adjusts the number of channels to 64 through two × 3 3 convolution layers and extracts the valid information it contains. Next, it changes the image size to × 256 256 through × 2 2 maximum pooling layers and performs feature fusion with a feature map of the same size from the lateral direction. After the fusion, the image size remains unchanged and the number of channels increases to 67; then, the number of channels is adjusted to 128 by × 3 3 convolutional layers again, and the image size is changed to × 128 128 by × 2 2 maximum pooling layers again, and feature fusion is continued with the same-sized feature map from the lateral direction, and so forth. Finally, the size of the feature map at the end of the coding network of this model is × 32 32 and the number of channels is 1024.

Overall Structure of the Model
This paper streamlines the 16-layer VGG16 network before introducing it to the network to decrease the number of network parameters and increase the efficiency of network operation. First, the three fully connected layers of the VGG16 network that take up a significant portion of the network's parameters are eliminated, followed by a reduction in the number of convolutional layers of the VGG16 network. As shown in Figure 6, finally, the VGG16 network's layers are reduced to six, with the final convolutional layer's channel count increasing from 512 to 1024, and it is then incorporated into the coding network of the improved model. At the same time, the input feature maps in this study are down-sampled using both average pooling and maximum pooling, and the three feature maps of various sizes that were generated by down-sampling from the 2 × 2 average pooling layer are input from the network laterally, and feature-fused with the feature maps produced by down-sampling from the 2 × 2 maximum pooling layer in a dimensional splicing manner. The improved model coding network is as follows. The input RGB image has three channels and the size of 512 × 512. Initially, it adjusts the number of channels to 64 through two 3 × 3 convolution layers and extracts the valid information it contains. Next, it changes the image size to 256 × 256 through 2 × 2 maximum pooling layers and performs feature fusion with a feature map of the same size from the lateral direction. After the fusion, the image size remains unchanged and the number of channels increases to 67; then, the number of channels is adjusted to 128 by 3 × 3 convolutional layers again, and the image size is changed to 128 × 128 by 2 × 2 maximum pooling layers again, and feature fusion is continued with the same-sized feature map from the lateral direction, and so forth. Finally, the size of the feature map at the end of the coding network of this model is 32 × 32 and the number of channels is 1024.
The model employs four ECA modules in the decoding network, which is as follows. The decoding network takes the feature map produced at the end of the coding network as the input image, which is first up-sampled through the 2 × 2 up-sampling layer, increasing the image size to 64 × 64 and keeping the number of channels constant, and then, feature fusing with the feature map of the same size from the jump connection after passing through the ECA module together. Following the fusion, the image size is maintained, while the number of channels increases to 1536. Next, 3 × 3 convolutional layers are applied to further adjust the number of channels, and 2 × 2 up-sampling layers are applied to further increase the image's size. This process is repeated until the image size is changed to 512 × 512 and the number of channels is 64. Lastly, 1 × 1 convolution is used to adjust the number of channels of the final feature map produced by the decoding network to the number of categories. Each pixel of the image is then classified, and the number of categories in this study is 3. The model employs four ECA modules in the decoding network, which is as follows. The decoding network takes the feature map produced at the end of the coding network as the input image, which is first up-sampled through the × 2 2 up-sampling layer, increasing the image size to × 64 64 and keeping the number of channels constant, and then, feature fusing with the feature map of the same size from the jump connection after passing through the ECA module together. Following the fusion, the image size is maintained, while the number of channels increases to 1536. Next, × 3 3 convolutional layers are applied to further adjust the number of channels, and × 2 2 up-sampling layers are applied to further increase the image's size. This process is repeated until the image size is changed to × 512 512 and the number of channels is 64. Lastly, × 1 1 convolution is used to adjust the number of channels of the final feature map produced by the decoding network to the number of categories. Each pixel of the image is then classified, and the number of categories in this study is 3.

Model Training Environment and Performance Evaluation
The deep learning framework TensorFlow was used for model training and testing. The computing hardware environment is as follows: AMD Ryzen 7 5800X 8-Core Processor, 3.80 GHz Main Frequency, 16 GB RAM; NVIDIA GeForce RTX 3060 Graphics Processor, 12 GB Video Memory. The operating system is Windows 10, together with CUDA 11.3, cuDNN 8.2.1, Python 3.7, and TensorFlow 2.5. The model's starting learning rate is − × 4 1 10 , its learning rate momentum is 0.9, the batch size is set to 4, the size of the input image is set to × 512 512 , and the number of iterations is 300. The "Adam" optimizer is used to optimize the network, which can constantly correct the learning rate to prevent the model from local fitting during the training process.
In this study, the model's performance is assessed in four areas: segmentation accuracy, model parametric number, model size, and segmentation speed. The segmentation speed is measured as the average time the model takes to process a single image. The average of the sum of the intersection and merge ratios between each category's true and projected labels is known as the mean intersection over union (MIOU). Mean pixel accuracy (MPA) is the average of the sum of the percentage of correct predicted pixel values for each category over the total pixel values. Therefore, the segmentation accuracy is indicated by MIOU and MPA, which are computed as follows:

Model Training Environment and Performance Evaluation
The deep learning framework TensorFlow was used for model training and testing. The computing hardware environment is as follows: AMD Ryzen 7 5800X 8-Core Processor, 3.80 GHz Main Frequency, 16 GB RAM; NVIDIA GeForce RTX 3060 Graphics Processor, 12 GB Video Memory. The operating system is Windows 10, together with CUDA 11.3, cuDNN 8.2.1, Python 3.7, and TensorFlow 2.5. The model's starting learning rate is 1 × 10 −4 , its learning rate momentum is 0.9, the batch size is set to 4, the size of the input image is set to 512 × 512, and the number of iterations is 300. The "Adam" optimizer is used to optimize the network, which can constantly correct the learning rate to prevent the model from local fitting during the training process.
In this study, the model's performance is assessed in four areas: segmentation accuracy, model parametric number, model size, and segmentation speed. The segmentation speed is measured as the average time the model takes to process a single image. The average of the sum of the intersection and merge ratios between each category's true and projected labels is known as the mean intersection over union (MIOU). Mean pixel accuracy (MPA) is the average of the sum of the percentage of correct predicted pixel values for each category over the total pixel values. Therefore, the segmentation accuracy is indicated by MIOU and MPA, which are computed as follows: where k denotes the total number of categories excluding the background category. In this study, we need to distinguish between crops and weeds in addition to the background; therefore, k = 2; TP is true positive, FP is false positive, TN is true negative, and FN is false negative.

Ablation Experiment
An ablation experiment was carried out in this study to examine the contribution of the VGG16+Cutting, multi-scale input, and ECA module to enhance U-Net. The results  Table 2. The VGG16+Cutting means employing the simplified VGG16 as the coding network of U-Net and cutting the number of the convolutional layers of the decoding network of U-Net, and the details of this approach can be seen in Figure 6. The addition of the VGG16+Cutting module, as can be observed in Table 2, reduces the MIOU of the model by 1.13% but also decreases the model parameters by 49.82% and the single-image time consumption by 13.85 milliseconds. In order to make the model lighter and better suited for real-time detection, we believe that a minor loss of accuracy is worthwhile. The MIOU of the model is increased with the addition of the multi-scale input module by 0.41%, but only at the expense of an increase in single-image time consumption of 1.26 milliseconds and an increase of 0.08% in model parameters. This is due to the fact that the multi-scale input module can increase the input image's number of channels to retain more information, whereas the number of channels of the image in this study was only increased briefly during the image feature fusion to avoid the model parametric number surge, and then, the number of channels was immediately restored to the original U-Net network with 3 × 3 convolutional layers. Contrarily, although refs. [35,46,48] also enhanced the model by boosting the number of image channels to achieve better segmentation, these enhancements were made by directly fusing RGB and NIR images to create a four-channel image input into the network, and this method would significantly increase the number of model parameters.
Furthermore, the MIOU of the U-Net model increased by 1.63 percentage points when the ECA module was included, showing that the ECA module can significantly improve the model's segmentation accuracy. The attention gate (AG) module, squeeze and excitation (SE) module, and convolutional block attention module (CBAM) were added to the U-Net model by John et al. [49], Yu et al. [50], and Jin et al. [44], respectively. Although the addition of these attention mechanism modules improves the segmentation accuracy of the model, it also introduces many new model parameters and increases the complexity of the network. In contrast, the ECA module used in this work is a lightweight module, and it can be seen in Table 2 that the number of model parameters is only slightly increased after the ECA module is added.

Comparison of the Overall Accuracy of the Model
The change curves of the mean intersection over union and mean pixel accuracy on the training set of the improved model in this paper and the original U-Net model as well as the current widely used semantic segmentation models PSPNet and DeepLab V3+ [51] are shown in Figure 7. The computational results are shown in Table 3, and the improved model in this paper is MSECA-Unet.
The change curves of the mean intersection over union and mean pixel accuracy on the training set of the improved model in this paper and the original U-Net model as well as the current widely used semantic segmentation models PSPNet and DeepLab V3+ [51] are shown in Figure 7. The computational results are shown in Table 3, and the improved model in this paper is MSECA-Unet.  In contrast to PSPNet, DeepLab V3+ and the original U-Net model, the MIOU and MPA of the MSECA-Unet model on the training set are higher, as shown in Figure 7. Additionally, as can be seen in Figure 7a, the improved MSECA-Unet model converged after 130 iterations, stabilized near the highest value earlier, and did so significantly more quickly than the other three comparison models. This is because, in this paper, the ECA module, which can successfully prevent the activation of irrelevant information and noise in the network, is introduced before the fusion of features in the U-Net network, so that it only fuses the feature information that requires attention, which decreases the time loss in feature fusion, and hence, quickens the model's convergence, which is consistent with the conclusions reached by Zhang et al. [29] when introducing the ECA module into the YOLOv4-Tiny network, and by Zhao et al. [52] when introducing the ECA module into DenseNet network.
As shown in Table 3, the improved MSECA-Unet model's MIOU is 88.95% and the MPA is 93.02% on the training set, which is higher than the 87.38% and 91.95% of the original U-Net model, and also higher than the corresponding indexes of the other two commonly used semantic segmentation models, which indicates that the improved MSECA-Unet network in this paper significantly improves the model's segmentation accuracy, and the MSECA-Unet model has a better segmentation effect on the Chinese  In contrast to PSPNet, DeepLab V3+ and the original U-Net model, the MIOU and MPA of the MSECA-Unet model on the training set are higher, as shown in Figure 7. Additionally, as can be seen in Figure 7a, the improved MSECA-Unet model converged after 130 iterations, stabilized near the highest value earlier, and did so significantly more quickly than the other three comparison models. This is because, in this paper, the ECA module, which can successfully prevent the activation of irrelevant information and noise in the network, is introduced before the fusion of features in the U-Net network, so that it only fuses the feature information that requires attention, which decreases the time loss in feature fusion, and hence, quickens the model's convergence, which is consistent with the conclusions reached by Zhang et al. [29] when introducing the ECA module into the YOLOv4-Tiny network, and by Zhao et al. [52] when introducing the ECA module into DenseNet network.
As shown in Table 3, the improved MSECA-Unet model's MIOU is 88.95% and the MPA is 93.02% on the training set, which is higher than the 87.38% and 91.95% of the original U-Net model, and also higher than the corresponding indexes of the other two commonly used semantic segmentation models, which indicates that the improved MSECA-Unet network in this paper significantly improves the model's segmentation accuracy, and the MSECA-Unet model has a better segmentation effect on the Chinese cabbage and weed training set compared with the U-Net, PSPNet, and DeepLab V3+ models.
The MSECA-Unet model, as well as the U-Net, PSPNet, and DeepLab V3+ models, are also assessed on the test set in this work. The prediction results are displayed in Table 4, whereas Table 5 shows the number of model parameters, model size, and prediction speed; Table 6 shows the model accuracy, precision, and F1-score. As can be seen in Table 4, the intersections over union and pixel accuracy of all models for weed segmentation are much lower than their corresponding metrics for background and crop segmentation, which is due to the high density and small area of weeds in the dataset collected in this study, which possess greater segmentation difficulty compared to background and crop with large areas and small numbers. In addition, Table 4 shows that for background, weeds, and crops, the proposed MSECA-Unet model in this paper produced the best results in terms of the intersection over union and pixel accuracy with 99.24%, 73.62%, 94.02%, and 99.64%, 82.58%, and 93.05%, respectively, as opposed to the original U-Net model with 99.16%, 69.87%, 93.62%, and 99.58%, 80.58%, 96.84%, which are increased by 0.08%, 3.75%, 0.40% and 0.06%, 2.00% and 0.08%, respectively. Thus, it can be seen, in addition to having a higher intersection over union and pixel accuracy than the original U-Net model for all categories in this study, the MSECA-Unet model also significantly increased these metrics for weeds, the hardest category to segment, which strongly supports the efficacy of the improvements made in this paper.
In Tables 4 and 5, we can see that the MIOU of the original U-Net model is 87.55% and the MPA is 92.33%, while the MIOU of the MSECA-Unet model proposed in this paper is 88.96% and the MPA is 93.05%, which are improved by 1.41 and 0.72 percentage points, respectively. This is due to the fact that the original U-Net model down-samples the feature map four times in order to obtain deeper feature information, which causes the network to lose a lot of detailed information that cannot be recovered by the subsequent up-sampling operation, and affects the segmentation accuracy of the network. While this study incorporates the multi-scale feature map produced by average pooling into the network, which effectively addresses the aforementioned information loss issue and boosts the model's segmentation accuracy. Meanwhile, the original U-Net model uses jump connections to combine the spatial data from the up-sampled paths with the spatial data from the down-sampled paths. However, this brings many redundant underlying features and noise, which affect the segmentation accuracy and speed of the network. In this paper, the ECA module is introduced before the network feature fusion. Increasing the target feature weight and reducing the weight of the useless or small-effect features make the model focus more on the target feature extraction and improve the model's feature extraction efficiency and accuracy.
Additionally, the proposed MSECA-Unet model has 1.58 × 10 7 model parameters and a model size of 60.27 MB, which are both 49.68% less than the original U-Net model's 3.14 × 10 7 and 119.77 MB. Moreover, the proposed MSECA-Unet model's single-image time consumption is 64.85 ms, which is 9.36% faster than the original U-Net model's 71.55 ms. This indicates that the proposed MSECA-Unet model has a faster segmentation speed than the original U-Net model, and that it is more capable of meeting the requirements of real-time crop and weed detection. This is because the model coding network is simplified according to the simple features of cabbage and weed in the images. The simplified coding network can maintain the same image feature extraction capability while consuming fewer computational resources. Meanwhile, this study also simplifies the model decoding network by reducing the number of convolutional layers, which is due to the fact that the images in this study are not complex and the decoding network does not need more abstract features. The reduction in the number of convolutional layers of the coding and decoding networks causes a decrease in the number of model parameters and model size, and speeds up the segmentation of the model. In contrast, refs. [39,40,43] directly use the VGG16 network as the encoding network for U-Net without simplifying VGG16, which also achieves better segmentation results but increases the width and depth of the network and requires a more optimal environment configuration to run the model. Chen et al. [53] achieved the accurate segmentation of grains, branches, and straws in hybrid rice grain images by improving the U-Net model, but the improvement they made was still to make the model extract richer semantic information by increasing the depth of the model. The advancements made in this work, however, strive to obtain the largest gain effect with the fewest possible factors. The ECA module introduced before the network feature fusion is a lightweight module, which has fewer parameters, and also when integrating the multiscale feature maps into the backbone feature extraction network, it is chosen to integrate from the lateral direction, which effectively avoids the significant growth of the network parameters. Due to these advancements, the model can have fewer model parameters and a smaller model size while still preserving the segmentation effect, and the smaller number of model parameters and model sizes allow the model to run in a relatively low hardware environment configuration, reducing memory costs and saving resource consumption.
In addition, the MSECA-Unet model proposed in this paper also significantly outperforms the current semantic segmentation models DeepLab V3+ and PSPNet. The MSECA-Unet model's MIOU and MPA are improved by 3.90% and 3.45%, respectively, over DeepLab V3+, while the number of model parameters, model size, and single-picture time consumption are decreased by 61.74%, 61.96%, and 15.03%, respectively. In comparison to PSPNet, the MIOU and MPA of the MSECA-Unet model are increased by 14.03% and 13.38%, and the number of model parameters, model size, and single-image time consumption are reduced by 67.82%, 66.30%, and 3.90%, respectively. In summary, the segmentation speed (single-image time consumption) of the proposed MSECA-Unet model is significantly faster than the other three semantic segmentation models, and its segmentation accuracy (MIOU and MPA) is also significantly improved with a significant reduction in the number of model parameters and model size, indicating that the proposed model is more suitable for application in the recognition of Chinese cabbage crops and weeds.
As can be seen in Table 6, the MSECA-Unet model proposed in this paper has the best accuracy, precision, and F1-score compared with U-Net, DeepLab V3+ and PSPNet. The accuracy, precision, and F1-score of the MSECA-Unet model each increased by 0.4%, 1.17%, and 0.94%, respectively, when compared to U-Net. In order to make the model more lightweight and improve the segmentation speed of the model, references [42,45] decreased the U-Net model's convolutional layer count in a manner similar to this study. Despite the fact that the segmentation speed of the improved model for farmland weeds was considerably increased, the reduction in a significant number of model parameters resulted in a decrease in model precision, and later, other improvements of the model were unable to make up for this loss. The MSECA-Unet model's accuracy, precision, and F1score increased in comparison to DeepLab V3+ by 1.05%, 1.74%, and 2.62%, respectively; in comparison to PSPNet, they increased by 4.4, 6.8, and 10.28 percentage points, respectively.

Comparison of Model Segmentation Effects
Randomly selected images in the test set are used as sample images to obtain their segmentation effects on each model. In order to observe the segmentation effect more clearly, the original image was fused with the predicted label image after reducing the transparency and the segmentation effect of each model was displayed in Figure 8. To facilitate the observation of the differences in segmentation effects between different models, certain regions in the segmentation effect map were locally enlarged and the weeds in the locally enlarged map were numbered in the labelled image, as shown in Figure 9. The Chinese cabbage crop is presented in red in the figure, and the weed is presented in green.
The accuracy, precision, and F1-score of the MSECA-Unet model each increased by 0.4%, 1.17%, and 0.94%, respectively, when compared to U-Net. In order to make the model more lightweight and improve the segmentation speed of the model, reference [42] and reference [45] decreased the U-Net model's convolutional layer count in a manner similar to this study. Despite the fact that the segmentation speed of the improved model for farmland weeds was considerably increased, the reduction in a significant number of model parameters resulted in a decrease in model precision, and later, other improvements of the model were unable to make up for this loss. The MSECA-Unet model's accuracy, precision, and F1-score increased in comparison to DeepLab V3+ by 1.05%, 1.74%, and 2.62%, respectively; in comparison to PSPNet, they increased by 4.4, 6.8, and 10.28 percentage points, respectively.

Comparison of Model Segmentation Effects
Randomly selected images in the test set are used as sample images to obtain their segmentation effects on each model. In order to observe the segmentation effect more clearly, the original image was fused with the predicted label image after reducing the transparency and the segmentation effect of each model was displayed in Figure 8. To facilitate the observation of the differences in segmentation effects between different models, certain regions in the segmentation effect map were locally enlarged and the weeds in the locally enlarged map were numbered in the labelled image, as shown in Figure 9. The Chinese cabbage crop is presented in red in the figure, and the weed is presented in green.  According to Figure 8, the MSECA-Unet model that was suggested in this study has the optimal segmentation effect and its segmentation effect is most similar to the labeled picture. In contrast, the segmentation effect of the PSPNet model is the least satisfying. In Figure 8f, it is obvious that the segmentation area of the Chinese cabbage crop by the PSPNet model has deviated seriously from the original area of the image, and the missegmentation and under-segmentation of weeds in the image are serious, making it impossible to correctly identify weeds.
The MSECA-Unet model has the best segmentation impact on weeds 2 A , 3 A , 4 B , and 5 B , according to the images of regions A, B, and C in Figure 9, while the DeepLab V3+ model has the lowest segmentation effect, segmenting weeds 3 A and 4 B partially, and failing to segment weeds 2 A and 5 B . Additionally, for weeds 1 A , 2 B , and 1 C , According to Figure 8, the MSECA-Unet model that was suggested in this study has the optimal segmentation effect and its segmentation effect is most similar to the labeled picture. In contrast, the segmentation effect of the PSPNet model is the least satisfying. In Figure 8f, it is obvious that the segmentation area of the Chinese cabbage crop by the PSPNet model has deviated seriously from the original area of the image, and the mis-segmentation and under-segmentation of weeds in the image are serious, making it impossible to correctly identify weeds.
The MSECA-Unet model has the best segmentation impact on weeds A 2 , A 3 , B 4 , and B 5 , according to the images of regions A, B, and C in Figure 9, while the DeepLab V3+ model has the lowest segmentation effect, segmenting weeds A 3 and B 4 partially, and failing to segment weeds A 2 and B 5 . Additionally, for weeds A 1 , B 2 , and C 1 , which are close to the crop, the MSECA-Unet model can accurately segment the gap between them and the crop, while the U-Net and DeepLab V3+ models have mis-segmentation issues when segmenting weeds A 1 , B 2 , and C 1 , which incorrectly segment the background as crop or weed, causing the crop and weed prediction labels to be mixed together directly without segmenting the gaps between them. Moreover, the DeepLab V3+ model had the worst segmentation effect, which not only mixed weed B 2 with the crop, but also mixed weed B 3 at the same time. Additionally, the MSECA-Unet model had the best segmentation of weed B 1 , which overlapped with the crop. While the U-Net and DeepLab V3+ models under-segmented weed B 1 severely, the U-Net model only segmented a very tiny region, and the DeepLab V3+ model did not segment it at all.
In summary, compared with U-Net and DeepLab V3+ models, the MSECA-Unet model has the best performance, which can not only accurately segment the weeds overlapping with crops, but also has the most accurate segmentation effect on the gap between crops and weeds, and the accurate segmentation of weeds close to crops and overlapping with crops is an important prerequisite for accurate spraying and accurate weed control.

Conclusions
To solve the problem of the efficient and accurate identification of vegetables and weeds in the field, and to realize the accurate spraying of herbicides and intelligent weeding operations, a semantic segmentation model, MSECA-Unet, based on an improved U-Net is proposed in this paper, which improves its segmentation accuracy and achieves efficient, accurate, and quick identification of Chinese cabbage crops and weeds by laterally integrating multi-scale inputs and introducing the efficient channel attention (ECA) mechanism with a substantial reduction in the number of model parameters and model size.
The suggested MSECA-Unet model outperformed the currently popular semantic segmentation models PSPNet and DeepLab V3+, as well as the original U-Net model with MIOU and MPA values of 88.96% and 93.05%, respectively, on the dataset for Chinese cabbage and weed. They improved by 1. Finally, by comparing the segmentation effects of the test set images on various models, it can be seen that the proposed MSECA-Unet model has more accurate segmentation effects on weeds close to and overlapping with the crop than the other three models, which is a necessary prerequisite for accurate spraying and accurate weeding. As a result, the proposed MSECA-Unet model can provide strong technical support for the development of intelligent spraying robots and intelligent weeding robots.
The MSECA-Unet model proposed in this paper is lightweight, and the fast recognition speed is its advantage, but it also has some limitations. This model only identifies weeds, but does not classify the species of them. Therefore, the model is unable to select the corresponding herbicide according to the type of weed when guiding the intelligent spraying robot to spray accurately. In addition, the model is poorly adaptable and needs to be retrained on a new dataset when used in other crop fields, and cannot be applied to multiple crops at the same time. Therefore, for future work, we will consider the further classification of weed species and expand the dataset for model training to include more crop species and weeds in the dataset to develop a more adaptable model that can be adapted to different crops and weeds.