Transformer-Based Weed Segmentation for Grass Management

Weed control is among the most challenging issues for crop cultivation and turf grass management. In addition to hosting various insects and plant pathogens, weeds compete with crop for nutrients, water and sunlight. This results in problems such as the loss of crop yield, the contamination of food crops and disruption in the field aesthetics and practicality. Therefore, effective and efficient weed detection and mapping methods are indispensable. Deep learning (DL) techniques for the rapid recognition and localization of objects from images or videos have shown promising results in various areas of interest, including the agricultural sector. Attention-based Transformer models are a promising alternative to traditional constitutional neural networks (CNNs) and offer state-of-the-art results for multiple tasks in the natural language processing (NLP) domain. To this end, we exploited these models to address the aforementioned weed detection problem with potential applications in automated robots. Our weed dataset comprised of 1006 images for 10 weed classes, which allowed us to develop deep learning-based semantic segmentation models for the localization of these weed classes. The dataset was further augmented to cater for the need of a large sample set of the Transformer models. A study was conducted to evaluate the results of three types of Transformer architectures, which included Swin Transformer, SegFormer and Segmenter, on the dataset, with SegFormer achieving final Mean Accuracy (mAcc) and Mean Intersection of Union (mIoU) of 75.18% and 65.74%, while also being the least computationally expensive, with just 3.7 M parameters.


Introduction
Global population growth has resulted in an increase in food demand. To meet the anticipated demand, the agricultural produce needs to increase by approximately 70% [1]. The farm output and its quality, along with crop cultivation, are, however, adversely affected by a number of factors. Among these issues is the growth of weeds, which occurs simultaneously with crop growth. A variety of weed plants exist that spread quickly and thus negatively impact crop yield. These weeds directly compete with crops for resources such as water, nutrients and sunlight, which leaves the crops prone to a number of diseases. Studies show that the vegetable yield decreases by 45% and up to 95% in the case of weedvegetable confrontation [2]. This extends even beyond crop cultivation; weed growth is also a problem in turfed surfaces such as those of football and golf, residential lawns, parks and sports fields. In order to tackle the issue of weed gardening, appropriate means must be taken. The focus of this paper is weed control for the latter case of turf grass management, but the same technology might be used for the former case of crop cultivation.
Weed control is a very challenging task. Various strategies can be employed by farmers for weed reduction in targeted areas. These methodologies can be divided into five main categories: (a) preventative-prevent weed growth preemptively; (b) mechanical-mowing, hand weeding and mulching; (c) cultural-maintaining field hygiene; (d) biologicalutilizing weeds' natural adversaries, such as insects, grazing animals, etc.; (e) chemical-

1.
Predicting accurate segmentation masks for weeds using Transformer-based architectures for the purpose of automatizing weed control with a focus on turf management.

2.
We investigate a range of recent Transformer models using our weed dataset and make detailed comparisons in terms of performance and complexity.
In Section 2, we provide a detailed review of previously designed methods for similar purposes. Section 3 contains information about the Transformer model architectures employed in the study. Section 4 contains details about our dataset, along with the applied augmentations and evaluation metrics. Subsequently, in Section 5, we provide the details of our experiments, the comparison of different models and the extracted conclusions. Finally, in Section 6, we provide a brief summary of the work performed.

Related Studies
In this section, we provide a brief overview of the Transformer model and its use in computer vision, along with a review of previously proposed vision-based automatic weed detection methods.

Transformer Architecture
Originally introduced in the context of machine translation, Transformer models are now used to solve a wide variety of tasks in multiple domains. In the context of natural language processing, recurrent or convolutional neural network (CNN) models based on encoder and decoder architectures were common before the inception of Transformer models. The Transformer gets rid of the recurrent and convolution layers and proposes a simple model based entirely on the attention mechanism. Transformer models employ the self-attention mechanism, where each word attends to every other word in the same input sequence. As a result, the Transformer model takes significantly less time to train than its counterparts while achieving more parallelization [4].
Building upon the original Transformer architecture, researchers have tried importing the architecture into the domain of computer vision [9][10][11][12]. However, that results in a quadratic cost with reference to the number of pixels, since self-attention in images treats each pixel as a separate token and attends to every other pixel in the image. This, in turn, makes the direct application of self-attention to images practically unfeasible, given the huge number of pixels present in a single image. Various techniques have been designed to mitigate the issue of the quadratic cost of Vision Transformer models. Dosovitskiy et al. [12] applied a pure Transformer block on a sequence of image patches termed Visual Transformer (ViT). In that study, the input image was split into fixed-sized patches, each treated as a single token. The patches were then flattened and underwent trainable linear projection. Positional embedding vectors were added to each input patch, and these patches were then feed-forwarded through a Transformer encoder for classification. Unlike CNNs, in this architecture, the authors did not include any explicit inductive bias about the 2D structure of images, except in the patch extraction and resolution adjustment step. Stand-Alone Self-Attention (SASA) [10] is a fully self-attentive model that replaces all the local convolution operations with self-attention instead of employing self-attention just as an augmentation over convolutions. This substitution operation is performed on the ResNet architecture. Vaswani et al. [13] introduced a new series of self-attention models called HaloNets that are built around the concept of blocked local self-attention, similar to SASA [10]. Swin Transformer [6] is another self-attention-based approach for visual detection. It involves splitting the image into windows of varying sizes between different layers, where self-attention is applied inside these shifted windows. Experiments exploring different locality patterns of the self-attention modules have also been performed [14][15][16].

Deep Learning (DL) Models for Weed Detection and Transformer Models in the Agricultural Sector
Machine learning (ML) has proved to be very effective for the development of automatic weed detection and classification systems for deployment in a wide range of circumstances [17]. Here, we provide a brief overview of previously performed research in this context.
Historically, various image processing techniques were used for the classification of weeds and crops [18,19]. Different shape features are extracted, and the feature vectors are then evaluated using a single-layer perceptron classifier. In contrast to ML techniques that require substantial domain expertise to properly design feature extractors, DL allows the machine to automatically extract the most characteristic features of objects from raw images. DL is more robust, compared with traditional ML models, to different variations in the input images, leading to better classification results.
Espejo-Garcia et al. [20] performed crop (tomato and cotton)/weed (Black nightshade and velvetleaf) identification using a combination of pre-trained convolutional neural networks with traditional machine learning classifiers. Jin et al. [21] identified weeds in a vegetable plantation setting. Contrary to other weed detection systems, their work focused on training a CenterNet model that was first used to detect vegetable and draw bounding boxes around them. Afterwards, the remaining, green-colored objects that fell out of the bounding boxes were classified as weeds. The detection of a large variation in weed species is feasible using this methodology. In their work, the weeds were further extracted from the background using color-index-based segmentation. Vaidhehi et al. [22] developed a model for weed and paddy detection using regional convolutional neural networks (R-CNNs). The results from the R-CNNs were compared with conventional CNN models and other segmentation models. Wang et al. [22] investigated an encoder-decoder based network for the semantic segmentation of crops and weeds. The network was optimized using different input representations. In their experiments, the inclusion of NIR information significantly improved the segmentation accuracy. Youjie et al. [5] combined several techniques with a dilated CNN model to enhance the performance of weed segmentation. They employed hybrid dilated convolution, UFAB (Universal Function Approximate Block), drop-block techniques in the network backbone, bridge attention blocks to link the encoder to the decoder and SPRB (Spatial Pyramid Attention Block) to refine the segmentation result.
Visual Transformer can be treated as an extension of such non-local attention technique. Reedha et al. [23] explored the application of Visual Transformer (ViT) to weed and crop recognition. For this study, the images were collected using a high-resolution camera mounted on an unmanned aerial vehicle (UAV). The UAV was deployed in beet, parsley and spinach fields for dataset collection. Experiments were conducted to compare the effect of varying training and test set sizes. Similarly, Liang et al. [24] used ViT for the classification of soybean and weeds.
Although not directly related to weed detection, there exist studies where ViT models were applied to a diverse set of agricultural problems. In the quest of applying the Transformer model to plant pathology, P. S. Thakur et al. [25] proposed a model named PlantXViT. The proposed model combines the capabilities of traditional convolutional neural networks with Vision Transformer to efficiently identify a large number of plant diseases in several crops. W. Zhu et al. [26] proposed a method to fuse local and global features of images for feature analysis. They introduced the Transformer encoder as a convolutional operation into the improved model, thereby establishing dependencies between long-distance features and extracting the global features of disease images. The center loss was introduced as a penalty term to optimize the common cross-entropy loss, thus expanding the inter-class differences of crop disease features and narrowing their intra-class gaps. Y. Shen et al. [27] applied Transformer to the field of the semantic segmentation of agricultural aerial images in an attempt to account for the drawback regarding inadequate long-range information utilization associated with fully convolutional networks. A hybrid Transformer (MiT) is employed in the encoder stage to enhance the field anomaly pattern recognition capability, and a squeeze and excitation (SE) module is utilized in the decoder stage to improve the effectiveness of key channels. In order to solve the problems of complex crop disease background and small disease area, [28,29] proposed a lightweight ConvViT model, which combines the convolutional structure and the Transformer structure, and modified the patch embedding method to retain more image edge information for the purpose of facilitating patching information exchange between them. R. Reedha et al. [23] studied ViT for plant classification in unmanned aerial vehicle (UAV) images, demonstrating the potential of ViT for remote sensing image analysis tasks.
The aim of our study is to find the best model for the precise localization and classification of weeds so as to minimize weed removal efforts. To this end, this paper explores the application of recent ViT-based segmentation models, which include Swin Transformer, SegFormer and Segmenter, for the aforementioned purpose.

Methods
For our experiments, we selected three high-performing Transformer-based segmentation models, Swin Transformer, SegFormer and Segmenter. Public implementations were used for network training. A brief description of each model is provided in the following sections.

Swin Transformer
Swin Transformer is built by replacing the standard multi-head self-attention (MSA) module in a Transformer block with a module based on shifted windows, whereas the other layers are kept the same. As illustrated in Figure 1b, a Swin Transformer block consists of a shifted window-based MSA module, followed by a 2-layer Multilayer Perceptron (MLP) with GELU nonlinearity in between. A LayerNorm (LN) layer is applied before each MSA module and MLP, and a residual connection is also applied after each module. In addition, Swin Transformer also uses the hierarchical feature map constructed by the Patch Merging block to compute the representation of the input. The process of Patch Merging is shown in Figure 2.
structure, and modified the patch embedding method to retain more image edge information for the purpose of facilitating patching information exchange between them. R. Reedha et al. [23] studied ViT for plant classification in unmanned aerial vehicle (UAV) images, demonstrating the potential of ViT for remote sensing image analysis tasks.
The aim of our study is to find the best model for the precise localization and classification of weeds so as to minimize weed removal efforts. To this end, this paper explores the application of recent ViT-based segmentation models, which include Swin Transformer, SegFormer and Segmenter, for the aforementioned purpose.

Methods
For our experiments, we selected three high-performing Transformer-based segmentation models, Swin Transformer, SegFormer and Segmenter. Public implementations were used for network training. A brief description of each model is provided in the following sections.

Swin Transformer
Swin Transformer is built by replacing the standard multi-head self-attention (MSA) module in a Transformer block with a module based on shifted windows, whereas the other layers are kept the same. As illustrated in Figure 1b, a Swin Transformer block consists of a shifted window-based MSA module, followed by a 2-layer Multilayer Perceptron (MLP) with GELU nonlinearity in between. A LayerNorm (LN) layer is applied before each MSA module and MLP, and a residual connection is also applied after each module. In addition, Swin Transformer also uses the hierarchical feature map constructed by the Patch Merging block to compute the representation of the input. The process of Patch Merging is shown in Figure 2.   As shown in Figure 1, the architecture alternates between Patch Merging and Swin Transformer blocks. Starting off from an input image of size H × W, the initial Patch Splitting module splits the image into non-overlapping patches, each of which is then treated as a 'token' in the input sequence of the split patches. Each patch size is 4 × 4, with a feature dimension of 4 × 4 × 3 = 48. A linear embedding is applied on these raw-pixel valued vectors in order to project it into an arbitrary dimension C. Within the whole architecture, the Patch Merging module builds hierarchical feature maps by concatenating the features of each group of 2 × 2 neighboring patches, where the 2 × 2 features within each patch are placed in the channel dimension. This results in a 2× downsampling of resolution. So, the H/4 × W/4 number of tokens, or patches, is reduced to H/8 × W/8. The number of tokens is further reduced in the subsequent modules as visualized in Figure 1.
The features coming from the Patch Merging modules are passed through a Swin Transformer block that applies Self-Attention to the partitioned image. The input sequence length is preserved after the application of the attention blocks. Self-attention is implemented in two steps, Window-based Self-Attention (W-MSA) and Shifted Windows Self-Attention (SW-MSA), where these two modules are placed in a sequential manner. In W-MSA, self-attention is applied locally within each window, which leads to a linear increase in complexity with reference to the number of windows or patches. This is an improvement over the previous ViT model, where attention was calculated between each patch/token, which resulted in quadratic complexity with reference to the number of tokens. The SW-MSA approach introduces connections between neighboring non-overlapping windows coming from the previous layer by means of shifting the window configuration slightly.

SegFormer
SegFormer is an efficient semantic segmentation framework based upon the encoder and decoder concepts. The encoder outputs multi-scale features, and a simple All-MLP decoder aggregates this multi-scale information from different layers, combining both local and global attention to compute rich representations in order to perform semantic segmentation. Figure 3 shows the proposed architecture of SegFormer, which is divided into two sections, the encoder and the decoder. The input image is first divided into 4 × 4 patches, unlike ViT, which uses a patch size of 16 × 16. This results in better performance in dense prediction tasks. The Transformer block in the encoder is composed of three sub-modules: As shown in Figure 1, the architecture alternates between Patch Merging and Swin Transformer blocks. Starting off from an input image of size H × W, the initial Patch Splitting module splits the image into non-overlapping patches, each of which is then treated as a 'token' in the input sequence of the split patches. Each patch size is 4 × 4, with a feature dimension of 4 × 4 × 3 = 48. A linear embedding is applied on these rawpixel valued vectors in order to project it into an arbitrary dimension C. Within the whole architecture, the Patch Merging module builds hierarchical feature maps by concatenating the features of each group of 2 × 2 neighboring patches, where the 2 × 2 features within each patch are placed in the channel dimension. This results in a 2× downsampling of resolution. So, the H/4 × W/4 number of tokens, or patches, is reduced to H/8 × W/8. The number of tokens is further reduced in the subsequent modules as visualized in Figure 1.
The features coming from the Patch Merging modules are passed through a Swin Transformer block that applies Self-Attention to the partitioned image. The input sequence length is preserved after the application of the attention blocks. Self-attention is implemented in two steps, Window-based Self-Attention (W-MSA) and Shifted Windows Self-Attention (SW-MSA), where these two modules are placed in a sequential manner. In W-MSA, self-attention is applied locally within each window, which leads to a linear increase in complexity with reference to the number of windows or patches. This is an improvement over the previous ViT model, where attention was calculated between each patch/token, which resulted in quadratic complexity with reference to the number of tokens. The SW-MSA approach introduces connections between neighboring non-overlapping windows coming from the previous layer by means of shifting the window configuration slightly.

SegFormer
SegFormer is an efficient semantic segmentation framework based upon the encoder and decoder concepts. The encoder outputs multi-scale features, and a simple All-MLP decoder aggregates this multi-scale information from different layers, combining both local and global attention to compute rich representations in order to perform semantic segmentation. Figure 3 shows the proposed architecture of SegFormer, which is divided into two sections, the encoder and the decoder. The input image is first divided into 4 × 4 patches, unlike ViT, which uses a patch size of 16 × 16. This results in better performance in dense prediction tasks. The Transformer block in the encoder is composed of three sub-modules: that results in the reduction in the sequence length using a reduction ratio. This helps to lower the computational cost of the self-attention process. ViT uses fixed resolution Position Encodings (PEs) in order to incorporate positional information, which reduces the performance in the case in which the test and the training resolution differ, since the positional code has to be interpolated for the new resolution. To solve this, SegFormer uses a 3 × 3 Conv in the feed-forward network for data-driven positional encoding. Lastly, the Overlap Patch Merging block is used to reduce the feature map size throughout the architecture. This results in hierarchical feature representation comprising high-resolution coarse features and low-resolution fine-grained features. Hierarchical feature maps of sizes 1/4, 1/8, 1/16 and 1/32 of the original image resolution are obtained as such.
duced in [7], that results in the reduction in the sequence length using a reduction ratio. This helps to lower the computational cost of the self-attention process. ViT uses fixed resolution Position Encodings (PEs) in order to incorporate positional information, which reduces the performance in the case in which the test and the training resolution differ, since the positional code has to be interpolated for the new resolution. To solve this, Seg-Former uses a 3 × 3 Conv in the feed-forward network for data-driven positional encoding. Lastly, the Overlap Patch Merging block is used to reduce the feature map size throughout the architecture. This results in hierarchical feature representation comprising high-resolution coarse features and low-resolution fine-grained features. Hierarchical feature maps of sizes 1/4, 1/8, 1/16 and 1/32 of the original image resolution are obtained as such.
The decoder modules contain a full-MLP layer, which takes the features from the encoder module and aggregates them together. The process is performed in four steps: (a) Multi-level features from the encoder go through an MLP layer to be unified in the channel dimension.

Segmenter
Segmenter is also a Transformer-based image segmentation model built upon the original Vision Transformer (ViT) that allows modeling global dependencies early on in the architecture. The decoder module of Segmenter is based on the Transformer framework. It adds K learnable class embeddings to Mask Transformer, which is input to Transformer as a patch embedding; then, a multiplication operation is performed between the class and the patch embedding, followed by softmax application and 2D feature conversion, with a restoration of the original input image size after upsampling in the end. The final class labels are obtained from these embeddings using a Point-wise Linear decoder or a Mask Transformer decoder. The structure of Segmenter is shown in Figure 4.
The input image, x ∈ R H×W×C , is first split into a sequence of patches. The raw RGB values are then flattened; then, these vectors are passed through a linear embedding for producing a sequence of patch embeddings. A learnable position embedding is added to

Segmenter
Segmenter is also a Transformer-based image segmentation model built upon the original Vision Transformer (ViT) that allows modeling global dependencies early on in the architecture. The decoder module of Segmenter is based on the Transformer framework. It adds K learnable class embeddings to Mask Transformer, which is input to Transformer as a patch embedding; then, a multiplication operation is performed between the class and the patch embedding, followed by softmax application and 2D feature conversion, with a restoration of the original input image size after upsampling in the end. The final class labels are obtained from these embeddings using a Point-wise Linear decoder or a Mask Transformer decoder. The structure of Segmenter is shown in Figure 4. and are used to predict the class map. These class embeddings are processed together with the output embedding of the encoder. The decoder is a Transformer encoder by design that generates K masks by computing the scalar product between L2-normalized patch embeddings and the aforementioned class embedding. A set of mask sequences are obtained, which are then reshaped into a 2D mask and upsampled to the original image size. The final segmentation map is obtained after the application of softmax followed by Lay-erNorm.

Dataset
As part of the evaluation, we constructed a weed dataset that could be used to assess the model's performance. The dataset included 10 categories of weeds: clover (Trifolium repens), common ragweed (Ambrosia artemisiifolia), crabgrass (Digitaria), dandelion (Taraxacum), ground ivy (Glechoma hederacea), lambsquarter (Chenopodium album), pigweed (Amaranthus), plantain (Plantago), tall fescue (Festuca arundinacea) and unknown weed. The unknown weed category contained weeds with features different from those of other classes for general weed detection. An example case of every category is visualized in Figure 5, where we can see diverse colors, textures and weed shapes on grassy backgrounds. Note that the density and the colors of grass in the images are different in the cluttered background. The input image, x ∈ R H×W×C , is first split into a sequence of patches. The raw RGB values are then flattened; then, these vectors are passed through a linear embedding for producing a sequence of patch embeddings. A learnable position embedding is added to the sequence of patches individually for incorporating the location information. These semantic embeddings are then passed through standard Transformer blocks consisting of multi-head self-attention and feed-forward layers to obtain contextualized encoding containing rich semantic information.
This sequence of embeddings is then passed to the decoder, which learns to map these patch-level encodings to patch-level class scores, which are then upsampled using bilinear interpolation to obtain pixel-level scores. This can be performed using a Point-wise Linear decoder or a Mask Transformer decoder. For the Point-wise Linear decoder, a Point-wise Linear layer is applied to the encoder outputs to produce patch-level class logics. This sequence is reshaped into a 2D shape and upsampled to the original image size. Final segmentation maps are obtained by applying softmax to the class dimension. For the Mask Transformer decoder, a set of K learnable class embeddings, where K refers to the number of classes, are introduced. These are all assigned to a specific semantic class and are used to predict the class map. These class embeddings are processed together with the output embedding of the encoder. The decoder is a Transformer encoder by design that generates K masks by computing the scalar product between L2-normalized patch embeddings and the aforementioned class embedding. A set of mask sequences are obtained, which are then reshaped into a 2D mask and upsampled to the original image size. The final segmentation map is obtained after the application of softmax followed by LayerNorm.

Dataset
As part of the evaluation, we constructed a weed dataset that could be used to assess the model's performance. The dataset included 10 categories of weeds: clover (Trifolium repens), common ragweed (Ambrosia artemisiifolia), crabgrass (Digitaria), dandelion (Taraxacum), ground ivy (Glechoma hederacea), lambsquarter (Chenopodium album), pigweed (Amaranthus), plantain (Plantago), tall fescue (Festuca arundinacea) and unknown weed. The unknown weed category contained weeds with features different from those of other classes for general weed detection. An example case of every category is visualized in Figure 5, where we can see diverse colors, textures and weed shapes on grassy back-

Data Augmentation
Data augmentation is used to increase the training data to evade overfitting and develop powerful models with limited amounts of initial training samples. However, the results of augmentation should look similar to the images captured in real fields. For augmentation, we used multi-scale training and geometric transforms, including random  Figure 5, each picture has only one category of weed, but each picture in the actual collected dataset may have the appearance of multiple types of weeds).
The dataset contained 1006 images in total, as shown in Table 1. All images were taken by lab members using cell phone cameras in Jeonju and Wanju, Jeonbuk Province, in South Korea. As the images were taken in real fields instead of a laboratory, they involved a number of visual challenges, including complex background conditions, differing illumination settings, etc. In addition, the density or the grass growth state varied between different fields. Furthermore, there also existed intra-class variations for each weed class in terms of their color, texture and shape. In Figure 5, we can see complex backgrounds for clover and unknown weed and different illuminations between crab grass and lambs quarter, along with various stages of grass growth in most of the images. In Figure 6, we can find examples of intra-class variations. All these challenges should be dealt with properly to achieve accurate weed segmentation.

Data Augmentation
Data augmentation is used to increase the training data to evade overfitting and develop powerful models with limited amounts of initial training samples. However, the results of augmentation should look similar to the images captured in real fields. For augmentation, we used multi-scale training and geometric transforms, including random Note that such diversity in a training dataset may help to train a model with high robustness, on the condition that its sample size is above a certain threshold. That is part of the reason why the training data should be augmented to enhance the diversity. For our training, we split the dataset into 805 and 201 images for training and testing, respectively.

Data Augmentation
Data augmentation is used to increase the training data to evade overfitting and develop powerful models with limited amounts of initial training samples. However, the results of augmentation should look similar to the images captured in real fields. For augmentation, we used multi-scale training and geometric transforms, including random cropping, random flipping and random rotation, along with photometric distortions, including brightness and contrast changes. Figures 7 and 8 shows some examples of augmented images using geometric transforms and photometric distortions.
In multi-scale training, an original image with size 512 × 512 is randomly changed to a scale of 512-2048 during training. Multi-scale training increases the robustness of the model by training it on images of different sizes.

Evaluation Metrics
We evaluated the semantic segmentation results in terms of two metrics, the pixel accuracy and IoU (Intersection of Union). It is important that the metrics reflect the purpose of weed segmentation. Since the segmentation results can be utilized to control a robot manipulator or to drive a weedicide spray nozzle, the exact localization of the weed area is important in order not to damage any healthy grass. cropping, random flipping and random rotation, along with photometric distortions, including brightness and contrast changes. Figures 7 and 8 shows some examples of augmented images using geometric transforms and photometric distortions.

Evaluation Metrics
We evaluated the semantic segmentation results in terms of two metrics, the pixel accuracy and IoU (Intersection of Union). It is important that the metrics reflect the purpose of weed segmentation. Since the segmentation results can be utilized to control a robot manipulator or to drive a weedicide spray nozzle, the exact localization of the weed area is important in order not to damage any healthy grass.

Pixel Accuracy (PA) and Mean PA (mPA)
The pixels belonging to a class are specified by the target mask, which can be compared with results from test data. The pixel accuracy in a class can be calculated as the ratio of the number of correctly classified pixels to the total number of pixels as cropping, random flipping and random rotation, along with photometric distortions, including brightness and contrast changes. Figures 7 and 8 shows some examples of augmented images using geometric transforms and photometric distortions.

Evaluation Metrics
We evaluated the semantic segmentation results in terms of two metrics, the pixel accuracy and IoU (Intersection of Union). It is important that the metrics reflect the purpose of weed segmentation. Since the segmentation results can be utilized to control a robot manipulator or to drive a weedicide spray nozzle, the exact localization of the weed area is important in order not to damage any healthy grass.

Pixel Accuracy (PA) and Mean PA (mPA)
The pixels belonging to a class are specified by the target mask, which can be compared with results from test data. The pixel accuracy in a class can be calculated as the ratio of the number of correctly classified pixels to the total number of pixels as PA = ∑ / ∑ (1) The pixels belonging to a class are specified by the target mask, which can be compared with results from test data. The pixel accuracy in a class can be calculated as the ratio of the number of correctly classified pixels to the total number of pixels as The class-wise PA can be averaged over all classes of weed objects to calculate the Mean Average Precision (mAP). Because the exact mask of ground truth for weeds is impossible to specify due to their complicate boundary, the PA can be treated as an approximate to measure the weed area.

Intersection over Union (IoU) and Mean IoU (mIoU)
The IoU is the area of overlap between the predicted segmentation mask and the ground truth divided by the area of union between the predicted segmentation mask and the ground truth. In segmentation, the area is calculated with the number of pixels in a segment. In addition, the object-wise IoUs can be averaged over all objects included in an image to produce the mIoU. From the point of view of its implementation, the IoU for weed objects is important to properly remove the weed using a robot or weedicide to exactly localize the end effector.
In this study, we focused on the weeds that needed to be removed, but the area of background grass is usually much wider than the sparse weed areas, resulting in the mIoU being larger than the IoU of each weed object.

Implementation Details
For every experiment, we used pre-trained ImageNet weights. For the comparisons, each model implemented in the experiments was the smallest version from its respective family, i.e., Swin Transformer-tiny, SegFormer_mit-b0 and Segmenter_vit-tiny. In Segmenter implementation, the best results were produced using two classes for semantic embeddings, namely, background and weed.
We used a single 2080ti GPU for training with the same training parameters. The AdamW [30] optimizer was chosen with an initial learning rate of 6 × 10 −5 and a weight decay of 0.01. The scheduler took the linear learning rate decay with a linear warm-up of 1500 and 160 k iterations. For augmentation, we adopted the default settings of random horizontal flip MMSegmentation [31], random rescaling in the ratio range of [0.5, 2.0], random rotation in the range of [0, 360] and random photometric distortion.

Prediction Analysis
The overall segmentation result of each Transformer is summarized in Table 2, whereas the expanded results can be found in Table 3. As shown in Table 2, SegFormer reported the best performance, with the smallest number of parameters, in terms of mIoU and pixel accuracy. On the other hand, Swin Transformer also displayed results comparable to SegFormer.   Table 2 shows the results of each model for every class individually. The weeds, including clover, common ragweed, dandelion, and lambs, had high IoU and pixel accuracy. In contrast, the results on crabgrass and dandelion were comparatively low. Figure 9a shows the prediction for lambsquarter, where Swin Transformer and Seg-Former produced almost perfect segmentation masks, except for the shadowy region, while Segmenter made a mistake on the boundary of the leaf. In Figure 9b, both crabgrass and tall fescue are included in one image. The Swin Transformer and SegFormer made precise masks for the weeds, unlike Segmenter, which failed to produce accurate results. In addition, the segmentation results tried to follow the zigzagged boundary of weed, and even the ground truth mask was smoothly approximated. As shown in the figure, Swin Transformer provided the best result. In Figure 9c, the image contains clover and plantain weeds, and small areas of clover were not included in the ground truth. The results of SegFormer showed that it found out the missing clover areas in the ground truth mask, but Segmenter could not. The results showed that the generalization ability of SegFormer was better than that of Segmenter. In Figure 9d, the ground truth mask only contained ground ivy with a small portion of dandelion towards the lower-left image boundary. Swin Transformer successfully found the dandelion bit that others did not. In general, the object around the boundary is hard to locate or identify, because only limited context information is available to make a proper inference. Swin Transformer was the best in terms of this generalization property with limited context information on the image boundary. Swin Transformer successfully found the dandelion bit that others did not. In general, the object around the boundary is hard to locate or identify, because only limited context information is available to make a proper inference. Swin Transformer was the best in terms of this generalization property with limited context information on the image boundary.  In conclusion, SegFormer produced the best results in terms of IoU and pixel Accuracy of weed objects with the smallest number of parameters, but Swin Transformer was comparable to or better than SegFormer in terms of the generalization ability, while having almost 5× the number of parameters of SegFormer.

Conclusions
The removal of weeds is essential to successful turf grass management and crop cultivation. Towards this goal, we developed deep learning-based Transformer models to autonomously detect and localize 10 classes of weeds. The dataset introduced in this study includes weed images taken under variable environmental conditions. Case studies were performed on the dataset using three Transformer models, Swin Transformer, SegFormer and Segmenter. The Segmenter model achieved final Mean Accuracy (mAcc) and Mean Intersection of Union (mIoU) of 75.18% and 65.74%. The natural succession to this work is the successful incorporation of the trained models in automated robots for deployment.