Research on the Applicability of Transformer Model in Remote-Sensing Image Segmentation

: Transformer models have achieved great results in the ﬁeld of computer vision over the past 2 years, drawing attention from within the ﬁeld of remote sensing. However, there are still relatively few studies on this model in the ﬁeld of remote sensing. Which method is more suitable for remote-sensing segmentation? In particular, how do different transformer models perform in the face of high-spatial resolution and the multispectral resolution of remote-sensing images? To explore these questions, this paper presents a comprehensive comparative analysis of three mainstream transformer models, including the segmentation transformer (SETRnet), SwinUnet


Introduction
The semantic segmentation of images using computers has a wide range of application scenarios in remote sensing, medicine [1], agriculture [2,3], and other fields.The semantic segmentation of remote-sensing images is an important part of processing and analyzing remote-sensing data and is one of the most widely used areas of remote-sensing applications [4][5][6].The accurate and rapid acquisition of remote-sensing image classification is important for urban management, resource investigation, environmental monitoring, natural disaster assessment, and military reconnaissance.This is because it provides managers with a source of information for more-robust decision-making [7,8].For example, Misbah et al., used remote sensing to detect nitrogen, phosphorus, and potassium elements in widely grown crops in Africa for the purpose of protecting the environment while increasing food production [9].Sataer et al., analyzed the remotely sensed images of Miami Park cliffs at the edge of East Lake Michigan to study the factors of park cliff deformation and prevent disasters from cliff landslides [10].However, the rich and complex features in remote-sensing images have been a challenge for segmentation.
Early semantic segmentation methods rely mainly on manual visual interpretation to classify remote-sensing images, which not only is time-consuming and laborious but also relies heavily on experience for classification accuracy.As the resolution of remotesensing images continues to develop, image-element-based and object-oriented semantic segmentation methods are becoming widely used [11].The image-element-based method makes full use of mainly the spectral reflection information of remotely sensed features for classification, but it lacks a consideration of the relationship between adjacent image elements [12].Object-oriented methods are used for classification in the framework of object-based image analysis [13], and the problems with such methods are that they are prone to noise, and the classification scale of different features is difficult to determine [14].Owing to the improvement of computer technology, machine learning has also been applied to the research of remote-sensing image segmentation [15].Different from the previous methods, machine learning extracts a large number of remote-sensing image features by different classifiers, further reducing the problem of using a human interference.The main methods include decision trees, support vector machines (SVMs), and random forests (RF) [16].For example, Ujjwal et al., used a large number of different advanced support vector machines to fully learn previously unlabelable data with a view to providing guidance for the ensuing research on remote-sensing applications [17].Du et al., used two effective methods, random forest and rotation forest, for classification in order to fully learn the texture features of polarimetric synthetic aperture radar remote-sensing images [18].Although machine learning has significantly improved in efficiency and accuracy, this improvement applies only to its ability to extract the shallow-feature information of remotesensing images.The efficiency and the accuracy of classification are still low when faced with complex remote-sensing images [19,20].
Deep-learning methods have powerful and fast modeling capabilities that can improve segmentation by using the spectral information and texture features of remote-sensing images [21].In remote-sensing image semantic segmentation, using convolutional neural networks (CNNs) is a popular deep-learning method that has significantly better image semantic segmentation capability than previous methods and has been widely used in both academic and industrial fields [22].Thanks to its excellent ability to express high-level semantic features, CNN and its derivatives have shown potential in many image semantic segmentation tasks.For example, the feature pyramid module and the attention-featureaggregation module are combined to improve the feature-learning capability of a CNN and to accomplish the task of the semantic segmentation of high-resolution remote-sensing images [23].The segmentation of building data from high-resolution imagery and LiDAR data use gated residual refinement networks [24], build a multichannel deep convolutional neural network to learn remote-sensing information in different bands, and further improve the segmentation of urban land-use features [25].It can be seen that CNN often performs semantic segmentation in remote-sensing images with high spatial and spectral resolution, and it has achieved remarkable results.However, CNNs tend to perform generally in the face of different scales of feature learning, and many improved methods have been proposed, such as the spatial pyramid pooling model, the jump link structure, and the multiscale feature fusion model [26][27][28].However, the segmentation results still have issues, which are due to the inability of CNNs to fully learn contextual semantic information and retain more spatial features [29].
Recently, the transformer model has achieved excellence in the field of semantic segmentation.Compared with CNNs, this module has a more outstanding ability to learn global semantic information [30,31].The transformer model was originally used, and achieved remarkable results, in the field of natural language processing [32].Researchers began to apply the transformer model to the study of image semantic segmentation.The vision transformer is the first example of the transformer model applied to image classification.Although the researchers found that the classification accuracy of the transformer method was significantly better than that of the CNN method, it ultimately did not complete the image segmentation task.The segmentation transformer (SETR) is an improved model based on the vision transformer, and it has been applied to, and has performed well in, segmentation tasks [33].Although SETR proves that the transformer is competent for the image semantic segmentation task, it comes at an expensive cost.Because of this, the Swin Transformer uses the hierarchical transformer model to obtain multiscale features and effectively reduce computational effort [34].TransUnet learns the features of the input image with CNNs and then inputs the feature-learning results into the transformer model, effectively combining the advantages of both the CNNs and transformer models [35].
Transformers have been widely used in image semantic segmentation tasks.However, their use in the field of remote-sensing images can be improved.Key questions include the following: (1) Which method is more suitable for remote-sensing segmentation?(2) How well does the transformer perform in remote-sensing images at different scales?(3) How do different transformer models perform in the face of high-spatial resolution and the multispectral resolution of remote-sensing images?To address these issues, three-channel and four-channel remote-sensing images are used as the data set in this study, in which the NIR band data are added.By comparing different transformer models from the perspective of segmentation accuracy and training time, the results of this study are beneficial to the selection and understanding of transformer models and provide a reference for future researchers with a view to promoting the development of fine segmentation tasks for remote-sensing images.

Transformer Model
The transformer model, a network architecture first proposed by Vaswani et al., in 2017 [36], is a network that eschews recursion and convolution and is based entirely on attention mechanisms, as shown in Figure 1.
Specifically, the transformer model is still a network structure from encoder to decoder, but with the abandonment of recursion and convolution, a multihead self-attention mechanism is added to the encoder and decoder modules, respectively.The multihead self-attention mechanism is a key component of the transformer model in that the multihead attention mechanism is able to capture remote dependencies between elements and to encode interactions between sequential tokens.As shown in Figure 2b, the multihead self-attention mechanism allows the model to jointly attend to information from different subspaces at different locations.The objective of the multihead attention mechanism is to simultaneously perform multiple parallel attention functions.A single attention (Figure 2a) function can be represented as a function consisting of a query and a set of keys and values corresponding to an output, where the query, key, value, and output are all represented as vectors.The input is composed of a query and key with dimension d k and a value with dimension d v .The dot product of the query is calculated by using all the keys divided by √ d k , and the weights of the values are obtained by the function.During the calculation, the attention function of a set of queries is simultaneously computed, which is defined as a matrix Q.The keys and values are defined as matrices K and V, respectively, and the attention function can be expressed as follows: Specifically, the transformer model is still a network structure from encoder to decoder, but with the abandonment of recursion and convolution, a multihead self-attention mechanism is added to the encoder and decoder modules, respectively.The multihead self-attention mechanism is a key component of the transformer model in that the multihead attention mechanism is able to capture remote dependencies between elements and to encode interactions between sequential tokens.As shown in Figure 2b, the multihead self-attention mechanism allows the model to jointly attend to information from different subspaces at different locations.The objective of the multihead attention mechanism is to simultaneously perform multiple parallel attention functions.A single attention (Figure 2a) function can be represented as a function consisting of a query and a set of keys and values corresponding to an output, where the query, key, value, and output are all represented as vectors.The input is composed of a query and key with dimension dk and a value with dimension dv.The dot product of the query is calculated by using all the keys divided by k d , and the weights of the values are obtained by the function.During the calculation, the attention function of a set of queries is simultaneously computed, which is defined as a matrix Q.The keys and values are defined as matrices K and V, respectively, and the attention function can be expressed as follows: The multihead self-attention module can acquire information from different representation subspaces at different locations.This cannot be done with single-head attention.The output of the multihead attention module can be expressed as follows: The multihead self-attention module can acquire information from different representation subspaces at different locations.This cannot be done with single-head attention.The output of the multihead attention module can be expressed as follows: (2) The multihead self-attention mechanism plays a crucial role in the transformer model that can not only improve the efficiency of remote-sensing image classification but also more accurately acquire the global and local features of remote-sensing images.However, if the network does not contain an attention layer, the model-based network cannot be implemented unless the network is changed, which defeats the original purpose of experimenting with high efficiency and accuracy.

SETRnet (Segmentation Transformer)
The segmentation transformer SETRnet is the first representative model of the visiontransformer-based semantic segmentation proposed by Zheng et al., in 2021 [33].The structure of SETRnet is shown in Figure 3. SETRnet abandons the stacked convolutional feature-extraction method in the encoded layer and instead uses the transformer-only feature-extraction method.In the model, the images are first sliced, and then all the two-dimensional image slices are considered as a one-dimensional sequence and fed into the network as a whole.The input one-dimensional series will become a one-dimensional feature-embedding series.In each layer, the input of attention consists of a query, key, and value triad computed by Z l−1 ∈ L C ×  (where L is the sequence length and C is the hidden channel size).The computed query, key, and value triad can be expressed as follows: SETRnet abandons the stacked convolutional feature-extraction method in the encoded layer and instead uses the transformer-only feature-extraction method.In the model, the images are first sliced, and then all the two-dimensional image slices are considered as a one-dimensional sequence and fed into the network as a whole.The input one-dimensional series will become a one-dimensional feature-embedding series.In each layer, the input of attention consists of a query, key, and value triad computed by Z l−1 ∈ R L×C (where L is the sequence length and C is the hidden channel size).The computed query, key, and value triad can be expressed as follows: where W Q , W K , W V ∈ R C×d are the learnable parameters of the 3 linear projection layers and d is the dimensionality of (query, key, value).Then the attention function of SETRnet can be expressed as follows: The output of SETRnet's multihead self-attention (MSA) module is converted by an multilayer perceptron (MLP) module with residual jumps; the structure is shown in Figure 4.
Appl.Sci.2023, 13, x FOR PEER REVIEW 7 of 24 ( ) ( ) ( ) SETRnet's decoder is designed with three methods, namely plain upsampling (naïve), progressive upsampling (PUP), and multilevel feature fusion (MLA).SETRnet constructs a new semantic segmentation model from a new perspective.Compared with the traditional semantic segmentation model, SETRnet models the global context in each layer of the encoder by using the transformer as the encoder, which effectively obtains the global context information and removes the semantic segmentation network's dependence on convolution.

SwinUnet
SwinUnet is a transformer-based network proposed by Cao et al., in 2021 [37].SwinUnet consists of an encoder, a channel, a decoder, and a jump connection, as shown in Figure 5.

SwinUnet
SwinUnet is a transformer-based network proposed by Cao et al., in 2021 [37].Swi-nUnet consists of an encoder, a channel, a decoder, and a jump connection, as shown in Figure 5. Inspired by the Unet network, SwinUnet is designed with a symmetric decoder with a patch extension layer based on the Swin Transformer.Unlike the traditional multihead self-attention (MSA), the Swin Transformer module is built based on a shifted window.Figure 6 shows the structure of the Swin Transformer module, which consists of Lay-erNorm (LN) layers, a multihead self-attention module, a residual connection, and a twolayer MLP with Gaussian error linear units (GELUs) nonlinearity.The window-based multihead self-attention module (W-MSA) and the translational window-based multihead self-attention (SW-MSA) module are used in two consecutive Swin Transformer modules, respectively.For such a modular composition, the continuous Swin Transformer can be expressed as follows: ( ) ( ) Inspired by the Unet network, SwinUnet is designed with a symmetric decoder with a patch extension layer based on the Swin Transformer.Unlike the traditional multihead self-attention (MSA), the Swin Transformer module is built based on a shifted window.Figure 6 shows the structure of the Swin Transformer module, which consists of LayerNorm (LN) layers, a multihead self-attention module, a residual connection, and a two-layer MLP with Gaussian error linear units (GELUs) nonlinearity.The window-based multihead selfattention module (W-MSA) and the translational window-based multihead self-attention (SW-MSA) module are used in two consecutive Swin Transformer modules, respectively.For such a modular composition, the continuous Swin Transformer can be expressed as follows: ˆl z and l z denote the output of the (S)W-MSA module and the MLP module of the lth block, respectively.Then the attention function can be expressed as follows: where Q, K, and V∈  SwinUnet successfully puts the transformer module into the encoder and decoder, and together with the jump connection of the Unet network, the SwinUnet network can more quickly and comprehensively acquire the global and local feature information of images, avoiding the limitation of the CNN model, which cannot acquire global and longrange feature information.

TransUnet
TransUnet is a network proposed by Chen et al., in February 2021 [35]; unlike the SwinUnet network structure based entirely on the transformer, the encoder in the TransUnet network does not use a pure transformer but instead uses a hybrid CNN-transformer model.The network structure is shown in Figure 7. TransUnet is a network proposed by Chen et al., in February 2021 [35]; unlike the SwinUnet network structure based entirely on the transformer, the encoder in the TransUnet network does not use a pure transformer but instead uses a hybrid CNN-transformer model.The network structure is shown in Figure 7. ẑl and z l denote the output of the (S)W-MSA module and the MLP module of the lth block, respectively.Then the attention function can be expressed as follows: where Q, K, and V ∈ R M 2 ×d denote the matrix of queries, keys, and values; M 2 and d denote the number of patches in the window and the dimension of the query or key, respectively; and B comes from the bias matrix B ∈ R (2M−1)×(2M+1) .
SwinUnet successfully puts the transformer module into the encoder and decoder, and together with the jump connection of the Unet network, the SwinUnet network can more quickly and comprehensively acquire the global and local feature information of images, avoiding the limitation of the CNN model, which cannot acquire global and long-range feature information.

TransUnet
TransUnet is a network proposed by Chen et al., in February 2021 [35]; unlike the SwinUnet network structure based entirely on the transformer, the encoder in the TransUnet network does not use a pure transformer but instead uses a hybrid CNN-transformer model.The network structure is shown in Figure 7. TransUnet is a network proposed by Chen et al., in February 2021 [35]; unlike the SwinUnet network structure based entirely on the transformer, the encoder in the TransUnet network does not use a pure transformer but instead uses a hybrid CNN-transformer model.The network structure is shown in Figure 7.
TransUnet first uses CNN as a feature extractor to generate the input feature maps.This is when the patch-embedding module extracts 1 × 1 patches from the feature map generated by the CNN, rather than from the original map.The transformer encoder in the TransUnet network consists of the L-layer MSA and MLP modules (as shown in Figure 8), so the output of the lth layer can be expressed as follows: where LN(•) denotes the layer normalization operator and z indicates the encoded image.TransUnet first uses CNN as a feature extractor to generate the input feature maps.This is when the patch-embedding module extracts 1 × 1 patches from the feature map generated by the CNN, rather than from the original map.The transformer encoder in the TransUnet network consists of the L-layer MSA and MLP modules (as shown in Figure 8), so the output of the lth layer can be expressed as follows: ( ) ( ) where LN(•) denotes the layer normalization operator and z  indicates the encoded im- age.By combining the CNN and transformer into the encoder, TransUnet not only avoids the limitations of the CNN method in acquiring remote relational features, thanks to convolutional operations, but also avoids the problem of coarse segmentation results, thanks to the transformer's excessive focus on modeling between global contexts.

Accuracy Comparison of Models
In order to verify which transformer model is more suitable for remote-sensing image segmentation, three aspects of the segmentation results, the segmentation accuracy, and the training time of three transformer models in two data sets, namely the Vaihingen data set and Potsdam data set, were compared.

Training Time Comparison of Models
For the consumption of training time for the three transformer models, this study By combining the CNN and transformer into the encoder, TransUnet not only avoids the limitations of the CNN method in acquiring remote relational features, thanks to convolutional operations, but also avoids the problem of coarse segmentation results, thanks to the transformer's excessive focus on modeling between global contexts.

Accuracy Comparison of Models
In order to verify which transformer model is more suitable for remote-sensing image segmentation, three aspects of the segmentation results, the segmentation accuracy, and the training time of three transformer models in two data sets, namely the Vaihingen data set and Potsdam data set, were compared.

Training Time Comparison of Models
For the consumption of training time for the three transformer models, this study will compare the analysis on the basis of the training time of the three models in two data sets, namely the Vaihingen data set and Potsdam data set (Tables 1 and 2), with time in seconds.

Experimental Setup
In this study, the network is constructed using the Pytorch framework and Python language, on an Intel(R) Xeon(R) Gold 5218 CPU with a GeForce RTX 2080Ti GPU.The Adam optimizer is used to optimize the training process; the learning rate of each model is the same 3 × 10 −4 ; and the number of iterations is set to 40.The learning rate of each model is the same 3 × 10 −4 ; the number of epochs is set to 40; and the image size is set to 224 × 224.

Metrics
The segmentation-extraction results of different transformer models can be evaluated on the basis of both subjective and objective aspects [38].The subjective aspects include whether the segmentation-extraction results of the remote-sensing image features are complete and whether the segmentation edges of the features are clear and consistent.The objective aspect can be quantitatively calculated on the basis of the evaluation criteria, and afterward, the model classification accuracy can be assessed.The following criteria were used in this study mainly to evaluate the classification performance of the three transformer models for the data in the training data set.
F1 score: the F1 score is a metric to evaluate the model proposed on the basis of precision and recall, which explains the extent to which the true value overlaps with the predicted outcome pixels.The F1 score also serves as a summed average of precision and recall as a whole, as defined below:

TP Precision TP FP
where TP is a positive sample predicted by the model as a positive class, FP is the negative sample predicted by the model as a positive class, and FP is the negative sample predicted by the model as a positive class.Overall classification accuracy (OA): the ratio of the number of correctly classified samples to the number of all samples can be defined as follows.

TP TN OA TP FN FP TN
TN is the negative sample predicted by the model as the negative class.
Mean intersection and merge ratio (MIoU): the MIoU is the calculation of the ratio of the intersection of two sets of true values to the merged set of predicted values, and it is a global evaluation of the image classification results, as defined below.

Metrics
The segmentation-extraction results of different transformer models can be evaluated on the basis of both subjective and objective aspects [38].The subjective aspects include whether the segmentation-extraction results of the remote-sensing image features are complete and whether the segmentation edges of the features are clear and consistent.The objective aspect can be quantitatively calculated on the basis of the evaluation criteria, and afterward, the model classification accuracy can be assessed.The following criteria were used in this study mainly to evaluate the classification performance of the three transformer models for the data in the training data set.
F1 score: the F1 score is a metric to evaluate the model proposed on the basis of precision and recall, which explains the extent to which the true value overlaps with the predicted outcome pixels.The F1 score also serves as a summed average of precision and recall as a whole, as defined below: where TP is a positive sample predicted by the model as a positive class, FP is the negative sample predicted by the model as a positive class, and FP is the negative sample predicted by the model as a positive class.Overall classification accuracy (OA): the ratio of the number of correctly classified samples to the number of all samples can be defined as follows.
TN is the negative sample predicted by the model as the negative class.
Mean intersection and merge ratio (MIoU): the MIoU is the calculation of the ratio of the intersection of two sets of true values to the merged set of predicted values, and it is a global evaluation of the image classification results, as defined below.
Kappa coefficient: the Kappa coefficient can be used to measure classification accuracy and also to test consistency.In practical classification problems, the Kappa coefficient is often used as an indicator to evaluate the "bias" of the model if the consistency between samples is poor.If P o is the overall classification accuracy and if the number of real samples in each category is a 1 , a 2 , . . ., a c , the predicted number of samples in each category is b 1 , b 2 , . . ., b c, and the total number of samples is n, then P e is the consistency error.

Metrics Visual Analysis of Classification Results
In order to better analyze the gap between the three transformer models in terms of segmentation details, partially cropped images in the two data sets were selected for the analysis of segmentation results in this study.The results of the Vaihingen and Potsdam data sets are shown in Figures 11 and 12, respectively.
In the first set of results from the Vaihingen data set, the SETRnet, SwinUnet, and TransUnet networks all showed missegmentation at the junction of impervious surfaces and trees.At the junction of buildings and low vegetation, SwinUnet showed fragmentation.
In the second set of results, all three networks showed errors at the junction of trees and impervious surfaces in the upper right corner, but SETRnet and SwinUnet showed more classification errors.The classification results of TransUnet were more complete overall, and the contours were more clearly continuous among the three networks.
In the third set of results, SETRnet better segmented the area compared with SwinUnet and TransUnet but misclassified at the junction of low vegetation and trees.TransUnet was the only network that correctly distinguished low vegetation from trees, while SwinUnet showed edge jaggedness.
In the fourth set of results, SETRnet and SwinUnet could not well identify the vehicles, resulting in broken classification results and distorted contours for both networks for vehicles.TransUnet well identified the vehicle contours, but there were also misclassifications at the junction of vehicles and low vegetation.
The four sets of segmentation results show that when features exist in proximity to each other, especially when such relationships exist between large-area features and smallarea features, the transformer model cannot segment well to deal with the relationships between such features, and it often misclassifies small-area features into large-area features.At the same time, we can see that there are great differences between the three models for vehicle segmentation.TransUnet is the best for vehicle segmentation in terms of both vehicle profile and number, SwinUnet is the second best, and SETRnet is the worst.
The first set of results in the Potsdam data set show severe fragmentation and profile breakage in SETRnet.TransUnet misclassified within the low vegetation area, while Swin-Unet was successful in identifying the low vegetation and separating the low vegetation from the impervious surface.
The second set of results indicated a clear error for TransUnet at the junction of road and background, while SETRnet and SwinUnet were classified as complete.SETRnet showed a wavy profile in the low vegetation area with the background area, and SwinUnet showed a jagged profile.In the third set of results, SwinUnet was more complete and accurate in its overall classification than SETRnet and TransUnet were.TransUnet did not identify vehicles well, and SETRnet misclassified impervious surface areas and low vegetation areas.
The segmentation results in the figure show that SwinUnet has the best segmentation in the large-scale Potsdam data set, followed by SETRnet and then by TransUnet.TransUnet exhibits a very different segmentation result from the Vaihingen data set, with large feature confusion and more segmentation fragmentation.It instead indicates that the method combining a CNN with a transformer is not suitable for large-scale remote-sensing data sets.In the face of large-scale data sets, a CNN combined with the transformer method face the explosion of feature information when extracting and processing feature information, owing to its strong local feature-extraction ability, which leads to a poor segmentation effect.The better segmentation effect of SwinUnet compared with SETRnet also indicates that the method of the local attention enhancement of a transformer in the face of large-scale data can effectively improve the ability of a transformer to extract global and local feature information.Among the four sets of segmentation results, different degrees of shadows exist on different features within the original image, and among the three models, only SwinUnet can better reduce the shadow effect and segment the features, while the other two models produced more incorrect segmentation for the shadow part, but SETRnet produced less of this situation compared with TransUnet.This situation may be due to the transformer's insufficient learning ability for shadows.At the same time, we can see a significant improvement in the segmentation effect of SwinUnet and SETRnet for vehicles in the Potsdam data set.It shows that with the expansion of the data set scale, the sample size of the ground objects increases, and the segmentation accuracy of the transformer for small-scale ground objects also increases.In the first set of results from the Vaihingen data set, the SETRnet, SwinUnet, and TransUnet networks all showed missegmentation at the junction of impervious surfaces and trees.At the junction of buildings and low vegetation, SwinUnet showed fragmentation.
In the second set of results, all three networks showed errors at the junction of trees and impervious surfaces in the upper right corner, but SETRnet and SwinUnet showed more classification errors.The classification results of TransUnet were more complete overall, and the contours were more clearly continuous among the three networks.
In the third set of results, SETRnet better segmented the area compared with SwinUnet and TransUnet but misclassified at the junction of low vegetation and trees.TransUnet was the only network that correctly distinguished low vegetation from trees, while SwinUnet showed edge jaggedness.
In the fourth set of results, SETRnet and SwinUnet could not well identify the vehicles, resulting in broken classification results and distorted contours for both networks for vehicles.TransUnet well identified the vehicle contours, but there were also misclassifi-

Accuracy Comparison of Results
The classification results in the Vaihingen data set test were collated and statistically compared with the accuracy evaluation results of the three models, as shown in Table 3.In the comparison, TransUnet had the highest precision in all categories except for the tree category, where SETRnet had the highest precision.In the comparison of recall, SETRnet and SwinUnet were the highest in the low vegetation and impervious surfaces categories, respectively.TransUnet had the highest recall in the rest of the categories.In the F1 score comparison, TransUnet had the highest F1 value in all the categories.In the Kappa comparison, SETRnet was 73.80%, SwinUnet was 77.50%, and TransUnet was the highest, at 80.54%.TransUnet in MIoU was the highest, at 56.25%, an improvement of 8.57% and 4.77% relative to SETRnet and SwinUnet, respectively.In the OA comparison, TransUnet remained the highest, at 85.55%, with SETRnet and SwinUnet at 80.50% and 83.29%, respectively.The classification results in the Potsdam data set test were collated and statistically compared with the accuracy evaluation results of the three models, as shown in Table 4.In the comparison of precision, SETRnet had the highest accuracy in the impervious surfaces category, SwinUnet had the highest accuracy in both car and clutter/background categories, and TransUnet had the highest accuracy in the building, low vegetation and tree item categories.In the recall comparison, SETRnet's accuracy was highest in the building and clutter/background categories, SwinUnet's accuracy was highest in the low vegetation and tree categories, and TransUnet's accuracy was highest in the impervious surfaces and car categories.In the F1 score comparison, SETRnet obtained the highest accuracy in the building category, SwinUnet in the impervious surfaces, low vegetation, and tree categories, and TransUnet in the car category.In the Kappa comparison, SwinUnet's accuracy was the highest, at 76.47-4.4% and 8.29% higher than SETRnet and TransUnet, respectively.SwinUnet's accuracy was still the highest in MIoU, at 63.62%, while SETRnet's and TransUnet's accuracies were 59.97% and 58%, respectively.In the comparison of OA, SwinUnet remained the highest, at 85.01%, with an improvement of 6.5% and 9.25% relative to the OA values of SETRnet and TransUnet, respectively.To demonstrate that the experimental results in the two data sets are not chance events, the Kappa coefficients of the models in the two data sets are therefore tested for effect sizes.The results are shown in Table 5. Cohen's d reflects the degree of difference between two aggregates after they are affected by something, and the larger the effect size, the greater the degree of difference.Generally, 0.2 ≤ d < 0.5 is called a small effect, 0.5 ≤ d < 0.8 is called a medium effect, and d ≥ 0.8 is called a large effect.As can be seen in Table 3, the Cohen's d value for the Kappa coefficient in both data sets is 0.795, which is a medium effect and very close to the large effect.This also indicates that the experimental results in both data sets are not chance events and are statistically significant.The whole training process of SwinUnet was relatively stable, the accuracy value shows a steady increase, and the loss value shows a steady decrease (Figure 13).There are fluctuations in both SETRnet and TransUnet; however, all the models converged after about the 25th epoch.The whole training process of SwinUnet was relatively stable, the accuracy value shows a steady increase, and the loss value shows a steady decrease (Figure 13).There are fluctuations in both SETRnet and TransUnet; however, all the models converged after about the 25th epoch.

Comparison of CNNs
Figure 14 shows the comparison results of transformers and Unet [27], DeepLab V3+ [39], and MAnet (multiscale attention net) [40] for the Vaihingen data set segmentation results.SwinUnet and SETRnet were significantly better than CNN for large-scale feature segmentation, which further proves that a transformer is beneficial to improve the largescale feature-learning ability [41].Regarding the case of feature confusion, SwinUnet had fewer such occurrences, and the problem of feature misclassification confusion is commonly found in CNNs, mainly because the feature-extraction ability of convolution is not as good as that of transformers.However, the segmented edges in both transformer networks appear to be less fine than those of convolutional networks, and even the results of TransUnet containing a CNN are better than those of the other two transformers.This indicates that the transformer still needs improvement in edge-extraction capability, and it is necessary to improve the spatial feature-information-learning capability.

Comparison of CNNs
Figure 14 shows the comparison results of transformers and Unet [27], DeepLab V3+ [39], and MAnet (multiscale attention net) [40] for the Vaihingen data set segmentation results.SwinUnet and SETRnet were significantly better than CNN for large-scale feature segmentation, which further proves that a transformer is beneficial to improve the largescale feature-learning ability [41].Regarding the case of feature confusion, SwinUnet had fewer such occurrences, and the problem of feature misclassification confusion is commonly found in CNNs, mainly because the feature-extraction ability of convolution is not as good as that of transformers.However, the segmented edges in both transformer networks appear to be less fine than those of convolutional networks, and even the results of TransUnet containing a CNN are better than those of the other two transformers.This indicates that the transformer still needs improvement in edge-extraction capability, and it is necessary to improve the spatial feature-information-learning capability.

Discussion
In the Potsdam data set experiment, SwinUnet was more suitable for the feature-segmentation extraction of large-scale remote-sensing images, while TransUnet performed relatively poorly.The overall accuracy performance of TransUnet was inferior to the other models.This is because the whole network structure of SwinUnet is built using a transformer, which can improve the interaction ability of the global semantic information of the model.Moreover, the fusion of features at different scales by jump linking enhances the model's ability to segment predictions at the pixel level [42].Using the Vaihigen data set experiment, TransUnet not only had the best overall accuracy performance but also had the highest accuracy for different features.This is because the encoder constructed by the transformer of SwinUnet is still inferior to CNN for small-scale image feature extraction in the small-scale case.However, TransUnet is a combination of CNN and a transformer that enhances the transformer and accelerates its convergence by using appropriate convolutional bias to obtain more local feature information.Thus, TransUnet has better segmentation results in the Vaihingen data set and better segmentation details and contours for features.[43].Therefore, before selecting a transformer, it should be considered according to the remote-sensing image scale.SwinUnet was preferential for largescale images and TransUnet for smaller-scale images, while SETRnet was not suitable as a remote-sensing image segmentation network.Meanwhile, the comparison experiment between a transformer and a CNN proves that a transformer is inferior to a CNN for the segmentation of edge features.However, it is significantly better than a CNN for largescale feature segmentation.This situation may be related to the fact that a transformer itself focuses too much on the global features, resulting in ignoring some edge features.

Conclusions
In this study, we investigated which transformer model is more suitable for remotesensing image feature segmentation by evaluating the performance of different transformer models.In this study, first, three transformer models were briefly described, and

Discussion
In the Potsdam data set experiment, SwinUnet was more suitable for the featuresegmentation extraction of large-scale remote-sensing images, while TransUnet performed relatively poorly.The overall accuracy performance of TransUnet was inferior to the other models.This is because the whole network structure of SwinUnet is built using a transformer, which can improve the interaction ability of the global semantic information of the model.Moreover, the fusion of features at different scales by jump linking enhances the model's ability to segment predictions at the pixel level [42].Using the Vaihigen data set experiment, TransUnet not only had the best overall accuracy performance but also had the highest accuracy for different features.This is because the encoder constructed by the transformer of SwinUnet is still inferior to CNN for small-scale image feature extraction in the small-scale case.However, TransUnet is a combination of CNN and a transformer that enhances the transformer and accelerates its convergence by using appropriate convolutional bias to obtain more local feature information.Thus, TransUnet has better segmentation results in the Vaihingen data set and better segmentation details and contours for features [43].Therefore, before selecting a transformer, it should be considered according to the remote-sensing image scale.SwinUnet was preferential for large-scale images and TransUnet for smaller-scale images, while SETRnet was not suitable as a remote-sensing image segmentation network.Meanwhile, the comparison experiment between a transformer and a CNN proves that a transformer is inferior to a CNN for the segmentation of edge features.However, it is significantly better than a CNN for large-scale feature segmentation.This situation may be related to the fact that a transformer itself focuses too much on the global features, resulting in ignoring some edge features.

Conclusions
In this study, we investigated which transformer model is more suitable for remotesensing image feature segmentation by evaluating the performance of different transformer models.In this study, first, three transformer models were briefly described, and the network structure of the transformer model was separately constructed.In this study, experiments were conducted on two data sets, namely the Vaihingen and Potsdam data sets, and the SETRnet, SwinUnet, and TransUnet models were compared by conducting a visual analysis of feature-segmentation results and by assessing their accuracy and training time.The three models were further discussed and analyzed with CNNs.This research will aid in the understanding of different transformer models and the selection of more-suitable transformer models for remote-sensing image feature segmentation in future experiments.The results indicated that SwinUnet performed better on the large-scale Potsdam data set thanks to its excellent global semantic interaction and pixel-level segmentation prediction ability.TransUnet benefits from its network structure jointly constructed by a transformer and a CNN, and it has the highest accuracy on the small-scale Vaihingen data set.Compared with SwinUnet and TransUnet, SETRnet is not suitable for the extraction of remote-sensing image features.At the same time, the experimental results also show that a transformer has obvious advantages for the segmentation of large-scale objects, but the pure transformer structure is not suitable for remote-sensing image segmentation.For different scales of remote-sensing data, researchers need to choose appropriate transformer models and improve methods.
In the future, we should pay more attention to the following two areas.First, the transformer model's ability to extract the edges of features is insufficient.We should address the issue of the transformer model's overly focusing on the semantic relationship between using the global details and ignoring the edge details.Second, we should invest in expanding the application of different transformer models in the segmentation and extraction of remote-sensing image features and further verify their effectiveness.

Figure 4 .
Figure 4. SETRnet's transformer layer structure.SETRnet's decoder is designed with three methods, namely plain upsampling (naïve), progressive upsampling (PUP), and multilevel feature fusion (MLA).SETRnet constructs a new semantic segmentation model from a new perspective.Compared with the traditional semantic segmentation model, SETRnet models the global context in each layer of the encoder by using the transformer as the encoder, which effectively obtains the global context information and removes the semantic segmentation network's dependence on convolution.


denote the matrix of queries, keys, and values; M 2 and d denote the number of patches in the window and the dimension of the query or key, respectively; and B comes from the bias matrix  ∈

3. 1
.1.Data Set In order to better compare the segmentation effect of transformers on remote-sensing data sets at different scales, the experimental data were obtained from the state-of-the-art airborne image data sets Vaihingen data set and Potsdam data set, provided by ISPRS.The Vaihingen data set and Potsdam data set have the same type of features and different scales.Vaihingen data set: Vaihingen is a relatively small village located in Germany with a number of detached buildings and small multistory buildings.The Vaihingen data set contains 33 remote-sensing images of different sizes covering an area of 1.38 km 2 in Vaihingen.The spatial resolution of the top image and DSM is 9 cm, and each image has four bands: near-infrared, red, green, and blue.We cropped 33 remote-sensing images of different sizes into small images with a size of 224 × 224, and the cropped images were divided into a training set, a validation set, and a test set according to the ratio of 8:1:1.The percentage of each type of feature in the data set is shown in Figure9.Appl.Sci.2023, 13, x FOR PEER REVIEW 12 of 24The Vaihingen data set and Potsdam data set have the same type of features and different scales.Vaihingen data set: Vaihingen is a relatively small village located in Germany with a number of detached buildings and small multistory buildings.The Vaihingen data set contains 33 remote-sensing images of different sizes covering an area of 1.38 km 2 in Vaihingen.The spatial resolution of the top image and DSM is 9 cm, and each image has four bands: near-infrared, red, green, and blue.We cropped 33 remote-sensing images of different sizes into small images with a size of 224 × 224, and the cropped images were divided into a training set, a validation set, and a test set according to the ratio of 8:1:1.The percentage of each type of feature in the data set is shown in Figure9.

Figure 9 .
Figure 9. Proportion of each type of feature in the Vaihingen data set.Potsdam data set: Potsdam is a city with a long history of large buildings, narrow streets, and a dense settlement structure.The Potsdam data set contains 38 remotely sensed images of the same size (6000 × 6000) covering an area of 3.42 km 2 in Potsdam.The spatial resolution of the top image and DSM is 5 cm, and each image has four bands: nearinfrared, red, green, and blue.Similarly, we cropped 38 remote-sensing images of the same size into small images with a size of 224 × 224, and the cropped images were divided into a training set, a validation set, and a test set according to the ratio of 8:1:1.The percentage of each type of feature in the data set is shown in Figure10.

Figure 9 . 24 Figure 10 .
Figure 9. Proportion of each type of feature in the Vaihingen data set.Potsdam data set: Potsdam is a city with a long history of large buildings, narrow streets, and a dense settlement structure.The Potsdam data set contains 38 remotely sensed images of the same size (6000 × 6000) covering an area of 3.42 km 2 in Potsdam.The spatial resolution of the top image and DSM is 5 cm, and each image has four bands: near-infrared, red, green, and blue.Similarly, we cropped 38 remote-sensing images of the same size into small images with a size of 224 × 224, and the cropped images were divided into a training

Figure 10 .
Figure 10.Proportion of each feature type in the Potsdam data set.

Figure 11 .
Figure 11.Comparison of transformer's segmentation results in the Vaihingen data set.The red boxes are where the differences between the three network segmentations are large.

Figure 11 .
Figure 11.Comparison of transformer's segmentation results in the Vaihingen data set.The red boxes are where the differences between the three network segmentations are large.

24 Figure 12 .
Figure 12.Comparison of transformer's segmentation results in the Potsdam data set.The red boxes are where the differences between the three network segmentations are large.

Figure 12 .
Figure 12.Comparison of transformer's segmentation results in the Potsdam data set.The red boxes are where the differences between the three network segmentations are large.

Table 1
shows the training time of the three models in the Vaihingen data set.From the table, it can be seen that SwinUnet has the shortest training time, SETRnet follows, and TransUnet has the longest training time.In each epoch, SwinUnet's time is 36.79s and 79.75 s faster compared with SETRnet and TransUnet, respectively.
Among them, fluctuations occurred at the beginning of the training for SETRnet.The final accuracy and loss values of SETRnet were the worst.TransUnet showed optimal results directly after the wave.This shows that SwinUnet is more robust and less difficult to train.The training result of TransUnet is better than the SwinUet and SETRnet models, but the training difficulty is relatively high.It is necessary to set the appropriate learning rate and simultaneously adjust the training strategy, including the epoch, learning rate decay strategy, etc. Appl.Sci.2023, 13, x FOR PEER REVIEW 20 of 24 3.2.5.Training Process of Different Transformers Among them, fluctuations occurred at the beginning of the training for SETRnet.The final accuracy and loss values of SETRnet were the worst.TransUnet showed optimal results directly after the wave.This shows that SwinUnet is more robust and less difficult to train.The training result of TransUnet is better than the SwinUet and SETRnet models, but the training difficulty is relatively high.It is necessary to set the appropriate learning rate and simultaneously adjust the training strategy, including the epoch, learning rate decay strategy, etc.

Figure 13 .
Figure 13.Training process of transformers on the Vaihingen data set.

Figure 13 .
Figure 13.Training process of transformers on the Vaihingen data set.

Figure 14 .
Figure 14.Classification results of CNN and transformers on the Vaihingen data set.The red boxes are where the differences between the five network segmentations are large.

Figure 14 .
Figure 14.results of CNN and transformers on the Vaihingen data set.The red boxes are where the differences between the five network segmentations are large.

Table 1 .
Training time of the Vaihingen data set.Table 2 shows the training time of the three models for the Potsdam data set.From the table, we can see that the respective training times of SwinUnet and TransUnet are basically equal, and both are faster than SETRnet by more than 10,000 s, which is faster than 300 s per epoch, on average.

Table 2 .
Training time of the Potsdam data set.

Table 3 .
Evaluation table of classification results in the Vaihingen data set.The bolded numbers are the models with the best performance in terms of accuracy.

Table 4 .
Evaluation table of classification results in the Potsdam data set.The bolded numbers are the models with the best performance in terms of accuracy.