TMNet: A Two-Branch Multi-Scale Semantic Segmentation Network for Remote Sensing Images

Pixel-level information of remote sensing images is of great value in many fields. CNN has a strong ability to extract image backbone features, but due to the localization of convolution operation, it is challenging to directly obtain global feature information and contextual semantic interaction, which makes it difficult for a pure CNN model to obtain higher precision results in semantic segmentation of remote sensing images. Inspired by the Swin Transformer with global feature coding capability, we design a two-branch multi-scale semantic segmentation network (TMNet) for remote sensing images. The network adopts the structure of a double encoder and a decoder. The Swin Transformer is used to increase the ability to extract global feature information. A multi-scale feature fusion module (MFM) is designed to merge shallow spatial features from images of different scales into deep features. In addition, the feature enhancement module (FEM) and channel enhancement module (CEM) are proposed and added to the dual encoder to enhance the feature extraction. Experiments were conducted on the WHDLD and Potsdam datasets to verify the excellent performance of TMNet.


Introduction
In the field of image processing, the research on semantic segmentation of remote sensing images is becoming one of the hot topics nowadays. It classifies each pixel in the image, and the information of the classification results is valuable in many fields, such as land planning [1] and natural disaster assessment [2] and land surveying [3].
Remote sensing images contain a large amount of feature information, and the same type of features present feature diversity and complexity at different times or locations [4]. Compared with car scenes and object images [5,6], the semantic segmentation of remote sensing images is obviously more difficult. Its characteristics of long distance and large scope of shooting make the size of the object features extremely inconsistent, and it is easy to have large intra-class variance and small inter-class variance. Features such as the large range of size variation of the same kind of target, obvious color and texture differences, and difficulty in distinguishing between mixed objects of different kinds mean that remote sensing image segmentation is more difficult compared with other images.
The rapid development of convolutional neural networks (CNNs) has played an important role in promoting research in the field of image processing, but CNN structures are always used for the classification of the whole image rather than for each pixel in the image. The invention of the full convolutional network (FCN) [7] is a milestone for 1.
A two-branch multi-scale semantic segmentation network (TMNet) for remote sensing images is proposed to improve the model image classification performance by the global feature information encoding capability of the Swin Transformer.

2.
The MFM is proposed to merge shallow spatial features of images at different scales into deep features and increase the fusion of multi-scale deep image features of the model to recover more minute details in the classification process. 3.
FEM and CEM are added to the main and auxiliary encoders, respectively, to enhance feature extraction. FEM enhances feature information interaction by calculating the relationship between its own feature information and updateable feature storage units; CEM further increases the spatial correlation of global features by establishing an inter-channel correlation to encode the spatial information of Swin Transformer.
The main work of this paper is as follows: Section 1 summarizes the current research status in the field of remote sensing image segmentation and the methodology and contributions of this paper. In Section 2, the related work on the proposed method of this model is presented. Section 3 is a detailed description of this method, and Section 4 is the experimental part, including various parameters, results, and analysis of the experiments. Section 5 is the conclusion of this paper.

Classical Image Semantic Segmentation Network
In recent years, remote sensing image segmentation methods based on the deep learning framework have made great progress. DenseASPP [15] uses dilated convolution with different expansion rates to enhance its global modeling capability. PSPNet [16] uses a Sensors 2023, 23, 5909 3 of 18 pyramid pooling module to acquire features and uses multiple scale pooling operations to realize the fusion of different scale features. DenseU-Net [17] improves the image classification accuracy by deepening the convolution layer and using U-Net architecture to realize the aggregation of small-scale features. BiSeNet [18] is a bilateral segmentation network designed to extract spatial information and generate high-resolution features in small steps, and to obtain sufficient receptive fields using a context path with a fast downsampling strategy. BSNet [19] enhances the global contextual information interaction and boundary information extraction by dynamic gradient convolution and coordinate sensitive attention. HRNet [20] enhances feature fusion by the iterative information exchange of multi-level features and has multiple scales of convolutional combinations to enhance spatial information accuracy. DBFNet [21] uses bilateral filtering to focus on enhancing the extraction of boundary information and filtering noise using a combination of feature nonlinearities at different scales. OANet [22] is a semantic segmentation network for marking directional attention, which uses asymmetric convolution to focus on features in different directions and explore their anisotropy, enhancing the semantic feature representation within the model. All the above methods fuse local features to form global feature information based on CNN, rather than directly encode global features.

Work on Transformer
In recent years, Transformer [23] has moved from natural language processing to image processing. Compared with CNN to obtain local feature information through convolution operation, Transformer can realize remote image dependence and encode global feature information by using a self-attention mechanism [24]. Transformer will divide the whole image into multiple tokens, and the multi-head attention mechanism is used to explore the feature relationship among all tokens and to encode the global feature. Its success in global relational feature modeling provides a new idea for many research fields. SETR [25] first applied the Transformer architecture to the field of image segmentation and introduced a sequence-to-sequence classification model, which greatly improved the problem of difficulty in obtaining global receptive fields. Segmenter [26] designs the encoder-decoder structure using Transformer and transforms the processed tokens into pixel-level annotations for encoding. PVT [27] combines Transformer with a pyramid structure, which can reduce the computation of the model by using the pyramid module while training the features intensively. SegFormer [28] was used for the design of a layered Transformer architecture with an improved MLP decoder to fuse the features of different layers to improve the classification performance of the network. RSSFormer [29] embeds Transformer into HRNet and designs adaptive multi-headed attention and MLP with dilation convolution to increase the foreground saliency of remote sensing images. DAformer [30] designs a multi-level context-aware structure based on Transformer, which effectively enhances the contextual semantic interaction. However, since the Transformer places each attention computation on the entire image, the rapidly increasing training cost will hinder the application of the model when the image size is large [31]. The Swin transformer [32] divides the image into different windows and restricts the attention computation to the windows, which makes it only linear in complexity. The shifting of window partitions between layers is a key component of the Swin Transformer architecture, where the shifted windows connect the windows of the previous layer and significantly improve the modeling capabilities. With only linear computational complexity, the Swin Transformer delivers advanced performance in a variety of fields such as video processing [33], image generation [34], and image segmentation [35].

Methods
In this section, we will explain in detail the architecture of TMNet and the main modules that make up this network architecture.

Network Structure
The overall network architecture of TMNet is shown in Figure 1. The dual encoderdecoder architecture is adopted, in which the CNN module with powerful feature extraction capability is used as the main encoder, and the Swin Transformer module with global feature information encoding capability is used as the auxiliary encoder. For a given remote sensing image, the image is first divided into different blocks and uses linear embedding to ensure the feature size is 128 × H/2 × W/2 and enhance the semantic features, and then the Swin Transformer module is used to enhance feature global modeling and feature extraction. In addition, we added CEM to the Swin transformer block to enhance the feature relationships between channels, reduce dimensionality by downsampling features in the patch merging layer, and increase dimensionality by splicing features. The three output sizes of the auxiliary encoder are 128 × H/2 × W/2, 256 × H/4 × W/4, and 512 × H/8 × W/8 and are summed with the corresponding output of the main encoder, enhancing the global modeling capability of the network. In the main encoder, the CNN module is used to extract the image backbone features, the FEM enhances the feature extraction capability of the main encoder by calculating the relationship between the input data and the updatable feature storage units, and the MFM fuses features at different scales through softpool of different sizes, especially enhancing the extraction capability of small-scale features. The decoder is primarily implemented by convolution and bilinear interpolation upsampling, and we use skip connections to increase the contextual relationship between the encoder and decoder. After decoding 3 times, the 1 × 1 convolutional layer and argmax function are used to obtain the final predicted image.

Methods
In this section, we will explain in detail the architecture of TMNet and the main modules that make up this network architecture.

Network Structure
The overall network architecture of TMNet is shown in Figure 1. The dual encoderdecoder architecture is adopted, in which the CNN module with powerful feature extraction capability is used as the main encoder, and the Swin Transformer module with global feature information encoding capability is used as the auxiliary encoder. For a given remote sensing image, the image is first divided into different blocks and uses linear embedding to ensure the feature size is 128 × H/2 × W/2 and enhance the semantic features, and then the Swin Transformer module is used to enhance feature global modeling and feature extraction. In addition, we added CEM to the Swin transformer block to enhance the feature relationships between channels, reduce dimensionality by downsampling features in the patch merging layer, and increase dimensionality by splicing features. The three output sizes of the auxiliary encoder are 128 × H/2 × W/2, 256 × H/4 × W/4, and 512 × H/8 × W/8 and are summed with the corresponding output of the main encoder, enhancing the global modeling capability of the network. In the main encoder, the CNN module is used to extract the image backbone features, the FEM enhances the feature extraction capability of the main encoder by calculating the relationship between the input data and the updatable feature storage units, and the MFM fuses features at different scales through softpool of different sizes, especially enhancing the extraction capability of small-scale features. The decoder is primarily implemented by convolution and bilinear interpolation upsampling, and we use skip connections to increase the contextual relationship between the encoder and decoder. After decoding 3 times, the 1 × 1 convolutional layer and argmax function are used to obtain the final predicted image.

CNN Blocks
CNNs have been successful in many fields due to their powerful feature extraction capabilities. CNN Blocks have three modules, and Table 1 shows the detailed parameters of each module. The order of the parameters in Conv2d is in_channel, out_channels, kernel_size, step size, and padding, and all other parameters are default. maxPool2d (2) indicates that max pooling uses convolutional kernels to halve the input size, and BN indicates BatchNormal.

Multi-Scale Feature Fusion Module
CNN-based models perform convolutional operations for downsampling during feature extraction, which can easily lead to the loss of small-scale features. To solve this problem, we propose a multiscale feature fusion module (MFM), as shown in Figure 2. We ingeniously combine Softpool [36] with convolution operations, and Softpool obeys a certain probability distribution to retain finer feature cues compared to Avgpool and Maxpool. MFM uses SoftPool to extract multi-scale features by downsampling at scales of 2, 4, 8, and 16, respectively. Then 1 × 1 convolution is used to make the finer semantic feature information more refined, and bilinear interpolation is used for upsampling to augment the features. Subsequently, 3 × 3 convolution is used, and feature concatenation is performed to fuse multi-scale semantic features for feature class recovery.

Feature Enhancement Module
CNN has a strong feature detection capability, but the understanding of features is insufficient and the spatial resolution of features gradually decreases with the increase in depth and number of layers, which hinders the prediction capability of the target location in remote sensing images [37]. To address this problem, this paper proposes a feature enhancement module (FEM) to deepen the relationship between features and improve the performance of the encoder. The FEM is added after each CNN module to further learn the output features of each module, as shown in Figure 3. The FEM is designed with two updateable feature storage units to retain the feature relationships of the data, which can better learn the distribution weight relationships of the features with constant feature size, and adding normalization in the two parameter units can further densify the features and accelerate the convergence speed of the model. At the same time, the FEM parameters are small, which can better trade-off the efficiency and accuracy of the model. FEM can be Softpool is computed as follows: The MFM execution process can be expressed as follows: Bi(·) stands for bilinear interpolation upsampling and σ represents the Relu function. The input features X ∈ R c×h×w , after the Softpool Y n ∈ R c×h/n×w/n , where n is the downsampling rate. The features Y n are convolved by the 1 × 1 convolution kernel and the Relu function, the number of channels is halved, and the size becomes the original size after upsampling bilinear interpolation, T n ∈ R c/2×h×w . For further feature extraction, after convolution and Relu activation function, the number of channels becomes one-fourth of the original number of channels. Finally, four different scale features are connected for multi-scale feature fusion and the number of channels becomes the output of the module, W n ∈ R c/4×h×w . Finally, four different scale features are concatenated for multi-scale feature fusion, and Z ∈ R c×h×w the number of channels resets to the original as the output result of this module.

Feature Enhancement Module
CNN has a strong feature detection capability, but the understanding of features is insufficient and the spatial resolution of features gradually decreases with the increase in depth and number of layers, which hinders the prediction capability of the target location in remote sensing images [37]. To address this problem, this paper proposes a feature enhancement module (FEM) to deepen the relationship between features and improve the performance of the encoder. The FEM is added after each CNN module to further learn the output features of each module, as shown in Figure 3. The FEM is designed with two updateable feature storage units to retain the feature relationships of the data, which can better learn the distribution weight relationships of the features with constant feature size, and adding normalization in the two parameter units can further densify the features and accelerate the convergence speed of the model. At the same time, the FEM parameters are small, which can better trade-off the efficiency and accuracy of the model. FEM can be expressed as follows: where the input features are F ∈ R c×h×w , and the output features are O ∈ R c×h×w , learnable features storage units are M a ∈ R c×c , M b ∈ R c×c , N(·) denotes the BatchNorm, and ⊗ denotes the matrix multiplication.
of this module.

Feature Enhancement Module
CNN has a strong feature detection capability, but the understanding of features is insufficient and the spatial resolution of features gradually decreases with the increase in depth and number of layers, which hinders the prediction capability of the target location in remote sensing images [37]. To address this problem, this paper proposes a feature enhancement module (FEM) to deepen the relationship between features and improve the performance of the encoder. The FEM is added after each CNN module to further learn the output features of each module, as shown in Figure 3. The FEM is designed with two updateable feature storage units to retain the feature relationships of the data, which can better learn the distribution weight relationships of the features with constant feature size, and adding normalization in the two parameter units can further densify the features and accelerate the convergence speed of the model. At the same time, the FEM parameters are small, which can better trade-off the efficiency and accuracy of the model. FEM can be expressed as follows: where the input features are    in it to establish the global relationship. For the standard transformer, the output of l-layer sl can be expressed as:ŝ l = MSA(LN(s l-1 ))+s, s l = MLP(LN(ŝ l )) +ŝ l , (5) formation relationships, and the global information is obtained by fusing the resultant letters from different heads. This can be expressed as follows: where ˆl s and l s denote the output of the lth (S)W-MSA and MLP modules, respectively. The number of layers corresponding to each stage of the Swin transformer configured in this experiment is (2,2,2), and the number of heads corresponding to each stage is (3,6,12), and the partition window size is set to 8.  The transformer uses MSA to compute the global feature information among all tokens, resulting in a computational complexity that is linearly related to the image area size, which is not friendly to its practical application. Unlike the transformer module, the Swin Transformer applies the attention mechanism to the different windows divided from the image and enhances the global feature relationships by shifting between windows, as shown in Figure 4b, where the Swin Transformer consists of a window multihead self-attention (W-MSA) module connected to a shift window multihead self-attention (SW-MSA). In the Swin Transformer, the image is divided into different patches according to their positions by patch partition, and the input information is converted into sequences and embedded as tokens; then the sequences are encoded, the dimensions are changed by the linear embedding layer, and the uniformly encoded tokens are passed through the Swin Transformer block and the patch merging layer to generate a global feature relationship representation. The Swin transformer uses multihead self-attention to divide the features into multiple heads, where each head is responsible for exploring different local information relationships, and the global information is obtained by fusing the resultant letters from different heads. This can be expressed as follows: s l = W-MSA(LN(s l-1 )) + s l-1 , s l = MLP(LN(ŝ l )) +ŝ l , s l+1 = SW-MSA(LN(s l )) + s l , s l+1 = MLP(LN(ŝ l+1 )) +ŝ l+1 , whereŝ l and s l denote the output of the lth (S)W-MSA and MLP modules, respectively. The number of layers corresponding to each stage of the Swin transformer configured in this experiment is (2,2,2), and the number of heads corresponding to each stage is (3,6,12), and the partition window size is set to 8.

Channel Enhancement Module
The Swin Transformer constructs patch token relationships in a limited number of windows, significantly reducing the computational complexity. However, this approach limits the window-to-window channel modeling capability to some extent, even with the strategy of the window and shift window cross execution. In this paper, we propose the channel enhancement module (CEM) to further enhance the window-to-window feature relationship interaction, compress the deep features in each channel through Avgpool, and use 3 × 1 convolution to interact with the channel features to extract more accurate spatial location information. The design of the CEM considers not only patch-to-patch but also channel-to-channel relationships, compensating for the window-to-window modeling capability limited by Swin Transformer and making the converter more suitable for image segmentation tasks. Unlike other methods [38,39], which divide windows by simply changing the Swin Transformer, our proposed CEM learns the deep relationships between channels to make its feature weight distribution more accurate. The CEM is shown in Figure 5. Channel information is first obtained by Avgpool, which is calculated as follows: and use 3 × 1 convolution to interact with the channel features to extract more accurate spatial location information. The design of the CEM considers not only patch-to-patch but also channel-to-channel relationships, compensating for the window-to-window modeling capability limited by Swin Transformer and making the converter more suitable for image segmentation tasks. Unlike other methods [38,39], which divide windows by simply changing the Swin Transformer, our proposed CEM learns the deep relationships between channels to make its feature weight distribution more accurate. The CEM is shown in Figure 5. Channel information is first obtained by Avgpool, which is calculated as follows: For the average pooled features, we convert the dimensions to 1 × c × 1 and then perform a convolution operation using a 3 × 1 convolution kernel and matrix multiply it with the original features to enhance their spatial characteristics. This is represented as follows:

Decoder
The decoder primarily decodes the features by convolutional operation and upsampling and skips the link with the main encoder in its process to enhance the global context information, as shown in the decoder part of Figure 1. Firstly, the MFM module output features are spliced with the CNN-Block3 module output to enhance the feature For the average pooled features, we convert the dimensions to 1 × c × 1 and then perform a convolution operation using a 3 × 1 convolution kernel and matrix multiply it with the original features to enhance their spatial characteristics. This is represented as follows: s l represents the output of the lth Swin Transformer block, R(·) represents the Reshape, and v ∈ R c×1×1 represents the feature matrix after Avgpool.

Decoder
The decoder primarily decodes the features by convolutional operation and upsampling and skips the link with the main encoder in its process to enhance the global context information, as shown in the decoder part of Figure 1. Firstly, the MFM module output features are spliced with the CNN-Block3 module output to enhance the feature information interaction, followed by a convolutional kernel of 3 × 3, and after the Relu function for feature decoding and channel size reduction, the feature size is doubled by image upsampling using linear interpolation. After three upsamplings, the feature size is changed to the original image size, and then the features are further refined using a 3 × 3 convolution kernel, and the number of feature channels is reduced to the number of image classes, and the prediction results are output by argmax function.

Loss Function
The loss function applied in this paper is a combination of Cross-entropy loss and Dice loss, where Cross-entropy loss primarily measures the difference in accuracy between pixels and Dice loss measures the accuracy of category regions.
Cross-entropy loss is expressed as follows: where N is the number of samples and K represents the number of categories, p i,k represents the prediction probability of the kth category in the ith sample, and y i,k indicates whether the prediction result corresponds to the true label. The Dice loss is expressed as follows: s i k , g i k denote the predicted classification result and the actual label value of the image for the kth class of the ith sample, respectively.
The total loss function is expressed as:

Experiments and Results
To verify the efficiency of TMNet in the field of remote sensing image segmentation, we conducted experiments on two publicly available remote sensing image datasets and compared them with other classical methods.

Datasets
The two publicly available datasets used in this experiment are the Wuhan Dense Labeling Dataset (WHDLD) [40,41] and the Potsdam dataset [42].
The WHDLD was captured by the Gaofen-1 and ZY-3 satellites in Wuhan and consists of 4940 RGB images of size 256 × 256 that were fused and upsampled to a resolution of 2 m. There are six image categories in the WHDLD, including bare soil, buildings, pavement, vegetation, roads, and water, which have been classified pixel by pixel. The Potsdam dataset has 38 remote sensing images of size 6000 × 6000 with a resolution of 5 cm, covering 3.42 km 2 of the building complex structures in Potsdam. The Potsdam dataset contains six categories: Impervious surface, building, low vegetation, tree, car, and clutter/background. In this paper, only 14 RGB images are selected for the experiment (image IDs: 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_14, 6_15, 7_13), and similar to WHDLD, to facilitate the experiment, we cropped the image size of the potsdam dataset to 256 × 256, for a total of 6877 images. We divided the training, validation, and test sets of these two datasets in the ratio of 6:2:2. To make the model more accurate, we used random rotation, flip, and Gaussian noise for data enhancement.

Training Setup and Evaluation Index
TMNet is built by PyTorch. We train the network using the SGD optimizer, where the momentum term is set to 0.9, the weights are decayed to 1 × 10 −4 , and for the learning rate, we set its initial value to 0.01 and halve the learning rate every 20 epochs to ensure the model reaches the optimal value faster. The experiments included are implemented on an NVIDIA GeForce RTX 3090 GPU with 24 GB RAM. The batch size is 16 and the maximum epoch is 150. We evaluate the performance of the model using mean intersection over union (MIOU), Mean F1 (MF1), and Mean Pixel Accuracy (MPA). The three evaluation metrics are calculated from the confusion matrix, which has four components: True positive (TP), false positive (FP), true negative (TN), and false negative (FN). Based on the above four terms, IOU is defined as the degree of overlap between the predicted and true values of an image and is calculated as follows: The F1 score for each category is expressed as follows: where Precision = TP/(TP + FP) and Recall = TP/(TP + FN). MIOU represents the mean of IOU for all categories, MF1 represents the mean of F1 for all categories, and MPA represents the mean value of precision for all categories, where we use MIOU as the main evaluation metric.

Comparison Results on WHDLD
In this section, we compare TMNet with other classical semantic segmentation networks, including DFANet [43], DenseASPP [15], PSPNet [16], Deeplabv3plus [44], DFANe-t [43], DUNet [45], MAUNet [46], MSCFF [47], MUNet [48], HRNet [20], SegFormer [28], and HRVit [49] where DenseASPP, PSPNet, and Deeplabv3plus use resnet101 as the backbone. The experimental results are shown in Table 2 and Figure 6.   From Table 2, it can be seen that the experimental results of TMNet outperform the other networks, where DFANet is based on pure CNN architecture, DenseASPP improves its global modeling capability by using convolution with different scaling rates, PSPNet utilizes context through pyramid pooling module, and all the above methods use local features to aggregate contextual information. The MSCFF uses trainable convolutional filters to densify the feature map and enhance small-scale features, MAUNet subdivides features at different scales by increasing the number of downsampling and attention mechanisms, which cannot achieve multi-scale feature fusion compared with TMNet, and Deeplabv3plus uses atrous space pyramidal pooling and applies deep separable convolution, but the performance is still inferior to TMNet. SegFormer designs a hierarchical Transformer architecture and fuses features at different scales using an improved MLP decoder. HRVit uses Transformer to design cascading converters that generate multi-scale feature representations. However, the lack of contextual information interaction makes its performance inferior to that of TMNet. In addition, we compared the parameters and FLOPs of each method, and TM performed moderately well in both aspects, indicating that TMNet does not simply pile up computational effort to obtain high accuracy. Compared with the three advanced models of HRNet, SegFormer, and HRVit, TMNet improves MIOU by 0.80%, 0.78%, and 0.52% with only 41.77%, 43.41%, and 71.80% of HRNet, SegFormer, and HRVit in terms of parameters. Compared to FLOPs, TMNet is 3.65% and 39.95% less than HRNet and SegFormer. Figure 6 shows the prediction results of each method on the WHDLD. From Figure 6, it can be seen that the prediction results of TMNet are closest to the real image and better in terms of small-scale features and edge details. The first row of Figure 6 shows that TMNet achieves the best classification results in the road category due to the powerful global modeling capability of the Swin Transformer that enhances the feature extraction ability of the network. For the more difficult-to-classify bare soil category, TMNet also has good classification results for it, as can be seen in the third row of Figure 6. This shows the excellent performance of TMNet in the semantic segmentation of remote sensing images. Table 3 shows the segmentation results of each method on the Potsdam dataset, which further proves the effectiveness of TMNet in the semantic segmentation study of remote sensing images as MIOU, MF1, and MPA reach 68.15%, 79.91%, and 78.77%, respectively, which are higher than other methods. Due to the difference in dataset resolution, the segmentation accuracy of Potsdam is better than that of WHDLD. Figure 7 shows the experimental prediction results of each method on the Potsdam dataset, and we can see that the classification results of TMNet are better than other methods, for example, it is obvious from the first and fifth rows that TMNet is significantly better than other methods in the prediction of clutter/background category, which demonstrates the performance of TMNet.

Ablation Study
In this section, to verify the effectiveness of each module proposed in this paper, we conducted ablation experiments on WHDLD with the CNN blocks and decoder as the baseline, and the results are shown in Table 4. SW stands for Swin Transformer.  Table 4's first three rows demonstrate that the addition of MFM improves MIOU by 1.18%, which is primarily due to the fact that MFM combines low-level spatial features of images at different scales into high-level semantic features, which enhances multi-scale feature information and increases the network's ability to detect minute features. The MIOU improves by 1.03% when SW is added, which is primarily due to the powerful global modeling capability of SW and the enhanced global context acquisition capability of the network using the dual branch structure. Figure 8 shows the results, from which it can be seen that after adding MFM and SW again, the segmentation effect capability is significantly improved, especially for small, detailed features of the image. For example, the accuracy improvement of the Road category detection results in the first and second rows of Figure 8 is more obvious.

Effect of FEM and CEM
We refer to baseline + GD + SW as baseline_1. As shown in the last four rows of Table  4, the MIOU improves by 0.37% when FEM is added, which is due to the fact that FEM combines the input data with learnable parameters to further enhance the feature extrac-

Effect of FEM and CEM
We refer to baseline + GD + SW as baseline_1. As shown in the last four rows of Table 4, the MIOU improves by 0.37% when FEM is added, which is due to the fact that FEM combines the input data with learnable parameters to further enhance the feature extraction capability. When CEM is added, the MIOU is improved by 0.48% due to the enhanced window-to-window modeling capability of CEM by focusing on the relationship between channels. When both modules are added, MIOU is improved by 0.82%, which shows the effect of FEM and CEM on network performance enhancement. Figure 9 shows the segmentation results. From the first row, we can see that the building category is more clearly segmented after adding the two modules, and the confusion between the Road and Pavement categories in the second row has been greatly improved.

Conclusions
In this paper, a two-branch multi-scale semantic segmentation network named TMNet for remote sensing images is proposed. It adopts the encoder-decoder structure, enhances the global contextual interaction of the network with the dual-branch structure of the Swin Transformer to make up for the lack of global modeling capability of CNN, and the MFM is proposed to improve the fusion of features at different scales. In addition, FEM and CEM aim to enhance the feature capture capability of the network through feature information relationship calculation. However, the TMNet prediction contours still cannot be fully fitted with the actual results. We will explore new methods to improve, and also work on network simplification to improve computational efficiency.

Conclusions
In this paper, a two-branch multi-scale semantic segmentation network named TMNet for remote sensing images is proposed. It adopts the encoder-decoder structure, enhances the global contextual interaction of the network with the dual-branch structure of the Swin Transformer to make up for the lack of global modeling capability of CNN, and the MFM is proposed to improve the fusion of features at different scales. In addition, FEM and CEM aim to enhance the feature capture capability of the network through feature information relationship calculation. However, the TMNet prediction contours still cannot be fully fitted with the actual results. We will explore new methods to improve, and also work on network simplification to improve computational efficiency.