Double Branch Parallel Network for Segmentation of Buildings and Waters in Remote Sensing Images

: The segmentation algorithm for buildings and waters is extremely important for the efﬁcient planning and utilization of land resources. The temporal and space range of remote sensing pictures is growing. Due to the generic convolutional neural network’s (CNN) insensitivity to the spatial position information in remote sensing images, certain location and edge details can be lost, leading to a low level of segmentation accuracy. This research suggests a double-branch parallel interactive network to address these issues, fully using the interactivity of global information in a Swin Transformer network, and integrating CNN to capture deeper information. Then, by building a cross-scale multi-level fusion module, the model can combine features gathered using convolutional neural networks with features derived using Swin Transformer, successfully extracting the semantic information of spatial information and context. Then, an up-sampling module for multi-scale fusion is suggested. It employs the output high-level feature information to direct the low-level feature information and recover the high-resolution pixel-level features. According to experimental results, the proposed networks maximizes the beneﬁts of the two models and increases the precision of semantic segmentation of buildings and waters.


Introduction
Semantic segmentation is very important in many domains such as unmanned driving, land use, ecological environment monitoring, disaster monitoring and agricultural monitoring.Identifying building and water area types from remote sensing images can provide an efficient technical approach for regional map updating, land planning, risk management [1] and regional economic development forecasting.The resolution of remote sensing images is increasing as modern computer vision and aerospace technology advance quickly, including space, spectrum and time [2].A highly effective and affordable method for mapping a large area is to use remote sensing techniques [3].The textural features and spatial structure properties of ground objects can be clearly expressed in high-resolution remote sensing photographs [4][5][6].The development of remote sensing image technology is of great significance to the promotion of semantic segmentation tasks.
The three primary approaches used to segment traditional remote sensing images are the threshold, clustering, and maximum likelihood methods.The maximum likelihood method calculates the maximum likelihood discriminant function of each category by training set data, then substitutes the value of each pixel into the calculation, and finally evaluates the reliability of the classification outcomes.The maximum likelihood method has a high requirement for the training set, which can easily lead to very poor estimation results.The threshold method mainly defines the regional attribution of different targets in the image by threshold, but it is sensitive to the noise of the image, and in remote sensing images with a highly complicated background, the gray difference is not obvious and the overlap of different target gray values is not obvious.The appropriate threshold is not easy to find.The clustering rule is to classify the pixels of the image, and some traditional clustering algorithms do not consider the spatial information, which can easily cause a lack of segmentation in image information and a decrease in segmentation accuracy, and the work to be performed in clustering analysis is more complicated.There are also some image segmentation methods such as the segmentation method based on the genetic algorithm, region segmentation and edge segmentation [7].The disadvantage of region segmentation is that it is easy to cause excessive segmentation of images.The edge segmentation method cannot obtain a better regional structure, and there is a contradiction between accuracy and noise immunity [8].Usually, the edge segmentation method is combined with the region segmentation method to obtain a better segmentation effect.In the segmentation method based on the genetic algorithm, it is difficult to determine the crossover probability and mutation probability, and the selection of fitness function is more difficult.Recently, some machine learning methods including decision tree, support vector machine, random forest and artificial shallow neural network have been determined unsuitable for massive quantities of data.Recently, fully supervised models have had great success in this area, however the lack of annotated data will result in significant performance loss [9].In summary, on the one hand, the traditional image segmentation method is limited by the high-resolution remote sensing image spectrum.On the other hand, it has limited ability to process massive data, and there are some problems such as poor segmentation effect and poor generalization ability.
High-resolution remote sensing images have complex features and are difficult to classify.At present, some deep learning methods have performed well in natural image segmentation tasks.If they are directly applied to remote sensing image segmentation, there will be some challenges: 1.Data volume and quality: different from natural images, the data volume of remote sensing images is usually small, and it is affected by clouds, shadows, noise and other factors, resulting in unstable image quality, which may lead to overfitting or underfitting of deep learning algorithms, thereby affecting classification accuracy.2. The multi-scale problem: remote sensing images usually involve multiple scales and resolutions.At different scales, the object in the image has different appearance and semantic information.Therefore, in the task of remote sensing image segmentation, multi-scale information needs to be processed and fused to capture global and local information, so that the model can classify the targets in the image more accurately.3. Category imbalance: remote sensing images usually involve multiple categories, some of which may have few samples.This category imbalance may lead to bias in the model, that is, the model tends to predict common categories in training, and the prediction of rare categories is inaccurate.4. Spatial and temporal issues: in remote sensing images, temporal and spatial information are closely related.For example, images of the same area may be taken at different times, and the location, shape, and number of targets may also change.Therefore, spatio-temporal information needs to be considered when processing remote sensing images in order to better capture change information [1,[10][11][12][13][14][15][16].The deep learning method can extract more and deeper feature information, which is suitable for the classification of high-resolution images [17][18][19][20].Long et al. [21] proposed a fully-connected convolutional neural network (FCN), which migrates the feature extraction layer to the segmentation task and updates these parameters through fine-tuning [22].At the same time, in order to achieve more precise segmentation findings, a unique structure is created to mix shallow semantic information and high-level semantic information.Chen et al. [23] proposed a semantic pixel-level segmentation network (SegNet).SegNet is a lightweight neural network that uses an upsampling layer to increase the resolution of the segmentation results to the same as the input image.In addition, SegNet uses an encoder-decoder structure with skip connections and deconvolution layers to better preserve detail and semantic information.Zhao et al. [24] proposed a pyramid network structure (PSPNet).PSPNet uses dilated convolution for the convolution of the basic ResNet [25], and in the entire encoder coding part, the features remain at the same resolution after the initial pooling layer.Auxiliary loss is introduced in training to help ResNet learning, and the Spatial Pyrmid Pooling module is introduced to integrate semantic information in different regions to obtain better global semantic information.In the same year, a multi-path refinement network was suggested for high-resolution images segmentation (RefineNet) [26].It is also based on making full use of shallow and deep feature information, cutting the image input, and then extracting the features, respectively.Sun et al. [27] proposed a high-resolution segmentation network (HRNet).All throughout the procedure, HRNet maintains a high-resolution representation.It achieves the purpose of strong semantic information and accurate location information by connecting high-resolution and low-resolution convolutional flow branches in parallel and continuously interacting with information between different branches.Pang et al. [8] proposed a lightweight building and water segmentation network (SGBNet), in which the channel pooling attention module extracts features through two different global pooling modules.While improving the segmentation accuracy, the network is also more efficient and lightweight.In conclusion, CNNs are usually composed of convolutional layers, pooling layers, and fully-connected layers.For semantic segmentation tasks, the encoderdecoder structure is generally used.The encoder is responsible for encoding the input image information into low-dimensional features.The decoder is responsible for mapping the low-dimensional features back to the original image size and outputting the classification results of each pixel.For each application field, the appropriate network structure and training strategy can be designed according to the specific situation to obtain the best performance.
Recently, Transformer [28] has shown great value in some fields.Transformer is the first transduction model that calculates its input and output representations purely using self-attention, and does not require convolutional network or a sequence-aligned recurrent neural network.According to the conventional model, ByteNet grows logarithmically, ConvS2S increases linearly, and the number of operations needed to correlate signals at two random input or output sites increases with distance.Learning the reliance between far-off places is made more difficult as a result.In Transformer, this is reduced to a fixed number of operations.ViT is proposed as a transformer for large-scale image recognition [29].The core process of ViT includes four parts: image block processing, image block embedding and location coding, Transformer encoder and MLP classification processing.In the task of migrating to small-scale data sets, it can achieve better performance than CNN, and successfully converts visual problems into natural language processing problems.Liu et al. [30] proposed a Transformer structure with moving windows (Swin Transformer).Swin Transformer designs a shift window and performs self-attention calculation in the shift window.The global information is fully interactive, which brings greater efficiency and performance.At the same time, the operation of the moving window can make the adjacent two windows interact with each other, thus achieving the ability of global modeling in disguise and achieving good results in the segmentation task.Based on Swin Transformer and UNet [31], swin-UNet [32] is proposed.It is based on the Swin Transformer module and constructs a symmetric encoder-decoder structure with skip connections to perform corresponding pixel-level segmentation prediction.Lu et al. [22] proposed a bilateral branch model based on the traditional Transformer, using a strip convolution module in the encoder.The information gathered by the two branches helps to guide each other and produce more accurate segmentation renderings.In conclusion, the Transformer model is a neural network structure based on a self-attention mechanism, which is usually composed of an encoder and decoder.For the semantic segmentation task, the encoder part of the Transformer model can be used as a feature extractor to extract the high-dimensional feature representation of the input image and input it into the decoder for pixel-level classification.CNN can only model local features, while Transformer can model global information.Combining them, a Transformer module can be introduced into CNN to capture longer-range contextual information.This method can improve the edge and detail information of the object in the image segmentation task.
High-resolution remote sensing photos' intricate spectral and spatial texture information not only enhances the table's finely detailed features, but also makes semantic segmentation tasks more challenging.Due to the large difference in the size of various types of features, neural networks need to effectively extract features of ground objects from different angles.For example, the shooting angle and distance of some images, the light intensity, the complexity of the terrain, including the diversity of landforms, the difference between the size of urban waters and natural lakes, the density and diversity of urban buildings and the sparseness of rural houses.More thorough criteria for the model of the semantic segmentation task are put forth by these issues.Due to the lack of global information interaction and the single calculation method, the traditional algorithm will lead to a lot of noise when predicting, and the detection and recognition ability of some areas similar to the pixels of non-target areas is insufficient.For some edge details, the loss of information is large, and on the whole, it is prone to misjudgment.Convolutional neural network-based approaches have difficulty learning explicit global and long-term semantic information relationships because of the intrinsic constraints of convolution processes.This research introduces a novel double-branch parallel image segmentation network in response to the shortcomings of the existing approaches.The network is based on Swin Transformer and CNN.In the stage of feature coding, the designed cross-scale multi-level fusion module is used to connect the two branches, and the comprehensive semantic information and spatial semantic information are extracted using CNN and Swin Transformer.The multi-scale fusion module designed by us guides feature information extracted by double branches to each other, giving full play to the characteristics of Swin Transformer's global information interaction, and making up for the errors in judgment brought on by a lack of global information and long-term semantic information interaction of CNN.During the feature decoding stage, the designed multi-scale fusion module is utilized to fuse the high-level feature information in the coding stage and the low-level feature information extracted by CNN, and the high-level feature information is used to direct low-level feature information and upsample step-by-step.Through the joint action of several modules, our network has significantly increased the segmentation precision.

Methods
At present, convolutional neural networks have a constrained receptive field and it is difficult to capture global information [33].The Swin Transformer network adds a mobile sliding window to better capture global feature information and perform global information interaction.The convolutional neural network has translation invariance and global correlation, while these characteristics in the Transformer network structure are insufficient.Taking into account the aforementioned factors, we propose a parallel combination structure of both Swin Transformer and CNN.The segmentation accuracy and generalization ability of the model in segmentation tasks are greatly improved.In addition, it can effectively identify houses, waters and backgrounds in building and water tasks, and the segmented boundary details are more delicate.Figure 1 depicts the parallel network's general design, which is mostly made up of encoders and decoders.Figures 2 depict the detailed layout of each module.For a given image, first enter the encoder, enter the Swin Transformer and CNN, extract the features information, and effectively fuse the features extracted by feature fusion module designed in this paper, and pass the fusion parameters into the Swin Transformer for further feature extraction.We use Swin Transformer as a branch in the research.Compared with the traditional Transformer module, Swin Transformer designs a moving window and performs self-attention calculation in the moving window.The global information is fully interactive, which brings greater efficiency and performance.At the same time, through the operation of the moving window, the feature information interaction between the two adjacent windows can be realized, thus achieving the ability of global modeling in disguise.Through the coding stage, highly detailed information is obtained and the global information is fully understood.We propose a step-by-step fusion upsampling module in the decoding stage.The feature information obtained through the convolutional network and the feature information obtained through the encoder are upsampled step-by-step through the multi-scale fusion module.Through the sufficient interaction of global information and the guidance of low-level feature information, some disadvantages are mitigated.The model's performance is significantly enhanced by the four fusion upsampling modules, which gradually combine feature information from high level to low level.

Overall Structure
The article uses the parallel structure of Swin Transformer and Resnet50 convolution network to draw different levels of information of images.Swin Transformer not only has dynamic attention to focus areas, adding a moving window, but also has a global receptive field and better generalization performance.CNN with Resnet50 as the backbone has two characteristics: local perception and parameter sharing.Local perception refers to the CNN's proposal that each neuron just needs to sense the local pixels in the image rather than all of them, and that this local information may subsequently be combined at a higher level to access all of the image's characterization information.To enhance the performance of the model, we design related modules to fully exploit the advantages of both.

Cross-Scale Multi-Level Fusion Module
To better improve the accuracy and predictive performance of models in buildings and waters segmentation tasks, we propose a CNN structure Resnet50 and Swin Transformer parallel network structure, but if we simply combine the two structures, we find that the effect of the model is not obvious, which does not meet our task requirements.Considering that two branches output different levels of characteristic information, in order to make full use of the advantages of double branch parallel network, we design a cross-scale multi-level fusion module (CMFM).
As shown in Figure 2, the fusion module we designed here has two branches.It is assumed that the size of the feature map f 1 output by the Resnet50 branch is C 1 × H × W, and the size of the feature map f 2 output by the Swin Transformer branch is C 2 × H × W, where C 1 and C 2 represent the number of channels.The f 1 is first passed through a global average pooling layer, which reduces the number of parameters.Global average pooling can better reflect the global information and avoid overfitting.On the other hand, it combines global spatial information and has stronger spatial conversion ability for input images.The next step is to go through a 1 × 1 convolutional layer, then through the BN and ReLu functions.Finally, the number of channels is changed from C 1 to C 2 through a 1 × 1 convolutional layer to obtain the other side.We will first process a 3 × 3 convolutional layer, then the BN and ReLu functions, and finally through a 1 × 1 convolutional layer.After that, f out1 is obtained by adding the C 2 × 1 × 1 and C 2 × H × W size after two 1 × 1 convolutions, and then f out2 is obtained by a BN and ReLu activation function again.After that, we add the C 2 × H × W size f 2 to it by a residual operation to obtain f out3 .In this paper, we obtain the output f out3 and then process the convolutional block attention module (CBAM) to finally obtain the output Y [34].The above-mentioned process's calculation formula may be represented as: f 1×1 (•) denotes a convolution operation with a convolution kernel of 1 × 1, f 3×3 (•) denotes a convolution operation with a convolution kernel of 3 × 3, δ(•) denotes ReLu activation.The ReLu function is sparse, making the sparse model to more effectively extract pertinent features and match training data.The specific calculation formula is shown in Formula (4).
where x denotes the input of the ReLu function.g a (•) in Formula (1) denotes one-dimensional average pooling, it calculates an average of all pixels of the feature map of each output channel, and can well suppress overfitting.Here, it changes the feature map of It compresses spatial feature information into channel dimension, and integrates the global spatial information so that the global feature information can be fully utilized.The specific calculation formula is shown in Formula (5).
where p(i) represents the pixel value at the i-th position of the feature map.The calculation formula of CBAM(•) in Equation ( 3) is as follows (assuming the input is F): In the above formula, the weights W 0 and W 1 of MLP are shared.AvgPool(•) and MaxPool(•) denote average pooling and maximum pooling operations.M c (•) represents channel attention operation.M s (•) denotes spatial attention operation.f 7×7 (•) denotes a convolution operation with a convolution kernel of 7 × 7. ⊗ represents tensor matrix multiplication.σ(•) represents the sigmoid activation function.The sigmoid function can play the role of normalization.Its calculation formula is as follows: where x represents the input of the sigmoid function.
Figure 3 shows the actual effect of our CMFM module.Among them, (a) is the original image, (b) is its label, (c) and (d) are the feature heat maps of the backbone model without CMFM module and with CMFM module, respectively.The feature heat map demonstrates that some places that were originally concerned or had relatively low attention.Following the addition of the CMFM module to the backbone network, the pixels in these regions-the red portion of the feature heat map-were given more weight by the network.This proves that the designed module is effective.

Multi-Scale Fusion Module
In the decoding stage, if only a simple and crude upsampling recovery is output, it is bound to lose more information, resulting in poor performance of the model; there will be a misjudgment of the situation.To take full advantage of the global interactivity of the Swin Transformer in our backbone network branch, and for high-level semantic features to be used to direct low-level semantic features, we designed a fusion module similar to CMFM.
The feature representation of different scales can be captured by guiding at different scales.Figure 4 is a multi-scale fusion module (MFM).Two feature maps with the same scale size make up the module's input, whose structure is shown in Figure 4. We assume that the two inputs are X 1 and X 2 , respectively, and the size is C × H × W, where C denotes the number of channels of the feature map, H and W denote the height and width of the feature map.First, we add the two inputs to obtain X 3 , and then parallel out of the two branches X 31 , X 32 ; X 31 through the global average pooling, and then through a convolution kernel of 1 × 1 two-dimensional convolution, then through BN and ReLu function, and finally through a convolution kernel of 1 × 1 convolution operation to obtain the output X 31 .On the other side we direct X 32 through a convolution kernel for 3 × 3 two-dimensional convolution operation, the activation functions for BN and ReLu also follow.At last, the output X 32 is obtained by convolution operation with convolution kernel of 1 × 1.Then we add the output of the two to obtain the output X 4 , which is activated by sigmoid to obtain the weight s.Since the weight s obtained after sigmoid activation is distributed between (0,1), here we use s and (1 − s) as weight coefficients to weight X 1 and X 2 , respectively to obtain X 1out and X 2out , and then add them.To obtain the number of channels matching the next stage, the final output Y is obtained by changing the channel through a two-dimensional convolution with a convolution kernel of 1 × 1 (the dotted boxed CBAM module is only used in the last MFM module, so Formula (15) is only used in the last MFM.The calculation formula has been described above, so it is not repeated here).The calculation formula of the above-mentioned process can be represented as: where g(•) denotes global average pooling, f 3×3 (•) denotes a 2D convolution with a 3 × 3 convolution kernel, f 1×1 (•) represents a 2D convolution with a 1 × 1 convolution kernel, δ(•) includes Batch Normalization and ReLU function activation, ⊗ represents tensor matrix multiplication, σ(•) represents the sigmoid activation function.
The above modules together constitute our algorithm network.The network adopts a two-branch parallel method.The convolution branch extracts detailed information to obtain low-level semantic information, and the Swin Transfomer branch extracts contextual information to obtain high-level semantic information.In the coding stage, the CMFM module completely interacts with the contextual information by utilizing the benefits of the two branch networks.In the decoding stage, the process gives full play to the advantages of MFM module, step-by-step fusion recovery, and makes the segmentation boundary more delicate, where in the case of complex backgrounds it can better identify waters and buildings.

Building and Water Dataset
In order to test its effectiveness in the semantic segmentation job of buildings and waters, this paper created a buidings and waters dataset to train and validate the model.In comparison to some other datasets, the dataset created in this experiment has a large spatial span, more angles, and a complex background due to the low angle of view, which necessitates more complex algorithms.The dataset comprises of 10,000 pairs of Google Earth photos divided into the following categories: a riverfront residential complex in China, a private villa in North America, and so on.After that, the photos were divided into 224 × 224 images, and the data was enhanced on these images.There are three different kinds of strategies: the 50% horizontal flip, the 50% vertical flip, and the 10% random spin, for example.The enlarged dataset can be improved, but it can also raise model training process interference and improve the model's generalizability.Architecture, water, and background are the three object categories that have been manually assigned to these pictures.Figure 5 displays the sliced image together with its label.A single type of image was eliminated, and the remaining photos were then randomly split into an 8:2 training and validation set.The dataset created in this experiment contains a rich variety of backgrounds, which meet the experimental requirements and ensure that the experimental segmentation accuracy will not be biased due to the single dataset type.
The following characteristics are present in the dataset: (1) of the objects in the dataset selected in this paper, such as vehicles, some containers and some similar buildings, some of them have large differences in color, as shown in Figure 5d, and some of the buildings have a great similarity to the surrounding background color, as shown in Figure 5e, so this puts forward higher requirements for the proposed model detection ability.(2) In this dataset, we selected more coverage scenarios to better and more comprehensively test the performance of our model.(3) Because the remote sensing satellite is different in angle and spatial distance when shooting, the difficulty of segmentation is increased to a certain extent.(4) In some dense buildings, some high-rise buildings' shadows cast on nearby low-rise structures causes a certain degree of interference to the segmentation of the model.

Waters Dataset
This paper selects the waters dataset for verification in order to more thoroughly reflect the network's ability to cope with edge characteristics, e.g., water area, to better validate the generalizability of the network model it proposes.In this experiment, we select China HJ-1A (HJ-1B) multi-spectral environmental remote sensing satellite images as the required dataset.Additionally, the model's capacity for generalization is examined.
Three visible and near-infrared spectral bands from the HJ-1A (HJ-1B) satellite charge coupled device camera are used in the water dataset.In order to effectively utilize the waters information, we selected a mix of bands 1, 2, and 4 to produce a three-channel waters image.To prevent over-fitting and ensure the accuracy of experimental segmentation accuracy, we cut the original image into 256 × 256 images and randomly flip, rotate, and scale the experimental image.Here, we created 8000 waters datasets and divided them into training sets and verification sets according to the 8:2 ratio column.The water dataset is a binary dataset, that is, the model identifies two semantic categories of waters and background.Figure 6 displays the cropped image and its labels.

Inria Dataset
In order to further verify the generalization ability of the proposed model, we selected the Inria Aerial Image Labeling Dataset.It is a public data set for computer vision and machine learning research developed by the French National Institute of Digital and Automation (Inria).The dataset contains a set of aerial images taken from high altitude, covering some cities in southern France, and the ground coverage types include buildings, trees and roads.Its resolution is 5000 × 5000 pixels.Here, we cut it into 256 × 256 images and divided them into a training set and validation set according to the ratio of 8:2.As shown in Figure 7, part of the data set is displayed.
Background Buildings A machine with an NVIDIA RTX3080 graphics card was used to carry out all the experiments in this study.The operating system adopted in this experiment is Windows 10.The construction of the experimental model in this paper is based on the deep learning framework of pytorch (2017).In terms of optimizer, this paper uses an adaptive moment estimation optimizer, it combines the advantages of both momentum and RM-SProp optimization algorithms, the first-order moment estimation of the gradient and the second-order moment estimation, comprehensive consideration, and then calculates the update step.The number of iterations is set to 300 in all of the experiments in this work because, according to experimental observation, most experiments tend to converge after 200 iterations.The loss function used in the experiment is BCEWITH-LogitsLoss.Due to the physical memory limitations of the computer's graphics GPU, the experiment's batch size was set to 8. The experimental index is an important reference for evaluating the effect of the model.Here, we use the mean pixel accuracy (MPA), pixel accuracy (PA) and mean intersection over union (MIOU) on the union set as evaluation indicators.The following are the MPA, PA, and MIOU formulae.
where k denotes the class of object segmentation (excluding background), P i,i shows the real number, P i,j represents the number of pixels that belong to category i but are predicted to be j.

Ablation Experiment
We initially use Resnet50 as the backbone network, and then upsample each layer and connect them for output.Then we add the Swin Transformer branch in parallel based on Resnet50, and each layer of the two branches is added and up-sampled step by step for addition and output.Then, to verify the efficiency of the models and modules created in this research, we gradually add each module to the model.Here, the model's primary evaluation metric is MIOU.The ablation experiments are shown in  (1) To more effectively extract the image's feature information, extract more scale spatial location information, and fully carry out global information interaction, we first simply add the Swin Transformer branch in parallel to the Resnet50 branch.Through the experimental results, we find that the MIOU value can be increased to 84.40% when the two branches extract features in parallel.( 2) Two branches are added to the connection module in parallel.For two branches in parallel, if each layer is simply added, there will be some lost spatial and semantic position information.To fully extract the feature information and improve how the high-level feature information directs the underlying feature information, the overall recognition ability of the model and the processing of some detailed features are improved.Based on this, this paper designs a cross-scale multi-level fusion module.It is found that the MOIU value of the model was increased to 85.53% after adding the CMFM module.( 3) Ablation (MFM) for high-low information mutual guidance fusion module: since the shape and size of some rivers are not constant in the building and water segmentation task, some of the previous methods for river boundary treatment are not delicate enough.To restore the characteristic information of the edge of the river including some buildings, in this paper, we integrate low-level feature information extracted through convolutional network and high-level feature information through the Swin Transformer, utilizing the high-level feature information to direct the low-level feature information.We processed the global feature pooling in the module, once again fully incorporating global feature information interaction, improving the accuracy of the category area identification and model performance.Finally, the MIOU value of our model reached 87.86%.

Contrast Experiment
We contrast our model with some excellent models for building and water segmentation in this part, such as DABNet [35], FCN8s, PSPNet, DeeplabV3plus [36], Shuf-flenetv2 [37], BisenetV2 [38], Dual-branch [22], and so on.The Deeplab series mainly uses dilated convolution and pyramid pooling ASPP [25].By using different dilated convolutions on a given feature layer, it can effectively resample and construct convolution kernels of different receptive fields to obtain information of multi-scale objects.The Bisenet series fuses the extracted deep feature information and spatial information through a spatial branch and a semantic branch, and supervises model training through an auxiliary loss function.FCN solves the issue of semantic-level picture segmentation, as the fully-connected layer of a traditional CNN is changed to a convolutional layer, which classifies images at the pixel level [39].The pyramid pooling module, which is the primary component of PSPNet, may gather contextual information from several regions to boost global information acquisition capacity [23].The Shufflenet series uses grouping convolution to group different features of the input layer, and then uses different convolution kernels to convolve each group, thereby reducing the amount of convolution calculation, mainly playing a lightweight effect.DABNet proposes a deep non-decomposable bottleneck module, which effectively uses asymmetric convolution kernel dilated convolution to construct the bottleneck layer, generates sufficient acceptance domain, intensively uses context information, and greatly reduces parameters.
Table 2 shows that SegNet has the worst segmentation effect, and the MIOU and MPA values are only 80.06% and 89.06%.In general convolutional neural networks, PSPNet (backbone adopts Resnet50) network has the highest segmentation accuracy, where the MIOU value and MPA value can reach 86.61% and 92.88%, respectively.At the same time, the double-branch parallel network designed in this paper is compared with the network swin UNet that improves the swin Transformer, where MPA is 90.00% and 94.11%, and MIOU is 81.14% and 87.86%.Compared with these algorithms, our double-branch parallel network achieves the optimal value in three indicators.Comparing our approach to other models, it can be shown that it significantly improves segmentation, and our model has a strong pertinence for semantic segmentation tasks of buildings and waters.None of the semantic segmentation networks in the table pre-loaded the training weight, and the training requirement parameters were set uniformly to ensure the fairness of the comparison experiment.Figure 8 compares the prediction maps of some networks.We comprehensively verify our network by comparing the prediction effect maps of seven remote sensing images.Figure 8i is the label graph.Figure 8b-g are the experimental comparison results, Figure 8h is the proposed network prediction rendering.The comparison shows that the effect map predicted by the network model in this research is more accurate in detecting the buildings and waters, and there is no omission overall.This is due to the fact that our double-branch parallel structure fully exploits its own benefits.On the basis of the global information interaction of Swin Transformer, the two modules we designed are added to the global average pooling again.With the cooperation of these modules, the global information interaction is continuously carried out.In addition, by using high-level feature information to direct low-level feature information, some edge feature information of the segmented target are repaired.Finally, we find that the effect prediction graph of the double-branch parallel network model is the closest to the label graph, and our prediction accuracy is the best.
Figure 8a is a remote sensing image, Figure 8b-h represent DeepLabV3, FCN-8s, DABNet, PAN, PSPNet, UNet and the paper designed a double-branch parallel network, Figure 8i is the label graph.To show the superiority of our network model over other networks more directly, the red box is used to mark some areas for more intuitive comparison.It is clear that the double-branch parallel network designed in this research reduces the occurrence of misjudgment and missed judgment by grasping the global information and processing some details.The processing of some edge information is also relatively good, and the segmentation accuracy is improved.Considering the generalization ability of the double-branch parallel network model, we chose to conduct experiments on the water dataset.Compared with the building and water dataset, the water dataset can detect the segmentation ability of our model in more complex background environments.In this research, we more thoroughly demonstrate the superiority of the network using experimental comparison using several network models on the water dataset, which increases the diversity of trials.
Here, our model is contrasted with some land cover neural networks, such as ESP-NetV2, SegNet, DeepLabV3+, PSPNet, FCN8s and other traditional land segmentation networks.This article also compares some of the latest improved networks on Transformer, such as PVT (Pyramid Vision Transformer), VIT (Vision Transformer), CVT, conformer.The experiment was carried out under the same conditions.The segmentation accuracy is shown in Table 3.It is evident from the table that the segmentation accuracy of the model utilized in this investigation, which came in at 96.38%, is the highest.
Figure 9 compares the predicted representations of various models from the waters dataset.Among them, Figure 9a is the remote sensing image, Figure 9b-h represent CVT, DeepLabV3+, DFN, FCN-8s, SegNet, ShuffleV2 and the double-branch parallel network designed in this paper, Figure 9i is the label graph.Here, some areas are marked with yellow boxes for more intuitive comparison.From the figure, we can see that our network can still detect rivers well under different complex background conditions.Although other models can also detect the river, in some small tributaries there will always be some missing parts, and our model can be a good detector of small tributaries with the information.This is because the two branches of our model can fully extract spatial feature information and detail feature information, under the action of CMFM module, the global feature information is fully interacted, which can better grasp the global information and accurately detect the location of the river.Finally, the MFM module guides the low-level feature information by using the advanced feature information, making up for the lack of some feature information.

Inria Dataset
The main task of the experimental model in this paper is to segment the remote sensing images of buildings and waters.After the generalization experiment results on the waters dataset, our model has a better segmentation effect on the waters.In order to further fully reflect the ability of our model, we conducted generalization experiments on the Inria dataset and compared it with some land segmentation network models, as shown in Table 4. From the table, we can see that the model designed in this paper has achieved the best in all three indicators.The data show that the generalization ability of the model is strong and persuasive.Figure 10 shows the comparison between our model and other models.From the figure, we can see that the model designed in this paper still achieves good results on the public data set Inria.Since our two-branch model complements the advantages of CNN and swin Transformer through the designed fusion module, on the prediction effect diagram, we can see that our model has achieved good results in both the processing of edge details and the problem of misjudgment, which proves that our model has strong generalization ability.

About the Model
The network is based on Swin Transformer and CNN.In the stage of feature coding, the designed cross-scale multi-level fusion module is used to connect the two branches, and the comprehensive semantic information and spatial semantic information are extracted using CNN and Swin Transformer.The multi-scale fusion module designed by us guides feature information extracted by double branches to each other, giving full play to the characteristics of Swin Transformer's global information interaction, and making up for the judgment errors brought on by a lack of global information and long-term semantic information interaction of CNN.During the feature decoding stage, the designed multiscale fusion module is utilized to fuse the high-level feature information in the coding stage and the low-level feature information extracted by CNN, and the high-level feature information is used to direct low-level feature information and upsample step-by-step.Through the joint action of several modules, our network has significantly increased the segmentation precision.The following are this paper's main contributions: 1.
A double-branch parallel network of Swin Transformer and CNN is proposed.The two network structures extract feature information separately and aggregate the extracted feature information, which can better improve the accuracy and generalization of segmentation.Swin Transformer makes up for the deficiency of the limited receptive field of convolutional neural network (CNN) and can better perform global information interaction; in addition, CNN can make up for the lack of translation in the variance of Transformer.

2.
Considering the difference of feature information extracted from two branches, a crossscale bilateral feature aggregation module is proposed.This method can effectively aggregate different levels of feature information and guide each other, so that more feature information can be globally interacted.It effectively reduces the occurrence of misjudgment.In the upsampling stage, an aggregation module is also proposed, which fully utilizes high-level semantic information to direct low-level semantic information, and recovers high-resolution pixel-level feature information and edge feature information.

About the Experiment
In order to verify the ability of our model, this study conducted comparative experiments and generalization experiments on the building water dataset and the water dataset and the public Inria dataset.In the comparative experiment, our model is superior to other classical network models in the three indicators.The PA value, MPA value and MIOU value reached 93.64%, 94.11% and 87.86%, respectively.In the prediction effect diagram, the model designed in this paper is compared with other networks regarding the problem of misjudgment, as well as dealing with some edge feature information, which is used to reflect the advantages of our dual branch network.CNN and Swin Transformer give full play to their respective advantages under the action of the fusion module; more feature information is globally interacted, and advanced feature information is used to guide low-level feature information, making edge detail features more delicate.We conducted generalization experiments on the water data set and the Inria data set.Similarly, from the numerical and prediction effect diagrams, our network model is still superior to other models, which proves that our model has better generalization ability.

Conclusions
In remote sensing images, house waters are important geographical indications.They have important practical significance for land planning, water resources protection planning and geographic mapping.The segmentation task of buildings and waters is also an important part of the land cover segmentation task.Existing image segmentation network models primarily employ CNNs to extract feature information from images.In order to make up for the deficiency of CNN in feature extraction and better fully interact with global semantic information, this paper presents a double branch parallel network structure algorithm for segmentation task.In the coding process of the algorithm, we use Resnet50 and Swin Transformer to extract features for the two branches, obtain rich context information and spatial information, utilize the benefits of the two branches' feature extraction information to the fullest extent, and fuse the feature information extracted by different branches through our fusion module, which provides rich pixel information for the upsampling information recovery.In the process of decoding, we use a fusion module designed to fuse the encoded high-level feature information with the feature information of ResNet50 branch, and use high-level feature information to direct low-level feature information.Upsampling can gradually refine and restore high-resolution images and obtain more spatial details.Compared with some of the current semantic segmentation network models, our model has greatly improved the accuracy of segmentation in buildings and waters.From the performance of different datasets, our model has good anti-interference and recognition capabilities, and can accurately determine the location of waters and houses in complex background environments, while the segmented edges are also more delicate.In the future, in order to enhance the practical applications, we will further optimize the model structure under the assumption of ensuring segmentation precision, and improve the model training speed.

Figure 1 .Figure 2 .
Figure 1.The overall structure of double branch parallel network.

Figure 3 .
Figure 3. Characteristic heat map comparison of CMFM modules, (a) is the original image, (b) is its label, (c,d) are the effect diagrams without module and with module, respectively.

Figure 5 .
Figure 5. Partial display of the land cover dataset.(a-e) are the display of different remote sensing images and corresponding labels, respectively.

Figure 6 .
Figure 6.Partial representation of the waters dataset.(a-d) are the display of different remote sensing images and corresponding labels, respectively.

Figure 7 .
Figure 7. Partial representation of the Inria dataset.

4. 3 .
Limitations and Future Prospects of the Model Since both CNN and Transformer models have large computational overhead, our work in the future is to further optimize the structure of the model while ensuring the segmentation accuracy of the model, design more efficient model structure and more effective training strategies to reduce training complexity and training difficulty.

Table 1 .
Results of ablation experiments in land cover datasets.

Table 2 .
Experimental results compared to other algorithms.

Table 3 .
Compare experimental results with different models on a water dataset.

Table 4 .
Compare experimental results with different models on Inria dataset.