MSNet: Multifunctional Feature-Sharing Network for Land-Cover Segmentation

: In recent years, the resolution of remote sensing images, especially aerial images, has become higher and higher, and the spans of time and space have become larger and larger. The phenomenon in which one class of objects can produce several kinds of spectra may lead to more errors in detection methods that are based on spectra. For different convolution methods, downsampling can provide some advanced information, which will lead to rough detail extraction; too deep of a network will greatly increase the complexity and calculation time of a model. To solve these problems, a multifunctional feature extraction model called MSNet (multifunctional feature-sharing network) is proposed, which is improved on two levels: depth feature extraction and feature fusion. Firstly, a residual shufﬂe reorganization branch is proposed; secondly, linear index upsampling with different levels is proposed; ﬁnally, the proposed edge feature attention module allows the recovery of detailed features. The combination of the edge feature attention module and linear index upsampling can not only provide beneﬁts in learning detailed information, but can also ensure the accuracy of deep feature extraction. The experiments showed that MSNet achieved 81.33% MIoU on the Landover dataset.


Introduction
Remote sensing image processing technology has played an important role in the study of urban and rural land conditions [1,2]. Accurate information on land cover is a key data resource for urban planning and other fields [3]. Semantic segmentation in aerial orthophoto images is very important for the detection of the real-time situations of buildings, plants, and surface water. Existing land-cover segmentation models still have some defects. Among the archaic remote sensing image segmentation practices, the valuation measure from mathematical statistics has been widely used. Methods that use this obtain the mean and variance of each category in the target area by learning the receptive field so as to obtain the classification results. Other relevant methods rely on the spectral discrimination ability [4] of a training model to obtain the spatio-temporal features of an image. However, with the progress of science and technology, the resolution of remote sensing images continues to improve, and the spectral features are becoming more and more complex; as a result, small differences in different objects of the same category will have a great impact on the segmentation results. Therefore, only using spectral features to extract targets is often not enough. Classification algorithms based on machine learning, such as width learning [5], non-deep neural networks [6], and so on, are not suitable for large amounts of data. When used for detection, target maps only undergo a small amount of linear or nonlinear transformation, but the classification of complex detailed information and high-order semantic information by the above means can be terrible. Especially for hyperspectral remote sensing maps with large feature differences and numerous spacetime indices, the above classification results are not satisfactory. For this type of land detection analysis [7], aerial images are widely used. There are many ways to occupy land, and the influence of tall architecture on low architecture is complex and changeable; for vegetation, areas covered by shrubs are difficult to separate from forest areas [8]. In addition, a wide variety of trees grow in different ways and soil types. Water is divided into living water and dead water, including natural pools and man-made ponds, but ditches and riverbeds are excluded. These features make it very difficult to extract features from remote sensing images. Last but not least, the above traditional methods [3,5,9] usually require manual calculation of the statistics of the obtained parameters, which further increases the complexity of deeper feature learning. Remote sensing images are developing towards higher resolutions and larger space-time spans. The same object with different spectra and different objects with the same spectrum can make classification more difficult. To sum up, the conventional means of land detection have limited feature-mining abilities and fail to adapt to different datasets.
In recent years, deep learning has been widely used in the field of remote sensing image analysis for land cover [9][10][11][12] and other applications [13]. When it comes to deep learning, since Long et al. [14] published a full convolution neural network (FCN, 2015), many achievements of scientific research based on pixel classification have emerged. For instance, Ronneberger et al. [15] proposed a UNet that can obtain detailed contextual information. The pyramid aggregation proposed by Zhao et al. [16] can integrate the contextual information of different regions so as to enhance the mechanisms to learn the overall characteristics, similarly to PSPNet [17] and DeepLabV3+ [18]. However, the excessive amounts and complexity of calculations made by the model restrict the experimental equipment and cause a certain waste of resources [19]. Therefore, Andrew and others proposed MobileNet (lightweight network) to alleviate the computational pressure, but the full release of the efficiency of the model is still a major problem faced by researchers. For example, Zhang et al. [20] proposed a channel shuffle module in 2017, which released the potential of the model by shuffling and recombining; Yu et al. proposed a feature fusion module (FFM) [21] and attention refinement module (ARM) [22] in 2020 to balance accuracy and speed; the selective kernel proposed by Li et al. in 2019 used an attention mechanism on the convolution kernel, allowing the network to select its own suitable convolution kernel. The difficulty in this kind of research is the enhancement of the accuracy of the model on the premise of limiting the weight of the model. Fully releasing the performance of the model [23] is our research direction.
Considering the existing problems, a multifunctional feature-sharing network for landcover segmentation is proposed in this paper. For depth feature mining, an SFR module is proposed. Inspired by the residual structure [24], we take the output result of the shuffle unit as the input of ResNet and change the numbers of input and output channels. In this way, even if the SFR blocks are stacked many times, the amounts of calculations (flops) can be strictly limited to about 1G. Another branch that we propose is a linear index upsampling branch with different levels to guide upsampling after continuous downsampling, which saves the process of learning upsampling. At the same time, the outputs after two SFR modules and three SFR modules are processed by the EFAModule and then fused with the output of the last downsampling of this branch to extract logical features and reply highresolution detailed information to improve the learning effects of the detail features and edge information. The introduction of the EFAModule can alleviate the mutual occlusion caused by different objects with the same spectrum and can also greatly avoid the influence of the shadows of high-level objects on low-level waters. The experimental results show that this branch can ensure the accuracy of deep feature extraction, and the fusion of the two branches has better performance: The average intersection union ratio (MIoU) of the multi-functional feature-sharing network is higher than that of other networks. In general, this work has three contributions: • One branch is a linear index upsampling branch with different levels. There is no need to learn upsampling. A trainable convolution kernel is used for convolution operations to acquire a complex feature map, which not only limits the amounts of calculations, but also ensures the integrity of high-frequency information. • The other branch combines a shuffle unit with a skip connection. Channel rearrangement makes the extraction of information more well distributed, and the residual structure ensures the accuracy of deep semantic information extraction. This branch extracts key features and pays attention to the dependencies between contexts by using the logical relationships [25] between information within a class and between classes. • After processing of the two SFR modules, the EFAModule is introduced to extract logical features and reply high-resolution detailed information, and good results in the learning of detailed information and edge information of a feature map were achieved.
The network structure enables the model to better integrate global information, and enhances the extraction of intra-class information and inter-class information. The MLNet model improves the average accuracy (MIoU) by 1.19-6.27%, with only 10-20% of the weight of other models, such as PSP and Deeplabv3+.

Land-Cover Segmentation Methodology
With the improvement of remote sensing image resolution, the amount of detailed information in remote sensing images has also greatly increased. Therefore, frameworks that are applicable to land cover have great room for progress from the perspectives of detailed information and upsampling feature fusion [26]. Fully releasing the efficiency of models based on the above is the research direction of this paper.

Network Architecture
This paper proposes a special image segmentation network. Firstly, we propose a residual shuffle reorganization branch. This branch learns the deep-level information of images in the order of the channels, pays attention to the logical relationship between intra-class information and inter-class information, and reduces misclassifications of the same object and misdetection of different objects. Secondly, we propose a linear index upsampling branch with different levels, and it does not need to learn upsampling. A trainable convolution kernel is used for the convolution operation to generate a dense feature map and fully extract the semantic information of the target feature map. Then, the EFAModule is introduced to strengthen the recognition of class information and accurately segment the edge information. The feature map processed by the SFR and EFAModule is fused with the downsampled feature map of the linear index upsampling branch with different levels, which effectively limits the amounts of calculations on the premise of ensuring the integrity of high-frequency information [27]. Finally, the output of the linear index upsampling branch with different levels is fused with the output of the SFR branch, and the final prediction map is generated [28]. The two-way fused MSNet has better performance, and its mean intersection over union (MIoU) is higher than that of other networks. Its hidden units in each graph convolutional layer are explicitly indicated in Table 1, and its overall architecture is described in the following ( Figure 1):

SFResidual
Inspired by shufflenet [29], we took the output of the shuffle unit as the input of the residual structure to form the SFResidual module, as shown in Figure 2. The shuffle unit's shuffling can cover the global information and make the extraction of information more uniform [30]; the residual structure uses the classic "skip connection", which can efficiently complete the recognition task with a large number of classifications, so the introduction of the residual structure can make up for the deficiencies of a lightweight network. The fusion of the two can greatly improve the spectral recognition ability, alleviate the problem of image misclassification, and greatly improve the accuracy of segmentation. The structure is shown in Figure 3.

LIU Branch
Note that with more convolutional layers, more corresponding features will be extracted, but a very deep network will cause gradient disappearance and gradient expulsion. VGG-16 uses convolutions to simulate fully connected layers, which can effectively alleviate this problem, so we propose a linear index upsampling branch with different levels (LIU) to optimize VGG-16 [31,32] and to achieve a better improvement.

EFAModule
Generally speaking, dilation convolution is used to prevent the loss of spatial hierarchical information. A convolution with a dilation rate of 2 and a convolution kernel of 3 actually becomes a 7 × 7 convolution, which will also produce grid effects when increasing the receptive field. In order to reduce the influence of similar problems, an EFA unit is proposed based on the lightweight structure of BiSeNet [33]. As shown in Figure 4, we adopt a two-branch model composed of strip convolution. One branch is used to obtain local information, and the other branch introduces dilation parameters to obtain edge semantic information. The comprehensive extraction of multiscale information enhances the ability to learn the module's edge information. The output graph (after ABunit processing) is pyramid pooled to obtain a characteristic graph with the number of C1 channels and converted into the dimensions of (H · W, C1) and (C1, H · W), as shown in Figure 5; then, they are cross-multiplied. The resulting graph is processed with the softmax function, and the two processing results after the softmax are cross-multiplied to obtain a large characteristic graph with dimensions of (H · W, H · w). The characteristic image is cross-multiplied, and it is then reshaped with the image before softmax (H · W, C1); finally, concatenation is performed on the channel dimension. After the fusion, the module's logic information extraction ability is significantly enhanced, and the accuracy of edge information and detailed information recognition is improved [34]. For example, buildings covered by tree shadows are no longer misclassified as background, and the segmentation of water edges is no longer affected by coasts and ships. The detailed structure of the edge feature attention module is shown in Figure 5. In summary, the linear index upsampling branch with different levels not only limits the amount of calculation, but also ensures the integrity of high-frequency information. The SFResidual module extracts key features and pays attention to the logical relationship between information within a class and between classes so that it can more fully focus on the dependencies between contexts. The edge feature attention module can provide high-resolution detailed information, and it has achieved good results in the learning of the detailed information and edge information of a feature map.

Land-Cover Segmentation Experiment
The way in which we verified the model for land-cover segmentation proposed in this paper was by using a dataset that we made. The dataset came from Google Earth. Google Earth is a virtual Earth software developed by Google. It presents satellite photos, aerial photos, and a GIS in the form of three-dimensional models. The authors first obtained 1000 large images with a resolution of 1500 × 800 px on Google Earth on 20 March 2021; these were cut into 23,915 small images with a resolution of 224 × 224 px. These large images had a large space span and a variety of shooting angles. They were roughly divided into the following categories: private villas in wealthy areas of North America and Europe, villages and forests in Western European countries, and China's coastal rivers. In summary, the dataset covered a wide area, including many environments with complex terrain, and it was suitable for investigating the true detection capabilities of the model. As shown in Figure 6, these images were manually labeled as 3 types of objects: buildings (white, RGB  The semantic segmentation of this dataset presented great difficulties. In addition, there were some more difficult problems [35]. Buildings are stationary objects, but the differences in their height are large. The projection of the shadows of high buildings will affect the edge contour segmentation of low buildings, and the same is true for tall trees; remote sensing images with projections that are similar in appearance, indistinguishability is likely. As shown in the figures below, some vehicles were similar to buildings. Although they are small in size, large areas of stationary vehicles are easily misclassified as buildings; a close-to-horizontal viewing angle can cause trees to hide the water, which would make a training set more difficult to learn; the tops of some buildings are similar in color to vegetation, and can easily be misclassified as the background; the same water area (private swimming pool) has two colors of blue and green, making it more difficult to segment water objects. In summary, this dataset is relatively difficult to learn [36], and it is also difficult to use it to perform accurate land-cover segmentation and perfect target classification, as shown in Figure 7.  To facilitate the experiments, all pictures were cut in a certain order-from left to right and from top to bottom, and there was no area overlap during the segmentation; images with only one category were excluded to obtain a final dataset of more than 12,000 images (224 × 224 px). The dataset is randomly divided into a training set and a test set at a ratio of 7:3.

Public Dataset
This dataset consisted of 310 aerial images in the Boston area, each with 1500 × 1500 pixels, and it contained hyperspectral, multispectral, and SAR-type images (the reader can search for the 'Massachusetts Roads Dataset' on the official website to find it easily). The complete dataset contained 34,000 hectares. We cut the dataset into 10,620 small images (256 px), which were divided into a training set and verification set according to the ratio of 7:3. There were only the buildings (white, RGB [0, 0, 0]) and the background (black, RGB [255, 255, 255]) in it, as shown in Figure 8.

Four-Class Public Dataset
For the sake of verifying the effect of the proposed network in the segmentation task, the LandCover dataset [37] was used, as shown in Figure 9.  Figure 9. The dataset had 33 pictures with a resolution of 25 cm (about 9000 × 9500 px) and 8 pictures with a resolution of 50 cm (about 4200 × 4700 px). It was definitely not easy to accurately segment this dataset. In addition to the difficulties mentioned above, there was still a problem with how we could accurately define these four types of objects. "Building" refers to a regular solid object with a certain height that will not move; "vegetation" refers to tall trees, flower beds, green belts, etc., but does not include pure grassland; "water" includes rivers and streams, but does not include waterless ponds. The projections of these objects blocked each other, which could easily cause false detection. In Figure 8, the objects encircled by the yellow ellipse are single trees, which are easy to misclassify as forests. The low shrubs marked by the yellow rectangle are also easily misclassified as forests; buildings marked with blue rectangles are easily misclassified as background; those marked with pink ellipses are greenhouses, which can easily be mistaken for buildings, but they should be classified as background. To sum up, it is not easy to perfectly classify land cover in this dataset. We processed datasets as follows: All images were cut in a certain order without overlapping and omission to create images with 224 × 224 px. Pure-color pictures were removed, and the remaining images were randomly divided into the training set, validation set, and test set according to the ratio of 7:3.

Evaluation Index
In this experiment, we selected three evaluation indicators: the pixel accuracy (pixel accuracy, PA), mean pixel accuracy (MPA), and mean Intersection over union (MIoU). They are calculated as follows: where k is the number of categories, p ii is a pixel, and its correct mark is i, but its prediction is j if the correct sign is i. When i = j, p ii is a true positive, p ij is a false negative, p ji is a false positive, and p jj is a true negative. True positive: for a real example, the model prediction provides a positive example, and it actually is a positive example. False negative: For a false counterexample, the model predicts that it is a counterexample, but it is actually a positive example. False positive: for a false positive example, the model predicts that it is a positive example, but it is actually a negative example. True Negative: for a true counterexample, the model predicts that it is a counterexample, but it is actually a counterexample. Pixel-like precision indicates the ratio of the intersection of the real index and the predicted index of each class in the three classifications [38,39]; the average pixel accuracy is the ratio of the intersection of the real value and the predicted value to the real value [40]; the average intersection and union ratio is an important index for measuring the effect of land segmentation [41]. It refers to the ratio of the intersection and union of the real value and the predicted value, and then the average value is taken [42]. This index can reflect the quality of a network and the advantages and disadvantages of a model well.

Supplementary Experimental Procedures
This experiment was based on the public platform PyTorch. In this work, we used the 'poly' learning rate strategy [43] and the Adam optimizer. We believed that Adam optimizer was the most suitable for this dataset and network-if the data were dense, the SGD optimizer was adopted. Although it takes a long time and is easily trapped at saddle points, it can quickly reach the maximum value. On the contrary, the Adam optimizer can converge quickly, and the rising curve is relatively stable. The experiments showed that the Adam optimizer can make the most of the model in terms of the density of the land segmentation dataset. Too high of a learning rate will lead to too a large span, and it is easy to miss the best advantage; too low of a learning rate will lead to too slow of a convergence speed. The training effect was the best when the learning rate = 0.001 in this experiment, so the learning rate was set to 0.001. When the power was lower than 0.9, the rising speed of the first 100 epochs was too slow, and when the power was higher than 0.9, the last 150 epochs were completely saturated. So, the basic learning rate was set to 0.001, the power was set to 0.9, and the upper iteration limit was set to 300. The momentum and weight decay rates were set to 0.9 and 0.0001. Considering the actual situation of GPU memory in this experiment, the batch size of the training was set to 4. All experiments were carried out on a Windows 10 system with an Intel (R) corei5 10400F/10500 CPU, 2.90 GHZ, 16 G memory, and NVIDIA GeForce RTX 3070s (8GB) graphics card. This experiment used Python version 3.8 with cuda10.1. We used the cross-entropy loss function [14] to calculate the loss of the neural network. Shannon proposed that the probability of the occurrence of information decreases with the increase in the amount of information, and vice versa. If the probability of an event is P(x), the amount of information is expressed as: Information entropy is used to express the expectation of the amount of information: If there are two separate probability distributions and P(x) and Q(x) can describe the same random variable, we can use the relative entropy (in this paper, the predicted value and the loss value of the label) to quantify the difference between the two probability distributions: where x i is the sample, p and q are two independent probability distributions of random variables, and n is the number of samples. The gradient descent algorithm was used in the training process. By comparing tags and predictions, the parameters were updated through backpropagation. The optimal parameters of the training model were all saved.
For the problem of land segmentation and cover, the effect of the cross-entropy loss function was better than that of the mean square error loss function [44], so the cross-entropy loss function [45] was used in this experiment.

Analysis of the Results
The optimal values are bold. As shown in Table 2, the main module used the SHResidual module as the backbone network, and the training parameters of all models were set to the same values. According to the information in the table, the EFAModule was improved by 1.03% because of the module's ability to recover detailed information and capture boundary information. The branch composed of the DCModule and EFAModule improved by 4.03%. The GDC_branch, which was connected with the DCModule and EFAModule, was able to greatly improve the segmentation effect. This combination paid attention to contextual information, detail features, and boundary information at the same time, which improved the accuracy of the final MSNet by 5.71% (learning rate = 0.001, power = 0.9, weight decay rate = 0.0001, batch = 4, epoch = 300). The maps that included the EFAModule obviously avoided many misclassifications and achieved a better edge segmentation effect, which was beneficial in that it was possible to extract logical features and output high-resolution detailed information. In comparison with the SFRModule alone, the combination did not misdetect the underground with respect to the buildings. Obviously, the combination of SFR, LIU, and the EFA basically allowed all of the misclassifications and edge blur to be avoided, which largely improved the segmentation effect. A diagram of the effects in the ablation experiment is shown in Figure 10. To compare the performance of each model, the models were tested under the same conditions. Figure 11 shows a chart comparing the effects of MSNet and the other models. In the first and second sets of images, under the influence of vehicles, lawns, and other objects, networks such as FCN and SegNet showed different degrees of misdetection. Seg-Net mistakenly recognized the background as the buildings. The problem of edge blurring in images assessed with ExtremeC3Net was obvious, but our model achieved accurate segmentation. In the third set of images, the low buildings above the swimming pool were easily recognized as the background, though networks such as SegNet and PSPNet had obvious omissions in their recognition of buildings. In the fourth and fifth groups of images, although there was interference from the boat and the cement on the land by the sea, our model still achieved a more accurate segmentation of the water, whereas the other models misclassified the boat and the cement on the land as buildings, especially SegNet and ExtremeC3Net. In fact, the boat and the cement on the land should have been classified as background. In the sixth set of images, the blue buildings were easy to recognize as water, as they were by ExtremeC3Net, but our model still avoided such mistakes. This was caused by the synchronous learning of intra-class information and inter-class information by the SFRModule; by combining it with the EFAModule, the high-frequency detailed information was restored and the accuracy of the edge segmentation was ensured. Finally, the fusion with LIU caused our model to achieve a great effect. As shown in Figure 11, the actual segmentation effect of MSNet was better than that of the other networks (learning rate = 0.001, power = 0.9, weight decay rate = 0.0001, batch = 4, epoch = 300) And the heat map of this data set is shown as Figure 12.  Table 3 shows the evaluation metrics of each model, and the PA represents the pixel accuracy of each category in the three categories. In comparison with FCN-32s, SegNet, DABNet, UNet [33], EspNet [46], ShuffleNetv1 [20], and the other models, MSNet was able to achieve the best results, which were 1.19% higher than the second best index. The MIoU curve of the model is shown in Figure 13 below. In the first 50 generations of ExtremeC3Net [52], the growth rate was very fast-better than that of MSNet (MFNet) after 100 generations-but the MSNet curve could be steadily maintained above other models. The same was true for the training loss curve (Figure 14). In the first 50 generations, it was significantly higher than that of DABNet, and it was stable at the bottom of all curves after 100 generations. From the point of view of the convergence speed and long-term effect, MSNet had great superiority.  We provide a "feature space analysis" to show the segmentation of MSNet. In the Figure 12, red represents the segmentation object and blue represents the background. Through the graphic analysis of the thermal diagram, it could be seen that MSNet's segmentation of the buildings at the lower-left corner was more accurate in the first image, but the shadows at the upper-left corner were wrongly detected as buildings. For the second image, MSNet was able to accurately detect the overall scope of the water area, but the center of the water area had a false detection.
For the sake of verifying the generalization ability of the model, further experiments were carried out on the public land-cover dataset. The dataset consisted of 310 aerial images in the Boston area, each with 1500 × 1500 pixels and an area of 225 hectares. The entire dataset covered about 34,000 hectares. It was cut into 10,620 small images (256 × 256), which were divided into a training set and verification set at a ratio of 7:3 [45]; this dataset had only the building (white, RGB [0, 0, 0]) and the background (black, RGB [255, 255, 255]) types. Without data enhancement, the settings of the various hyperparameters, except for the batch size of 3, were the same as those in the previous experiment. For the first set of images, it was obvious that MSNet's edge segmentation effect was much better than those of FCN32s and DeepLabV3Plus. For the second, third, and fourth sets of images, the abilities of PSPNet and DABNet to segment small and dense buildings were relatively poor, but our model had accurate recognition of those buildings, including their edge information and detailed information. In terms of indicators, it was 1.01% better than the second best model and 9.44% better than the lowest model. The experimental results are shown in Table 4 and Figure 15 (learning rate = 0.001, power = 0.9, weight decay rates = 0.0001, batch = 4, epoch = 300).  To verify the performance of the model with other categories of datasets, further experiments were carried out on a public water-cover dataset ( Figure 16). The data in this paper came from high-resolution remote sensing images selected from Google Earth, and the number of images in the data was 26,200. In order to make the data more authentic, we used a wide range of distributions, and in terms of river selection, we chose rivers with different widths and colors and small and rugged rivers. On the other hand, we selected complex environments surrounding the rivers, including hills, forests, urban areas, farmlands, and other areas, which could fully test the generalization performance of the model. Some of the images of the river that were collected are shown in Figure 15. The average size of the Google Earth images was 4800 × 2742 pixels, and this was cut to 224 × 224 for model training. The training set and test set contained 20,960 and 5240 images, respectively. This dataset had only the building (red, RGB[128, 0, 0]) and the background (black, RGB [255, 255, 255]) types. Without data enhancement, the settings of the various hyperparameters, except for the batch size of 3, were the same as those in the previous experiment. For the first set of images, it was obvious that MSNet's edge segmentation effect was much better than those of SegNet and DeepLabV3Plus. There were also obvious fractures in SegNet's effect for the first and second sets of images. For the third and fourth groups of maps, SegNet and DABNet mistakenly detected the grassland and buildings as water areas, and DeeplabV3+ mistakenly detected an intersection of rivers as the background. The edge detection accuracy of PSPNet was relatively low, and it also mistakenly detected a water area as the background. However, MSNet could not only distinguish water areas from grasslands, buildings, and other backgrounds, but could also accurately extract edge information. In terms of indicators, it is 2.82% better than the second better model and 10.4% higher than the worst model. The experimental results are shown in Table 5 and Figure 15 (learning rate = 0.001, power = 0.9, weight decay rates = 0.0001, batch = 4, epoch = 300).  In order to include all objects in the scene, the four-class public dataset introduced above Figure 15 was selected for a generalization experiment. As shown in Figure 17, for the first set of images, UNet directly missed all buildings in the lower-left corner. FCN32s and DeepLabV3plus missed the detection to varying degrees. DABNet mistakenly detected buildings as plants. MSNet did not have these problems, as it benefited from SFR's synchronous learning of intra-class information and inter-class information. For the second set of pictures, SegNet mistakenly classified the plants in the lower-left corner as buildings, and UNet, ExtremeC3Net, and the other networks mistakenly classified the background in the middle of the figure as plants. MSNet did not have this error. For the third and fourth sets of pictures, UNet confused plants with water, and the edge detection of the buildings was relatively fuzzy. The learning of the edge information of DeepLabV3plus and ExtremeC3 was not ideal, and the error was large. However, MSNet basically avoided the problems of edge blur and false recognition, which showed the superiority of the two-way fusion model. So, it can be seen intuitively that MSNet's MIoU was higher than that of Unet by 14.3% and higher than that of FCN32s by 1.14%, as shown in Table 6.  The results show that the average intersection ratio and the other indicators of MSNet were higher than those of the other models. Therefore, the generalization and effectiveness of MSNet were proven. MSNet combined a shuffle unit with a skip connection, the channel rearrangement caused the extraction of information to be more well distributed, and the residual structure ensured the accuracy of the extraction of deep semantic information, which paid attention to the logical relationship between information within a class and information between classes. The combination of this branch and the LIU greatly improved the accuracy of segmentation, allowed information to be extracted from a deeper level, and caused better results to be achieved. A comparison of the actual segmentation results is shown in Figure 15, and the details of the indices are shown in Table 6. In this experiment, the hyperparameters were set as follows: learning rate = 0.0001, power = 0.9, weight decay rate = 0.0001, batch = 4, epoch = 300.

Conclusions
In this work, in order to optimize the effect of land division, a new three-way parallel feature fusion network called MSNet was proposed, and it focused on the enhancement of contextual information and compatible intra-class information and inter-class information to improve the model. The proposed LIU focused on contextual features and strengthened the learning of detailed information; a branch composed of the SFRModule and EFAModule was able to take into account the identification of intra-class information and inter-class information, filter redundant information, extract key features, and focus on the learning of boundary information. The two-way feature-sharing network was proven to have a good segmentation effect. However, the segmentation effect of the network is not ideal when faced with a large number of categories and more complex datasets. When buildings are captured from different angles, it cannot guarantee that the contours of the predicted map perfectly match those of the original image. It can only ensure the accuracy of the location. The same is true for water areas. To increase the accuracy, many studies' results have shown that adding an optimized transformer structure can significantly improve the segmentation accuracy of a model, so the next direction for research is to think about how the transformer structure can be optimized so that it can have a better effect on fusion with a convolutional neural network. In addition, the network still needs to achieve a faster computing speed with fewer parameters.