DSSM: A Deep Neural Network with Spectrum Separable Module for Multi-Spectral Remote Sensing Image Segmentation

Over the past few years, deep learning algorithms have held immense promise for better multi-spectral (MS) optical remote sensing image (RSI) analysis. Most of the proposed models, based on convolutional neural network (CNN) and fully convolutional network (FCN), have been applied successfully on computer vision images (CVIs). However, there is still a lack of exploration of spectra correlation in MS RSIs. In this study, a deep neural network with a spectrum separable module (DSSM) is proposed for semantic segmentation, which enables the utilization of MS characteristics of RSIs. The experimental results obtained on Zurich and Potsdam datasets prove that the spectrumseparable module (SSM) extracts more informative spectral features, and the proposed approach improves the segmentation accuracy without increasing GPU consumption.


Introduction
With the rapid development of remote sensing (RS) technology, multi-spectral (MS) images are able to provide increasingly complicated and effective information. As one of the most significant steps in the interpretation of RSIs, segmentation is a comprehensive research topic that includes computer vision (CV), neural networks and RS fields. Segmentation tasks in RS generally focus on extracting a specific category, e.g., water, buildings or cars, or multiple categories all together [1,2]. Today, segmentation for RSIs plays a significant role in disaster prevention and control, land-use planning, urban sprawl detection, etc. [3][4][5][6].
As compared with CVIs, RSIs have two major feature dimensions: spatial features and spectral features. 2.
In the spatial dimension, CVIs generally have a lower resolution and a lower variety of objects. Correspondingly, the resolution in RSIs is generally hundreds of times higher than CVIs. Moreover, RSIs have a more complicated spatial distribution, more diverse object textures, and boundary patterns, and extremely unbalanced object categories. 3.
In the spectral dimension, CVIs consist of red, green, and blue spectra (RGB), which indicate unitary spectrum characteristics. However, aside from visible spectra, such as 1.
A model was designed to realize the self-learning fusion mechanism of MS features through a depth-separable convolution and attention mechanism, which takes dissimilar contributions of different spectra and features into account and reduces the misclassification errors.

2.
On the basis of the above model, an end-to-end MS image segmentation framework called DSSM was proposed to improve the segmentation ability of all surface elements in an experimental dataset and to improve the comprehensive segmentation accuracy.

3.
A series of experiments were conducted to verify the effectiveness and superiority of our proposed method. The experimental results provide a new baseline for further research.
The rest of this article is organized as follows: Section 2 illustrates related work. In Section 3, the proposed method is described in detail. The effectiveness is analyzed and compared with various state-of-the-art (SOTA) methods using the Zurich [34] and Potsdam [35] datasets in Section 4. Finally, Section 5 presents the conclusion.

Related Work
In this section, we firstly review the development of image segmentation approaches in the CV field, including the design of the infrastructure, the principle of various SOTA methods, and fine-tuning in terms of specific scenarios. Then, the image segmentation methods in the RS field are reviewed and compared with the proposed algorithms.
Image segmentation, which aims to automatically assign a category to each pixel, is an active research topic in the field of CV and RS. In the CV field, traditional solutions utilize the knowledge of digital image processing tools, topology and mathematics to segment images, which makes it difficult to adequately leverage the color, spatiality, shape, texture, and boundary features [36,37]. To efficiently integrate these diverse features, certain methods based on CNNs provide a new strategy with which to analyze and interpret images [38]. However, the convolution operation, which is the most significant component of CNNs, is an irreversible feature extraction process, and, thus, it is hard to obtain the pixel-level classification results [36]. To solve this problem, with the encoder-decoder architecture, FCN imposes an upsampling and skip-connection module to retrieve the feature map in its original size [8]. Consequently, FCN has become the benchmark in the image segmentation task, and, thus, correlative algorithms are discussed in the following.
Compared with FCN, UNet has a more graceful architecture and a more complicated skip-connection module, which is able to efficiently fuse multi-scale features [9]. On the basis of UNet, UNet++ optimizes the skip-connection module with a pruned, deep supervised subnetwork [10]. PSPNet fuses the multi-scale features through the pyramid pooling module and combines the structural information using CRF [39], which improves the segmentation performance [12]. On the basis of the pyramid pooling module, combined with dilated convolution, Google proposes an ASPP module, which extracts and fuses multi-scale features more effectively [13,[40][41][42].
In addition to the optimization of network structure, researchers also improved the convolution units. As regards enhancing the feature extraction capability, the attention mechanism was introduced to tackle the image processing task and to learn the significant correlation of multi-channel feature maps [43,44]. Inspired by this idea, we optimized the feature extraction module to reassign the weight of the feature maps for each spectrum. Furthermore, to decrease the convolution operation parameters, XCeption proposed depthwise separable convolution (DS-CNN), which reduces the dimension of 3D feature maps and accelerates the speed of training models [45]. On this basis, we conducted the spectrum separable module (SSM) to decouple the spectral information, which makes it possible to reassign the weight of spectral feature maps.
Most of the methods mentioned above obtained SOTA results on various image data sets. However, as mentioned before, considering the characteristics and differences between RSIs and CVIs, fine-tuning and embedding of RS expertise are still needed in RSI segmentation tasks [25,26,46]. Various researchers have made innovations and optimized the framework or structure, obtaining good results on certain RS datasets. More specifically, inspired by atrous convolution, an FCN-based method without downsampling is proposed to obtain and fuse features of different scales [47]. The holistically nested edge detection method, based on SegNet (HNED-SegNet), realizes RSI segmentation through an end-to-end edge detection scheme [48]. Thereafter, Pan proposed a dense pyramid network to enhance the low information flow between dimensional features by independently processing the digital surface model of the image using grouped convolution and connecting an effective data fusion method [49]. ScasNet, which adopted a coarse-to-fine refinement strategy, consists of a pre-trained encoder, various self-cascading convolution units, and a decoder component. It uses a pre-trained encoder to obtain more effective low-dimensional features, utilizes residual correction for multi-scale feature fusion, and obtained SOTA results on multiple challenging benchmark data sets [2]. Various researchers improved the loss function used in network training for the uneven distribution of ground objects of RSIs, such as focal loss [50] and balanced or weighted cross-entropy loss [21,51].
Furthermore, other researchers focus on fusing spectral and spatial features. More specifically, spectral features of each pixel are exploited by 1D CNN, which contains five layers, i.e., an input layer, a 1D convolutional layer, a max-pooling layer, a fully connected layer, and an output layer [28]. In [29,31,52,53], diversiform strategies are utilized to explore the effective representation of hyperspectral data in a lower dimension, including principal components analysis (PCA), 1D CNN, local discriminant embedding (LDE) and fractional order darwinian particle swarm optimization (FODPSO). A lightweight framework, with bag-of-features learning, that integrates 2D and 3D CNNs is proposed to learn the joint spatial-spectral features in [54]. In [55], a mini-graph convolutional network (miniGCN) is integrated with a CNN to extract fused spectral and spatial features in an efficient approach. In contrast, in our work, we focus on exploring the abstract correlation among spectra rather than finding a specific representation, and we model the spectral information in a global manner rather than in a local manner as a 3D CNN does.
In summary, beginning with the idea of decoupling spectral features, we built a spectral separable convolution module, and leveraged the attention mechanism to select effective features in a self-learning manner. Extensive experiments demonstrated that our module effectively improved the baseline segmentation accuracy, and the overall performance achieved SOTA results.

Methodology
In this section, we describe the proposed network structure in detail. First, we explain how the proposed spectral feature extraction strategy module explores the links between channels in MS images. Secondly, the detailed network structure of the DSSM is illustrated herein. Finally, the loss function formulation is elaborated.

Spectrum Separable Module
In this section, we describe the structure and principle of the SSM in detail, focusing on how SSM can improve the representation of spectral features.
In the sixth paragraph in Section 2, we discussed the shortcomings of CNNs in processing RSIs, because the inherent channel correlation inherent to CNNs ignores the different influence capabilities between spectra. Therefore, inspired by the depth-separable concept of DS-CNN, we proposed the spectrum separable module, as shown in Figure 1. The network structure of the SSM can be divided into three steps: spectrum-wise convolution, depth-wise attention and point-wise convolution.

Spectrum-Wise Convolution
In the spectrum-wise convolution step, I ∈ R H×W×4 refers to the input image. I i j represents the i-th convolutional block on the j-th spectrum where j ∈ {1, 2, 3, 4} and i ∈ {1, 2, . . . , n}, and where n is the number of convolutional blocks. Then, for each convolutional block, we adopt four groups of kernels with 64 kernels in each group. K k j ∈ R s×s denotes the k-th kernel in the j-th group where k ∈ {1, 2, . . . , 64}. Moreover, feature map F can be obtained by concatenating all the convolutional results and F x co is defined to conduct the concatenation operation upon dimension x. Therefore, the i-th value in F ∈ R H ×W ×256 can be indicated as: Different from DS-CNNs, we removed the constraint on the number of output feature graphs. We isolated each spectrum, treated them as separate H × W × 1 images, and then carried out the convolution operation. Since the input has only one channel, there is no redundant channel correlation. Here, we set the number of convolution kernels per channel to 64 for comparison with the baseline. By separating the spectrum, we convert one input into four inputs. Moreover, for each input, the network can learn the feature type required by each input through backpropagation, which is a capability that CNNs and DS-CNNs do not have. This approach allows us to obtain the feature maps for each spectrum, and fuse them together using a concatenation operation to produce the final feature maps.
Since the concatenation operation is conducted for feature fusion, the spectrum-based feature maps are equally dealt with in the following processes, which can cause a loss of accuracy. For example, the nearshore water may have a different color from the ordinary water in the RGB image due to the shallow depth. However, the corresponding pixel values for the same nearshore water in NIR may not be very different from ordinary water. To achieve better segmentation accuracy, a trade-off should be made to capture both the texture features in RGB and the near-infrared features in NIR for the water area. We expected the proposed network to learn the features based on certain points of focus, and thus an attention mechanism was utilized in the proposed model.

Depth-Wise Attention
In the first step, we obtained feature map F, which contains a number of channels representing multiple types of features. By applying the attention mechanism on F, we enhanced the important features and weakened the unimportant ones. In brief, we gave each channel the ability to self-learn its weights.
In order to obtain the global feature of a channel, the global average pooling method (GAP) was adopted to integrate the global information and obtain the compressed feature graph G ∈ 1 × 1 × 256. This process is represented as F sq , and then G can be calculated by: Since we want to obtain the weights of a complete channel, GAP is used to integrate the channel's internal information. Note that GAP is a relatively simple, but effective, pooling method to integrate global information. In addition to this, for example, maximum pooling and random pooling both cause more or less pixel point loss inside the channel, so we chose GAP.
Thereafter, two FC layers were adopted to integrate the multi-channel information, followed by the sigmoid function as the active function with which to generate the weight map S that can be applied to the original feature map. F ex is used to denote this operation, and S can be calculated by: It is necessary to use the FC to integrate the channel information here, because each channel value of G is the result of a GAP based entirely on a single channel. If it is directly applied to F, the process of obtaining the following T is completely without backpropagation and training, which makes it impossible to learn the law of channel enhancement and attenuation through all channel information.
Finally, the obtained weight map was multiplied directly by the channel corresponding to the original feature map, which is represented by F scale , with which we obtain the weighted feature map T. At this point, the feature map has acquired the ability to express the strength and weaknesses of multiple features. T can be expressed as:

Point-Wise Convolution
We applied the attention mechanism to a series of features corresponding to each spectrum, and the convolution kernel of 1 × 1 to integrate the channel correlations. On the one hand, this enabled it to learn nonlinear relationships between channels, and on the other hand, it allows it to configure subsequent input dimensions by specifying the number of convolution kernels.
In Equation (5), J ∈ R 1×1×256 refers to the kernels we adopted here, and the final feature map O ∈ R H ×W ×64 can be obtained by:

Computational Cost
To compare the computational cost between standard convolutional layers and DSSM when the input and output has the same shape, we used the following assumptions: 1.
Both the standard convolutional layer and DSSM take as input a D i × D i × M image I as input and produce a D i × D i × N feature map F, where D i is the spatial width and height of a square input image, M is the number of spectra in the input image, and N is the number of channels in the output feature map.

2.
The padding type is the same, which means the spatial size of output feature map should be the same as that of the input image.

3.
During the execution process of the model, the GPU consumption is mainly determined by the model size under the same conditions. In turn, the model size is proportional to the number of parameters. Therefore, in this section, the number of parameters of the module is used to evaluate the computational cost.
Thereafter, in the standard convolutional layer, N convolution kernels of size J are needed to obtain the feature map, where D k is the spatial length of the convolution kernel, and M is the number of the input image spectrum, and N is the number of output channels as defined previously. Thus, the standard convolution layer has the the computational cost of : where the cost is greatly influenced by the shape of the input and output. As we all know, the coupling between each spectrum leads to the inevitable equality in the standard convolution.
To solve this problem, SSM was proposed to break the coupling and adopt a 1×1 convolution to integrate the mixed feature map. Here, we need M kernel groups in SSM, then in each group, N convolution kernels K of size D k × D k × 1 are designed to obtain specific feature map group for each spectrum. Therefore, the computational cost of the first SSM step is: These groups are concatenated together, rather than added, to obtain a mixed feature map. Then, N kernels of size D k × D k × 1 are adopted to apply a linear combination of the mixed feature map to obtain the final feature map. Considering the cost of the above step together, SSM has the total cost of: Therefore, we calculate the following ratio by comparing SSM with the standard convolution: It is noteworthy that when we set N = 1 then the cost is the same as that in DS-CNN. Moreover, when N is set to be the same as N, we maintain the dimension of the feature map. Now, we solely need to focus on the component of N D 2 i , which is equal to 1 64 in our implementation, where N and D i are both 64. Hence, the computational cost was only 1.56% more than standard convolutions, but a much improved spectral feature extraction capability was successfully achieved.

Network Structure
In order to solve the semantic segmentation problems of MS images, we developed an end-to-end network framework based on Deeplab. The structure of the framework is shown in Figure 2. The DSSM is composed of the following steps: Step 1: Derive a feature map with spectral information The MS image is used as the model input and can be processed using the SSM, which results in a feature map with a strong expression ability for spectral information.
Step 2: Feature extraction using XCeption The feature map is transferred into XCeption for low-dimensional feature extraction.
Step 3: Multi-scale feature extraction using ASPP The extracted feature map is input into the ASPP module to extract the multi-scale feature, and the result is concatenated to obtain the high-dimensional feature.
Step 4: Upsampling and concatenating The high-dimensional feature is firstly upsampled and then concatenated with the lowdimensional feature. Finally, a simple convolution and upsampling operation is adopted to return the size of the feature to the input size.

Loss Function
In our proposed framework, the overall loss function can be indicated as:

Balanced Sparse Softmax Cross Entropy
In image segmentation, there is usually a large deviation in the proportion of pixels of different categories. To cope with the biased sampling class, we defined a balanced sparse softmax cross-entropy-based tradeoff strategy.
Assuming that there are S classes of surface object in our dataset, we adoptŶ ∈ R H×W×S to denote the true label, andŶ = {ŷ i , i = 1, 2, . . . , S} which means i is the object category. Moreover, eachŷ i ∈ R H×W is a binary map that only contains 1 and 0, which represents if the current pixel belongs to the i-th category or not. Similarly, Y ∈ R H×W×S is adopted to refer to the prediction map, where Y = {y i , i = 1, 2, . . . , S}. The difference is that the value of a specific pixel in y i is the probability of the pixel belonging to i-th category. Following a simple balanced strategy, and balanced sparse softmax cross entropy can be defined as: where A 1 denotes the L1 norm of matrix A and β i can be defined as:

Dice Coefficient
The dice coefficient is also a commonly used loss function applied in image segmentation tasks [21], which can be defined as:

Experiment
We designed and conducted three groups of experiments with the following three objectives:

1.
In order to verify the effectiveness of the proposed DSSM in an MS information fusion, firstly, we compared the segmentation performance of the baseline framework based on RGB, NIR, and RGB-NIR inputs to verify whether RGB-NIR contains more valuable information and improves the performance. Secondly, we compared the proposed DSSM with the baseline framework based on the same set of RGB-NIR inputs to evaluate its effectiveness.

2.
In order to further verify the validity of the proposed SSM, features from different levels were visualized and compared to demonstrate that more abstract and effective features can be extracted from the SSM.

3.
In order to verify the overall performance of the proposed framework, comparison experiments were carried out to compare with other SOTA methods.

Data Introduction
In image segmentation tasks, supervised learn-based frameworks generally require large amounts of data with high quality labels, which are often difficult to obtain. Fortunately, Michele released his self-labeled Zurich dataset in 2015, which includes 20 highresolution photos of Zurich, Switzerland with a 0.62-m GSD obtained from a QuickBird satellite in August 200. Surface objects in this dataset are classified into eight different categories: road, building, tree, grass, bare soil, water, railway, and swimming pool, and each object is annotated by a specific color. In order to reflect the real-world distribution, the number of pixels in various samples is unbalanced, as shown in Figure 3. In addition, since it was marked manually, these labels are not completely consistent with the real-world objects, and there are some errors or missing labels. The data set contains the original image and the corresponding labels. The original image is an MS image containing four channels of NIR and RGB, with a 16-bit depth. The 16-bit depth image contains a lot of colors and subtle differences among those colors that humans cannot recognize, but that neural networks can easily pick up. Another dataset we adopted is the Potsdam dataset provided by ISPRS as shown in Figure 4, which contain 38 images of Potsdam, Germany.

Sampling Strategy
For the Zurich dataset, as the resolution of the source photos ranges from 650 × 650 to 1700 × 1700, random sampling was conducted to take into account the impact of resource constraints, training efficiency, training effects, and other factors. The 20 photos were randomly sampled to produce 20,000 slices, 256 × 256 in size, preserving the maximum amount of detail and edge information in the original photos. Then, 80% of these participated in training and validation, and the remaining 20% served as test sets.

Data Augmentation
In order to improve the expression ability of the data and the generalization ability of the model, we applied a variety of data enhancement methods to the data, including rotation, flipping, slicing, Gaussian filtering, bilateral filtering, gamma transformation, and so on. In fact, the random slice method mentioned above, which increases the number of samples, was also used for data augmentation.

Evaluation Methods
In order to evaluate the effectiveness of the proposed framework, we adopted intersection over union (IoU) as an evaluation indicator for a single category of ground objects. IoU defines the similarity between the predicted area and the ground reality area of the objects in this set of images. The calculation formula is as follows: where TP, FP, and FN represent the counts of true positive, false positive, and false negative, respectively, and i refers to the category number. The result of the count is generally obtained by the confusion matrix between the predicted value and the ground reality. Moreover, FW IoU was used to evaluate the overall segmentation performance, which can be calculated as: where p i refers to the percentage of pixels of the i-th ground object. When calculating the average IoU, FW IoU also took into account the occurrence probability of ground objects, thus making the evaluation of the segmentation result more accurate.

Experimental Environments
The proposed architecture is implemented using the Tensorflow library. The hardware device is a Ubuntu server equipped with four GeForce RTX 2080 Ti GPUs (each has 12 GB of memory), one Intel i9-9960X CPU and 64 GB of RAM. The detailed hardware configuration and software requirements are shown in Table 1.

Comparison of Various Processing Strategies
In the introduction to SSM, we stated that the SSM actually provides a strategy for fusing different spectra from the input image. The purpose of this section of the experiment was to verify the effectiveness of the fusion strategy in the SSM. This group of experiments were conducted on the Zurich and Potsdam datasets.
Before we proposed the SSM, we put forward a hypothesis that different spectra would show different values on different ground objects. On the basis of this assumption, we treated each spectrum of the input image as an independent single spectral image. We convolved them separately, and then we established the nonlinear relationship among the feature maps generated by all the spectra through 1 × 1 convolution. This enabled the network to learn which features needed to be learned for each spectrum independently. In order to assess whether more valuable information is contained in RGB-NIR and the effectiveness of the SSM, we designed several processing strategies for comparison, which are as follows:

1.
We fed three spectra of red, green, and blue into the baseline (baseline with RGB).

2.
We fed the NIR into the baseline (baseline with NIR).

3.
We fed four spectra into baseline (baseline with RGB-NIR).

4.
We fed all four spectra into the proposed framework (DSSM with RGB-NIR).
In this group of experiments, the Deeplabv3+ network, which is fine-tuned in terms of cost function and the number of convolutional layers, was adopted as the baseline with which to compare the DSSM. Figure 5 shows the segmentation results of the baseline with RGB, the baseline with NIR, the baseline with RGB-NIR, and the DSSM with RGB-NIR. Figure 5a,b are the original images and their ground truths. Figure 5c-e refer to the baseline with RGB, the baseline with NIR and the baseline with RGB-NIR, respectively. Figure 5f represents the DSSM with RGB-NIR inputs.
As a result of the large receptive field of the original image, the contrast effect of the whole image is not obvious. Therefore, various small image blocks with a size of 200 × 200 were cut from the original image. These image blocks were selected from certain parts with extreme differences in segmentation results in typical ground objects.
Firstly, we compared the segmentation differences of different inputs on small isolated trees, as shown in the yellow box in the first line of Figure 5, where three small isolated trees are shown. The baseline with RGB and the baseline with NIR strategies lost the ability to identify small trees, and they roughly identified the area as the background. The RGB-NIR strategy identified a tree, but its outline is quite different from the ground truth. The DSSM with the RGB-NIR strategy identified three complete trees. Although the boundary between the first tree and the second tree is connected, the outline of the two trees can be seen, and the shape of the identified trees is closer to the ground truth than that of the other three strategies. In the yellow box in the second line of Figure 5, we selected small roads, which are bridges marked as roads. Curiously, the two strategies, RGB and NIR, were able to coarsely divide the bridge. However, when they were mixed, i.e., when the RGB-NIR strategy was used, the recognition ability actually reduced. Their simultaneous use may inhibit the other's recognition ability. Of course, the RGB-NIR with DSSM strategy also correctly divided the bridge.
In the yellow box in the third line of Figure 5, there is a small and shallow swimming pool. The RGB strategy completely lost the ability to recognize this object, and even identified the shore of the swimming pool as a road. The baseline with NIR, the baseline with RGB-NIR and the DSSM with RGB-NIR were all able to identify the swimming pool, but the predicted contour of the DSSM RGB-NIR was closest to the ground truth.
In the yellow box in the fourth row of Figure 5, there is a road with a shaded area. Both the baseline with NIR and the baseline with RGB-NIR accurately identified the road with a shaded area, while the baseline with RGB and the baseline with RGB-NIR consider that to be the background.
In a previous analysis, we compared four classes in which obvious differences can be observed with the human eye. Since each image block was only 200 × 200 in size, the visual effects in most areas using different processing strategies were similar. In Table 2, IoU and FW IoU of different strategies on each ground object are quantitatively given. It can be clearly seen from Table 2 that the fusion strategy of the DSSM proposed by us greatly improved the segmentation effect of each type of ground object. Firstly, the baseline with RGB-NIR performed better than the baseline with RGB and the baseline with NIR by 1.18% in terms of FW IoU using the Zurich dataset. However, as compared with the first three strategies, the DSSM with RGB-NIR consistently outperformed them, registering increased FW IoUs of approximately 3.47%. Furthermore, the confusion matrix for the prediction is shown in Figure 6. The wrong classification frequently occurred for the background; this does not represent an inter-object mistake but a difficulty in distinguishing between background and non-background objects.  To further verify the efficiency of the DSSM, we also conducted other experiments using the Potsdam dataset. Figure 7 and Table 3 show the effects of the four strategies using the Potsdam dataset. The DSSM with RGB-NIR continued to demonstrate a better segmentation capability than the other three strategies, which increased by 1.92% in terms of FW IoU.  In addition to comparing the segmentation accuracy of different strategies, we also compared the consumption cost of each strategy. As can be seen from Table 4, for the first three schemes, we maintained the same size and number of convolution kernels in the baseline. At this point, the number of spectra of the input image only affects the thickness of the convolution kernel, and we already know from the previous analysis that the thickness of the convolution kernel in the CNN does not affect the parameters of the model; thus, their consumption cost remained the same. As compared with the aforementioned three strategies, the consumption cost of the DSSM with RGB-NIR increased by about 20%. Considering that this strategy provides a great improvement in segmentation accuracy, we believe that the increase in consumption is acceptable. In conclusion, the fusion strategy in our proposed SSM is more effective than that in those strategies that use the pure CNN to obtain the image feature, and it provides an acceptable consumption cost. Another interesting fact is that using the RGB-NIR strategy is sometimes less effective than directly using the NIR strategy, which can be seen in both the qualitative and quantitative results. In the next section, we discuss why the SSM improves the segmentation accuracy in more detail.

Comparison of Features Extracted from Different Levels
In this group of experiments, we assessed the validity of the SSM from another perspective. Different levels of features extracted from the baseline and DSSM were visualized and contrasted. This group of experiments was conducted using the Zurich dataset. Figure 8 shows the low-level features extracted from the baseline and DSSM, respectively. Figure 8a,b are the original images and their ground truth. Figure 8c shows the low-level features extracted in the baseline where spectra are convolved in a weight-sharing manner. Figure 8d-g represent the low-level features explored in DSSM in a spectrumseparable way. Concentrating on the yellow boxes in Figure 8d-g, different features are revealed in different spectra. In other words, diversiform characteristics under different spectra can be captured by the SSM. However, as shown in Figure 8c, differentiation among objects becomes less distinct due to the weight-sharing strategy.
In Figure 9, we visualize different level of features. Figure 9a,b are the original images and their ground truth. Figure 9c,d show the low-level features. They are extracted by one convolutional layer in the baseline and the SSM, respectively, in the proposed method. Obviously, as shown in the yellow boxes, fused low-level features in the DSSM provide a sharper and more legible pattern and information. Figure 9e,f illustrate the high-level features obtained from the ASPP in the baseline and DSSM, respectively. The higher the level is, the more abstract and unrecognizable the pattern will be. As shown in Figure 9e,f, although the features are abstract and hard to recognize, high-level features in the DSSM contain more complicated manifestations.

Comparison of Other Popular Segmentation Methods
In order to verify the overall performance of the DSSM, we compared the DSSM with UNet++, DeeplabV3+, HNED-SegNet, ScasNet, and miniGCN using the Zurich and Potsdam datasets. As discussed in Section 2, UNet++ and DeeplabV3+ are CV-field methods and HNED-SegNet, ScasNet, and miniGCN are RS-field methods. In order to obtain more effective results, only the network structure was customized in the experiment using each method.
As shown in Table 5, the DSSM outperformed the other methods in most surface elements, and provided a satisfying improvement of 5.78% for trees in terms of IoU. Vegetation areas, such as trees, have a similar pattern as other surface elements in a single spectrum. Through establishing the correlations among spectra, we improved the segmentation efficiency of these areas. However, as a result of the extremely irregular shapes and the inaccurate labels of trees, the IoU remained lower than the FW IoU. Moreover, the segmentation accuracy improved by 2.99% for water and pools for the same reason. Other surface elements improved by 1.53% overall when background and rails were excluded.
Furthermore, the IoU of the DSSM for rails was 1% lower than that of UNet++, because the multi-level skip-connection structure of UNet++ is more effective on these narrow and slender surface elements. Overall, the FW IoU is increased by 2.19%, which demonstrates that the segmentation accuracy of the DSSM in terms of segmenting MS RSIs is generally better than that of other SOTA methods. The quantitative results using the Potsdam dataset are shown in Table 6, which are similar to those using the Zurich dataset. It is worth noting that our method still has a lot of room for improvement in the results of the buildings. Overall, the FW IoU is increased by 0.19% compared with the best method, which is HNED-SegNet. We also compared the execution time with other methods as shown in Table 7. We define the unit of execution time as the time it takes for the model to predict each batch. Model and data loading times have been excluded from the results. In the actual prediction, each batch contains 64 images of 256 × 256 size. UNet++ takes the shortest time, thanks to its streamlined network structure. Our method takes about 10.85 ms, which is 1.54 ms slower than UNet++. The extra time consumption of the proposed method is acceptable when taking its performance improvement into account.

Conclusions
In this paper, we propose a deep-learning based, end-to-end network structure DSSM for the semantic segmentation of MS optical RSIs. The framework is mainly composed of an SSM module and a deep neural network.
The SSM is based on a DS-CNN and optimizes the spectral feature extraction strategy. First, features are independently extracted through spectrum-wise convolution, and then the importance of each feature is studied using a depth-wise attention module. Finally, a nonlinear relationship between features is established through point-wise convolution to generate the final feature map. These extracted features not only contain spatial information, but also have a stronger ability to express spectral correlation. We applied the SSM as a prefeature extraction module in deep neural network. The experimental results show that the DSSM has a better segmentation capability than other SOTA methods, and provides an improvement of 2.19% in terms of FW IoU. Moreover, our proposed SSM can be easily grafted onto other deep-learning-based networks.
In future work, we will focus on transforming the processing techniques proposed in the SSM into hyper-spectral RSIs, reducing the complexity of the network and consumption cost, and applying the strategy to other RS tasks, such as change detection.