Combining Deep Semantic Segmentation Network and Graph Convolutional Neural Network for Semantic Segmentation of Remote Sensing Imagery

: Although the deep semantic segmentation network (DSSN) has been widely used in remote sensing (RS) image semantic segmentation, it still does not fully mind the spatial relationship cues between objects when extracting deep visual features through convolutional ﬁlters and pooling layers. In fact, the spatial distribution between objects from different classes has a strong correlation characteristic. For example, buildings tend to be close to roads. In view of the strong appearance extraction ability of DSSN and the powerful topological relationship modeling capability of the graph convolutional neural network (GCN), a DSSN-GCN framework, which combines the advantages of DSSN and GCN, is proposed in this paper for RS image semantic segmentation. To lift the appearance extraction ability, this paper proposes a new DSSN called the attention residual U-shaped network (AttResUNet), which leverages residual blocks to encode feature maps and the attention module to reﬁne the features. As far as GCN, the graph is built, where graph nodes are denoted by the superpixels and the graph weight is calculated by considering the spectral information and spatial information of the nodes. The AttResUNet is trained to extract the high-level features to initialize the graph nodes. Then the GCN combines features and spatial relationships between nodes to conduct classiﬁcation. It is worth noting that the usage of spatial relationship knowledge boosts the performance and robustness of the classiﬁcation module. In addition, beneﬁting from modeling GCN on the superpixel level, the boundaries of objects are restored to a certain extent and there are less pixel-level noises in the ﬁnal classiﬁcation result. Extensive experiments on two publicly open datasets show that DSSN-GCN model outperforms the competitive baseline (i.e., the DSSN model) and the DSSN-GCN when adopting AttResUNet achieves the best performance, which demonstrates the advance of our method.


Introduction
As the fundamental task of geographic information interpretation, remote sensing (RS) image semantic segmentation is the basis for other RS research and applications, such as natural resource protection, land cover mapping and land use change detection [1,2]. Although it has received considerable attention in the past decade, semantic segmentation of high-resolution RS image is still full of challenges [3][4][5][6], because of the complexity of structure in RS images, which leads to interclass similarity and intraclass variability [7][8][9].
With recent developments in deep learning [10][11][12][13][14], deep semantic segmentation network (DSSN) has made remarkable improvements for RS image semantic segmentation [15] compared to traditional methods, such as random forest (RF), decision trees (DT) and support vector machines (SVMs) [16]. For the first time in end-to-end semantic segmentation, Long et al. [17] proposed the fully convolutional network (FCN) for by adding the deconvolution layers [18] to the convolutional neural network (CNN). As the representative of the encoder-decoder architecture, U-Net [19] used skip connections to take advantage of multiscale information. The reason why U-Net achieved promising performance was that it strengthened the feature maps by combining low-level detail information and high-level semantic information through the skip connections. Moreover, SegNet [20] recorded the index of max pooling in the encoder to perform nonlinear upsampling in the decoder. After that, many other DSSNs [21][22][23][24][25] had been proposed, including the DeepLab V3+ network, which adopted the atrous separable convolution for image semantic segmentation and achieved the state of art result. In DSSN, the extracted deep features are applied to specify the category of each pixel, which proves the importance of the features for semantic segmentation. To obtain powerful features needs to further improve the expressive ability of the network. Like human visual system, the attention mechanism helps to boost meaningful features while suppressing weak ones [26]. In the channel domain, features are selected in channel dimension according to the importance. Hu et al. [27] proposed the squeeze-and-excitation block (SE), which adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels. The spatial domain attention introduces spatial context by assigning different weights to pixels with different positions. Attention U-net [28] used attention gates module to control the importance of features at different spatial locations. The semantic segmentation network with spatial and channel attention (SCAttNet) [29] was proposed for RS image semantic segmentation, which adopted the convolutional block attention module (CBAM) [30] consisting of spatial attention followed by channel attention. However, this cascading mechanism could cause aliasing of spatial information and channel information. Therefore, the concurrent spatial and channel 'squeeze & excitation' attention module (scSE) [26] was proposed for medical image segmentation to have concurrent spatial and channel SE blocks [27] that recalibrate the feature maps separately along channel and space, which sped up the transfer of information.
In the field of RS, a lot of works on DSSN-based RS image semantic segmentation emerged in recent years. Many researchers applied FCN to the semantic segmentation of RS images [31][32][33][34]. Kampffmeyer et al. [35] proposed a novel DSSN, which was used for urban land cover mapping. Wang et al. [36] used an ensemble multiscale residual deep learning method based on U-Net architecture to extract buildings. Audebert et al. [37] trained a variants network of SegNet and used multicore convolutional layers to quickly aggregate predictions on multiple scales. Zhang et al. [38] proposed the dual multiscale manifold ranking (DMSMR) network to further improve the performance of segmentation. Pan et al. [39] performed semantic labeling of high-resolution aerial imagery with the fine segmentation network. In order to use multisensor data, such as DSM, both the RGB image and the multimodal data are combined to provide more information for DSSN [40]. Marmanis et al. [33] proposed a Siamese network to handle the images and the DSM data, and combined edge detection and semantic segmentation in the improved version [41]. He et al. [42] introduces edge information into DSSN to revise the segmentation results. Moreover, prior knowledge is used for RS image semantic segmentation. Alirezaie et al. [43] applied U-Net to achieve fast and accurate pixel-level classification followed with a knowledge-based post-processing. However, when extracting the deep features through convolutional and pooling layers, DSSN ignores the spatial relationship between objects, which plays a key role in the classification from the biological vision perspective [44]. This problem will be more serious in processing high-resolution RS images, which contain rich objects and spatial relationships.
Due to the irregular distribution of ground objects and connections between objects only in mutual relations, the objects and their relationships form a graph, where graph nodes represent objects and the connecting edges between the graph nodes denote the spatial relationship between the objects [45], such as neighboring, intersecting, separating, etc. Although DSSNs have achieved great success in processing Euclidean data, the performance of these methods is still unsatisfactory as to graph data, which is in non-Euclidean space. As an application of deep learning to graph data, graph convolutional neural network (GCN) has obvious advantages in extracting features from irregular graph data by graph convolution. The core idea of graph convolution is to use edge connections to aggregate node information for generating new representations of nodes, so GCN has a strong ability to model the dependency relationship between graph nodes. These advantages have promoted the breakthrough of researches related to graph analysis [46]. Kipf et al. [47] proposed an effective layered propagation graph model, which directly operated on graph data by convolution in spectral space. NN4G [48] realized graph convolution based on spatial domain by directly accumulating the information of neighbor nodes. Regarding that GCN was limited to shallow layers, Li et al. [49] successfully constructed Deep-GCN by borrowing tricks from CNN, including residual connection and dilated convolution, and adapted them to the architecture of GCN to alleviate gradient vanishing. Graph attention network (GAT) [50] used the attention mechanism to determine the weight of each neighbor node to the central node when aggregating the neighbor information. Compared to GAT, GCN pays more attention to spatial relations, rather than similarity represented by weights. To make full use of local position of each pixel, Yi et al. [51] proposed a pixel-based GCN model initialized by a fully convolutional network (FCN) for semantic segmentation of natural image. Although every pixel possesses the local position, it cannot really represent the ground objects and the strength of the spatial relationship is ignored. In order to mine both the object information and topological relationships among multiple objects, Li et al. [52] presented a CNN-GCN framework to address multilabel aerial image scene classification. The abstract features from CNN are conducive to scene classification [2], while the pixel-level semantic segmentation requires details to specify the category of each pixel.
In order to address these problems, we propose a DSSN-GCN framework focusing on spatial relationship modeling, which is a generic way for RS image semantic segmentation by combining DSSN and GCN. The DSSN is trained to extract high-level features to semantically initialize the graph nodes. As the deep features extracted by DSSN are not sufficient for semantic segmentation, we adopt GCN to make full use of detailed information such as the spatial relationship modeled by a region adjacency graph, where regions obtained by the unsupervised superpixel segmentation algorithm implemented on images denote the graph nodes and the spatial relationships between nodes represent graph edges. Considering that the contributions of different nodes in the neighborhood to the central node are different, the strength of spatial relationship is determined by the spectral similarity and the spatial location. Then the GCN uses the features and the spatial relationships to classify the graph nodes. Moreover, in order to extract better features and improve the accuracy of semantic segmentation, the attention residual U-Net (AttResUNet) is proposed in this paper, which integrates residual blocks and attention module in the U-shaped architecture with skip connections [19]. In AttResUNet, the residual blocks [53] help to extract features effectively from deep networks, and the attention mechanism [26] with spatial attention and channel attention is adopted to refines the feature maps automatically. Extensive experiments on the UC Merced Land-Use Dataset (the UCM dataset) [54] and the land cover classification dataset on DeepGlobe Challenge (the DeepGlobe dataset) [55] show that our proposed approach outperforms the competitive baseline (i.e., the DSSN model), which demonstrates that the spatial relationship knowledge can boost the performance and robustness of the classifier. In addition, the results of segmentation show the object-level modeling helps to reduce pixel-level noises and restore the boundaries of objects [56,57]. The main contributions of this paper are twofold: (1) A DSSN-GCN framework is proposed to combine DSSN and GCN for RS image semantic segmentation, where the strength of spatial relationship is quantified by considering spectral and spatial information of ground objects. The spatial relationship introduced by GCN boosts the performance and robustness of the classification module. In addition, we convert the pixel-level semantic segmentation into the superpixel-level node classification by graph modeling, which helps to reduce pixel-level noises and restore the boundaries of ground objects. (2) We propose a new DSSN (AttResUNet), which has U-shaped architecture using residual blocks to encode feature maps and attention module to refine the features. Experiments on two publicly open datasets show that the DSSN-GCN when adopting AttResUNet achieves the best performance, which demonstrates the advance of our method.
The remainder of this paper is organized as follows. The proposed method is described in Section 2, including introduction of the proposed AttResUNet, feature extraction based on DSSN, graph construction and node classification via GCN. Section 3 presents the experiments and results. Finally, Sections 4 and 5 provide the discussion and the conclusion respectively.

Materials and Methods
In this section, the workflow of the DSSN-GCN framework is described at first. Then, the proposed AttResUNet will be introduced in detail and it is shown how to extract features based on DSSN. The construction of graph model will be presented in the next. Finally, node classification via GCN is presented.

The Proposed DSSN: AttResUNet
The proposed DSSN is shown in Figure 2. It consists of three parts: the U-shaped architecture with skip connections, the encoder based on residual blocks and the attention module, which is composed of spatial attention and channel attention in parallel. The The workflow of the proposed DSSN-GCN framework is visually shown in Figure 1. In the graph module, in order to make full use of the object-level spatial relationship and reduce the impact of pixel-level random noise, objects (or superpixels) segmented by the superpixel segmentation algorithm represent graph nodes. Topological spatial relationships between the objects denote the connecting edges of the graph. In DSSN module, considering the powerful ability of feature extraction of DSSN, the DSSN is trained to extract the deep features to semantically initialize the intrinsic content of the graph nodes. In the GCN module, the GCN, which is good at modeling the irregular dependency relationship, converts the semantic segmentation task into the graph node classification task. With the help of the topological spatial relationships between objects and the deep features with strong generalization, the GCN models the relationships between graph nodes and classifies all nodes.

The Proposed DSSN: AttResUNet
The proposed DSSN is shown in Figure 2. It consists of three parts: the U-shaped architecture with skip connections, the encoder based on residual blocks and the attention module, which is composed of spatial attention and channel attention in parallel. The attention module is placed after each block of AttResUNet to refine the extracted features.  In order to get a good result of semantic segmentation, it is very important to take low-level details into consideration, while retaining high-level semantic information. Especially for RS image, it contains richer detail information than natural images. The Ushaped architecture with skip connections strengthens the feature maps by combining low-level detail information from the encoder and high-level semantic information from the decoder, which allows DSSN to use these two kinds of information in segmentation. In general, the deeper network would get the better features. However, it could hamper the training because of gradient vanishing. He et al. [533] solved this problem by residual neural network (ResNet) that consists of a series of stacked residual blocks, as shown in Figure 2b, which allowed high-level gradients to be directly backpropagated through short connections to facilitate training to learn better features. The following formulas describe the residual block in detail. In order to get a good result of semantic segmentation, it is very important to take lowlevel details into consideration, while retaining high-level semantic information. Especially for RS image, it contains richer detail information than natural images. The U-shaped architecture with skip connections strengthens the feature maps by combining low-level detail information from the encoder and high-level semantic information from the decoder, which allows DSSN to use these two kinds of information in segmentation. In general, the deeper network would get the better features. However, it could hamper the training because of gradient vanishing. He et al. [53] solved this problem by residual neural network (ResNet) that consists of a series of stacked residual blocks, as shown in Figure 2b, which allowed high-level gradients to be directly backpropagated through short connections to facilitate training to learn better features. The following formulas describe the residual block in detail.
where x l and x l+1 respectively represent the input and output of the l-th residual block. Each residual block generally contains a multilayer structure. F (·) is the residual function, which generates the residual by using layer weights W l , the identity mapping function h(·) is usually used as x = h(x) and f (·) is the rectified linear unit (ReLU) activation function. In RS images, there are abundant ground objects and the extremely complex spatial distribution. It is important to automatically select regions of interest according to the task. In order to select and refine the features, the attention module with channel attention and spatial attention in [26] is applied to AttResUNet. Like the human visual system, the attention module can enhance meaningful features and suppress useless features, which is achieved by adjusting the weights of corresponding features. In the attention module shown in Figure 2c, spatial attention and channel attention are parallel. Features (C × H × W) from convolution layers are used as input. The spatial attention performs a 1 × 1 × 1 convolution and a sigmoid activation on the input feature to learn the spatial attention map (1 × H × W), which represents the importance of different spatial positions, then multiply the input with the spatial attention map to strengthen the expression of spatial semantic information. In the channel attention, the global average pooling is performed on the input feature at first. Then, there are two 1 × 1 ×1 convolution and a sigmoid activation to calculate the channel attention mask (C × 1 × 1). Finally, channelwise multiplication is applied between the input feature and the mask to weight each channel according to its usefulness. At the end of the attention module, the outputs from spatial attention and channel attention are added together to fuse the channel information and the spatial information for semantic segmentation.

Feature Extraction Based on DSSN
In general, DSSN consists of an encoder and a decoder. The encoder extracts feature by convolutional layers and pooling layers. The decoder restores the feature maps to original image size with upsampling. After decoding, the results of segmentation are generated from input data.
where both input data X 0 ∈ R w×h×channel and output data Y ∈ R w×h have a width of w and height of h. φ is the encoder and ϕ is the decoder. The probability of results belonging to class c is p c , and the total number of classes is n. In the back propagation, the neural network reduces the training loss by continuously adjusting the learnable parameters θ to optimize the results. The cross-entropy loss function L is often used for the semantic segmentation task, as the following: for pixel (i, j), prediction from the forward propagation of network is Y ij ∈ Y, if Y ij = c, then y c ij = 1, otherwise, y c ij = 0. After the supervised training, the DSSN has learned how to extract features that are effective to semantic segmentation. Obviously, features can be extracted from the encoder or decoder. Although the feature from the encoder is highly abstract, its size is small and details such as the spatial relationship are lost after convolution and pooling.
The highly abstract feature is helpful to a one-hot task such as image classification, but is not suitable for semantic segmentation. On the contrary, the decoder recovers the original size of features and restores some details by upsampling. Therefore, we chose the features extracted by the decoder to initialize the graph.

Graph Construction
The objects and the relationships constitute an unstructured graph. The graph is represented by a tuple Graph = (V, E), where V is the set of graph nodes and E is the set of edges representing the connection between nodes. If e ij ∈ E, node v i connects to node v j with edge e ij . In the RS image, V is a set of ground objects and E represents the relationships between the objects. In order to construct the graph, as depicted in Figure 3, superpixels segmented by the unsupervised segmentation algorithm are used as graph nodes. Each superpixel is composed of a set of adjacent pixels with consistent characteristics. In addition, the first-order adjacency relationship (with common edge) between superpixels is regarded as graph edge to take the topological spatial relationship into consideration.
where X 0 is the input image and S i denotes the i-th superpixel. K is the total number of superpixels or nodes.
Remote Sens. 2021, 13, x FOR PEER REVIEW 8 of 23 . The feature vector of the graph is = [ 1 , 2 , … , ] ∈ ℝ × , where is the dimension of the feature vector of the graph node.
For each node, we need to figure out which category it belongs to. It is easy to know the class of each pixel from the label image. In general, a homogeneous region, such as a superpixel, can be represented by its main characteristics, so it makes sense to specify the majority class as category for the region. Specifically, we counted the class of all pixels in each node and assigned the category that contains the maximum number of pixels with the same class as the label of the node. A toy example about the graph construction process. After the unsupervised superpixel segmentation, the remote sensing (RS) image (a) consisted of superpixels (b), which are regarded as graph nodes (c). Additionally, the firstorder adjacency relationship (with common edge) between superpixels denotes graph edge (c).
The spatial adjacency relationships between the graph nodes are denoted by the adjacency matrix ∈ ℝ × . Considering that the contributions of different nodes in the neighborhood to the central node are different, the strength of spatial relationship between them is determined by the spectral similarity and the spatial location. If node is adjacent to node , the connecting edge between them has a weight ∈ . To quantitatively express the strength of spatial relationship between and , we set that the is larger if the greater the spectral similarity between the two nodes.
where denotes the average color value of the node in the LAB color space of the Commission International Eclairage (CIELAB) and 2 controls the range of weight.

Node Classification via GCN
We adopted GCN, denoted as ( , ), to get the classification of the graph nodes. Same as CNN, GCN extracts features on graph structure data by convolution, which is called graph convolution. As shown in Figure 4, graph convolution consists of three steps. In the first step, each node sends its characteristic information to neighboring nodes. This step is to extract the characteristic information of the node. Every node collects the characteristic information from neighboring nodes and fuses the local structure and the characteristic information in the second step. In the third step, gather the previous A toy example about the graph construction process. After the unsupervised superpixel segmentation, the remote sensing (RS) image (a) consisted of superpixels (b), which are regarded as graph nodes (c). Additionally, the first-order adjacency relationship (with common edge) between superpixels denotes graph edge (c).
In the region of superpixel corresponding to node v i , the average value of each channel of the deep features extracted by DSSN is taken as the node feature vector where D is the dimension of the feature vector of the graph node. For each node, we need to figure out which category it belongs to. It is easy to know the class of each pixel from the label image. In general, a homogeneous region, such as a superpixel, can be represented by its main characteristics, so it makes sense to specify the majority class as category for the region. Specifically, we counted the class of all pixels in each node and assigned the category that contains the maximum number of pixels with the same class as the label of the node.
The spatial adjacency relationships between the graph nodes are denoted by the adjacency matrix A ∈ R N×N . Considering that the contributions of different nodes in the neighborhood to the central node are different, the strength of spatial relationship between them is determined by the spectral similarity and the spatial location. If node v i is adjacent to node v j , the connecting edge between them has a weight a ij ∈ A. To quantitatively express the strength of spatial relationship between v i and v j , we set that the a ij is larger if the greater the spectral similarity between the two nodes.
where lab denotes the average color value of the node in the LAB color space of the Commission International Eclairage (CIELAB) and σ 2 w controls the range of weight.

Node Classification via GCN
We adopted GCN, denoted as g(X, A), to get the classification Z of the graph nodes. Same as CNN, GCN extracts features on graph structure data by convolution, which is called graph convolution. As shown in Figure 4, graph convolution consists of three steps. In the first step, each node sends its characteristic information to neighboring nodes. This step is to extract the characteristic information of the node. Every node collects the characteristic information from neighboring nodes and fuses the local structure and the characteristic information in the second step. In the third step, gather the previous information and then performs a nonlinear transformation to increase the expressive ability of the model.  . Visual example of the three steps in the graph convolution of GCN: information broadcasting, information collection and information aggregation. For example, node 1 broadcasts its feature to its neighboring nodes 2 , 3 , 4 and 5 at first. Second, node 2 collect features from node 2 , 3 , 4 and 5 . Third, aggregate information by gathering and activating nonlinearly the fusion features.
Graph convolution aggregates the information of node from its neighbor nodes t generate a new representation. Therefore, using graph convolution on ground objects i useful for aggregating spatial information, which means that it is necessary to take matri representing the relationships between nodes into consideration when calculating th features of each layer of GCN.
where ( ) is the feature of layer of GCN, for the input layer (0) = . , and are the nonlinear activation function, the learnable parameter matrix and the Laplacian matrix respectively. ̃= + , where is the identity matrix. ̃ is the degree matrix o Graph convolution aggregates the information of node from its neighbor nodes to generate a new representation. Therefore, using graph convolution on ground objects is useful for aggregating spatial information, which means that it is necessary to take matrix A representing the relationships between nodes into consideration when calculating the features H of each layer of GCN. where H (l) is the feature of layer l of GCN, for the input layer H (0) = X G . σ, W and L G are the nonlinear activation function, the learnable parameter matrix and the Laplacian matrix respectively. A = A + I, where I is the identity matrix. D is the degree matrix of A.
In the training process, GCN adjusts W by continuously reducing loss, thereby optimizing the output, as shown: where T i is the ground truth of training sample i and the number of samples is n. L is the loss function, such as the cross entropy loss function. When finishing training, the GCN combines features X G and adjacent matrix A to classify the graph nodes. As the nodes are from superpixels, each node corresponds to a region on the image. The class of all pixels in the region is the same as the category of the corresponding node, thereby the entire semantic segmentation of the image is completed.
In general, the process of our DSSN-GCN framework is composed of three steps above including feature extraction based on DSSN, graph construction and node classification via GCN, which is in brief captured in Algorithm 1.
4. Construct the graph edges E. Take the first-order adjacency relationships (with common edge) between the graph nodes as the graph edges and calculate the strength of the edges. 5. After the training of the GCN, adopt the GCN to perform classification on the graph nodes. 6. Get the maps of semantic segmentation. Assign the category of each node to the pixels located in the node. Output: the maps of semantic segmentation.

Experiments
In this section, the data description and details of experimental settings were introduced at first. The experimental results and analysis were given after that.

Datasets and Evaluation Metrics
To test our method, experiments were performed on the UCM dataset [54] and the DeepGlobe dataset [56].
As visually illustrated in Figure 5, the UCM dataset that contains 2100 aerial images with 0.3 m spatial resolution and 256 × 256-pixel size was labeled into 17 categories for semantic segmentation on DLRSD dataset [54]. In order to reduce the similarity between the classes [43], we merged the 17 classes into 8 classes, which are vegetation (trees and grass), ground (bare soil, sand and chaparral), pavement (pavement and dock), building (building, mobile home and tank), water (water and sea), airplane (airplane), car (car) and ship (ship), and removes images containing a field or tennis court. Each category is a combination of the original categories in the parentheses. These filtered images are randomly divided into the training set, validation set and test set, each with 1513, 189 and 190 images, with the proportions of 80%, 10% and 10% respectively. The class distribution for UCM dataset can be seen in Table 1. The four categories of vegetation, pavement, ground and building account for the first, the second, the third and the fourth place respectively. The proportion of top four categories was more than 85% and the proportion of airplane was the least (less than 0.5%).
combination of the original categories in the parentheses. These filtered images randomly divided into the training set, validation set and test set, each with 1513, 189 a 190 images, with the proportions of 80%, 10% and 10% respectively. The class distribut for UCM dataset can be seen in Table 1. The four categories of vegetation, paveme ground and building account for the first, the second, the third and the fourth pla respectively. The proportion of top four categories was more than 85% and the proport of airplane was the least (less than 0.5%).    , as shown in Figure 6 provides 1146 submeter highresolution images with a size of 2448 × 2448. Seven categories are manually labeled, namely urban, agriculture, rangeland, forest, water, barren and unknown. As shown in Table 2, there is a large imbalance in the dataset that the area of agriculture is more than 50% and the share of urban and forest are 10.93% and 9.98% respectively. Moreover, the proportion of unknown is close to 0% meaning few pixels of unknown in the ground truth masks. The entire dataset was divided into the training set, validation set and test set, containing 803, 171 and 172 images respectively. Images with a size of 256 × 256 are uniformly cropped from every raw image. These cropped images were randomly divided into the training set, validation set and test set, each with 10,272, 1280 and 1296 images, with the proportions of 80%, 10% and 10% respectively.  In this paper, the overall accuracy (OA), the intersection over union (IoU) and the frequency weighted intersection over union (FWIoU) are adopted as the evaluation metrics [588]. = ( + ) ( + + + ) ⁄ The DeepGlobe dataset [55], as shown in Figure 6 provides 1146 submeter highresolution images with a size of 2448 × 2448. Seven categories are manually labeled, namely urban, agriculture, rangeland, forest, water, barren and unknown. As shown in Table 2, there is a large imbalance in the dataset that the area of agriculture is more than 50% and the share of urban and forest are 10.93% and 9.98% respectively. Moreover, the proportion of unknown is close to 0% meaning few pixels of unknown in the ground truth masks. The entire dataset was divided into the training set, validation set and test set, containing 803, 171 and 172 images respectively. Images with a size of 256 × 256 are uniformly cropped from every raw image. These cropped images were randomly divided into the training set, validation set and test set, each with 10,272, 1280 and 1296 images, with the proportions of 80%, 10% and 10% respectively. In this paper, the overall accuracy (OA), the intersection over union (IoU) and the frequency weighted intersection over union (FWIoU) are adopted as the evaluation metrics [58].
where TP, TN, FP and FN are the number of true positive points, true negative points, false positive points and false negative points respectively. n is the number of classes.

Implementation Details
In this paper, we used three representative backbones and a proposed network as DSSN for feature extraction: U-Net [19], SegNet [20], DeepLab V3+ [25] and the proposed AttResUNet. On this basis, four neural networks are proposed: DSSN-GCN V1 with U-Net, DSSN-GCN V2 with SegNet, DSSN-GCN V3 with DeepLab V3+ and DSSN-GCN V4 with AttResUNet. For U-Net and SegNet, they are widely used as the baseline model for RS image semantic segmentation. The DeepLab V3+ network is a new method, which has achieved the state-of-the-art semantic segmentation results on the natural image. The network structure of AttResUNet is shown in Table 3. We adopted the residual block1-4 from ResNet-101 [53] pretrained on ImageNet dataset as the encoder. The decoder uses 3 × 3 convolutions and 4×4 transposed convolutions to recover the original input size. For the training of DSSN, the stochastic gradient descent method (SGD) and the cross entropy were adopted as the optimizer and the loss function respectively.
The simple linear iterative cluster (SLIC) [59] was used to segment images to get the ground objects (or superpixels). Features of each object were from the mean values of the feature maps of the upsampling layer within the corresponding region of the object. We adopted the feature maps from layers of the DSSN networks above: feature map for DSSN-GCN V1, DSSN-GCN V2, DSSN-GCN V3 and DSSN-GCN V4 were all obtained from the last layer of its DSSN. To initialize the graph, the features of the objects were used as the features of the graph nodes in GCN [47]. The SGD optimizer and the cross-entropy loss function were used for the training of the GCN. With these settings, extensive experiments including sensitivity analysis were performed to choose the best critical parameters and examine the effectiveness of the proposed DSSN-GCN model. All the experiments were conducted on the Pytorch framework with NVIDIA 1080Ti GPU.

Sensitivity Analysis of Critical Parameters
Hyperparameters in our proposed method mainly included the number of superpixels k, the similarity factor σ 2 w and the number of GCN layers l. k determines the number of ground objects (or superpixels) in the graph. σ 2 w controls the range of value of adjacent matrix in the graph. l is the number of layers of GCN, which represents the depth of the network. As the representation of the ground object on the image, the superpixel, which retains the boundary of the ground object, is composed of a set of pixels with similar characteristics such as color, brightness, texture, etc. Therefore, converting the pixel-by-pixel classification to the object-based node classification can not only reduce the time consumption of classification, but also reduce noises and restore the boundary of the ground objects to a certain degree. The number of superpixels in an image determines the size of each superpixel. The smaller k, the larger the area corresponding to one superpixel. If the superpixel is too small, the characteristic information will be insufficient, on the contrary, details would be lost. In (a) and (b) of Figure 7, it shows that the best k for the UCM dataset and the DeepGlobe dataset were both 700 respectively. pixel classification to the object-based node classification can not only reduce the time consumption of classification, but also reduce noises and restore the boundary of the ground objects to a certain degree. The number of superpixels in an image determines the size of each superpixel. The smaller , the larger the area corresponding to one superpixel. If the superpixel is too small, the characteristic information will be insufficient, on the contrary, details would be lost. In (a) and (b) of Figure 7, it shows that the best for the UCM dataset and the DeepGlobe dataset were both 700 respectively.  The adjacency matrix in the graph depicts the spatial relationship between nodes. The matrix values reflect the strength of the relationship and its distribution is controlled by the similarity factor σ 2 w . (a) and (b) of Figure 8 illustrate the accuracy of segmentation in different σ w . With the increase of the similarity factor, both OA and FWIoU rose at first and then decreased. When σ 2 w = 2 on the UCM dataset and σ 2 w = 3 on the DeepGlobe dataset, the best performance was obtained. The number of layers was positively correlated with the complexity of the neural network. In general, the deeper the model structure, the better the fitting effect. However, due to the gradient vanishing, GCN was usually limited to shallow layers ( = 2-4) [49]. In view of this problem, we tested our DSSN-GCN model with 1-5 layers of GCN. In Figure 9a,b, as the layers of GCN increased, OA and FWIoU both increased up to the maximum. The accuracy of segmentation reached the best value when = 2 on the UCM dataset and when = 4 on the DeepGlobe dataset. Additionally, it is worth noting that OA and FWIoU were both poor when GCN had only one layer, because the layer of GCN was too shallow to learn better feature expression. The number of layers was positively correlated with the complexity of the neural network. In general, the deeper the model structure, the better the fitting effect. However, due to the gradient vanishing, GCN was usually limited to shallow layers (l = 2-4) [50]. In view of this problem, we tested our DSSN-GCN model with 1-5 layers of GCN. In Figure 9a,b, as the layers of GCN increased, OA and FWIoU both increased up to the maximum. The accuracy of segmentation reached the best value when l = 2 on the UCM dataset and when l = 4 on the DeepGlobe dataset. Additionally, it is worth noting that OA and FWIoU were both poor when GCN had only one layer, because the layer of GCN was too shallow to learn better feature expression. due to the gradient vanishing, GCN was usually limited to shallow layers ( = 2-4) [49]. In view of this problem, we tested our DSSN-GCN model with 1-5 layers of GCN. In Figure 9a,b, as the layers of GCN increased, OA and FWIoU both increased up to the maximum. The accuracy of segmentation reached the best value when = 2 on the UCM dataset and when = 4 on the DeepGlobe dataset. Additionally, it is worth noting that OA and FWIoU were both poor when GCN had only one layer, because the layer of GCN was too shallow to learn better feature expression.

Comparison With the State-of-the-Art Method
In order to evaluate the performance of our proposed DSSN-GCN model, experiments were conducted on the UCM dataset and the DeepGlobe dataset. In the experiments, we tested four DSSN-GCN networks, which were DSSN-GCN V1 based on U-Net, DSSN-GCN V2 based on SegNet, DSSN-GCN V3 based on DeepLab v3+ and DSSN-GCN V4 based on the proposed AttResUNet, respectively.

Comparison with the State-of-the-Art Method
In order to evaluate the performance of our proposed DSSN-GCN model, experiments were conducted on the UCM dataset and the DeepGlobe dataset.

Results on the UCM Dataset
The overall accuracy of semantic segmentation on the UCM dataset is shown in the last column of Tables 4 and 5. The OA/ FWIoU increased by 1.99%/2.94% for DSSN-GCN V1 compared to the U-Net, the OA/FWIoU rose by 0.89%/1.23% for DSSN-GCN V2 compared to the SegNet, the OA/FWIoU increased by 0.68%/0.84% for DSSN-GCN V3 compared to the DeepLab V3+ and DSSN-GCN V4 improved the OA/FWIoU by 0.69%/1%. Each DSSN-GCN outperformed its DSSN, which proved the effectiveness of the integration of DSSN and GCN in DSSN-GCN framework. Moreover, Tables 4 and 5 shows that the proposed AttResUNet was better than all reference methods and DSSN-GCN V4 achieved the best semantic segmentation results, which demonstrated the advance of our AttResUNet.  Tables 4 and 5 reported respectively the OA and the IoU of semantic segmentation of each category on the UCM dataset. Due to the class imbalance, the main categories including vegetation, ground, pavement and building were more than 85% (Table 1) in the dataset. Considering the top four categories, it can be seen that DSSN-GCN V1 improved the OA/IoU of vegetation, ground and building by 3.01%/1.13%, 7.63%/2.89% and 6.75%/6.07% respectively compared with the U-Net, DSSN-GCN V2 rose the OA/IoU of vegetation, pavement and building by 2.09%/0.85%, 0.87%/1.16% and 4.04%/2.22% respectively on the basis of the SegNet and DSSN-GCN V3 achieved an improvement in the OA/IoU of vegetation, ground and pavement by 2.17%/0.99%, 1.02%/0.47% and 4.48%/1.1% respectively compared to DeepLab v3+, and DSSN-GCN V4 improved the OA/IoU of the main categories by 0.9%/1.05%, 1.44%/1.31%, 1.64%/0.76% and 0.04%/2.22% respectively. As shown in Table 5, the IoUs of the top five classes (top four categories and water) of our proposed DSSN-GCN models (including V1, V2, V3 and V4) were higher than that of backbones (including U-Net, SegNet, DeepLab V3+ and AttResUNet) by 1-6%. However, the IoUs of airplane exceeded that of DSSN-GCN models. Though the airplane made up a quite low share (0.36%) of the UCM dataset, the IoU of airplane still contributed 1/8 (eight categories in the UCM dataset) to the MIoU. Therefore, it is reasonable to adopt the FWIoU metric to evaluate semantic segmentation methods under the circumstance of the class imbalance.
To compare results of all models above, we visualized the results of the proposed DSSN-GCN models and other referenced methods. In Figure 10, we could see the results of U-Net, DSSN-GCN V1, SegNet, DSSN-GCN V2, DeepLab V3+, DSSN-GCN V3, the proposed AttResUNet and AttResUNet-GCN (DSSN-GCN V4) from (c) to (j). It presents that segmentation of our DSSN-GCN model was more accurate and consistent compared to its backbone network, shown in (d) to (c), (f) to (e), (h) to (g) and (j) to (i), which explained the effectiveness of the proposed DSSN-GCN model to improve the results of semantic segmentation. AttResUNet-GCN with the best FWIoU (73.99%) achieved the best results and results of AttResUNet were better than that of other backbones, which demonstrated the advance of the proposed AttResUNet. In addition, the results of DSSN-GCN models were less noisy and possessed more accurate boundaries, especially in building, car and pavement.

Results on the DeepGlobe Dataset
The last column of Tables 6 and 7  of U-Net, DSSN-GCN V1, SegNet, DSSN-GCN V2, DeepLab V3+, DSSN-GCN V3, the proposed AttResUNet and AttResUNet-GCN (DSSN-GCN V4) from (c) to (j). It presents that segmentation of our DSSN-GCN model was more accurate and consistent compared to its backbone network, shown in (d) to (c), (f) to (e), (h) to (g) and (j) to (i), which explained the effectiveness of the proposed DSSN-GCN model to improve the results of semantic segmentation. AttResUNet-GCN with the best FWIoU (73.99%) achieved the best results and results of AttResUNet were better than that of other backbones, which demonstrated the advance of the proposed AttResUNet. In addition, the results of DSSN-GCN models were less noisy and possessed more accurate boundaries, especially in building, car and pavement.    proposed AttResUNet achieved better results than other backbones and AttResUNet-GCN with the best OA and FWIoU on this dataset got the best result.

Discussion
In this paper, we analyzed the sensitivity of critical parameters including the number of superpixels , the similarity factor 2 and the number of GCN layers , which have an important influence on the proposed DSSN-GCN model. In the model, determines the size and number of graph nodes, the value range of the strength of the spatial relationship was controlled by 2 , and reflects the expression ability of GCN. In order to build the graph in GCN, we constructed graph nodes through the superpixel segmentation and transformed pixel-level segmentation into node classification. The number of superpixels should be selected carefully. Since if the superpixel is too small, the characterization information will be insufficient, on the contrary, details would be lost. According to the experimental results, the best of our model for the UCM dataset and The OA and the IoU of semantic segmentation of each category on the DeepGlobe dataset are presented in Tables 6 and 7, respectively. There was a large class imbalance in the dataset that main categories including agriculture (58%), urban (11%) and forest (11%) were more than 80% (Table 2). Considering the top three categories, it could be seen that DSSN-GCNs had advantages in OA compared to the corresponding DSSNs.
Furthermore, the IoUs of top four categories of DSSN-GCN were almost better than its backbone. In addition, DSSN-GCN V1 increased in IoU of all categories, especially 13.76%/7.11% improvement in the OA/IoU of urban. However, in the next to last column of Tables 6 and 7, all the OA/IoU of unknown were 0. This is because that the proportion of unknown in the dataset (counting for 0.05%) was too low to be learned and recognized by the classifier.
The visible semantic segmentation of the DeepGlobe dataset was presented to qualitatively verify the conclusions of the paragraph above. It can be seen in Figure 11 that the results from (c) to (j) were the segmentation maps of the results of U-Net, DSSN-GCN V1, SegNet, DSSN-GCN V2, DeepLab V3+, DSSN-GCN V3, the proposed AttResUNet and AttResUNet-GCN (DSSN-GCN V4). It shows that our DSSN-GCN models achieved more accurate and consistent segmentation compared to its backbone network, shown in (d) to (c), (f) to (e), (h) to (g) and (j) to (i), which demonstrated that the proposed DSSN-GCN model was effective to improve the results of semantic segmentation. Moreover, the proposed AttResUNet achieved better results than other backbones and AttResUNet-GCN with the best OA and FWIoU on this dataset got the best result.

Discussion
In this paper, we analyzed the sensitivity of critical parameters including the number of superpixels k, the similarity factor σ 2 w and the number of GCN layers l, which have an important influence on the proposed DSSN-GCN model. In the model, k determines the size and number of graph nodes, the value range of the strength of the spatial relationship was controlled by σ 2 w , and l reflects the expression ability of GCN. In order to build the graph in GCN, we constructed graph nodes through the superpixel segmentation and transformed pixel-level segmentation into node classification. The number of superpixels should be selected carefully. Since if the superpixel is too small, the characterization information will be insufficient, on the contrary, details would be lost. According to the experimental results, the best k of our model for the UCM dataset and the DeepGlobe dataset were both 700. However, it is urgent to improve the speed and the accuracy of the methods for node construction. In the graph, the relationships between nodes were also of importance, which constitute the path of information transmission in GCN. When constructing the connection edges between nodes, we comprehensively considered the spatial relationship (the first-order adjacency relationship) and the spectral information (values in the CIE LAB color space) to quantify the strength of spatial relationships, and adopt the similarity factor σ 2 w to control the value range of these edges. By doing that, the strength of spatial relationship between different nodes is of difference. Moreover, the number of layers is the key parameter of the neural network. Due to the problem of gradient vanishing, GCNs are limited to shallow layers. Although there are some works to construct deep GCNs [50], the classification accuracy still needs further improvement. In view of this problem, we tested the proposed DSSN-GCN model with 1-5 layer GCN and got the best performance on the UCM dataset for l = 2 and on the DeepGlobe dataset for l = 4.
In order to verify the effectiveness of our model, we designed four DSSN-GCN models based on different backbones including the proposed AttResUNet, the classic U-Net model, SegNet and DeepLab V3+, which is the state-of-the-art model in natural image semantic segmentation. The DSSN-GCN models (V1, V2, V3 and V4) and the contrast methods were applied to the experiments. We chose the metrics of OA and FWIoU to measure the results of semantic segmentation. Additionally, the FWIoU metric was adopted because it was widely used in semantic segmentation and more reasonable than MIoU under the circumstance of the class imbalance. The segmentation results on two datasets in Tables 4-7 present that the DSSN-GCN model outperformed its backbone and DSSN-GCN V4 achieved the best performance both on the two datasets, which proved the effectiveness of our DSSN-GCN model. Meanwhile, it shows the importance of the spatial relationship for high-precision semantic segmentation and the relationship modeling ability of GCN. Moreover, there is a large improvement of the proposed AttResUNet compared to other advanced DSSNs, which shows the advance of our AttResUNet. The OA/IoU of AttResUNet for the small objects, such as car, ship and airplane in the UCM dataset, are increased significantly compared to U-Net. However, the performance of DSSN-GCN V4 on small objects such as car, ship and airplane is inferior to the proposed AttResUNet. This is because with the help of the residual blocks-based encoder and the attention model, AttResUNet can extract deep features with stronger expression and select regions of interest automatically, which is helpful for the segmentation of small objects. In addition, there were few samples for the training of GCN and some detailed information will be lost in object-level modeling of DSSN-GCN, which would bring a negative impact on the recognition of small objects. From Tables 1 and 2, the samples of airplane count for only 0.36% on the UCM dataset, which was the same as the unknown (0.05%) on the DeepGlobe dataset. These samples were too few to train DSSN to learn expression of the airplane, which caused all the OA and the IoU of the airplane to be quite low in Tables 4 and 5 and even all the OA and the IoU of the unknown to be 0% in Tables 6 and 7. This phenomenon reflected that DSSN model was not good at extracting features of the minority classes and recognizing them when there were few samples of the minority classes for training. Additionally, this problem of DSSN further led to the bad performance of DSSN-GCN on the minority categories.

Conclusions
In order to introduce the spatial relationship information into RS image semantic segmentation, this paper proposed the DSSN-GCN framework for semantic segmentation of the RS image via combining DSSN and GCN. To lift the appearance extraction ability, we also proposed a new DSSN (AttResUNet), which had U-shaped architecture using residual blocks to encode feature maps and an attention module to refine the features. In the framework, a graph was built, where graph nodes were denoted by the superpixels. Additionally, the graph weight denoting the strength of spatial relationship was calculated by considering the spectral information and spatial information of the nodes. Then GCN combines graph node features extracted by DSSN and the graph weight to classify all the nodes, which converts the semantic segmentation into the node classification. On the basis of the DSSN-GCN framework, we designed four networks, As shown in Figures 10 and 11, the segmentation results of the DSSN-GCN model were significantly better than its backbone both on two datasets. In addition, there were less noise in the results of DSSN-GCN because the object-based classification reduced noises. Moreover, since the superpixel was composed of a series of adjacent pixels with consistent characteristics, which retains the boundary of the object, the results of boundaries of DSSN-GCN were more closely aligned with the real contours of the ground objects. As shown in the red circles of Figure 12, results of DSSN-GCN in (d) and (f) were better than that of (c) and (e), respectively, even the boundary in (d) and (f) fit the real boundary more closely than the ground truth (b), especially in buildings.

Conclusions
In order to introduce the spatial relationship information into RS image semantic segmentation, this paper proposed the DSSN-GCN framework for semantic segmentation of the RS image via combining DSSN and GCN. To lift the appearance extraction ability, we also proposed a new DSSN (AttResUNet), which had U-shaped architecture using residual blocks to encode feature maps and an attention module to refine the features. In the framework, a graph was built, where graph nodes were denoted by the superpixels. Additionally, the graph weight denoting the strength of spatial relationship was calculated by considering the spectral information and spatial information of the nodes. Then GCN combines graph node features extracted by DSSN and the graph weight to classify all the nodes, which converts the semantic segmentation into the node classification. On the basis of the DSSN-GCN framework, we designed four networks, namely DSSN-GCN V1, V2, V3 and V4. Extensive experiments performed on the UCM dataset and the DeepGlobe dataset show the effectiveness of the DSSN-GCN framework and the advance of the proposed AttResUNet. In addition, the superpixel-level modeling through GCN helped to reduce pixel-level noises and restored the boundaries of ground objects.
This paper presents that the spatial relationship information introduced by GCN enhanced the performance and robustness of classifier. The information of spatial relationship is essential for high-precision and interpretable semantic segmentation. How to effectively use spatial relationship information and other prior knowledge and learn knowledge automatically to interpret RS images intelligently requires further research in the future.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: http://weegee.vision.ucmerced.edu/datasets/landuse.html, http://deepglobe.org/ challenge.html.